Incident response
An alert fired in Google Chat. Breathe. You probably have 5-15 minutes before it gets worse. Walk through this page top-to-bottom.
Step 0 — Read the alert carefully
Every alert message carries enough info to triage:
| Field in the Chat message | Tells you |
|---|---|
[CRITICAL] / [WARNING] / [INFO] severity | How urgent |
alertname (the heading) | Which rule fired — maps to a runbook section below |
namespace / pod / node | Which resource, in which tenant |
summary + description | The specific "what happened" + usually a hint at "what to do" |
If the alertname appears in the table below, jump to its section.
Step 1 — Is a human impacted right now?
Before digging into logs, check customer-facing signals:
- Wecare External Probes dashboard: are the "Is wecare up?" panels green or red?
- Open
https://www.wecare.grin a private browser window. - Check the Traefik 5xx rate on the Wecare Operations dashboard.
If customers are seeing errors or downtime: communicate first (post a brief heads-up in the ops Chat space), then triage. If not: quiet investigation is fine.
Step 2 — Pick a runbook
| Alert name pattern | Quick triage |
|---|---|
WecareNodeNotReady / KubeNodeNotReady | Node is NotReady |
RedisDown | Redis pod down |
RedisEndpointEmpty | Redis endpoints empty |
RedisReplicationBroken / RedisSentinelPodsDown | Redis replication / sentinel |
MariaDBDown | MariaDB down |
MariaDBReplicationLagCritical | MariaDB replication lag |
ExternalEndpointDown | External endpoint down |
TraefikWecareHigh5xxRate | Traefik 5xx spike |
FPMPoolSaturation / FPMListenQueueBacklog | PHP-FPM saturated |
KubePodCrashLooping / PodStuckTerminating | Pod not happy |
Watchdog | ✅ ignore — this is the "pipeline is alive" heartbeat |
Runbooks
Node is NotReady
Likely: kubelet on the node hung, or the VM itself died.
kubectl get nodes -o wide
kubectl describe node <node-name> | tail -30If the node is a wecare-cache or wecare-db node, and pods on it are critical, delegate to the k8s-manager agent — drain + failover has nuances per component.
For a control-plane master node NotReady, check the Hetzner console (server console page) — may need hard reset. Never hard-reset more than one master at a time (etcd quorum).
Redis pod down
kubectl -n ecommercen-clients-wecare-infrastructure get pods -l app=redis-cache
kubectl -n ecommercen-clients-wecare-infrastructure logs redis-cache-0 -c redis --tail=50Common patterns:
OOMKilled→ pod hit memory limit, operator restarts it; check maxmemory vs pod memory limit alignment.- AUTH errors in logs → redis-auth secret changed without a full rolling restart.
- Pod keeps restarting → check
kubectl describe podfor real reason.
Redis endpoints empty
Master/replica label got out of sync with reality. See redis-role labels:
kubectl -n ecommercen-clients-wecare-infrastructure get pods -l app=redis-cache --show-labelsIf no pod has redis-role=master, the operator hasn't reconciled. Usually: restart the redis-operator pod in op-redis namespace. If that doesn't work, the db-manager agent has the recovery procedure.
Redis replication / sentinel
Replication broken between master + replica, OR sentinel doesn't have quorum.
kubectl -n ecommercen-clients-wecare-infrastructure exec redis-cache-0 -c redis -- \
redis-cli -a "$(kubectl ... get secret redis-auth -o jsonpath='{.data.password}' | base64 -d)" info replicationCheck master_link_status, connected_slaves. If replication is stuck, often a rolling restart of the slave pod resyncs it. Delegate to the db-manager agent for anything complex.
MariaDB down
Check the pod:
kubectl -n ecommercen-clients-wecare-infrastructure get pods -l app.kubernetes.io/name=mariadb
kubectl -n ecommercen-clients-wecare-infrastructure logs <pod> -c mariadb --tail=100Common: storage full, OOMKilled, config error after a myCnf change. The mariadb-operator has its own reconcile — give it a minute before intervening.
Never try to manually fix MariaDB data directory. Delegate to the db-manager agent.
MariaDB replication lag
Replica is behind master. Open the Wecare Database dashboard in Grafana and look at the lag panel over time.
- Short spike (1-5 min): usually self-resolves. Wait.
- Sustained (>10 min climbing): likely a large transaction or hot-row contention on the master. db-manager agent.
External endpoint down
Blackbox probe says the site is unreachable from outside the cluster.
- Can you reach it from a browser? If yes, likely a probe config issue — check blackbox-exporter logs.
- Can't reach it externally? Check the traffic path: Cloudflare status page → Cloudflared tunnel pod logs → Traefik → backend service.
- Grafana "External Probes" dashboard shows TLS days remaining — if cert expired,
cert-manageris the culprit (rare since renewals are automatic).
Traefik 5xx spike
Application is returning errors. Grafana Operations dashboard shows you which service and which status codes.
- 502/503: backend down or slow. Check PHP-FPM saturated next.
- 500: application error.
kubectl logsthe app-web pods for error traces. - 504: backend timeout. Usually a slow DB query or external API call. Grafana → Loki for slow query logs.
PHP-FPM saturated
FPM pool ran out of workers. You'll see it in the Wecare App Health dashboard (listen queue >0, pool utilisation near 100%, max_children_reached counter climbing).
Short-term: KEDA should already scale app-web pods up automatically. If it's already at max, raise max temporarily.
Long-term: find the slow route. phpfpm_slow_requests metric + slow-query log in Loki tells you which endpoint is the culprit.
Pod not happy
CrashLoopBackOff, ImagePullBackOff, stuck Terminating, etc. See the Check app status runbook — it's the dedicated triage.
Step 3 — After it's over
If the alert fired for >5 minutes or was customer-visible:
- Drop a short note in a new file under
memory/(or a Chat thread) describing what happened, what you did, and what you'd do differently. Future-you will thank you. - If the alert's threshold was wrong (false positive, or should've fired earlier), adjust the PrometheusRule in
manifests_v1/app-constructs/kube-prometheus-stack/setup/prometheusrule-*.yaml. - If there's a recurring pattern, file a GitHub issue.
When to escalate vs. DIY
| Situation | Do |
|---|---|
| You're confident you know what it is | Handle it, commit the fix |
| You're unsure, but nothing's on fire | Ask in Chat, read the agent docs |
| Customer-facing + you're unsure | Communicate first, then escalate to whoever has the context (dev team for app errors, GitOps person for infra) |
| Multi-system weirdness + time pressure | Delegate to Claude — the k8s-manager / network-expert / db-manager agents know the context and will triage faster than ad-hoc |
Further reading
- Daily checks — catches a lot of issues before they become alerts
- Check app status — the dedicated app-level triage
- Claude agents — who to delegate to for what