Incident response

An alert fired in Google Chat. Breathe. You probably have 5-15 minutes before it gets worse. Walk through this page top-to-bottom.

Step 0 — Read the alert carefully

Every alert message carries enough info to triage:

Field in the Chat message	Tells you
`[CRITICAL]` / `[WARNING]` / `[INFO]` severity	How urgent
`alertname` (the heading)	Which rule fired — maps to a runbook section below
`namespace` / `pod` / `node`	Which resource, in which tenant
`summary` + `description`	The specific "what happened" + usually a hint at "what to do"

If the alertname appears in the table below, jump to its section.

Step 1 — Is a human impacted right now?

Before digging into logs, check customer-facing signals:

Wecare External Probes dashboard: are the "Is wecare up?" panels green or red?
Open https://www.wecare.gr in a private browser window.
Check the Traefik 5xx rate on the Wecare Operations dashboard.

If customers are seeing errors or downtime: communicate first (post a brief heads-up in the ops Chat space), then triage. If not: quiet investigation is fine.

Step 2 — Pick a runbook

Alert name pattern	Quick triage
`WecareNodeNotReady` / `KubeNodeNotReady`	Node is NotReady
`RedisDown`	Redis pod down
`RedisEndpointEmpty`	Redis endpoints empty
`RedisReplicationBroken` / `RedisSentinelPodsDown`	Redis replication / sentinel
`MariaDBDown`	MariaDB down
`MariaDBReplicationLagCritical`	MariaDB replication lag
`ExternalEndpointDown`	External endpoint down
`TraefikWecareHigh5xxRate`	Traefik 5xx spike
`FPMPoolSaturation` / `FPMListenQueueBacklog`	PHP-FPM saturated
`KubePodCrashLooping` / `PodStuckTerminating`	Pod not happy
`Watchdog`	✅ ignore — this is the "pipeline is alive" heartbeat

Runbooks

Node is NotReady

Likely: kubelet on the node hung, or the VM itself died.

bash

kubectl get nodes -o wide
kubectl describe node <node-name> | tail -30

If the node is a wecare-cache or wecare-db node, and pods on it are critical, delegate to the k8s-manager agent — drain + failover has nuances per component.

For a control-plane master node NotReady, check the Hetzner console (server console page) — may need hard reset. Never hard-reset more than one master at a time (etcd quorum).

Redis pod down

bash

kubectl -n ecommercen-clients-wecare-infrastructure get pods -l app=redis-cache
kubectl -n ecommercen-clients-wecare-infrastructure logs redis-cache-0 -c redis --tail=50

Common patterns:

OOMKilled → pod hit memory limit, operator restarts it; check maxmemory vs pod memory limit alignment.
AUTH errors in logs → redis-auth secret changed without a full rolling restart.
Pod keeps restarting → check kubectl describe pod for real reason.

Redis endpoints empty

Master/replica label got out of sync with reality. See redis-role labels:

bash

kubectl -n ecommercen-clients-wecare-infrastructure get pods -l app=redis-cache --show-labels

If no pod has redis-role=master, the operator hasn't reconciled. Usually: restart the redis-operator pod in op-redis namespace. If that doesn't work, the db-manager agent has the recovery procedure.

Redis replication / sentinel

Replication broken between master + replica, OR sentinel doesn't have quorum.

bash

kubectl -n ecommercen-clients-wecare-infrastructure exec redis-cache-0 -c redis -- \
  redis-cli -a "$(kubectl ... get secret redis-auth -o jsonpath='{.data.password}' | base64 -d)" info replication

Check master_link_status, connected_slaves. If replication is stuck, often a rolling restart of the slave pod resyncs it. Delegate to the db-manager agent for anything complex.

MariaDB down

Check the pod:

bash

kubectl -n ecommercen-clients-wecare-infrastructure get pods -l app.kubernetes.io/name=mariadb
kubectl -n ecommercen-clients-wecare-infrastructure logs <pod> -c mariadb --tail=100

Common: storage full, OOMKilled, config error after a myCnf change. The mariadb-operator has its own reconcile — give it a minute before intervening.

Never try to manually fix MariaDB data directory. Delegate to the db-manager agent.

MariaDB replication lag

Replica is behind master. Open the Wecare Database dashboard in Grafana and look at the lag panel over time.

Short spike (1-5 min): usually self-resolves. Wait.
Sustained (>10 min climbing): likely a large transaction or hot-row contention on the master. db-manager agent.

External endpoint down

Blackbox probe says the site is unreachable from outside the cluster.

Can you reach it from a browser? If yes, likely a probe config issue — check blackbox-exporter logs.
Can't reach it externally? Check the traffic path: Cloudflare status page → Cloudflared tunnel pod logs → Traefik → backend service.
Grafana "External Probes" dashboard shows TLS days remaining — if cert expired, cert-manager is the culprit (rare since renewals are automatic).

Traefik 5xx spike

Application is returning errors. Grafana Operations dashboard shows you which service and which status codes.

502/503: backend down or slow. Check PHP-FPM saturated next.
500: application error. kubectl logs the app-web pods for error traces.
504: backend timeout. Usually a slow DB query or external API call. Grafana → Loki for slow query logs.

PHP-FPM saturated

FPM pool ran out of workers. You'll see it in the Wecare App Health dashboard (listen queue >0, pool utilisation near 100%, max_children_reached counter climbing).

Short-term: KEDA should already scale app-web pods up automatically. If it's already at max, raise max temporarily.

Long-term: find the slow route. phpfpm_slow_requests metric + slow-query log in Loki tells you which endpoint is the culprit.

Pod not happy

CrashLoopBackOff, ImagePullBackOff, stuck Terminating, etc. See the Check app status runbook — it's the dedicated triage.

Step 3 — After it's over

If the alert fired for >5 minutes or was customer-visible:

Drop a short note in a new file under memory/ (or a Chat thread) describing what happened, what you did, and what you'd do differently. Future-you will thank you.
If the alert's threshold was wrong (false positive, or should've fired earlier), adjust the PrometheusRule in manifests_v1/app-constructs/kube-prometheus-stack/setup/prometheusrule-*.yaml.
If there's a recurring pattern, file a GitHub issue.

When to escalate vs. DIY

Situation	Do
You're confident you know what it is	Handle it, commit the fix
You're unsure, but nothing's on fire	Ask in Chat, read the agent docs
Customer-facing + you're unsure	Communicate first, then escalate to whoever has the context (dev team for app errors, GitOps person for infra)
Multi-system weirdness + time pressure	Delegate to Claude — the `k8s-manager` / `network-expert` / `db-manager` agents know the context and will triage faster than ad-hoc

Incident response ​

Step 0 — Read the alert carefully ​

Step 1 — Is a human impacted right now? ​

Step 2 — Pick a runbook ​

Runbooks ​

Node is NotReady ​

Redis pod down ​

Redis endpoints empty ​

Redis replication / sentinel ​

MariaDB down ​

MariaDB replication lag ​

External endpoint down ​

Traefik 5xx spike ​

PHP-FPM saturated ​

Pod not happy ​

Step 3 — After it's over ​

When to escalate vs. DIY ​

Further reading ​

Incident response

Step 0 — Read the alert carefully

Step 1 — Is a human impacted right now?

Step 2 — Pick a runbook

Runbooks

Node is NotReady

Redis pod down

Redis endpoints empty

Redis replication / sentinel

MariaDB down

MariaDB replication lag

External endpoint down

Traefik 5xx spike

PHP-FPM saturated

Pod not happy

Step 3 — After it's over

When to escalate vs. DIY

Further reading