Skip to content

Incident response

An alert fired in Google Chat. Breathe. You probably have 5-15 minutes before it gets worse. Walk through this page top-to-bottom.

Step 0 — Read the alert carefully

Every alert message carries enough info to triage:

Field in the Chat messageTells you
[CRITICAL] / [WARNING] / [INFO] severityHow urgent
alertname (the heading)Which rule fired — maps to a runbook section below
namespace / pod / nodeWhich resource, in which tenant
summary + descriptionThe specific "what happened" + usually a hint at "what to do"

If the alertname appears in the table below, jump to its section.

Step 1 — Is a human impacted right now?

Before digging into logs, check customer-facing signals:

  • Wecare External Probes dashboard: are the "Is wecare up?" panels green or red?
  • Open https://www.wecare.gr in a private browser window.
  • Check the Traefik 5xx rate on the Wecare Operations dashboard.

If customers are seeing errors or downtime: communicate first (post a brief heads-up in the ops Chat space), then triage. If not: quiet investigation is fine.

Step 2 — Pick a runbook

Alert name patternQuick triage
WecareNodeNotReady / KubeNodeNotReadyNode is NotReady
RedisDownRedis pod down
RedisEndpointEmptyRedis endpoints empty
RedisReplicationBroken / RedisSentinelPodsDownRedis replication / sentinel
MariaDBDownMariaDB down
MariaDBReplicationLagCriticalMariaDB replication lag
ExternalEndpointDownExternal endpoint down
TraefikWecareHigh5xxRateTraefik 5xx spike
FPMPoolSaturation / FPMListenQueueBacklogPHP-FPM saturated
KubePodCrashLooping / PodStuckTerminatingPod not happy
Watchdog✅ ignore — this is the "pipeline is alive" heartbeat

Runbooks

Node is NotReady

Likely: kubelet on the node hung, or the VM itself died.

bash
kubectl get nodes -o wide
kubectl describe node <node-name> | tail -30

If the node is a wecare-cache or wecare-db node, and pods on it are critical, delegate to the k8s-manager agent — drain + failover has nuances per component.

For a control-plane master node NotReady, check the Hetzner console (server console page) — may need hard reset. Never hard-reset more than one master at a time (etcd quorum).

Redis pod down

bash
kubectl -n ecommercen-clients-wecare-infrastructure get pods -l app=redis-cache
kubectl -n ecommercen-clients-wecare-infrastructure logs redis-cache-0 -c redis --tail=50

Common patterns:

  • OOMKilled → pod hit memory limit, operator restarts it; check maxmemory vs pod memory limit alignment.
  • AUTH errors in logs → redis-auth secret changed without a full rolling restart.
  • Pod keeps restarting → check kubectl describe pod for real reason.

Redis endpoints empty

Master/replica label got out of sync with reality. See redis-role labels:

bash
kubectl -n ecommercen-clients-wecare-infrastructure get pods -l app=redis-cache --show-labels

If no pod has redis-role=master, the operator hasn't reconciled. Usually: restart the redis-operator pod in op-redis namespace. If that doesn't work, the db-manager agent has the recovery procedure.

Redis replication / sentinel

Replication broken between master + replica, OR sentinel doesn't have quorum.

bash
kubectl -n ecommercen-clients-wecare-infrastructure exec redis-cache-0 -c redis -- \
  redis-cli -a "$(kubectl ... get secret redis-auth -o jsonpath='{.data.password}' | base64 -d)" info replication

Check master_link_status, connected_slaves. If replication is stuck, often a rolling restart of the slave pod resyncs it. Delegate to the db-manager agent for anything complex.

MariaDB down

Check the pod:

bash
kubectl -n ecommercen-clients-wecare-infrastructure get pods -l app.kubernetes.io/name=mariadb
kubectl -n ecommercen-clients-wecare-infrastructure logs <pod> -c mariadb --tail=100

Common: storage full, OOMKilled, config error after a myCnf change. The mariadb-operator has its own reconcile — give it a minute before intervening.

Never try to manually fix MariaDB data directory. Delegate to the db-manager agent.

MariaDB replication lag

Replica is behind master. Open the Wecare Database dashboard in Grafana and look at the lag panel over time.

  • Short spike (1-5 min): usually self-resolves. Wait.
  • Sustained (>10 min climbing): likely a large transaction or hot-row contention on the master. db-manager agent.

External endpoint down

Blackbox probe says the site is unreachable from outside the cluster.

  1. Can you reach it from a browser? If yes, likely a probe config issue — check blackbox-exporter logs.
  2. Can't reach it externally? Check the traffic path: Cloudflare status page → Cloudflared tunnel pod logs → Traefik → backend service.
  3. Grafana "External Probes" dashboard shows TLS days remaining — if cert expired, cert-manager is the culprit (rare since renewals are automatic).

Traefik 5xx spike

Application is returning errors. Grafana Operations dashboard shows you which service and which status codes.

  • 502/503: backend down or slow. Check PHP-FPM saturated next.
  • 500: application error. kubectl logs the app-web pods for error traces.
  • 504: backend timeout. Usually a slow DB query or external API call. Grafana → Loki for slow query logs.

PHP-FPM saturated

FPM pool ran out of workers. You'll see it in the Wecare App Health dashboard (listen queue >0, pool utilisation near 100%, max_children_reached counter climbing).

Short-term: KEDA should already scale app-web pods up automatically. If it's already at max, raise max temporarily.

Long-term: find the slow route. phpfpm_slow_requests metric + slow-query log in Loki tells you which endpoint is the culprit.

Pod not happy

CrashLoopBackOff, ImagePullBackOff, stuck Terminating, etc. See the Check app status runbook — it's the dedicated triage.

Step 3 — After it's over

If the alert fired for >5 minutes or was customer-visible:

  1. Drop a short note in a new file under memory/ (or a Chat thread) describing what happened, what you did, and what you'd do differently. Future-you will thank you.
  2. If the alert's threshold was wrong (false positive, or should've fired earlier), adjust the PrometheusRule in manifests_v1/app-constructs/kube-prometheus-stack/setup/prometheusrule-*.yaml.
  3. If there's a recurring pattern, file a GitHub issue.

When to escalate vs. DIY

SituationDo
You're confident you know what it isHandle it, commit the fix
You're unsure, but nothing's on fireAsk in Chat, read the agent docs
Customer-facing + you're unsureCommunicate first, then escalate to whoever has the context (dev team for app errors, GitOps person for infra)
Multi-system weirdness + time pressureDelegate to Claude — the k8s-manager / network-expert / db-manager agents know the context and will triage faster than ad-hoc

Further reading

Internal documentation — Advisable only