Skip to content

Daily checks

A short morning routine. If all four of these are green, the cluster is fine and you can close this page. If any is red, the Incident response runbook has the next steps.

Time budget: 2 minutes.

1. Google Chat — alert channels

Open Google Chat. Look at:

  • K8S Alerts (infra/platform alerts)
  • Wecare Alerts (tenant-specific alerts)

TIP

Green state = the last message in each space is an older "resolved" ✅ or the standing Watchdog heartbeat. If you see fresh 🚨 CRITICAL or ⚠️ WARNING messages that haven't been followed by a ✅ resolved, that's a live alert.

If either channel is shouting, jump to Incident response and stop here.

2. Grafana — three dashboards

Grafana: grafana-ecnv4-mgmt.ecommercen.com (sign in with Google, @advisable.com).

Open these in order:

a. Wecare Operations

Folder: Wecare, dashboard: Wecare Operations.

Quick scan:

  • Traffic graph (requests/sec) — is there traffic? Is it normal for the hour of day?
  • 5xx error rate — should be near 0.
  • Response-time P95 — should be below 2s typically.

b. Wecare External Probes (customer-facing)

Folder: Wecare, dashboard: Wecare External Probes.

The "Is wecare up?" stat panel at the top should show UP on all three (www.wecare.gr, api.wecare.gr, docs.wecare.gr). If any shows DOWN, see Incident response.

c. Wecare Redis Health

Folder: Wecare, dashboard: Wecare Redis Health.

Quick scan the "Overall status" row:

  • Redis pods up: should be 4/4.
  • Sentinels running: should be 4/4.
  • Service endpoints ready: all four (cache-master, cache-replica, session-master, session-replica) should be green 1's.
  • Node Ready state: both wecare-cache-* should show green "Ready".

3. ArgoCD — everything synced?

From a terminal, run:

bash
argocd app list

You're looking for two columns: STATUS (should be Synced) and HEALTH (should be Healthy). If anything says OutOfSync or Degraded, investigate — start with:

bash
argocd app get <app-name>

ArgoCD web UI

Prefer a UI? Open argocd.ecnv4-mgmt.ecommercen.com, sign in with Google. Same information, clickable.

4. Nodes — are they all Ready?

bash
kubectl get nodes -o wide

No kubeconfig yet?

If kubectl errors out with "connection refused" or "no configuration has been provided", you don't have a working kubeconfig on this machine. See Authenticating to the cluster for how to grab one and set your current-context.

All nodes should show Ready. Any NotReady or SchedulingDisabled needs attention unless you know it's intentional (e.g., you cordoned a node for maintenance).

Common counts to verify:

  • 3 master nodes (ecnv4k8s-hel1-1..3)
  • 4 wecare-web nodes (wecare-web-1..4)
  • 2 wecare-db nodes (wecare-db-1..2)
  • 2 wecare-cache nodes (wecare-cache-1..2)
  • Sometimes extra autoscaler-* nodes if load has been high — these come and go.

Green = done

If all four checks passed, you're done. The cluster is fine. Close this page, go to your actual work.

Not green? → jump to the right runbook

What you sawGo to
Chat alert firedIncident response
External probe showing DOWNIncident response
ArgoCD OutOfSync or DegradedCheck app status
Node NotReadyReboot & patch or Incident response
Traffic anomaly in Wecare OperationsIncident response

Internal documentation — Advisable only