Daily checks
A short morning routine. If all four of these are green, the cluster is fine and you can close this page. If any is red, the Incident response runbook has the next steps.
Time budget: 2 minutes.
1. Google Chat — alert channels
Open Google Chat. Look at:
- K8S Alerts (infra/platform alerts)
- Wecare Alerts (tenant-specific alerts)
TIP
Green state = the last message in each space is an older "resolved" ✅ or the standing Watchdog heartbeat. If you see fresh 🚨 CRITICAL or ⚠️ WARNING messages that haven't been followed by a ✅ resolved, that's a live alert.
If either channel is shouting, jump to Incident response and stop here.
2. Grafana — three dashboards
Grafana: grafana-ecnv4-mgmt.ecommercen.com (sign in with Google, @advisable.com).
Open these in order:
a. Wecare Operations
Folder: Wecare, dashboard: Wecare Operations.
Quick scan:
- Traffic graph (requests/sec) — is there traffic? Is it normal for the hour of day?
- 5xx error rate — should be near 0.
- Response-time P95 — should be below 2s typically.
b. Wecare External Probes (customer-facing)
Folder: Wecare, dashboard: Wecare External Probes.
The "Is wecare up?" stat panel at the top should show UP on all three (www.wecare.gr, api.wecare.gr, docs.wecare.gr). If any shows DOWN, see Incident response.
c. Wecare Redis Health
Folder: Wecare, dashboard: Wecare Redis Health.
Quick scan the "Overall status" row:
- Redis pods up: should be 4/4.
- Sentinels running: should be 4/4.
- Service endpoints ready: all four (cache-master, cache-replica, session-master, session-replica) should be green 1's.
- Node Ready state: both
wecare-cache-*should show green "Ready".
3. ArgoCD — everything synced?
From a terminal, run:
argocd app listYou're looking for two columns: STATUS (should be Synced) and HEALTH (should be Healthy). If anything says OutOfSync or Degraded, investigate — start with:
argocd app get <app-name>ArgoCD web UI
Prefer a UI? Open argocd.ecnv4-mgmt.ecommercen.com, sign in with Google. Same information, clickable.
4. Nodes — are they all Ready?
kubectl get nodes -o wideNo kubeconfig yet?
If kubectl errors out with "connection refused" or "no configuration has been provided", you don't have a working kubeconfig on this machine. See Authenticating to the cluster for how to grab one and set your current-context.
All nodes should show Ready. Any NotReady or SchedulingDisabled needs attention unless you know it's intentional (e.g., you cordoned a node for maintenance).
Common counts to verify:
- 3 master nodes (
ecnv4k8s-hel1-1..3) - 4 wecare-web nodes (
wecare-web-1..4) - 2 wecare-db nodes (
wecare-db-1..2) - 2 wecare-cache nodes (
wecare-cache-1..2) - Sometimes extra
autoscaler-*nodes if load has been high — these come and go.
Green = done
If all four checks passed, you're done. The cluster is fine. Close this page, go to your actual work.
Not green? → jump to the right runbook
| What you saw | Go to |
|---|---|
| Chat alert fired | Incident response |
| External probe showing DOWN | Incident response |
ArgoCD OutOfSync or Degraded | Check app status |
| Node NotReady | Reboot & patch or Incident response |
| Traffic anomaly in Wecare Operations | Incident response |