Daily checks

A short morning routine. If all four of these are green, the cluster is fine and you can close this page. If any is red, the Incident response runbook has the next steps.

Time budget: 2 minutes.

1. Google Chat — alert channels

Open Google Chat. Look at:

K8S Alerts (infra/platform alerts)
Wecare Alerts (tenant-specific alerts)

TIP

Green state = the last message in each space is an older "resolved" ✅ or the standing Watchdog heartbeat. If you see fresh 🚨 CRITICAL or ⚠️ WARNING messages that haven't been followed by a ✅ resolved, that's a live alert.

If either channel is shouting, jump to Incident response and stop here.

2. Grafana — three dashboards

Grafana: grafana-ecnv4-mgmt.ecommercen.com (sign in with Google, @advisable.com).

Open these in order:

a. Wecare Operations

Folder: Wecare, dashboard: Wecare Operations.

Quick scan:

Traffic graph (requests/sec) — is there traffic? Is it normal for the hour of day?
5xx error rate — should be near 0.
Response-time P95 — should be below 2s typically.

b. Wecare External Probes (customer-facing)

Folder: Wecare, dashboard: Wecare External Probes.

The "Is wecare up?" stat panel at the top should show UP on all three (www.wecare.gr, api.wecare.gr, docs.wecare.gr). If any shows DOWN, see Incident response.

c. Wecare Redis Health

Folder: Wecare, dashboard: Wecare Redis Health.

Quick scan the "Overall status" row:

Redis pods up: should be 4/4.
Sentinels running: should be 4/4.
Service endpoints ready: all four (cache-master, cache-replica, session-master, session-replica) should be green 1's.
Node Ready state: both wecare-cache-* should show green "Ready".

3. ArgoCD — everything synced?

From a terminal, run:

bash

argocd app list

You're looking for two columns: STATUS (should be Synced) and HEALTH (should be Healthy). If anything says OutOfSync or Degraded, investigate — start with:

bash

argocd app get <app-name>

ArgoCD web UI

Prefer a UI? Open argocd.ecnv4-mgmt.ecommercen.com, sign in with Google. Same information, clickable.

4. Nodes — are they all Ready?

bash

kubectl get nodes -o wide

No kubeconfig yet?

If kubectl errors out with "connection refused" or "no configuration has been provided", you don't have a working kubeconfig on this machine. See Authenticating to the cluster for how to grab one and set your current-context.

All nodes should show Ready. Any NotReady or SchedulingDisabled needs attention unless you know it's intentional (e.g., you cordoned a node for maintenance).

Common counts to verify:

3 master nodes (ecnv4k8s-hel1-1..3)
4 wecare-web nodes (wecare-web-1..4)
2 wecare-db nodes (wecare-db-1..2)
2 wecare-cache nodes (wecare-cache-1..2)
Sometimes extra autoscaler-* nodes if load has been high — these come and go.

Green = done

If all four checks passed, you're done. The cluster is fine. Close this page, go to your actual work.

Not green? → jump to the right runbook

What you saw	Go to
Chat alert fired	Incident response
External probe showing DOWN	Incident response
ArgoCD `OutOfSync` or `Degraded`	Check app status
Node NotReady	Reboot & patch or Incident response
Traffic anomaly in Wecare Operations	Incident response

Daily checks ​

1. Google Chat — alert channels ​

2. Grafana — three dashboards ​

a. Wecare Operations ​

b. Wecare External Probes (customer-facing) ​

c. Wecare Redis Health ​

3. ArgoCD — everything synced? ​

4. Nodes — are they all Ready? ​

Green = done ​

Not green? → jump to the right runbook ​

Daily checks

1. Google Chat — alert channels

2. Grafana — three dashboards

a. Wecare Operations

b. Wecare External Probes (customer-facing)

c. Wecare Redis Health

3. ArgoCD — everything synced?

4. Nodes — are they all Ready?

Green = done

Not green? → jump to the right runbook