Observability

Three questions you'll always want to answer:

What's happening right now? — Metrics (Prometheus + Grafana dashboards)
What happened in the last few minutes? — Logs (Loki in Grafana's Explore view)
Who do I need to wake up, and when? — Alerts (Alertmanager → Google Chat)

Each has one tool, all integrated through Grafana.

The stack

Tool	What it does	Where
Prometheus	Pulls metrics from ~everything every 30s; stores them as time-series	`observability` namespace
Grafana	Renders dashboards and is the search entry point for Loki + Prometheus	grafana-ecnv4-mgmt.ecommercen.com
Loki	Log aggregation; think "Prometheus for logs"	`observability` namespace
Alertmanager	Receives fired alerts from Prometheus, routes + groups + notifies	`observability` namespace
Google Chat bridge	Translates Alertmanager webhook payloads → Google Chat messages	`observability` namespace, a tiny in-cluster Python service
Blackbox exporter	Probes external URLs (www.wecare.gr, etc) to answer "is it up from outside?"	`observability` namespace

Everything deployed via the app-kube-prometheus-stack ArgoCD app (plus a couple of siblings for Loki + blackbox).

How Prometheus knows what to scrape

Three Custom Resources the Prometheus operator watches:

ServiceMonitor — "scrape the pods behind this Service". Used for most stateful services (MariaDB, PostgreSQL, most operators).
PodMonitor — "scrape these pods directly". Used when there's no Service (KEDA operator, some sidecars).
Probe — "hit this external URL via blackbox-exporter and record up/down + latency". Used for www.wecare.gr and friends.

Each one carries a release: app-kube-prometheus-stack label so Prometheus finds it. If you add a new thing to monitor, pick the right CR kind + that label.

Where the alert rules live

manifests_v1/app-constructs/kube-prometheus-stack/setup/:

prometheusrule-platform.yaml — cluster-wide rules (node Ready, pods stuck Terminating, KEDA operator down, etc).
prometheusrule-wecare.yaml — wecare-tenant rules (Redis down, MariaDB replication lag, Traefik 5xx rate, external probe down, FPM pool saturation).

Each rule has:

An expr — PromQL that evaluates to non-empty when the alert should fire.
A for — how long the condition must hold before firing (dampens spikes).
labels.severity — critical | warning | info.
labels.namespace — drives where in Google Chat the alert lands (see next section).
annotations.summary + description — the human-readable message.

Routing: Google Chat

Alertmanager groups alerts and sends them to the in-cluster gchat-bridge service, which translates and forwards to Google Chat webhooks.

Chat spaces are cluster-scoped, not tenant-scoped. Different webhooks inside one space play the role of "whose alert is this". Today there is a single Chat space — currently called K8S Alerts — with multiple incoming webhooks configured inside it. Each webhook has its own bot display name, so alerts from different tenants land side-by-side in the same feed but are easy to tell apart by author:

a wecare webhook posts under a wecare-branded bot name, for anything matching the wecare namespace regex
a system webhook posts under a platform-branded bot name, for everything else (nodes, etcd, cluster-scoped resources, other tenants)

The gchat-bridge Python service is what makes this work. Alertmanager's webhook_configs URL ends in a ?target=wecare or ?target=system query param; the bridge reads it, picks the matching webhook URL from the alertmanager-gchat-webhook sealed secret (keys wecare and system), and forwards the rendered card there. The Alertmanager routing rule is still a plain namespace regex — what changed is that each receiver's destination is a webhook inside the shared space, not a separate space of its own.

Routing logic is a simple namespace-regex in values.yaml:

yaml

route:
  receiver: gchat-system           # default — ?target=system
  routes:
    - matchers:
        - namespace =~ "ecommercen-clients-wecare-.*|ecommercen-wecare-.*"
      receiver: gchat-wecare       # ?target=wecare

Adding a new tenant = extend the regex (or add a new route) + create a new incoming webhook inside the same Chat space + store its URL under a new key in the alertmanager-gchat-webhook sealed secret + re-seal + teach the bridge about the new ?target= value.

Later: a second Chat space for Critical alerts (cross-tenant, severity=critical) is planned but not yet built. The idea is that anything page-worthy — regardless of tenant — mirrors into a dedicated space that on-call watches, while the tenant-flavoured feed in K8S Alerts stays as the full-fidelity stream. When it lands it'll be another webhook target in the bridge, nothing more exotic than that.

Where dashboards live

Grafana dashboards are committed to the repo as ConfigMap resources with the label grafana_dashboard: "1". A Grafana sidecar watches for them and provisions them into the UI automatically — nothing you do in Grafana's "Save" button persists (it's ephemeral).

Location: manifests_v1/app-constructs/kube-prometheus-stack/setup/dashboard-*.yaml.

Relevant "Wecare" folder dashboards:

Wecare Operations — request rate, 5xx, response time overview
Wecare External Probes — "is wecare up?" per URL
Wecare Redis Health — cache + session metrics
Wecare Database — MariaDB + MaxScale
Wecare App Health — PHP-FPM pool saturation + nginx connection states

A quick "where's X?" map

I want to see	Go to
Is the site up?	Wecare External Probes
What's the request rate right now?	Wecare Operations dashboard
Why is a pod unhappy?	Explore → Loki, query `{namespace="ecommercen-clients-wecare"}`
Did we get any alerts last night?	Google Chat → K8S Alerts (scroll up; look at the bot author per message to see which tenant)
Is the Prometheus I'm reading from healthy?	`kubectl -n observability get pods`, or Alertmanager's Watchdog alert

Observability ​

The stack ​

How Prometheus knows what to scrape ​

Where the alert rules live ​

Routing: Google Chat ​

Where dashboards live ​

A quick "where's X?" map ​

Further reading ​