Skip to content

Observability

Three questions you'll always want to answer:

  1. What's happening right now? — Metrics (Prometheus + Grafana dashboards)
  2. What happened in the last few minutes? — Logs (Loki in Grafana's Explore view)
  3. Who do I need to wake up, and when? — Alerts (Alertmanager → Google Chat)

Each has one tool, all integrated through Grafana.

The stack

ToolWhat it doesWhere
PrometheusPulls metrics from ~everything every 30s; stores them as time-seriesobservability namespace
GrafanaRenders dashboards and is the search entry point for Loki + Prometheusgrafana-ecnv4-mgmt.ecommercen.com
LokiLog aggregation; think "Prometheus for logs"observability namespace
AlertmanagerReceives fired alerts from Prometheus, routes + groups + notifiesobservability namespace
Google Chat bridgeTranslates Alertmanager webhook payloads → Google Chat messagesobservability namespace, a tiny in-cluster Python service
Blackbox exporterProbes external URLs (www.wecare.gr, etc) to answer "is it up from outside?"observability namespace

Everything deployed via the app-kube-prometheus-stack ArgoCD app (plus a couple of siblings for Loki + blackbox).

How Prometheus knows what to scrape

Three Custom Resources the Prometheus operator watches:

  • ServiceMonitor — "scrape the pods behind this Service". Used for most stateful services (MariaDB, PostgreSQL, most operators).
  • PodMonitor — "scrape these pods directly". Used when there's no Service (KEDA operator, some sidecars).
  • Probe — "hit this external URL via blackbox-exporter and record up/down + latency". Used for www.wecare.gr and friends.

Each one carries a release: app-kube-prometheus-stack label so Prometheus finds it. If you add a new thing to monitor, pick the right CR kind + that label.

Where the alert rules live

manifests_v1/app-constructs/kube-prometheus-stack/setup/:

  • prometheusrule-platform.yaml — cluster-wide rules (node Ready, pods stuck Terminating, KEDA operator down, etc).
  • prometheusrule-wecare.yaml — wecare-tenant rules (Redis down, MariaDB replication lag, Traefik 5xx rate, external probe down, FPM pool saturation).

Each rule has:

  • An expr — PromQL that evaluates to non-empty when the alert should fire.
  • A for — how long the condition must hold before firing (dampens spikes).
  • labels.severitycritical | warning | info.
  • labels.namespace — drives where in Google Chat the alert lands (see next section).
  • annotations.summary + description — the human-readable message.

Routing: Google Chat

Alertmanager groups alerts and sends them to the in-cluster gchat-bridge service, which translates and forwards to Google Chat webhooks.

Chat spaces are cluster-scoped, not tenant-scoped. Different webhooks inside one space play the role of "whose alert is this". Today there is a single Chat space — currently called K8S Alerts — with multiple incoming webhooks configured inside it. Each webhook has its own bot display name, so alerts from different tenants land side-by-side in the same feed but are easy to tell apart by author:

  • a wecare webhook posts under a wecare-branded bot name, for anything matching the wecare namespace regex
  • a system webhook posts under a platform-branded bot name, for everything else (nodes, etcd, cluster-scoped resources, other tenants)

The gchat-bridge Python service is what makes this work. Alertmanager's webhook_configs URL ends in a ?target=wecare or ?target=system query param; the bridge reads it, picks the matching webhook URL from the alertmanager-gchat-webhook sealed secret (keys wecare and system), and forwards the rendered card there. The Alertmanager routing rule is still a plain namespace regex — what changed is that each receiver's destination is a webhook inside the shared space, not a separate space of its own.

Routing logic is a simple namespace-regex in values.yaml:

yaml
route:
  receiver: gchat-system           # default — ?target=system
  routes:
    - matchers:
        - namespace =~ "ecommercen-clients-wecare-.*|ecommercen-wecare-.*"
      receiver: gchat-wecare       # ?target=wecare

Adding a new tenant = extend the regex (or add a new route) + create a new incoming webhook inside the same Chat space + store its URL under a new key in the alertmanager-gchat-webhook sealed secret + re-seal + teach the bridge about the new ?target= value.

Later: a second Chat space for Critical alerts (cross-tenant, severity=critical) is planned but not yet built. The idea is that anything page-worthy — regardless of tenant — mirrors into a dedicated space that on-call watches, while the tenant-flavoured feed in K8S Alerts stays as the full-fidelity stream. When it lands it'll be another webhook target in the bridge, nothing more exotic than that.

Where dashboards live

Grafana dashboards are committed to the repo as ConfigMap resources with the label grafana_dashboard: "1". A Grafana sidecar watches for them and provisions them into the UI automatically — nothing you do in Grafana's "Save" button persists (it's ephemeral).

Location: manifests_v1/app-constructs/kube-prometheus-stack/setup/dashboard-*.yaml.

Relevant "Wecare" folder dashboards:

  • Wecare Operations — request rate, 5xx, response time overview
  • Wecare External Probes — "is wecare up?" per URL
  • Wecare Redis Health — cache + session metrics
  • Wecare Database — MariaDB + MaxScale
  • Wecare App Health — PHP-FPM pool saturation + nginx connection states

A quick "where's X?" map

I want to seeGo to
Is the site up?Wecare External Probes
What's the request rate right now?Wecare Operations dashboard
Why is a pod unhappy?Explore → Loki, query {namespace="ecommercen-clients-wecare"}
Did we get any alerts last night?Google Chat → K8S Alerts (scroll up; look at the bot author per message to see which tenant)
Is the Prometheus I'm reading from healthy?kubectl -n observability get pods, or Alertmanager's Watchdog alert

Further reading

Internal documentation — Advisable only