Infrastructure inventory

A snapshot of every moving piece the cluster is made of — servers, load balancer, network, object storage, external services. This page is the single answer to "what's in our cluster?". It's kept in sync with terraform/main.tf and the related app-construct values files; if you change infrastructure, update this page.

Last-audited: 2026-04-18.

Terraform is the source of truth

Everything under "Nodes" is declared in terraform/main.tf and applied via ./tf.sh. Delegate any node add/remove to the terraform-manager agent rather than editing the file directly. This page is a human-readable projection; terraform wins if they disagree.

Topology at a glance

Why the LB targets the masters

Traefik is configured with tolerations for node-role.kubernetes.io/control-plane:NoSchedule and priorityClassName: system-cluster-critical, and has no tolerations for the tenant-pool taints (wecare, wecare-mariadb, wecare-cache). So the only nodes it can schedule on are the three control-plane masters. The Hetzner LB therefore forwards ports 80/443/3306 to the masters' NodePorts, not the wecare-web nodes — Traefik inside those master pods then routes in-cluster via app-svc to app-web pods on the wecare-web pool. The masters are doing two jobs: Kubernetes control plane and ingress.

Nodes

All nodes are in Hetzner's hel1 location (Helsinki, Finland), Ubuntu 24.04, joined to RKE2 v1.31.2+rke2r1 via cloud-init, attached to the network-k8s private network and the firewall-k8s firewall.

Control plane (masters)

3 servers running the API server, scheduler, controller-manager, and embedded etcd. Shared-CPU Hetzner tier.

Name	Type	vCPU/RAM	Private IP	Public IP	Role
`ecnv4k8s-hel1-1`	`cx42`	8/16 GiB	10.1.0.4	135.181.32.184	primary (Cilium `k8sServiceHost`)
`ecnv4k8s-hel1-2`	`cx42`	8/16 GiB	10.1.0.2	95.217.177.124	secondary
`ecnv4k8s-hel1-3`	`cx42`	8/16 GiB	10.1.0.3	65.109.9.80	secondary

Lifecycle-ignored fields: labels, network, user_data — Terraform won't roll these on drift, so changes need a full re-provision. RKE2 join token is loaded into var.rke2_token by tf.sh from Vaultwarden (item ecnv4 - rke2 token).

Wecare MariaDB pool (wecare-db)

Dedicated AMD EPYC servers reserved for the MariaDB primary+replica and MaxScale. Data lives on the node's local NVMe via the local-path StorageClass.

Name	Type	vCPU/RAM	Private IP
`wecare-db-1`	`ccx33`	8 dedicated / 32 GiB	10.1.0.7
`wecare-db-2`	`ccx33`	8 dedicated / 32 GiB	10.1.0.6

Labels: wecare-mariadb="". Taint: wecare-mariadb:NoSchedule.

Only pods with a matching toleration (MariaDB, MaxScale, redis-session, their backup jobs) land here.

Wecare Redis cache pool (wecare-cache)

Dedicated AMD EPYC servers for Redis cache and Sentinel. Less RAM than the DB pool — Redis is in-memory but not the hot working set.

Name	Type	vCPU/RAM	Private IP
`wecare-cache-1`	`ccx23`	4 dedicated / 16 GiB	10.1.0.12
`wecare-cache-2`	`ccx23`	4 dedicated / 16 GiB	10.1.0.13

Labels: wecare-cache="". Taint: wecare-cache:NoSchedule.

Note: redis-cache lands here; redis-session actually lands on the MariaDB pool (see Databases for the rationale).

Wecare web pool (wecare)

Dedicated AMD EPYC servers running nginx + php-fpm (the app-web Deployment). Four static nodes today; the cluster-autoscaler provisions additional CCX13 nodes dynamically up to 12 total when KEDA drives the pod count up.

Name	Type	vCPU/RAM	Private IP
`wecare-web-1`	`ccx13`	2 dedicated / 8 GiB	10.1.0.8
`wecare-web-2`	`ccx13`	2 dedicated / 8 GiB	10.1.0.9
`wecare-web-3`	`ccx13`	2 dedicated / 8 GiB	10.1.0.10
`wecare-web-4`	`ccx13`	2 dedicated / 8 GiB	10.1.0.11

Labels: wecare="", topology.kubernetes.io/region=hel1, topology.kubernetes.io/zone=hel1-dc2. Taint: wecare:NoSchedule.

Cluster-autoscaler pool (dynamic wecare-web)

Ephemeral CCX13 nodes provisioned by the cluster-autoscaler when workloads can't fit on the static four. Transition state today — autoscaler coexists with the static nodes (Phase 3); Phase 4 plan is to drop the static four in favor of fully autoscaled capacity.

Property	Value
Autoscaling group name	`wecare-web`
`minSize`	0 (Phase 3 testing; will move to 2 in Phase 4)
`maxSize`	12
`instanceType`	`CCX13`
`region`	`hel1`
Scale-down threshold	50% utilization, unneeded 10m, delay-after-add 10m
`max-node-provision-time`	45m

The autoscaler applies the same wecare="" label and wecare:NoSchedule taint as the Terraform-managed webs via its HCLOUD_CLUSTER_CONFIG secret (built by untracked/secrets/build-autoscaler-config.sh — requires Bitwarden unlocked). Config source: manifests_v1/app-constructs/cluster-autoscaler/values.yaml.

Network

Private network

Property	Value
Hetzner network name	`network-k8s`
IP range	`10.1.0.0/16`
Subnet	`10.1.0.0/24` (eu-central cloud zone)
Gateway	`10.1.0.1`
Private interface on nodes	`enp7s0`

Kubernetes CIDRs

CIDR	Purpose
`10.42.0.0/16`	Pod network (Cilium `clusterCIDR`)
`10.43.0.0/16`	ClusterIP / Service range
`10.1.0.0/24`	Node IPs (Hetzner private)

See Cluster networking for how pods talk to each other inside this space.

Firewall (`firewall-k8s`)

Applied to every node. Only four inbound rules — everything else relies on the Hetzner private network being unreachable from the outside.

Direction	Protocol	Port	Source	Purpose
in	tcp	22	`0.0.0.0/0`, `::/0`	SSH
in	tcp	6443	`0.0.0.0/0`, `::/0`	`kube-apiserver`
in	tcp	30000-32767	`0.0.0.0/0`, `::/0`	Kubernetes NodePort (TCP)
in	udp	30000-32767	`0.0.0.0/0`, `::/0`	Kubernetes NodePort (UDP)

Load balancer

One Hetzner Cloud Load Balancer provisioned by HCCM (Hetzner Cloud Controller Manager) via the Traefik Service type=LoadBalancer:

Property	Value
Name	`ecnv4-lb`
Hetzner annotation	`load-balancer.hetzner.cloud/name: "ecnv4-lb"`
Proxy Protocol	enabled (`uses-proxyprotocol: "true"`)
Private-IP target	enabled (`use-private-ip: "true"`)
Fronts	Traefik pods — HTTP (80), HTTPS (443)

Path A (public HTTP) terminates here. (Path C — the external ERP route on port 3306 — was retired 2026-04-23; see Ingress, TLS & Cloudflare for the current two-path traffic flow.)

Managed by HCCM declaratively through annotations on Traefik's Service — do NOT edit the LB in the Hetzner console.

Object storage (S3-compatible)

Hetzner's own S3-compatible Object Storage, at endpoint hel1.your-objectstorage.com. Two buckets, one shared credential Secret (ecn-bucket-access, sealed; keys FILES_S3_ACCESS_KEY_ID, FILES_S3_SECRET_ACCESS_KEY).

`ecn-private` — backups

Where MariaDB backups land. Infrastructure-level bucket, not served to any application.

Property	Value
Bucket	`ecn-private`
Writers	`Backup` CR (mysqldump / LogicalBackup) and `PhysicalBackup` CR — both in `manifests_v1/app-constructs/ecommercen-clients/wecare/infrastructure/prod/`
Prefixes	`wecare/mariadb-sql-backup/`, `wecare/mariadb-backup/`
Retention	Per CR `maxRetention` — see the backup manifests
Readers	Restore jobs, the `db-manager` agent when investigating

`wecare` — app file storage

Where the wecare app writes uploaded files (product images, user documents, exports, etc.). Read and write from the app-web pods via the AWS SDK; also served back to browsers via an nginx proxy_pass reverse proxy that hides the bucket endpoint.

Property	Value
Bucket	`wecare`
Env vars	`FILES_S3_BUCKET=wecare`, `FILES_S3_ENDPOINT=https://hel1.your-objectstorage.com`, `FILES_S3_DEFAULT_REGION=hel1`, `FILES_S3_PREFIX=""` (root)
Config source	`manifests_v1/app-constructs/ecommercen-clients/wecare/adveshop4/base/app.yaml` (the `envs` ConfigMap)
Readers / writers	`app-web` pods (nginx proxy + php-fpm direct SDK access)
Access pattern	Private bucket; the nginx sidecar fronts it with `proxy_pass ${FILES_S3_ENDPOINT}/${FILES_S3_BUCKET}/...` so browsers never see the raw storage URL

New clients get their own bucket (name matches the tenant's FILES_S3_BUCKET env, set by the client-onboarder agent during onboarding). Bucket creation itself happens in the Hetzner Object Storage console — it isn't declared in this repo.

External services

Service	Role	Where configured
Cloudflare DNS	Orange-cloud proxy for `.wecare.gr` + `.ecommercen.com`; CNAMEs for `*ecnv4-mgmt.ecommercen.com`	Cloudflare dashboard (not in repo)
Cloudflare Tunnel	Outbound tunnel for management URLs	`manifests_v1/app-constructs/cloudflared/configmap.yaml`
Cloudflare Access	SSO gate for management URLs (`@advisable.com` + Service Auth)	Cloudflare dashboard
Bitwarden / Vaultwarden	Source of truth for plaintext secrets before sealing; HCLOUD_TOKEN + rke2 token for `tf.sh`	Bitwarden cloud + `bw-unlock.sh`
GitHub (`Advisable-com/ecnv4_manifests`)	Git remote for GitOps; GitHub Actions for the docs build	`.github/workflows/docs.yml`
ghcr.io	Container registry (app images, operator images)	Image references in manifests
`docker-registry1.mariadb.com`	MariaDB + MaxScale official images	`mariadb.yaml`, `maxscale.yaml`

Monthly footprint (reference only)

Hetzner pricing changes — check the console for live numbers. Rough figures at time of writing:

Line item	Count	Approx
`cx42` master	3	~€36
`ccx33` wecare-db	2	~€132
`ccx23` wecare-cache	2	~€66
`ccx13` wecare-web	4	~€48
Autoscaler headroom (up to 12 × `ccx13`)	0–12	up to ~€144 extra
Hetzner LB	1	~€6
Object Storage	1	usage-based (~€5 floor)
Private network	1	included
Cloudflare	—	free/Pro tier — external
Total (static)		≈ €293/mo

Numbers from project memory and the Terraform-defined inventory; validate against the Hetzner invoice for month-accurate totals.

What's NOT in this inventory

MetalLB — installed in manifests_v1/app-constructs/metallb/ but disabled (no app-*.yaml in apps-enabled/). HCCM handles the LB; MetalLB would only matter in a bare-metal deployment.
bootstrapFrom on the MariaDB CR — commented out in mariadb.yaml because the bootstrap pods that consume PhysicalBackups fail on start. The hourly PhysicalBackup CronJob itself is running — snapshots land in S3 normally. Fresh replicas just fall back to native GTID streaming clone from the primary instead of pulling the snapshot. See Databases.
Non-wecare clients — the ApplicationSet generates per-client apps from config.json files, but only wecare has production manifests today. Any new client would be added via the client-onboarder agent; see Add a client.

Planned expansion

Dedicated observability node pool

Today Prometheus, Loki, Grafana, Alertmanager, Traefik, and the Kubernetes control plane (including embedded etcd) all run on the same three cx42 masters. This is fine under normal load but two things are pushing us toward a split:

Observability load is bursty. A big scrape interval tick or a Loki compactor run can spike CPU and memory on a master — the same node that's also serving kube-apiserver and arbitrating etcd. Under pressure, ecnv4k8s-hel1-1 has occasionally restarted etcd (tracked in project memory as one of the observability follow-ups).
Resource headroom is getting thin. Prometheus TSDB + Loki chunks are memory-hungry and grow with tenant count. Master nodes are cx42 (shared CPU, 16 GiB RAM) — adequate today, tight as more clients onboard.

The plan — add a dedicated observability node pool alongside the existing pools:

Server type: likely ccx23 or ccx33 (dedicated EPYC, 16–32 GiB RAM) depending on retention targets.
Node label: node-role.kubernetes.io/observability="" + a specific label like ecnv4-observability="".
Taint: ecnv4-observability:NoSchedule.
Tolerations + nodeSelector on the kube-prometheus-stack, Loki, and Grafana Helm values so their pods land there exclusively.
Masters keep Traefik (it's a small footprint and stays close to the control plane for the ingress path).

Not scheduled yet — this is a post-wecare-go-live item. When it happens, it'll be added to terraform/main.tf (same pattern as the existing wecare-db / wecare-cache pools) and the observability-layer Helm values updated to include the new scheduling rules.

See Observability for how these components currently compose, and the "Observability follow-ups" memory for the other five deferred items bundled with this one.

Infrastructure inventory ​

Topology at a glance ​

Nodes ​

Control plane (masters) ​

Wecare MariaDB pool (wecare-db) ​

Wecare Redis cache pool (wecare-cache) ​

Wecare web pool (wecare) ​

Cluster-autoscaler pool (dynamic wecare-web) ​

Network ​

Private network ​

Kubernetes CIDRs ​

Firewall (firewall-k8s) ​

Load balancer ​

Object storage (S3-compatible) ​

ecn-private — backups ​

wecare — app file storage ​

External services ​

Monthly footprint (reference only) ​

What's NOT in this inventory ​

Planned expansion ​

Dedicated observability node pool ​

Further reading ​