Infrastructure inventory
A snapshot of every moving piece the cluster is made of — servers, load balancer, network, object storage, external services. This page is the single answer to "what's in our cluster?". It's kept in sync with terraform/main.tf and the related app-construct values files; if you change infrastructure, update this page.
Last-audited: 2026-04-18.
Terraform is the source of truth
Everything under "Nodes" is declared in terraform/main.tf and applied via ./tf.sh. Delegate any node add/remove to the terraform-manager agent rather than editing the file directly. This page is a human-readable projection; terraform wins if they disagree.
Topology at a glance
Why the LB targets the masters
Traefik is configured with tolerations for node-role.kubernetes.io/control-plane:NoSchedule and priorityClassName: system-cluster-critical, and has no tolerations for the tenant-pool taints (wecare, wecare-mariadb, wecare-cache). So the only nodes it can schedule on are the three control-plane masters. The Hetzner LB therefore forwards ports 80/443/3306 to the masters' NodePorts, not the wecare-web nodes — Traefik inside those master pods then routes in-cluster via app-svc to app-web pods on the wecare-web pool. The masters are doing two jobs: Kubernetes control plane and ingress.
Nodes
All nodes are in Hetzner's hel1 location (Helsinki, Finland), Ubuntu 24.04, joined to RKE2 v1.31.2+rke2r1 via cloud-init, attached to the network-k8s private network and the firewall-k8s firewall.
Control plane (masters)
3 servers running the API server, scheduler, controller-manager, and embedded etcd. Shared-CPU Hetzner tier.
| Name | Type | vCPU/RAM | Private IP | Public IP | Role |
|---|---|---|---|---|---|
ecnv4k8s-hel1-1 | cx42 | 8/16 GiB | 10.1.0.4 | 135.181.32.184 | primary (Cilium k8sServiceHost) |
ecnv4k8s-hel1-2 | cx42 | 8/16 GiB | 10.1.0.2 | 95.217.177.124 | secondary |
ecnv4k8s-hel1-3 | cx42 | 8/16 GiB | 10.1.0.3 | 65.109.9.80 | secondary |
Lifecycle-ignored fields: labels, network, user_data — Terraform won't roll these on drift, so changes need a full re-provision. RKE2 join token is loaded into var.rke2_token by tf.sh from Vaultwarden (item ecnv4 - rke2 token).
Wecare MariaDB pool (wecare-db)
Dedicated AMD EPYC servers reserved for the MariaDB primary+replica and MaxScale. Data lives on the node's local NVMe via the local-path StorageClass.
| Name | Type | vCPU/RAM | Private IP |
|---|---|---|---|
wecare-db-1 | ccx33 | 8 dedicated / 32 GiB | 10.1.0.7 |
wecare-db-2 | ccx33 | 8 dedicated / 32 GiB | 10.1.0.6 |
Labels: node-role.kubernetes.io/db="", wecare-mariadb="". Taint: wecare-mariadb:NoSchedule.
Only pods with a matching toleration (MariaDB, MaxScale, redis-session, their backup jobs) land here.
Wecare Redis cache pool (wecare-cache)
Dedicated AMD EPYC servers for Redis cache and Sentinel. Less RAM than the DB pool — Redis is in-memory but not the hot working set.
| Name | Type | vCPU/RAM | Private IP |
|---|---|---|---|
wecare-cache-1 | ccx23 | 4 dedicated / 16 GiB | 10.1.0.12 |
wecare-cache-2 | ccx23 | 4 dedicated / 16 GiB | 10.1.0.13 |
Labels: node-role.kubernetes.io/cache="", wecare-cache="". Taint: wecare-cache:NoSchedule.
Note: redis-cache lands here; redis-session actually lands on the MariaDB pool (see Databases for the rationale).
Wecare web pool (wecare)
Dedicated AMD EPYC servers running nginx + php-fpm (the app-web Deployment). Four static nodes today; the cluster-autoscaler provisions additional CCX13 nodes dynamically up to 12 total when KEDA drives the pod count up.
| Name | Type | vCPU/RAM | Private IP |
|---|---|---|---|
wecare-web-1 | ccx13 | 2 dedicated / 8 GiB | 10.1.0.8 |
wecare-web-2 | ccx13 | 2 dedicated / 8 GiB | 10.1.0.9 |
wecare-web-3 | ccx13 | 2 dedicated / 8 GiB | 10.1.0.10 |
wecare-web-4 | ccx13 | 2 dedicated / 8 GiB | 10.1.0.11 |
Labels: node-role.kubernetes.io/app="", wecare="", topology.kubernetes.io/region=hel1, topology.kubernetes.io/zone=hel1-dc2. Taint: wecare:NoSchedule.
Cluster-autoscaler pool (dynamic wecare-web)
Ephemeral CCX13 nodes provisioned by the cluster-autoscaler when workloads can't fit on the static four. Transition state today — autoscaler coexists with the static nodes (Phase 3); Phase 4 plan is to drop the static four in favor of fully autoscaled capacity.
| Property | Value |
|---|---|
| Autoscaling group name | wecare-web |
minSize | 0 (Phase 3 testing; will move to 2 in Phase 4) |
maxSize | 12 |
instanceType | CCX13 |
region | hel1 |
| Scale-down threshold | 50% utilization, unneeded 10m, delay-after-add 10m |
max-node-provision-time | 45m |
The autoscaler applies the same wecare="" label and wecare:NoSchedule taint as the Terraform-managed webs via its HCLOUD_CLUSTER_CONFIG secret (built by untracked/secrets/build-autoscaler-config.sh — requires Bitwarden unlocked). Config source: manifests_v1/app-constructs/cluster-autoscaler/values.yaml.
Network
Private network
| Property | Value |
|---|---|
| Hetzner network name | network-k8s |
| IP range | 10.1.0.0/16 |
| Subnet | 10.1.0.0/24 (eu-central cloud zone) |
| Gateway | 10.1.0.1 |
| Private interface on nodes | enp7s0 |
Kubernetes CIDRs
| CIDR | Purpose |
|---|---|
10.42.0.0/16 | Pod network (Cilium clusterCIDR) |
10.43.0.0/16 | ClusterIP / Service range |
10.1.0.0/24 | Node IPs (Hetzner private) |
See Cluster networking for how pods talk to each other inside this space.
Firewall (firewall-k8s)
Applied to every node. Only four inbound rules — everything else relies on the Hetzner private network being unreachable from the outside.
| Direction | Protocol | Port | Source | Purpose |
|---|---|---|---|---|
| in | tcp | 22 | 0.0.0.0/0, ::/0 | SSH |
| in | tcp | 6443 | 0.0.0.0/0, ::/0 | kube-apiserver |
| in | tcp | 30000-32767 | 0.0.0.0/0, ::/0 | Kubernetes NodePort (TCP) |
| in | udp | 30000-32767 | 0.0.0.0/0, ::/0 | Kubernetes NodePort (UDP) |
Load balancer
One Hetzner Cloud Load Balancer provisioned by HCCM (Hetzner Cloud Controller Manager) via the Traefik Service type=LoadBalancer:
| Property | Value |
|---|---|
| Name | ecnv4-lb |
| Hetzner annotation | load-balancer.hetzner.cloud/name: "ecnv4-lb" |
| Proxy Protocol | enabled (uses-proxyprotocol: "true") |
| Private-IP target | enabled (use-private-ip: "true") |
| Fronts | Traefik pods — HTTP (80), HTTPS (443), MySQL (3306) |
Path A (public HTTP) and Path C (external ERP on 3306) both terminate here. See Ingress, TLS & Cloudflare for the full traffic flow.
Managed by HCCM declaratively through annotations on Traefik's Service — do NOT edit the LB in the Hetzner console.
Object storage (S3-compatible)
Hetzner's own S3-compatible Object Storage, at endpoint hel1.your-objectstorage.com. Two buckets, one shared credential Secret (ecn-bucket-access, sealed; keys FILES_S3_ACCESS_KEY_ID, FILES_S3_SECRET_ACCESS_KEY).
ecn-private — backups
Where MariaDB backups land. Infrastructure-level bucket, not served to any application.
| Property | Value |
|---|---|
| Bucket | ecn-private |
| Writers | Backup CR (mysqldump / LogicalBackup) and PhysicalBackup CR — both in manifests_v1/app-constructs/ecommercen-clients/wecare/infrastructure/prod/ |
| Prefixes | wecare/mariadb-sql-backup/, wecare/mariadb-backup/ |
| Retention | Per CR maxRetention — see the backup manifests |
| Readers | Restore jobs, the db-manager agent when investigating |
wecare — app file storage
Where the wecare app writes uploaded files (product images, user documents, exports, etc.). Read and write from the app-web pods via the AWS SDK; also served back to browsers via an nginx proxy_pass reverse proxy that hides the bucket endpoint.
| Property | Value |
|---|---|
| Bucket | wecare |
| Env vars | FILES_S3_BUCKET=wecare, FILES_S3_ENDPOINT=https://hel1.your-objectstorage.com, FILES_S3_DEFAULT_REGION=hel1, FILES_S3_PREFIX="" (root) |
| Config source | manifests_v1/app-constructs/ecommercen-clients/wecare/adveshop4/base/app.yaml (the envs ConfigMap) |
| Readers / writers | app-web pods (nginx proxy + php-fpm direct SDK access) |
| Access pattern | Private bucket; the nginx sidecar fronts it with proxy_pass ${FILES_S3_ENDPOINT}/${FILES_S3_BUCKET}/... so browsers never see the raw storage URL |
New clients get their own bucket (name matches the tenant's FILES_S3_BUCKET env, set by the client-onboarder agent during onboarding). Bucket creation itself happens in the Hetzner Object Storage console — it isn't declared in this repo.
External services
| Service | Role | Where configured |
|---|---|---|
| Cloudflare DNS | Orange-cloud proxy for *.wecare.gr + *.ecommercen.com; CNAMEs for *ecnv4-mgmt.ecommercen.com | Cloudflare dashboard (not in repo) |
| Cloudflare Tunnel | Outbound tunnel for management URLs | manifests_v1/app-constructs/cloudflared/configmap.yaml |
| Cloudflare Access | SSO gate for management URLs (@advisable.com + Service Auth) | Cloudflare dashboard |
| Bitwarden / Vaultwarden | Source of truth for plaintext secrets before sealing; HCLOUD_TOKEN + rke2 token for tf.sh | Bitwarden cloud + bw-unlock.sh |
GitHub (Advisable-com/ecnv4_manifests) | Git remote for GitOps; GitHub Actions for the docs build | .github/workflows/docs.yml |
| ghcr.io | Container registry (app images, operator images) | Image references in manifests |
docker-registry1.mariadb.com | MariaDB + MaxScale official images | mariadb.yaml, maxscale.yaml |
Monthly footprint (reference only)
Hetzner pricing changes — check the console for live numbers. Rough figures at time of writing:
| Line item | Count | Approx |
|---|---|---|
cx42 master | 3 | ~€36 |
ccx33 wecare-db | 2 | ~€132 |
ccx23 wecare-cache | 2 | ~€66 |
ccx13 wecare-web | 4 | ~€48 |
Autoscaler headroom (up to 12 × ccx13) | 0–12 | up to ~€144 extra |
| Hetzner LB | 1 | ~€6 |
| Object Storage | 1 | usage-based (~€5 floor) |
| Private network | 1 | included |
| Cloudflare | — | free/Pro tier — external |
| Total (static) | ≈ €293/mo |
Numbers from project memory and the Terraform-defined inventory; validate against the Hetzner invoice for month-accurate totals.
What's NOT in this inventory
- MetalLB — installed in
manifests_v1/app-constructs/metallb/but disabled (noapp-*.yamlinapps-enabled/). HCCM handles the LB; MetalLB would only matter in a bare-metal deployment. bootstrapFromon the MariaDB CR — commented out inmariadb.yamlbecause the bootstrap pods that consume PhysicalBackups fail on start. The hourly PhysicalBackup CronJob itself is running — snapshots land in S3 normally. Fresh replicas just fall back to native GTID streaming clone from the primary instead of pulling the snapshot. See Databases.- Non-wecare clients — the ApplicationSet generates per-client apps from
config.jsonfiles, but onlywecarehas production manifests today. Any new client would be added via the client-onboarder agent; see Add a client.
Planned expansion
Dedicated observability node pool
Today Prometheus, Loki, Grafana, Alertmanager, Traefik, and the Kubernetes control plane (including embedded etcd) all run on the same three cx42 masters. This is fine under normal load but two things are pushing us toward a split:
- Observability load is bursty. A big scrape interval tick or a Loki compactor run can spike CPU and memory on a master — the same node that's also serving
kube-apiserverand arbitrating etcd. Under pressure,ecnv4k8s-hel1-1has occasionally restarted etcd (tracked in project memory as one of the observability follow-ups). - Resource headroom is getting thin. Prometheus TSDB + Loki chunks are memory-hungry and grow with tenant count. Master nodes are
cx42(shared CPU, 16 GiB RAM) — adequate today, tight as more clients onboard.
The plan — add a dedicated observability node pool alongside the existing pools:
- Server type: likely
ccx23orccx33(dedicated EPYC, 16–32 GiB RAM) depending on retention targets. - Node label:
node-role.kubernetes.io/observability=""+ a specific label likeecnv4-observability="". - Taint:
ecnv4-observability:NoSchedule. - Tolerations + nodeSelector on the kube-prometheus-stack, Loki, and Grafana Helm values so their pods land there exclusively.
- Masters keep Traefik (it's a small footprint and stays close to the control plane for the ingress path).
Not scheduled yet — this is a post-wecare-go-live item. When it happens, it'll be added to terraform/main.tf (same pattern as the existing wecare-db / wecare-cache pools) and the observability-layer Helm values updated to include the new scheduling rules.
See Observability for how these components currently compose, and the "Observability follow-ups" memory for the other five deferred items bundled with this one.
Further reading
terraform/main.tf— the ground truth for every server on this page- Cluster networking — what happens between these nodes once traffic is in the private network
- Databases — detailed topology of the DB + cache pools
- Autoscaling — how the wecare-web pool grows under load
- Scale the cluster — runbook for changing node counts
- Reboot & patch — runbook for OS updates across this fleet