Skip to content

Infrastructure inventory

A snapshot of every moving piece the cluster is made of — servers, load balancer, network, object storage, external services. This page is the single answer to "what's in our cluster?". It's kept in sync with terraform/main.tf and the related app-construct values files; if you change infrastructure, update this page.

Last-audited: 2026-04-18.

Terraform is the source of truth

Everything under "Nodes" is declared in terraform/main.tf and applied via ./tf.sh. Delegate any node add/remove to the terraform-manager agent rather than editing the file directly. This page is a human-readable projection; terraform wins if they disagree.

Topology at a glance

Why the LB targets the masters

Traefik is configured with tolerations for node-role.kubernetes.io/control-plane:NoSchedule and priorityClassName: system-cluster-critical, and has no tolerations for the tenant-pool taints (wecare, wecare-mariadb, wecare-cache). So the only nodes it can schedule on are the three control-plane masters. The Hetzner LB therefore forwards ports 80/443/3306 to the masters' NodePorts, not the wecare-web nodes — Traefik inside those master pods then routes in-cluster via app-svc to app-web pods on the wecare-web pool. The masters are doing two jobs: Kubernetes control plane and ingress.

Nodes

All nodes are in Hetzner's hel1 location (Helsinki, Finland), Ubuntu 24.04, joined to RKE2 v1.31.2+rke2r1 via cloud-init, attached to the network-k8s private network and the firewall-k8s firewall.

Control plane (masters)

3 servers running the API server, scheduler, controller-manager, and embedded etcd. Shared-CPU Hetzner tier.

NameTypevCPU/RAMPrivate IPPublic IPRole
ecnv4k8s-hel1-1cx428/16 GiB10.1.0.4135.181.32.184primary (Cilium k8sServiceHost)
ecnv4k8s-hel1-2cx428/16 GiB10.1.0.295.217.177.124secondary
ecnv4k8s-hel1-3cx428/16 GiB10.1.0.365.109.9.80secondary

Lifecycle-ignored fields: labels, network, user_data — Terraform won't roll these on drift, so changes need a full re-provision. RKE2 join token is loaded into var.rke2_token by tf.sh from Vaultwarden (item ecnv4 - rke2 token).

Wecare MariaDB pool (wecare-db)

Dedicated AMD EPYC servers reserved for the MariaDB primary+replica and MaxScale. Data lives on the node's local NVMe via the local-path StorageClass.

NameTypevCPU/RAMPrivate IP
wecare-db-1ccx338 dedicated / 32 GiB10.1.0.7
wecare-db-2ccx338 dedicated / 32 GiB10.1.0.6

Labels: node-role.kubernetes.io/db="", wecare-mariadb="". Taint: wecare-mariadb:NoSchedule.

Only pods with a matching toleration (MariaDB, MaxScale, redis-session, their backup jobs) land here.

Wecare Redis cache pool (wecare-cache)

Dedicated AMD EPYC servers for Redis cache and Sentinel. Less RAM than the DB pool — Redis is in-memory but not the hot working set.

NameTypevCPU/RAMPrivate IP
wecare-cache-1ccx234 dedicated / 16 GiB10.1.0.12
wecare-cache-2ccx234 dedicated / 16 GiB10.1.0.13

Labels: node-role.kubernetes.io/cache="", wecare-cache="". Taint: wecare-cache:NoSchedule.

Note: redis-cache lands here; redis-session actually lands on the MariaDB pool (see Databases for the rationale).

Wecare web pool (wecare)

Dedicated AMD EPYC servers running nginx + php-fpm (the app-web Deployment). Four static nodes today; the cluster-autoscaler provisions additional CCX13 nodes dynamically up to 12 total when KEDA drives the pod count up.

NameTypevCPU/RAMPrivate IP
wecare-web-1ccx132 dedicated / 8 GiB10.1.0.8
wecare-web-2ccx132 dedicated / 8 GiB10.1.0.9
wecare-web-3ccx132 dedicated / 8 GiB10.1.0.10
wecare-web-4ccx132 dedicated / 8 GiB10.1.0.11

Labels: node-role.kubernetes.io/app="", wecare="", topology.kubernetes.io/region=hel1, topology.kubernetes.io/zone=hel1-dc2. Taint: wecare:NoSchedule.

Cluster-autoscaler pool (dynamic wecare-web)

Ephemeral CCX13 nodes provisioned by the cluster-autoscaler when workloads can't fit on the static four. Transition state today — autoscaler coexists with the static nodes (Phase 3); Phase 4 plan is to drop the static four in favor of fully autoscaled capacity.

PropertyValue
Autoscaling group namewecare-web
minSize0 (Phase 3 testing; will move to 2 in Phase 4)
maxSize12
instanceTypeCCX13
regionhel1
Scale-down threshold50% utilization, unneeded 10m, delay-after-add 10m
max-node-provision-time45m

The autoscaler applies the same wecare="" label and wecare:NoSchedule taint as the Terraform-managed webs via its HCLOUD_CLUSTER_CONFIG secret (built by untracked/secrets/build-autoscaler-config.sh — requires Bitwarden unlocked). Config source: manifests_v1/app-constructs/cluster-autoscaler/values.yaml.

Network

Private network

PropertyValue
Hetzner network namenetwork-k8s
IP range10.1.0.0/16
Subnet10.1.0.0/24 (eu-central cloud zone)
Gateway10.1.0.1
Private interface on nodesenp7s0

Kubernetes CIDRs

CIDRPurpose
10.42.0.0/16Pod network (Cilium clusterCIDR)
10.43.0.0/16ClusterIP / Service range
10.1.0.0/24Node IPs (Hetzner private)

See Cluster networking for how pods talk to each other inside this space.

Firewall (firewall-k8s)

Applied to every node. Only four inbound rules — everything else relies on the Hetzner private network being unreachable from the outside.

DirectionProtocolPortSourcePurpose
intcp220.0.0.0/0, ::/0SSH
intcp64430.0.0.0/0, ::/0kube-apiserver
intcp30000-327670.0.0.0/0, ::/0Kubernetes NodePort (TCP)
inudp30000-327670.0.0.0/0, ::/0Kubernetes NodePort (UDP)

Load balancer

One Hetzner Cloud Load Balancer provisioned by HCCM (Hetzner Cloud Controller Manager) via the Traefik Service type=LoadBalancer:

PropertyValue
Nameecnv4-lb
Hetzner annotationload-balancer.hetzner.cloud/name: "ecnv4-lb"
Proxy Protocolenabled (uses-proxyprotocol: "true")
Private-IP targetenabled (use-private-ip: "true")
FrontsTraefik pods — HTTP (80), HTTPS (443), MySQL (3306)

Path A (public HTTP) and Path C (external ERP on 3306) both terminate here. See Ingress, TLS & Cloudflare for the full traffic flow.

Managed by HCCM declaratively through annotations on Traefik's Service — do NOT edit the LB in the Hetzner console.

Object storage (S3-compatible)

Hetzner's own S3-compatible Object Storage, at endpoint hel1.your-objectstorage.com. Two buckets, one shared credential Secret (ecn-bucket-access, sealed; keys FILES_S3_ACCESS_KEY_ID, FILES_S3_SECRET_ACCESS_KEY).

ecn-private — backups

Where MariaDB backups land. Infrastructure-level bucket, not served to any application.

PropertyValue
Bucketecn-private
WritersBackup CR (mysqldump / LogicalBackup) and PhysicalBackup CR — both in manifests_v1/app-constructs/ecommercen-clients/wecare/infrastructure/prod/
Prefixeswecare/mariadb-sql-backup/, wecare/mariadb-backup/
RetentionPer CR maxRetention — see the backup manifests
ReadersRestore jobs, the db-manager agent when investigating

wecare — app file storage

Where the wecare app writes uploaded files (product images, user documents, exports, etc.). Read and write from the app-web pods via the AWS SDK; also served back to browsers via an nginx proxy_pass reverse proxy that hides the bucket endpoint.

PropertyValue
Bucketwecare
Env varsFILES_S3_BUCKET=wecare, FILES_S3_ENDPOINT=https://hel1.your-objectstorage.com, FILES_S3_DEFAULT_REGION=hel1, FILES_S3_PREFIX="" (root)
Config sourcemanifests_v1/app-constructs/ecommercen-clients/wecare/adveshop4/base/app.yaml (the envs ConfigMap)
Readers / writersapp-web pods (nginx proxy + php-fpm direct SDK access)
Access patternPrivate bucket; the nginx sidecar fronts it with proxy_pass ${FILES_S3_ENDPOINT}/${FILES_S3_BUCKET}/... so browsers never see the raw storage URL

New clients get their own bucket (name matches the tenant's FILES_S3_BUCKET env, set by the client-onboarder agent during onboarding). Bucket creation itself happens in the Hetzner Object Storage console — it isn't declared in this repo.

External services

ServiceRoleWhere configured
Cloudflare DNSOrange-cloud proxy for *.wecare.gr + *.ecommercen.com; CNAMEs for *ecnv4-mgmt.ecommercen.comCloudflare dashboard (not in repo)
Cloudflare TunnelOutbound tunnel for management URLsmanifests_v1/app-constructs/cloudflared/configmap.yaml
Cloudflare AccessSSO gate for management URLs (@advisable.com + Service Auth)Cloudflare dashboard
Bitwarden / VaultwardenSource of truth for plaintext secrets before sealing; HCLOUD_TOKEN + rke2 token for tf.shBitwarden cloud + bw-unlock.sh
GitHub (Advisable-com/ecnv4_manifests)Git remote for GitOps; GitHub Actions for the docs build.github/workflows/docs.yml
ghcr.ioContainer registry (app images, operator images)Image references in manifests
docker-registry1.mariadb.comMariaDB + MaxScale official imagesmariadb.yaml, maxscale.yaml

Monthly footprint (reference only)

Hetzner pricing changes — check the console for live numbers. Rough figures at time of writing:

Line itemCountApprox
cx42 master3~€36
ccx33 wecare-db2~€132
ccx23 wecare-cache2~€66
ccx13 wecare-web4~€48
Autoscaler headroom (up to 12 × ccx13)0–12up to ~€144 extra
Hetzner LB1~€6
Object Storage1usage-based (~€5 floor)
Private network1included
Cloudflarefree/Pro tier — external
Total (static)≈ €293/mo

Numbers from project memory and the Terraform-defined inventory; validate against the Hetzner invoice for month-accurate totals.

What's NOT in this inventory

  • MetalLB — installed in manifests_v1/app-constructs/metallb/ but disabled (no app-*.yaml in apps-enabled/). HCCM handles the LB; MetalLB would only matter in a bare-metal deployment.
  • bootstrapFrom on the MariaDB CR — commented out in mariadb.yaml because the bootstrap pods that consume PhysicalBackups fail on start. The hourly PhysicalBackup CronJob itself is running — snapshots land in S3 normally. Fresh replicas just fall back to native GTID streaming clone from the primary instead of pulling the snapshot. See Databases.
  • Non-wecare clients — the ApplicationSet generates per-client apps from config.json files, but only wecare has production manifests today. Any new client would be added via the client-onboarder agent; see Add a client.

Planned expansion

Dedicated observability node pool

Today Prometheus, Loki, Grafana, Alertmanager, Traefik, and the Kubernetes control plane (including embedded etcd) all run on the same three cx42 masters. This is fine under normal load but two things are pushing us toward a split:

  1. Observability load is bursty. A big scrape interval tick or a Loki compactor run can spike CPU and memory on a master — the same node that's also serving kube-apiserver and arbitrating etcd. Under pressure, ecnv4k8s-hel1-1 has occasionally restarted etcd (tracked in project memory as one of the observability follow-ups).
  2. Resource headroom is getting thin. Prometheus TSDB + Loki chunks are memory-hungry and grow with tenant count. Master nodes are cx42 (shared CPU, 16 GiB RAM) — adequate today, tight as more clients onboard.

The plan — add a dedicated observability node pool alongside the existing pools:

  • Server type: likely ccx23 or ccx33 (dedicated EPYC, 16–32 GiB RAM) depending on retention targets.
  • Node label: node-role.kubernetes.io/observability="" + a specific label like ecnv4-observability="".
  • Taint: ecnv4-observability:NoSchedule.
  • Tolerations + nodeSelector on the kube-prometheus-stack, Loki, and Grafana Helm values so their pods land there exclusively.
  • Masters keep Traefik (it's a small footprint and stays close to the control plane for the ingress path).

Not scheduled yet — this is a post-wecare-go-live item. When it happens, it'll be added to terraform/main.tf (same pattern as the existing wecare-db / wecare-cache pools) and the observability-layer Helm values updated to include the new scheduling rules.

See Observability for how these components currently compose, and the "Observability follow-ups" memory for the other five deferred items bundled with this one.

Further reading

Internal documentation — Advisable only