Ansible playbooks (legacy)

Legacy infrastructure tooling

This is legacy infrastructure tooling. Terraform (terraform/main.tf + ./tf.sh) is the maintained path for cluster provisioning and node lifecycle — cloud-init in main.tf bootstraps new nodes into the cluster on first boot, so most of what these playbooks used to do now happens without Ansible at all. See Scale the cluster for the current approach.

The playbooks below still exist under ansible/playbooks/ and may still work, but they are not actively maintained and are not kept in sync with day-to-day practice. Before running any of them:

Prefer the Terraform / operator path if one exists (see the "When you'd use it" column).
If there's genuinely no alternative, read the playbook source first and confirm the variables, inventory, and RKE2 config schema still match the live cluster.
Assume it will need patching.

OS-level operations on the cluster nodes — anything kubelet/kubectl can't reach. Originally these ran from your workstation against Hetzner VMs via SSH, and a few are still useful as a last-resort fallback. Most have been superseded.

When a playbook genuinely needs to run:

bash

cd ansible
ansible-playbook -i inventory/hosts.yml playbooks/<name>.yml

The playbook table

Playbook	Purpose	Destructive?	When you'd use it
`setup_rke2.yml`	Provision a brand-new RKE2 cluster on fresh Hetzner VMs — install RKE2, join masters + workers, apply initial CNI + HCCM config.	Only for new clusters	Historical bootstrap path only. The cluster already exists; you will almost never run this. New nodes join the existing cluster via Terraform + cloud-init, not via this playbook.
`configure_rke2.yml`	Apply runtime config changes to RKE2 (`config.yaml` edits, feature-gate flips).	Rolling restart of rke2-server / rke2-agent	Last-resort fallback when RKE2 server config must change cluster-wide. Verify the playbook still matches the current RKE2 config schema before running — it's infrequently updated.
`configure_cni_only.yml`	Reconfigure Cilium CNI only (bypasses full RKE2 reconfigure).	Cilium pod restart on all nodes	Rare. Cilium config is better managed via the ArgoCD-deployed Cilium Helm values these days.
`configure_netplan_routes.yml`	Push netplan route updates to nodes.	Network blip possible	Only if a network route change can't be expressed in Terraform's Hetzner network resources. Rare.
`configure_multipath.yml`	Set up multipath storage device aliases on nodes.	None	Not used — holdover from an earlier storage experiment.
`apt_update_and_upgrade.yml`	`apt update && apt upgrade` on all nodes, one at a time with drain/uncordon.	Pod evictions	kured handles routine security patching automatically. Only fall back to this playbook for an urgent CVE that kured's window hasn't caught yet.
`reboot.yml`	Coordinated rolling reboot: drain → reboot → wait for Ready → uncordon → next node.	Brief pod evictions	kured is the primary path. Use this playbook only when kured's defaults don't fit (e.g., you need a reboot right now and can't wait for kured's maintenance window).
`restart_rke2.yml`	Restart the `rke2-server` / `rke2-agent` systemd unit on nodes without rebooting the OS.	Transient API-server / kubelet unavailability per node	Last-resort fallback after a manual RKE2 config change. Prefer `kubectl rollout restart` for workloads; this is only for the RKE2 services themselves.
`setup_workers.yml`	Join a new worker node into an existing cluster.	None	Superseded. Today: edit `terraform/main.tf`, `./tf.sh plan -out=plans/<name>.tfplan`, `./tf.sh apply`. Cloud-init in `main.tf` handles the join. Don't use this playbook.
`teardown_rke2.yml`	Full cluster teardown — uninstall RKE2, wipe data dirs.	Destroys everything	Never, unless spinning down a one-off test cluster. Has protections but assume worst-case.

Key variables (if you do run one)

Most of the knobs live in ansible/playbooks/vars/main.yaml:

rke2_version — which RKE2 release to install / upgrade to
Node role mappings, taints, labels

CNI-specific settings in ansible/playbooks/vars/cni.yaml:

cni_provider: cilium (we use Cilium; canal is the other tested option)

The ansible/inventory/hosts.yml file lists the actual Hetzner VMs grouped by role (masters, workers-per-client, autoscaler pool).

Inventory drift

The inventory and variable files have drifted from live cluster state in the past. Spot-check hostnames and IPs in hosts.yml against hcloud server list or kubectl get nodes -o wide before running anything that iterates over the inventory.

After `setup_rke2.yml` (historical note)

If the bootstrap playbook is ever run again, the initial kubeconfig is written to ansible/playbooks/tmp/rke.yml. For the existing cluster, grab the kubeconfig from the Bitwarden vault instead — see Authenticating to the cluster.

Common pitfalls

SSH access: the playbooks assume your SSH public key is on every node. Hetzner Cloud injection handles this at VM creation time (now via Terraform cloud-init); if a key isn't there, you'll get auth errors.
Concurrency: most playbooks run sequentially (not parallel) per node — this is intentional for drain/reboot flows. Don't force parallelism without knowing what it breaks.
etcd quorum during master reboots: reboot.yml serializes masters one at a time. Don't short-circuit that loop.
Bitwarden: some playbooks need Hetzner API tokens. Unlock via scripts/bw-unlock.sh first.

Where each tool actually fits today

Three tools, three layers — but Terraform now owns the bulk of what Ansible used to:

Terraform (terraform/ + ./tf.sh) — the maintained path. Owns Hetzner Cloud resources (VMs, load balancers, firewalls, networks) and bootstraps them into the cluster via cloud-init on first boot. New nodes, new client pools, node replacement, network changes: Terraform.
ArgoCD — everything running inside Kubernetes (pods, services, CRDs, Helm charts).
Ansible — legacy. Patches to already-running nodes when cloud-init isn't enough and no operator covers it (see the table above — a handful of edge cases, plus kured as the automated replacement for rolling reboots / apt upgrades).

Ansible playbooks (legacy) ​

The playbook table ​

Key variables (if you do run one) ​

After setup_rke2.yml (historical note) ​

Common pitfalls ​

Where each tool actually fits today ​

Further reading ​