Ansible playbooks (legacy)
Legacy infrastructure tooling
This is legacy infrastructure tooling. Terraform (terraform/main.tf + ./tf.sh) is the maintained path for cluster provisioning and node lifecycle — cloud-init in main.tf bootstraps new nodes into the cluster on first boot, so most of what these playbooks used to do now happens without Ansible at all. See Scale the cluster for the current approach.
The playbooks below still exist under ansible/playbooks/ and may still work, but they are not actively maintained and are not kept in sync with day-to-day practice. Before running any of them:
- Prefer the Terraform / operator path if one exists (see the "When you'd use it" column).
- If there's genuinely no alternative, read the playbook source first and confirm the variables, inventory, and RKE2 config schema still match the live cluster.
- Assume it will need patching.
OS-level operations on the cluster nodes — anything kubelet/kubectl can't reach. Originally these ran from your workstation against Hetzner VMs via SSH, and a few are still useful as a last-resort fallback. Most have been superseded.
When a playbook genuinely needs to run:
cd ansible
ansible-playbook -i inventory/hosts.yml playbooks/<name>.ymlThe playbook table
| Playbook | Purpose | Destructive? | When you'd use it |
|---|---|---|---|
setup_rke2.yml | Provision a brand-new RKE2 cluster on fresh Hetzner VMs — install RKE2, join masters + workers, apply initial CNI + HCCM config. | Only for new clusters | Historical bootstrap path only. The cluster already exists; you will almost never run this. New nodes join the existing cluster via Terraform + cloud-init, not via this playbook. |
configure_rke2.yml | Apply runtime config changes to RKE2 (config.yaml edits, feature-gate flips). | Rolling restart of rke2-server / rke2-agent | Last-resort fallback when RKE2 server config must change cluster-wide. Verify the playbook still matches the current RKE2 config schema before running — it's infrequently updated. |
configure_cni_only.yml | Reconfigure Cilium CNI only (bypasses full RKE2 reconfigure). | Cilium pod restart on all nodes | Rare. Cilium config is better managed via the ArgoCD-deployed Cilium Helm values these days. |
configure_netplan_routes.yml | Push netplan route updates to nodes. | Network blip possible | Only if a network route change can't be expressed in Terraform's Hetzner network resources. Rare. |
configure_multipath.yml | Set up multipath storage device aliases on nodes. | None | Not used — holdover from an earlier storage experiment. |
apt_update_and_upgrade.yml | apt update && apt upgrade on all nodes, one at a time with drain/uncordon. | Pod evictions | kured handles routine security patching automatically. Only fall back to this playbook for an urgent CVE that kured's window hasn't caught yet. |
reboot.yml | Coordinated rolling reboot: drain → reboot → wait for Ready → uncordon → next node. | Brief pod evictions | kured is the primary path. Use this playbook only when kured's defaults don't fit (e.g., you need a reboot right now and can't wait for kured's maintenance window). |
restart_rke2.yml | Restart the rke2-server / rke2-agent systemd unit on nodes without rebooting the OS. | Transient API-server / kubelet unavailability per node | Last-resort fallback after a manual RKE2 config change. Prefer kubectl rollout restart for workloads; this is only for the RKE2 services themselves. |
setup_workers.yml | Join a new worker node into an existing cluster. | None | Superseded. Today: edit terraform/main.tf, ./tf.sh plan -out=plans/<name>.tfplan, ./tf.sh apply. Cloud-init in main.tf handles the join. Don't use this playbook. |
teardown_rke2.yml | Full cluster teardown — uninstall RKE2, wipe data dirs. | Destroys everything | Never, unless spinning down a one-off test cluster. Has protections but assume worst-case. |
Key variables (if you do run one)
Most of the knobs live in ansible/playbooks/vars/main.yaml:
rke2_version— which RKE2 release to install / upgrade to- Node role mappings, taints, labels
CNI-specific settings in ansible/playbooks/vars/cni.yaml:
cni_provider: cilium(we use Cilium;canalis the other tested option)
The ansible/inventory/hosts.yml file lists the actual Hetzner VMs grouped by role (masters, workers-per-client, autoscaler pool).
Inventory drift
The inventory and variable files have drifted from live cluster state in the past. Spot-check hostnames and IPs in hosts.yml against hcloud server list or kubectl get nodes -o wide before running anything that iterates over the inventory.
After setup_rke2.yml (historical note)
If the bootstrap playbook is ever run again, the initial kubeconfig is written to ansible/playbooks/tmp/rke.yml. For the existing cluster, grab the kubeconfig from the Bitwarden vault instead — see Authenticating to the cluster.
Common pitfalls
- SSH access: the playbooks assume your SSH public key is on every node. Hetzner Cloud injection handles this at VM creation time (now via Terraform cloud-init); if a key isn't there, you'll get auth errors.
- Concurrency: most playbooks run sequentially (not parallel) per node — this is intentional for drain/reboot flows. Don't force parallelism without knowing what it breaks.
- etcd quorum during master reboots:
reboot.ymlserializes masters one at a time. Don't short-circuit that loop. - Bitwarden: some playbooks need Hetzner API tokens. Unlock via
scripts/bw-unlock.shfirst.
Where each tool actually fits today
Three tools, three layers — but Terraform now owns the bulk of what Ansible used to:
- Terraform (
terraform/+./tf.sh) — the maintained path. Owns Hetzner Cloud resources (VMs, load balancers, firewalls, networks) and bootstraps them into the cluster via cloud-init on first boot. New nodes, new client pools, node replacement, network changes: Terraform. - ArgoCD — everything running inside Kubernetes (pods, services, CRDs, Helm charts).
- Ansible — legacy. Patches to already-running nodes when cloud-init isn't enough and no operator covers it (see the table above — a handful of edge cases, plus kured as the automated replacement for rolling reboots / apt upgrades).
Further reading
- Scale the cluster — current Terraform-based node lifecycle
- Reboot & patch — kured + the rare ansible fallback
- Terraform entry point:
terraform/main.tf(managed via./tf.sh) - Ansible (legacy):
ansible/inventory/hosts.yml,ansible/playbooks/vars/ - RKE2 docs: https://docs.rke2.io/
- Next: Claude agents — prefer terraform-manager for node lifecycle changes