Reboot & patch
Default answer: you don't. Kured handles routine security patching on its own, draining nodes one at a time. In normal operations nobody runs reboot by hand. This page is for the edge cases where you do.
Kured — the automated path
kured (Kubernetes Reboot Daemon) runs as a DaemonSet on every node. Once per hour each pod checks /var/run/reboot-required on its host (Ubuntu creates that file after apt installs a kernel or library needing a restart). When it appears, Kured:
- Acquires a cluster-wide lock via a DaemonSet annotation (only one node reboots at a time).
- Drains the node (5-minute timeout).
- Reboots the host via
nsenterinto PID 1. - Waits for the node to come back Ready, releases the lock, uncordons.
The manifests live in manifests_v1/app-constructs/kured/. Version, flags, and reboot windows are configured there — the Watchdog heartbeat + KubeNodeNotReady alert tell you if a reboot stalls.
- 🟢 Don't do anything for routine patching. Kured will work through the cluster over a few hours.
- 🟠 Kured reboots one node at a time. If a workload doesn't tolerate a brief eviction, fix the PDB, don't disable Kured.
- 🔴 Don't suppress Kured with
--reboot-days=unless you have a compensating control for CVE latency.
When you actually need to reboot manually
- An urgent CVE needs the patched kernel now — you don't want to wait the hour for Kured to notice.
- A node is in a weird state (kubelet up but Pods aren't scheduling, network interfaces flapping, systemd failures). A clean reboot is cheaper than debugging.
- You're rolling an RKE2 or CNI change that requires the kubelet to re-read config.
For "replace the VM entirely" (kernel upgrade, instance resize, bad hardware) don't reboot — destroy and recreate via Terraform. Cloud-init re-joins the replacement to the cluster. See Scale the cluster.
The drain-reboot-uncordon pattern
# 1. Drain — evict pods, keep DaemonSets, tolerate emptyDir
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
# 2. Reboot (whichever access path works)
ssh root@<node-public-ip> systemctl reboot
# 3. Wait for Ready
kubectl get nodes -w # Ctrl-C when <node> shows Ready
# 4. Uncordon
kubectl uncordon <node>Verify workloads rebalanced:
kubectl get pods -A -o wide | grep <node>- 🟠
--delete-emptydir-datais fine for our workloads (PHP scratch, warmup script). Double-check before draining anything with anemptyDirthat matters. - 🔴 Don't forget the uncordon. A cordoned Ready node looks healthy at a glance and silently starves capacity.
Masters: one at a time, always
The three control-plane nodes (ecnv4k8s-hel1-1..3) form an etcd quorum of two. You can lose one without impact; two means the API server and etcd go read-only.
- 🟢 Reboot masters serially — wait for the
etcd-<node>Pod inkube-systemto report Ready before touching the next one. - 🔴 Never reboot two masters concurrently, and never run destructive ops against two masters at once.
Emergency: Hetzner console hard reset
When the kubelet is unreachable and SSH is unreachable, the node is wedged at the hypervisor level. Last resort:
- Open the Hetzner Cloud console, find the server.
- Under Power → Reset (or Shut down + Power on for a clean stop first).
- Watch the serial console output for boot errors while it comes up.
- 🔴 Hard reset is equivalent to pulling the power cord. File systems go through journal replay on boot; occasionally you lose some writes. Always try
systemctl rebootfirst if you have any way in.
Legacy: the Ansible playbooks
The repo still carries ansible/playbooks/reboot.yml (coordinated rolling reboot via the reboot module) and ansible/playbooks/apt_update_and_upgrade.yml (cluster-wide apt update && dist-upgrade). These predate Kured.
Ansible playbooks are not actively maintained
They still exist under ansible/playbooks/ but aren't kept in sync with day-to-day practice. Verify the playbook matches the current cluster state before running, and prefer Terraform-based or operator-based paths (Kured, cluster-autoscaler) where they apply.
Practical guidance:
reboot.yml— fallback only. Use when Kured's defaults don't fit and you need the reboot right now. Drain manually around it if PDBs are strict.apt_update_and_upgrade.yml— fallback only. Use for urgent user-space CVEs that can't wait for Kured. It does not reboot; if the upgrade installs a new kernel, Kured picks up the sentinel within the hour.
Run either from the ansible/ directory after unlocking Bitwarden for any secrets it needs, and read the source first.
Further reading
- Ansible playbooks reference — full playbook catalog
- Incident response — if the reboot was triggered by an alert
- k8s primer — drain/cordon/uncordon semantics
- Scale the cluster — replace-the-VM path