Skip to content

Reboot & patch

Default answer: you don't. Kured handles routine security patching on its own, draining nodes one at a time. In normal operations nobody runs reboot by hand. This page is for the edge cases where you do.

Kured — the automated path

kured (Kubernetes Reboot Daemon) runs as a DaemonSet on every node. Once per hour each pod checks /var/run/reboot-required on its host (Ubuntu creates that file after apt installs a kernel or library needing a restart). When it appears, Kured:

  1. Acquires a cluster-wide lock via a DaemonSet annotation (only one node reboots at a time).
  2. Drains the node (5-minute timeout).
  3. Reboots the host via nsenter into PID 1.
  4. Waits for the node to come back Ready, releases the lock, uncordons.

The manifests live in manifests_v1/app-constructs/kured/. Version, flags, and reboot windows are configured there — the Watchdog heartbeat + KubeNodeNotReady alert tell you if a reboot stalls.

  • 🟢 Don't do anything for routine patching. Kured will work through the cluster over a few hours.
  • 🟠 Kured reboots one node at a time. If a workload doesn't tolerate a brief eviction, fix the PDB, don't disable Kured.
  • 🔴 Don't suppress Kured with --reboot-days= unless you have a compensating control for CVE latency.

When you actually need to reboot manually

  • An urgent CVE needs the patched kernel now — you don't want to wait the hour for Kured to notice.
  • A node is in a weird state (kubelet up but Pods aren't scheduling, network interfaces flapping, systemd failures). A clean reboot is cheaper than debugging.
  • You're rolling an RKE2 or CNI change that requires the kubelet to re-read config.

For "replace the VM entirely" (kernel upgrade, instance resize, bad hardware) don't reboot — destroy and recreate via Terraform. Cloud-init re-joins the replacement to the cluster. See Scale the cluster.

The drain-reboot-uncordon pattern

bash
# 1. Drain — evict pods, keep DaemonSets, tolerate emptyDir
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

# 2. Reboot (whichever access path works)
ssh root@<node-public-ip> systemctl reboot

# 3. Wait for Ready
kubectl get nodes -w       # Ctrl-C when <node> shows Ready

# 4. Uncordon
kubectl uncordon <node>

Verify workloads rebalanced:

bash
kubectl get pods -A -o wide | grep <node>
  • 🟠 --delete-emptydir-data is fine for our workloads (PHP scratch, warmup script). Double-check before draining anything with an emptyDir that matters.
  • 🔴 Don't forget the uncordon. A cordoned Ready node looks healthy at a glance and silently starves capacity.

Masters: one at a time, always

The three control-plane nodes (ecnv4k8s-hel1-1..3) form an etcd quorum of two. You can lose one without impact; two means the API server and etcd go read-only.

  • 🟢 Reboot masters serially — wait for the etcd-<node> Pod in kube-system to report Ready before touching the next one.
  • 🔴 Never reboot two masters concurrently, and never run destructive ops against two masters at once.

Emergency: Hetzner console hard reset

When the kubelet is unreachable and SSH is unreachable, the node is wedged at the hypervisor level. Last resort:

  1. Open the Hetzner Cloud console, find the server.
  2. Under PowerReset (or Shut down + Power on for a clean stop first).
  3. Watch the serial console output for boot errors while it comes up.
  • 🔴 Hard reset is equivalent to pulling the power cord. File systems go through journal replay on boot; occasionally you lose some writes. Always try systemctl reboot first if you have any way in.

Legacy: the Ansible playbooks

The repo still carries ansible/playbooks/reboot.yml (coordinated rolling reboot via the reboot module) and ansible/playbooks/apt_update_and_upgrade.yml (cluster-wide apt update && dist-upgrade). These predate Kured.

Ansible playbooks are not actively maintained

They still exist under ansible/playbooks/ but aren't kept in sync with day-to-day practice. Verify the playbook matches the current cluster state before running, and prefer Terraform-based or operator-based paths (Kured, cluster-autoscaler) where they apply.

Practical guidance:

  • reboot.yml — fallback only. Use when Kured's defaults don't fit and you need the reboot right now. Drain manually around it if PDBs are strict.
  • apt_update_and_upgrade.yml — fallback only. Use for urgent user-space CVEs that can't wait for Kured. It does not reboot; if the upgrade installs a new kernel, Kured picks up the sentinel within the hour.

Run either from the ansible/ directory after unlocking Bitwarden for any secrets it needs, and read the source first.

Further reading

Internal documentation — Advisable only