Disaster recovery runbook
June 19, 2026 · View on GitHub
The single source of truth for "how do I get the platform back" — covering single-node loss, full-cluster loss, and credential rotation. Designed so that with this repo + the off-cluster artifacts listed below + ~30 minutes of manual control-plane work, prod can be reconstructed to a state indistinguishable from the day before the incident.
RPO target: 24 h (daily snapshots). RTO target: 4 h (mostly slack for manual Hetzner / Cloudflare clicks; the automated portion is < 15 minutes in CI).
Off-cluster artifacts you must keep safe
The repo + these items are the entire seed for a rebuild. Lose all of these simultaneously and you cannot recover.
| Artifact | Where it lives | Recovery if lost |
|---|---|---|
| SOPS Age private keys (one per env) | Secure vault + offline backup | Re-encrypt all *.enc.yaml (below) |
| OpenBao unseal key + root token | openbao-unseal Secret (Velero-backed) + operator vault | Restore the openbao-unseal Secret from the most recent Velero backup; the paired raft snapshot lives on the vault-snapshots PVC and in the R2 openbao-snapshots/ mirror, and the vault-config Job restores it automatically (openbao.md scenarios 2-3); only if every copy is gone, re-initialize OpenBao and re-seed KV — existing encrypted data is then unrecoverable |
| Cloudflare R2 access keys | Secure vault | Mint new in Cloudflare; SOPS-update |
| Hetzner Cloud API token | Secure vault | Mint new in Hetzner Cloud console |
| Cloudflare API token | Secure vault | Mint new in Cloudflare dashboard |
Recommendation: store these in a shared vault accessible by at least one additional trusted operator, plus an offline copy in a second physical location. For the SOPS Age keys, a hardware-backed pair (two YubiKeys via
age-plugin-yubikey) is the strongest configuration; seecrypto-custody.mdfor the full design and per-artifact threat model.
CI deploy credentials — the
KUBE_CONFIGandTALOS_CONFIGsecrets in the GitHubprodenvironment — are deliberately not in the table above. They are derived from the cluster (regenerated byksail cluster create), so losing them costs nothing permanent. But they go stale on every cluster rebuild (new API endpoint, new Talos PKI) and must be refreshed, or the prod deploy pipeline cannot connect. See Scenario 9 below.
Scenario 1 — Single node loss
Expected behaviour: PDBs keep every multi-replica workload serving traffic.
Re-scale workers or re-run ksail cluster update to replace the lost node.
# Inspect state
kubectl get nodes
kubectl get pods -A --field-selector=status.phase!=Running
kubectl get pdb -A # all should show ALLOWED-DISRUPTIONS=1
# Replace the failed node (re-runs Hetzner provisioning for missing members)
ksail --config ksail.prod.yaml cluster update
If any workload is stuck in Pending because all replicas were on the dead node and the PDB is blocking eviction on the new one, force a rollout:
kubectl -n <ns> rollout restart deployment/<name>
Scenario 2 — Planned rolling Talos / Kubernetes upgrade
Talos OS and Kubernetes upgrades are driven by the version pins, not by the
ISO. Bump spec.cluster.talos.version (Renovate bumps it together with the
matching machine.install.image installer tag in
talos/cluster/install-image.yaml)
and/or spec.cluster.kubernetesVersion, then re-run ksail cluster update.
KSail performs an in-place rolling upgrade — one node at a time, workers
first, rebooting each node into the new installer image (Kubernetes upgrades
roll the static control-plane pods and kubelets); PDBs and maxUnavailable: 0
keep workloads available across the reboots.
The Hetzner iso field is not an upgrade lever: a change to it is applied
in-place and only affects newly provisioned nodes (autoscaler scale-ups and
full rebuilds boot from it). Bump it so new nodes come up on the new version,
but a stale iso does not block the in-place upgrade of the existing nodes.
(This runbook previously said to bump the ISO to roll nodes — that was never how
ksail cluster update upgrades existing nodes.)
# Pre-flight: confirm every multi-replica workload has a PDB
kubectl get pdb -A
# Pre-flight: confirm RollingUpdate strategy uses maxUnavailable: 0
kubectl get deploy -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}\t{.spec.strategy.rollingUpdate.maxUnavailable}{"\n"}{end}'
# Apply the upgrade (in-place rolling Talos OS + Kubernetes upgrade)
ksail --config ksail.prod.yaml cluster update
If anything reports maxUnavailable other than 0, that workload was
either added without an HA configuration or has a chart limitation — fix
before upgrading.
If a rolling upgrade is interrupted (a node fails to rejoin), the cluster is left mixed — some nodes upgraded, some not. KSail releases before the fix in devantler-tech/ksail#5359 read the cluster's current version from a single node, so the next
cluster updatemis-reads the cluster as already upgraded and silently skips the laggards (the deploy stays green while the stragglers never move). Recover by upgrading each stuck node directly, one at a time, preserving etcd quorum:# <schematic-id>:<version> is the installer image from talos/cluster/install-image.yaml talosctl --nodes <node-ip> upgrade \ --image factory.talos.dev/installer/<schematic-id>:<version>Once the platform tracks a KSail release containing the fix,
cluster updateresumes interrupted upgrades on its own.
Scenario 3 — etcd corruption / control-plane loss
With Omni retired, there is no managed etcd snapshot. Recovery path is full cluster rebuild (Scenario 4) followed by Velero + CNPG restores. This is an accepted trade-off documented in the migration decision: workload state lives in R2-backed Velero and CNPG backups; the control plane is a cattle resource that ksail can re-provision in < 15 min.
Scenario 4 — Full cluster rebuild from zero
The "everything is gone" path. ~10 min of Hetzner provisioning + ~15 min of Flux reconciliation.
One-button path: run the
DR - Rebuild Prodworkflow (.github/workflows/dr-rebuild.yaml,workflow_dispatch, confirmation phraseREBUILD-PROD). It executes every step below from the CI runner — cluster create, Flux convergence, the Velero resource restore, and the OpenBao raft-snapshot recovery (openbao.md scenario 3) — and needs none of the (stale-after-rebuild)KUBE_CONFIG/TALOS_CONFIGsecrets, becauseksail cluster createwrites fresh configs on the runner. The manual procedure below is the fallback when GitHub Actions itself is unavailable.
# 1. Set credentials locally
export HCLOUD_TOKEN=<hetzner-cloud-api-token>
export GHCR_TOKEN=<ghcr-pat-with-packages-read-write>
export SOPS_AGE_KEY_FILE=~/.config/sops/age/keys.txt # points at the env's Age key
# 2. Boot a fresh cluster (ksail handles Talos boot, CCM, CSI, kubeconfig)
ksail --config ksail.prod.yaml cluster create
# 3. Bootstrap Flux from this repo
ksail --config ksail.prod.yaml workload push # packages -> GHCR
ksail --config ksail.prod.yaml workload reconcile # Flux pulls and applies
# 4. Wait for Flux to settle
flux get kustomizations -A
# Re-run if any are NotReady; expect convergence in 10-15 minutes
# 4b. ONLY if the OpenBao raft-snapshot recovery was impossible (no snapshot
# in R2 — the vault came up fresh): re-feed the user-fed secrets that
# SOPS deliberately does not seed (see the header of
# k8s/bases/infrastructure/vault-seed/push-secrets.yaml). Until then,
# cert-manager DNS01, external-dns, and fleetdm stay pending:
kubectl -n openbao exec openbao-0 -- \
bao kv put secret/infrastructure/dns/cloudflare api_token=<cloudflare-token>
kubectl -n openbao exec openbao-0 -- \
bao kv put secret/apps/fleetdm/license license-key=<fleet-license-jwt>
# 5. DNS — normally NO manual step: external-dns (hetzner overlay,
# policy: sync, gateway-httproute source) repoints the Cloudflare
# records at the new load balancer automatically once the HTTPRoutes
# are Ready and its Cloudflare token has re-synced from the vault.
# Verify, and only intervene if external-dns itself is broken:
kubectl -n external-dns logs deploy/external-dns | tail -20
kubectl -n kube-system get svc cilium-gateway-platform \
-o jsonpath='{.status.loadBalancer.ingress[0].ip}'
# Fallback only: update A/AAAA records for ${domain} at your DNS provider.
# 6. Restore Velero backups (apps + PVCs)
kubectl -n velero create -f - <<EOF
apiVersion: velero.io/v1
kind: Restore
metadata:
name: rebuild-$(date +%s)
namespace: velero
spec:
backupName: <pick-latest-from-velero-backup-get>
includedNamespaces:
- "*"
excludedNamespaces:
- kube-system
- velero
EOF
# 7. (If any CNPG Cluster exists) restore from R2
kubectl cnpg restore <new-cluster-name> \
--backup <backup-name> \
--target-time '<RFC3339-timestamp-or-omit-for-latest>'
If this is the first time restoring after losing the SOPS keys, replace step 3 with the rotation flow in Scenario 6 first.
After a rebuild the cluster has a new API endpoint and a new Talos PKI, so the
prodenvironment'sKUBE_CONFIG/TALOS_CONFIGsecrets are now stale. Refresh them per Scenario 9 before relying on the automated deploy pipeline, otherwiseksail cluster updatein CI cannot reach the cluster.
Scenario 5 — Velero / CNPG restore (single namespace or app)
Quick path for "I deleted the wrong PVC" or "this Postgres database needs to roll back to last night".
# Find the relevant backup
kubectl -n velero get backups
velero backup get # if velero CLI installed locally
# Namespace restore
kubectl -n velero create -f - <<EOF
apiVersion: velero.io/v1
kind: Restore
metadata:
name: ns-restore-$(date +%s)
namespace: velero
spec:
backupName: daily-full-<date>
includedNamespaces: ["<your-ns>"]
EOF
# CNPG point-in-time recovery (PITR is "free" once WAL archiving is on)
kubectl cnpg restore <new-cluster-name> \
--source-cluster <old-cluster> \
--target-time '2026-04-17T22:00:00Z'
Cross-provider / cross-distribution restore (StorageClass mapping). The backup data in R2 is storage-agnostic (Kopia repository), so there is no Longhorn/Hetzner dependency at the destination — this is what makes a GitOps migration to another distribution/provider possible. Two things to handle on the target cluster:
StorageClass names. Velero recreates each PVC with the same StorageClass name it had on the source (
longhorn,hcloud). If the destination's classes are named differently (e.g.gp3,standard,local-path), those PVCs stayPending. Map them before restoring with avelero.io/change-storage-classConfigMap in theveleronamespace:apiVersion: v1 kind: ConfigMap metadata: name: change-storage-class-config namespace: velero labels: velero.io/change-storage-class: RestoreItemAction data: longhorn: <target-default-storageclass> hcloud: <target-default-storageclass>Restore-side capabilities. The destination Velero needs the node-agent (DaemonSet) to rehydrate Kopia data — both for FSB backups (openbao/hcloud) and for data-mover (Longhorn CSI) backups. For the data-mover backups it also needs
features: EnableCSI; the target does not need Longhorn or any CSI-snapshot support of its own (the data is replayed into a fresh PVC by Kopia, per Velero's CSI snapshot data-movement restore).
Scenario 6 — SOPS Age key rotation
# 1. Generate a new key
age-keygen -o new-key.txt
NEW_PUB=$(grep '^# public key' new-key.txt | cut -d: -f2 | tr -d ' ')
# 2. Add the new pub key as a recipient *before* removing the old one
# (gives you a window where both keys can decrypt).
yq -i ".creation_rules[].age += \",\n$NEW_PUB\"" .sops.yaml
# 3. Re-encrypt every SOPS file with the new recipient list
find . -name '*.enc.yaml' -print0 | xargs -0 -n1 sops updatekeys --yes
# 4. Commit + merge. Verify Flux still decrypts (no errors in
# flux-system pods).
# 5. Rotate the new key into your secret store, distribute to operators.
# 6. Once everyone is on the new key, drop the old one from .sops.yaml
# and re-run sops updatekeys --yes one more time.
# 7. Securely destroy old-key.txt copies.
Scenario 7 — R2 / Cloudflare credential rotation
# 1. Mint a new R2 token in the Cloudflare dashboard (scoped to your
# <your-bucket> bucket only). DO NOT revoke the old one
# yet -- there is a window where both must work.
# 2. Update the encrypted secret in-place. The R2 keys are per-environment
# and live in the CLUSTER secret (variables-cluster), not the shared base.
sops --set '["stringData"]["r2_access_key_id"] "<new-id>"' \
k8s/clusters/prod/bootstrap/variables-cluster-secret.enc.yaml
sops --set '["stringData"]["r2_secret_access_key"] "<new-secret>"' \
k8s/clusters/prod/bootstrap/variables-cluster-secret.enc.yaml
# (repeat for k8s/clusters/local/bootstrap/ if rotating the local creds)
# 3. PR + merge. Flux propagates within one reconciliation cycle, and the
# hourly seed-r2-credentials PushSecret refreshes infrastructure/backup/r2
# in OpenBao, from where the Velero/CNPG ExternalSecrets re-sync.
# 4. Wait one Velero schedule + one CNPG WAL archive cycle to confirm
# the new credentials work end-to-end.
kubectl -n velero get backups.velero.io -w
kubectl logs -n cnpg-system -l app.kubernetes.io/name=cloudnative-pg --tail=50
# 5. Revoke the old token in Cloudflare.
The Cloudflare API token (DNS01 + external-dns) is user-fed, not in SOPS — rotate it with a single vault write; the consuming ExternalSecrets re-sync within their 1h refresh interval:
kubectl -n openbao exec openbao-0 -- \
bao kv put secret/infrastructure/dns/cloudflare api_token=<new-token>
Encryption-at-rest verification
Run after any node replacement to confirm secrets are still ciphertext on disk.
# Pull a fresh etcd snapshot via talosctl
talosctl --nodes <cp-node> etcd snapshot /tmp/etcd.snapshot
# Inspect a Secret -- must NOT be plain text
etcdctl --endpoints unix:///tmp/etcd.snapshot \
get --prefix /registry/secrets/ | head -c 200
# Expect bytes that look like cipher (binary garbage). If you see
# Kubernetes Secret YAML, the EncryptionConfiguration was lost.
This check is deliberately not part of the CI restore drill — Talos verifies the encryption key at install time, so a CI assertion would add complexity for a structurally-enforced property (see restore-drill.md for the full rationale).
Local clusters
Local clusters are ephemeral and reconstructed from this repo on every
ksail cluster create. There is nothing meaningful to back up — the
restore procedure for local is:
ksail cluster delete
ksail cluster create
ksail workload push && ksail workload reconcile
CI exercises this on every PR (.github/workflows/ci.yaml), and also
exercises a Velero backup → restore against the in-cluster
MinIO so the prod code path is regression-tested.
Scenario 8 — Cluster Autoscaler issues
Autoscaler not scaling up
# Check for pending pods
kubectl get pods -A --field-selector=status.phase=Pending
# Inspect autoscaler logs
kubectl -n kube-system logs -l app.kubernetes.io/name=cluster-autoscaler --tail=200
# Check status ConfigMap
kubectl -n kube-system get cm cluster-autoscaler-status -o yaml
Common causes:
- Pool
maxSizereached — increasemaxunder the relevant pool inksail.prod.yaml, then runksail --config ksail.prod.yaml cluster update HCLOUD_TOKENexpired — rotate in SOPS secrets and GitHub environment secrets
Orphaned autoscaler nodes after cluster delete
ksail cluster delete may not remove servers created by the Cluster
Autoscaler. Clean up manually:
hcloud server list --selector cluster.autoscaler.nodeGroupLabel
# Delete each orphaned server
hcloud server delete <server-id>
Autoscaler node not joining cluster
# Check if the server was created in Hetzner
hcloud server list
# If the server exists but node doesn't appear in kubectl:
# The worker machine config may be invalid or stale.
# Re-run cluster update to regenerate worker config and re-apply:
ksail --config ksail.prod.yaml cluster update
Scenario 9 — Refresh CI deploy credentials after a cluster rebuild
The
DR - Rebuild Prodworkflow refreshes both secrets automatically at the end of a rebuild if aDR_GH_ADMIN_TOKENsecret (a fine-grained PAT with environment-secrets write on this repo) is configured; without it the workflow prints a warning and the manual procedure below applies.
The prod deploy pipeline (the merge-queue deploy-prod job in ci.yaml, and
the manual .github/workflows/cd.yaml) authenticates to the cluster with
two GitHub prod environment secrets:
| Secret | Restored to | Used by |
|---|---|---|
KUBE_CONFIG | ~/.kube/config | ksail cluster update drift detection (kube API) |
TALOS_CONFIG | ~/.talos/config | ksail cluster update machine-config / secret sync |
After a full rebuild (Scenario 4) the API endpoint and Talos PKI change,
so both secrets are stale. Symptom in CI: the
🩺 Verify prod cluster is reachable preflight fails, or
ksail cluster update reports a bogus "N configuration changes" plan (every
component shows Default/Disabled/None) and then fails with
connection refused / x509: certificate signed by unknown authority "talos".
# Run from a machine that can reach the rebuilt cluster (e.g. the homelab),
# after `ksail cluster create` has written fresh local configs.
# 1. Locate the fresh configs ksail produced during the rebuild:
# kubeconfig — ~/.kube/config (context: admin@prod)
# talosconfig — ~/.talos/config
# Sanity-check they point at the new cluster:
kubectl --context admin@prod get --raw='/readyz' # -> ok
talosctl --talosconfig ~/.talos/config version # -> talks to the nodes
# 2. Push both into the GitHub `prod` environment secrets used by CI/CD.
# (Drop `--env prod` if these are configured as repo-level secrets.)
gh secret set KUBE_CONFIG --env prod --repo devantler-tech/platform < ~/.kube/config
gh secret set TALOS_CONFIG --env prod --repo devantler-tech/platform < ~/.talos/config
# 3. Re-trigger the deploy (push/re-tag, or re-queue the PR). The reachability
# preflight should pass and `ksail cluster update` should report no spurious
# component changes.
Upstream context: ksail currently warns (rather than fails) when the cluster is unreachable and then proposes a full reinstall from a default baseline — tracked in devantler-tech/ksail#4868 and #4869. The preflight step in the deploy workflows is the local guard against that behaviour until those land.
Related documents
- Cryptographic custody — per-artifact threat model for SOPS keys, Talos PKI, OpenBao seal, cosign identity
- OpenBao DR — seal/unseal, root-token rotation, restore from Velero
- Node autoscaling — architecture, prerequisites, and troubleshooting
- Velero + CNPG → R2 — application/PV backups
- Alerting — automated detection of backup failures