Disaster recovery runbook

June 19, 2026 · View on GitHub

The single source of truth for "how do I get the platform back" — covering single-node loss, full-cluster loss, and credential rotation. Designed so that with this repo + the off-cluster artifacts listed below + ~30 minutes of manual control-plane work, prod can be reconstructed to a state indistinguishable from the day before the incident.

RPO target: 24 h (daily snapshots). RTO target: 4 h (mostly slack for manual Hetzner / Cloudflare clicks; the automated portion is < 15 minutes in CI).

Off-cluster artifacts you must keep safe

The repo + these items are the entire seed for a rebuild. Lose all of these simultaneously and you cannot recover.

Artifact	Where it lives	Recovery if lost
SOPS Age private keys (one per env)	Secure vault + offline backup	Re-encrypt all `*.enc.yaml` (below)
OpenBao unseal key + root token	`openbao-unseal` Secret (Velero-backed) + operator vault	Restore the `openbao-unseal` Secret from the most recent Velero backup; the paired raft snapshot lives on the `vault-snapshots` PVC and in the R2 `openbao-snapshots/` mirror, and the `vault-config` Job restores it automatically (openbao.md scenarios 2-3); only if every copy is gone, re-initialize OpenBao and re-seed KV — existing encrypted data is then unrecoverable
Cloudflare R2 access keys	Secure vault	Mint new in Cloudflare; SOPS-update
Hetzner Cloud API token	Secure vault	Mint new in Hetzner Cloud console
Cloudflare API token	Secure vault	Mint new in Cloudflare dashboard

Recommendation: store these in a shared vault accessible by at least one additional trusted operator, plus an offline copy in a second physical location. For the SOPS Age keys, a hardware-backed pair (two YubiKeys via age-plugin-yubikey) is the strongest configuration; see crypto-custody.md for the full design and per-artifact threat model.

CI deploy credentials — the KUBE_CONFIG and TALOS_CONFIG secrets in the GitHub prod environment — are deliberately not in the table above. They are derived from the cluster (regenerated by ksail cluster create), so losing them costs nothing permanent. But they go stale on every cluster rebuild (new API endpoint, new Talos PKI) and must be refreshed, or the prod deploy pipeline cannot connect. See Scenario 9 below.

Scenario 1 — Single node loss

Expected behaviour: PDBs keep every multi-replica workload serving traffic. Re-scale workers or re-run ksail cluster update to replace the lost node.

# Inspect state
kubectl get nodes
kubectl get pods -A --field-selector=status.phase!=Running
kubectl get pdb -A    # all should show ALLOWED-DISRUPTIONS=1

# Replace the failed node (re-runs Hetzner provisioning for missing members)
ksail --config ksail.prod.yaml cluster update

If any workload is stuck in Pending because all replicas were on the dead node and the PDB is blocking eviction on the new one, force a rollout:

kubectl -n <ns> rollout restart deployment/<name>

Scenario 2 — Planned rolling Talos / Kubernetes upgrade

Talos OS and Kubernetes upgrades are driven by the version pins, not by the ISO. Bump spec.cluster.talos.version (Renovate bumps it together with the matching machine.install.image installer tag in talos/cluster/install-image.yaml) and/or spec.cluster.kubernetesVersion, then re-run ksail cluster update. KSail performs an in-place rolling upgrade — one node at a time, workers first, rebooting each node into the new installer image (Kubernetes upgrades roll the static control-plane pods and kubelets); PDBs and maxUnavailable: 0 keep workloads available across the reboots.

The Hetzner iso field is not an upgrade lever: a change to it is applied in-place and only affects newly provisioned nodes (autoscaler scale-ups and full rebuilds boot from it). Bump it so new nodes come up on the new version, but a stale iso does not block the in-place upgrade of the existing nodes. (This runbook previously said to bump the ISO to roll nodes — that was never how ksail cluster update upgrades existing nodes.)

# Pre-flight: confirm every multi-replica workload has a PDB
kubectl get pdb -A

# Pre-flight: confirm RollingUpdate strategy uses maxUnavailable: 0
kubectl get deploy -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}\t{.spec.strategy.rollingUpdate.maxUnavailable}{"\n"}{end}'

# Apply the upgrade (in-place rolling Talos OS + Kubernetes upgrade)
ksail --config ksail.prod.yaml cluster update

If anything reports maxUnavailable other than 0, that workload was either added without an HA configuration or has a chart limitation — fix before upgrading.

If a rolling upgrade is interrupted (a node fails to rejoin), the cluster is left mixed — some nodes upgraded, some not. KSail releases before the fix in devantler-tech/ksail#5359 read the cluster's current version from a single node, so the next cluster update mis-reads the cluster as already upgraded and silently skips the laggards (the deploy stays green while the stragglers never move). Recover by upgrading each stuck node directly, one at a time, preserving etcd quorum:
# <schematic-id>:<version> is the installer image from talos/cluster/install-image.yaml
talosctl --nodes <node-ip> upgrade \
  --image factory.talos.dev/installer/<schematic-id>:<version>
Once the platform tracks a KSail release containing the fix, cluster update resumes interrupted upgrades on its own.

Scenario 3 — etcd corruption / control-plane loss

With Omni retired, there is no managed etcd snapshot. Recovery path is full cluster rebuild (Scenario 4) followed by Velero + CNPG restores. This is an accepted trade-off documented in the migration decision: workload state lives in R2-backed Velero and CNPG backups; the control plane is a cattle resource that ksail can re-provision in < 15 min.

Scenario 4 — Full cluster rebuild from zero

The "everything is gone" path. ~10 min of Hetzner provisioning + ~15 min of Flux reconciliation.

One-button path: run the DR - Rebuild Prod workflow (.github/workflows/dr-rebuild.yaml, workflow_dispatch, confirmation phrase REBUILD-PROD). It executes every step below from the CI runner — cluster create, Flux convergence, the Velero resource restore, and the OpenBao raft-snapshot recovery (openbao.md scenario 3) — and needs none of the (stale-after-rebuild) KUBE_CONFIG/TALOS_CONFIG secrets, because ksail cluster create writes fresh configs on the runner. The manual procedure below is the fallback when GitHub Actions itself is unavailable.

# 1. Set credentials locally
export HCLOUD_TOKEN=<hetzner-cloud-api-token>
export GHCR_TOKEN=<ghcr-pat-with-packages-read-write>
export SOPS_AGE_KEY_FILE=~/.config/sops/age/keys.txt  # points at the env's Age key

# 2. Boot a fresh cluster (ksail handles Talos boot, CCM, CSI, kubeconfig)
ksail --config ksail.prod.yaml cluster create

# 3. Bootstrap Flux from this repo
ksail --config ksail.prod.yaml workload push       # packages -> GHCR
ksail --config ksail.prod.yaml workload reconcile  # Flux pulls and applies

# 4. Wait for Flux to settle
flux get kustomizations -A
# Re-run if any are NotReady; expect convergence in 10-15 minutes

# 4b. ONLY if the OpenBao raft-snapshot recovery was impossible (no snapshot
#     in R2 — the vault came up fresh): re-feed the user-fed secrets that
#     SOPS deliberately does not seed (see the header of
#     k8s/bases/infrastructure/vault-seed/push-secrets.yaml). Until then,
#     cert-manager DNS01, external-dns, and fleetdm stay pending:
kubectl -n openbao exec openbao-0 -- \
  bao kv put secret/infrastructure/dns/cloudflare api_token=<cloudflare-token>
kubectl -n openbao exec openbao-0 -- \
  bao kv put secret/apps/fleetdm/license license-key=<fleet-license-jwt>

# 5. DNS — normally NO manual step: external-dns (hetzner overlay,
#    policy: sync, gateway-httproute source) repoints the Cloudflare
#    records at the new load balancer automatically once the HTTPRoutes
#    are Ready and its Cloudflare token has re-synced from the vault.
#    Verify, and only intervene if external-dns itself is broken:
kubectl -n external-dns logs deploy/external-dns | tail -20
kubectl -n kube-system get svc cilium-gateway-platform \
  -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
# Fallback only: update A/AAAA records for ${domain} at your DNS provider.

# 6. Restore Velero backups (apps + PVCs)
kubectl -n velero create -f - <<EOF
apiVersion: velero.io/v1
kind: Restore
metadata:
  name: rebuild-$(date +%s)
  namespace: velero
spec:
  backupName: <pick-latest-from-velero-backup-get>
  includedNamespaces:
    - "*"
  excludedNamespaces:
    - kube-system
    - velero
EOF

# 7. (If any CNPG Cluster exists) restore from R2
kubectl cnpg restore <new-cluster-name> \
  --backup <backup-name> \
  --target-time '<RFC3339-timestamp-or-omit-for-latest>'

If this is the first time restoring after losing the SOPS keys, replace step 3 with the rotation flow in Scenario 6 first.

After a rebuild the cluster has a new API endpoint and a new Talos PKI, so the prod environment's KUBE_CONFIG / TALOS_CONFIG secrets are now stale. Refresh them per Scenario 9 before relying on the automated deploy pipeline, otherwise ksail cluster update in CI cannot reach the cluster.

Scenario 5 — Velero / CNPG restore (single namespace or app)

Quick path for "I deleted the wrong PVC" or "this Postgres database needs to roll back to last night".

# Find the relevant backup
kubectl -n velero get backups
velero backup get   # if velero CLI installed locally

# Namespace restore
kubectl -n velero create -f - <<EOF
apiVersion: velero.io/v1
kind: Restore
metadata:
  name: ns-restore-$(date +%s)
  namespace: velero
spec:
  backupName: daily-full-<date>
  includedNamespaces: ["<your-ns>"]
EOF

# CNPG point-in-time recovery (PITR is "free" once WAL archiving is on)
kubectl cnpg restore <new-cluster-name> \
  --source-cluster <old-cluster> \
  --target-time '2026-04-17T22:00:00Z'

Cross-provider / cross-distribution restore (StorageClass mapping). The backup data in R2 is storage-agnostic (Kopia repository), so there is no Longhorn/Hetzner dependency at the destination — this is what makes a GitOps migration to another distribution/provider possible. Two things to handle on the target cluster:
StorageClass names. Velero recreates each PVC with the same StorageClass name it had on the source (longhorn, hcloud). If the destination's classes are named differently (e.g. gp3, standard, local-path), those PVCs stay Pending. Map them before restoring with a velero.io/change-storage-class ConfigMap in the velero namespace:
apiVersion: v1
kind: ConfigMap
metadata:
  name: change-storage-class-config
  namespace: velero
  labels:
    velero.io/change-storage-class: RestoreItemAction
data:
  longhorn: <target-default-storageclass>
  hcloud: <target-default-storageclass>
Restore-side capabilities. The destination Velero needs the node-agent (DaemonSet) to rehydrate Kopia data — both for FSB backups (openbao/hcloud) and for data-mover (Longhorn CSI) backups. For the data-mover backups it also needs features: EnableCSI; the target does not need Longhorn or any CSI-snapshot support of its own (the data is replayed into a fresh PVC by Kopia, per Velero's CSI snapshot data-movement restore).

Scenario 6 — SOPS Age key rotation

# 1. Generate a new key
age-keygen -o new-key.txt
NEW_PUB=$(grep '^# public key' new-key.txt | cut -d: -f2 | tr -d ' ')

# 2. Add the new pub key as a recipient *before* removing the old one
#    (gives you a window where both keys can decrypt).
yq -i ".creation_rules[].age += \",\n$NEW_PUB\"" .sops.yaml

# 3. Re-encrypt every SOPS file with the new recipient list
find . -name '*.enc.yaml' -print0 | xargs -0 -n1 sops updatekeys --yes

# 4. Commit + merge. Verify Flux still decrypts (no errors in
#    flux-system pods).

# 5. Rotate the new key into your secret store, distribute to operators.

# 6. Once everyone is on the new key, drop the old one from .sops.yaml
#    and re-run sops updatekeys --yes one more time.

# 7. Securely destroy old-key.txt copies.

Scenario 7 — R2 / Cloudflare credential rotation

# 1. Mint a new R2 token in the Cloudflare dashboard (scoped to your
#    <your-bucket> bucket only). DO NOT revoke the old one
#    yet -- there is a window where both must work.

# 2. Update the encrypted secret in-place. The R2 keys are per-environment
#    and live in the CLUSTER secret (variables-cluster), not the shared base.
sops --set '["stringData"]["r2_access_key_id"] "<new-id>"' \
  k8s/clusters/prod/bootstrap/variables-cluster-secret.enc.yaml
sops --set '["stringData"]["r2_secret_access_key"] "<new-secret>"' \
  k8s/clusters/prod/bootstrap/variables-cluster-secret.enc.yaml
# (repeat for k8s/clusters/local/bootstrap/ if rotating the local creds)

# 3. PR + merge. Flux propagates within one reconciliation cycle, and the
#    hourly seed-r2-credentials PushSecret refreshes infrastructure/backup/r2
#    in OpenBao, from where the Velero/CNPG ExternalSecrets re-sync.

# 4. Wait one Velero schedule + one CNPG WAL archive cycle to confirm
#    the new credentials work end-to-end.
kubectl -n velero get backups.velero.io -w
kubectl logs -n cnpg-system -l app.kubernetes.io/name=cloudnative-pg --tail=50

# 5. Revoke the old token in Cloudflare.

The Cloudflare API token (DNS01 + external-dns) is user-fed, not in SOPS — rotate it with a single vault write; the consuming ExternalSecrets re-sync within their 1h refresh interval:

kubectl -n openbao exec openbao-0 -- \
  bao kv put secret/infrastructure/dns/cloudflare api_token=<new-token>

Encryption-at-rest verification

Run after any node replacement to confirm secrets are still ciphertext on disk.

# Pull a fresh etcd snapshot via talosctl
talosctl --nodes <cp-node> etcd snapshot /tmp/etcd.snapshot

# Inspect a Secret -- must NOT be plain text
etcdctl --endpoints unix:///tmp/etcd.snapshot \
  get --prefix /registry/secrets/ | head -c 200
# Expect bytes that look like cipher (binary garbage). If you see
# Kubernetes Secret YAML, the EncryptionConfiguration was lost.

This check is deliberately not part of the CI restore drill — Talos verifies the encryption key at install time, so a CI assertion would add complexity for a structurally-enforced property (see restore-drill.md for the full rationale).

Local clusters

Local clusters are ephemeral and reconstructed from this repo on every ksail cluster create. There is nothing meaningful to back up — the restore procedure for local is:

ksail cluster delete
ksail cluster create
ksail workload push && ksail workload reconcile

CI exercises this on every PR (.github/workflows/ci.yaml), and also exercises a Velero backup → restore against the in-cluster MinIO so the prod code path is regression-tested.

Scenario 8 — Cluster Autoscaler issues

Autoscaler not scaling up

# Check for pending pods
kubectl get pods -A --field-selector=status.phase=Pending

# Inspect autoscaler logs
kubectl -n kube-system logs -l app.kubernetes.io/name=cluster-autoscaler --tail=200

# Check status ConfigMap
kubectl -n kube-system get cm cluster-autoscaler-status -o yaml

Common causes:

Pool maxSize reached — increase max under the relevant pool in ksail.prod.yaml, then run ksail --config ksail.prod.yaml cluster update
HCLOUD_TOKEN expired — rotate in SOPS secrets and GitHub environment secrets

Orphaned autoscaler nodes after cluster delete

ksail cluster delete may not remove servers created by the Cluster Autoscaler. Clean up manually:

hcloud server list --selector cluster.autoscaler.nodeGroupLabel
# Delete each orphaned server
hcloud server delete <server-id>

Autoscaler node not joining cluster

# Check if the server was created in Hetzner
hcloud server list

# If the server exists but node doesn't appear in kubectl:
# The worker machine config may be invalid or stale.
# Re-run cluster update to regenerate worker config and re-apply:
ksail --config ksail.prod.yaml cluster update

Scenario 9 — Refresh CI deploy credentials after a cluster rebuild

The DR - Rebuild Prod workflow refreshes both secrets automatically at the end of a rebuild if a DR_GH_ADMIN_TOKEN secret (a fine-grained PAT with environment-secrets write on this repo) is configured; without it the workflow prints a warning and the manual procedure below applies.

The prod deploy pipeline (the merge-queue deploy-prod job in ci.yaml, and the manual .github/workflows/cd.yaml) authenticates to the cluster with two GitHub prod environment secrets:

Secret	Restored to	Used by
`KUBE_CONFIG`	`~/.kube/config`	`ksail cluster update` drift detection (kube API)
`TALOS_CONFIG`	`~/.talos/config`	`ksail cluster update` machine-config / secret sync

After a full rebuild (Scenario 4) the API endpoint and Talos PKI change, so both secrets are stale. Symptom in CI: the 🩺 Verify prod cluster is reachable preflight fails, or ksail cluster update reports a bogus "N configuration changes" plan (every component shows Default/Disabled/None) and then fails with connection refused / x509: certificate signed by unknown authority "talos".

# Run from a machine that can reach the rebuilt cluster (e.g. the homelab),
# after `ksail cluster create` has written fresh local configs.

# 1. Locate the fresh configs ksail produced during the rebuild:
#    kubeconfig  — ~/.kube/config   (context: admin@prod)
#    talosconfig — ~/.talos/config
#    Sanity-check they point at the new cluster:
kubectl --context admin@prod get --raw='/readyz'      # -> ok
talosctl --talosconfig ~/.talos/config version        # -> talks to the nodes

# 2. Push both into the GitHub `prod` environment secrets used by CI/CD.
#    (Drop `--env prod` if these are configured as repo-level secrets.)
gh secret set KUBE_CONFIG  --env prod --repo devantler-tech/platform < ~/.kube/config
gh secret set TALOS_CONFIG --env prod --repo devantler-tech/platform < ~/.talos/config

# 3. Re-trigger the deploy (push/re-tag, or re-queue the PR). The reachability
#    preflight should pass and `ksail cluster update` should report no spurious
#    component changes.

Upstream context: ksail currently warns (rather than fails) when the cluster is unreachable and then proposes a full reinstall from a default baseline — tracked in devantler-tech/ksail#4868 and #4869. The preflight step in the deploy workflows is the local guard against that behaviour until those land.

Cryptographic custody — per-artifact threat model for SOPS keys, Talos PKI, OpenBao seal, cosign identity
OpenBao DR — seal/unseal, root-token rotation, restore from Velero
Node autoscaling — architecture, prerequisites, and troubleshooting
Velero + CNPG → R2 — application/PV backups
Alerting — automated detection of backup failures