foundation/dr/RUNBOOK.md
Andreas Niemann d807a45c79 feat(dr): disaster restore to a fresh VM + runbook (T13)
Rehearsed and validated. The destructive sibling of backup/restore.sh:
rebuilds the ENTIRE egg on a fresh, Docker-equipped VM from the offsite,
age-encrypted bundle, in the mandated order (CONTRACT_004 §4.4):
Vault -> Postgres -> RustFS -> Forgejo.

- restore-to-fresh-vm.sh (operator): pulls the disaster-survivable secret set
  from passphrase-encrypted config (age identity + Vault OLD unseal keys/root
  token), ships VERSIONS + the VM-side restorer, runs it (secrets on stdin).
- restore-to-fresh-vm-remote.sh (VM-side): decrypt+verify bundle; restore Vault
  (init throwaway -> raft snapshot restore -force -> re-unseal with OLD keys,
  with a settle+retry loop because -force re-seals asynchronously); read every
  other service's creds back out of the restored Vault; restore Postgres, RustFS
  (buckets + scoped service account + blobs), and Forgejo (full /data incl.
  app.ini); publish git :22 only when free.
- RUNBOOK.md: the human procedure, the {repo+passphrase+offsite} trust chain,
  and §5 re-establish-ingress (DNS, Caddy, runner, re-key).

Rehearsal (throwaway cx33, offsite source, then destroyed): DR RESTORE OK —
Vault unsealed with OLD keys, postgres rows=2, forge healthy against restored
DB+S3, `git clone ssh://git@<vm>:2222/olsitec/foundation.git` returns all 28
commits, ai-baseline present. Trust chain proven end-to-end.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-30 23:58:07 +02:00

5.2 KiB
Raw Blame History

DR RUNBOOK — rebuild the foundation egg on a fresh VM (T13)

Realises CONTRACT_004 §4.4 (restore order) · Companion: dr/restore-to-fresh-vm.sh (orchestrator) + dr/restore-to-fresh-vm-remote.sh (VM-side). This is the destructive sibling of backup/restore.sh (the non-destructive scratch verifier).

0. When you are here

The Helsinki VM (or its Vault/data) is gone. You still have:

  • this git repo (olsitec/foundation) — including bootstrap/Pulumi.foundation.yaml, whose passphrase-encrypted secure: values hold the Vault OLD unseal keys + root token (CONTRACT_002 §2.4) and the age identity (CONTRACT_004 §4.3);
  • the master passphrase (pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE) — the one external secret; without it nothing below decrypts;
  • the offsite bundle (self-hosted MinIO olsitec-foundation/<TS>/) — RustFS is assumed lost, so restore reads offsite (--source off).

{repo + passphrase + offsite bundle} is sufficient to fully reconstitute the egg. Nothing else is needed — that is the whole point of the trust chain (PLAN-002 §4.1).

1. Pick the bundle

List candidates in the offsite store and choose the newest verified one:

ts=<UTC-YYYYMMDDTHHMMSSZ>     # e.g. the latest in olsitec-foundation/

A bundle is trustworthy only if backup/restore.sh <ts> off has passed for it (the weekly verify job, CONTRACT_004 §4.6). Prefer the last green one.

2. Provision a fresh VM

Real DR — use the provision stack so the new VM becomes the managed home:

cd provision
export HCLOUD_TOKEN="$(pass olsicloud4/HCLOUD_TOKEN)"
# edit the server name if the old one still exists in Hetzner; the cloud-init already
# installs docker + age + zstd (jq is a base package) — backupTools in provision/index.ts.
PULUMI_CONFIG_PASSPHRASE=dev-validation-throwaway pulumi up

The VM must have, at first boot: docker, age, zstd, jq (the cloud-init provides them). SSH is on :222 for a provision-stack VM (the vendored cloud-init moves it); pass --port 222 below.

Rehearsal — a throwaway VM created directly via the Hetzner API (cx33, debian-12, ssh key foundation-test-ssh-key, cloud-init installing docker+age+zstd+jq), sshd on :22. Destroy it immediately after (DELETE /v1/servers/<id>).

3. Restore (Vault → Postgres → RustFS → Forgejo)

./dr/restore-to-fresh-vm.sh --host <new-ip> --port <22|222> --ts "$ts" --source off

What it does, in the mandated order (CONTRACT_004 §4.4 — starting Forgejo before 13 is a defect):

  1. Decrypt the bundle with the age identity; verify every artifact's MANIFEST sha256.
  2. Vault — start a fresh raft node, init a throwaway node, raft snapshot restore -force, then unseal with the OLD keys from config. Vault is now the source of truth again; the OLD root token authenticates. All other creds are read back out.
  3. Postgres — start with the super-password from Vault, restore postgres.sql.gz (recreates the forgejo role + DB; asserts "user" rows ≥ 1).
  4. RustFS — start with the admin keys from Vault, recreate the four buckets + the scoped service account (the exact serviceKeyId/Secret Forgejo's app.ini expects), sync rustfs-blobs back into the buckets.
  5. Forgejo — extract forgejo-repos.tar.zst into the data volume (git repos + app.ini, which already carries DB/S3 creds + INTERNAL_TOKEN/JWTs), inject SECRET_KEY from Vault, start. Asserts healthz pass + olsitec/foundation.git present.

On success it prints DR RESTORE OK (<ts>): ….

4. What is NOT restored (recreatable — CONTRACT_004 §4.5)

Container images (re-pulled by digest), Caddy ACME data (re-issued), the runner's ephemeral registration, search indexes/caches. These come back in §5.

5. Re-establish ingress + management

  1. DNS — repoint forge/git/s3/vault.olsitec.net A records at the new IP (Cloudflare; the bootstrap's deployDns does this once the stack is re-adopted).
  2. Caddy + runner — re-adopt the stack so IaC manages the new VM:
    cd bootstrap
    pulumi config set foundation:vm.host <new-ip>
    pulumi config set foundation:vm.sshPort <22|222>
    ./run.sh up        # creates Caddy (re-issues LE cert via DNS-01), re-registers the runner
    
    up is idempotent against the already-restored containers it can adopt, and creates the recreatable ones (Caddy, runner). Verify: https://forge.olsitec.net = 200, git clone git@git.olsitec.net:olsitec/foundation.git.
  3. Re-key reminder (D2) — after a real disaster, rotate the Vault root token + the offsite creds (they were materialised on a possibly-compromised host): pulumi up --replace the relevant credential resources, then re-run a backup.

6. Gotchas (discovered during the T13 rehearsal)

  • The docker gid on the host is host-specific; the runner mounts the host socket (PLAN-002 R5). On a fresh VM re-check the gid before trusting runner jobs.
  • raft snapshot restore -force re-seals the node (it swaps in the snapshot's keyring) — you MUST unseal again with the OLD keys, not the throwaway init keys.
  • Restore reads offsite by default. RustFS on the new VM starts EMPTY; its blobs come from the bundle, not from a surviving RustFS.