# DR RUNBOOK — rebuild the foundation egg on a fresh VM (T13) **Realises** CONTRACT_004 §4.4 (restore order) · **Companion**: `dr/restore-to-fresh-vm.sh` (orchestrator) + `dr/restore-to-fresh-vm-remote.sh` (VM-side). This is the **destructive** sibling of `backup/restore.sh` (the non-destructive scratch verifier). ## 0. When you are here The Helsinki VM (or its Vault/data) is gone. You still have: - **this git repo** (`olsitec/foundation`) — including `bootstrap/Pulumi.foundation.yaml`, whose passphrase-encrypted `secure:` values hold the Vault **OLD unseal keys + root token** (CONTRACT_002 §2.4) and the **age identity** (CONTRACT_004 §4.3); - the **master passphrase** (`pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE`) — the one external secret; without it nothing below decrypts; - the **offsite bundle** (self-hosted MinIO `olsitec-foundation//`) — RustFS is assumed lost, so restore reads **offsite** (`--source off`). `{repo + passphrase + offsite bundle}` is sufficient to fully reconstitute the egg. Nothing else is needed — that is the whole point of the trust chain (PLAN-002 §4.1). ## 1. Pick the bundle List candidates in the offsite store and choose the newest verified one: ``` ts= # e.g. the latest in olsitec-foundation/ ``` A bundle is trustworthy only if `backup/restore.sh off` has passed for it (the weekly verify job, CONTRACT_004 §4.6). Prefer the last green one. ## 2. Provision a fresh VM **Real DR** — use the provision stack so the new VM becomes the managed home: ``` cd provision export HCLOUD_TOKEN="$(pass olsicloud4/HCLOUD_TOKEN)" # edit the server name if the old one still exists in Hetzner; the cloud-init already # installs docker + age + zstd (jq is a base package) — backupTools in provision/index.ts. PULUMI_CONFIG_PASSPHRASE=dev-validation-throwaway pulumi up ``` The VM must have, at first boot: **docker, age, zstd, jq** (the cloud-init provides them). SSH is on **:222** for a provision-stack VM (the vendored cloud-init moves it); pass `--port 222` below. **Rehearsal** — a throwaway VM created directly via the Hetzner API (cx33, debian-12, ssh key `foundation-test-ssh-key`, cloud-init installing docker+age+zstd+jq), sshd on **:22**. Destroy it immediately after (`DELETE /v1/servers/`). ## 3. Restore (Vault → Postgres → RustFS → Forgejo) ``` ./dr/restore-to-fresh-vm.sh --host --port <22|222> --ts "$ts" --source off ``` What it does, in the **mandated order** (CONTRACT_004 §4.4 — starting Forgejo before 1–3 is a defect): 1. **Decrypt** the bundle with the age identity; verify every artifact's MANIFEST sha256. 2. **Vault** — start a fresh raft node, init a throwaway node, `raft snapshot restore -force`, then **unseal with the OLD keys** from config. Vault is now the source of truth again; the OLD root token authenticates. All other creds are read back out. 3. **Postgres** — start with the super-password from Vault, restore `postgres.sql.gz` (recreates the `forgejo` role + DB; asserts `"user"` rows ≥ 1). 4. **RustFS** — start with the admin keys from Vault, recreate the four buckets + the scoped service account (the exact `serviceKeyId/Secret` Forgejo's app.ini expects), sync `rustfs-blobs` back into the buckets. 5. **Forgejo** — extract `forgejo-repos.tar.zst` into the data volume (git repos + `app.ini`, which already carries DB/S3 creds + INTERNAL_TOKEN/JWTs), inject `SECRET_KEY` from Vault, start. Asserts healthz pass + `olsitec/foundation.git` present. On success it prints `DR RESTORE OK (): …`. ## 4. What is NOT restored (recreatable — CONTRACT_004 §4.5) Container images (re-pulled by digest), Caddy ACME data (re-issued), the runner's ephemeral registration, search indexes/caches. These come back in §5. ## 5. Re-establish ingress + management 1. **DNS** — repoint `forge/git/s3/vault.olsitec.net` A records at the new IP (Cloudflare; the bootstrap's `deployDns` does this once the stack is re-adopted). 2. **Caddy + runner** — re-adopt the stack so IaC manages the new VM: ``` cd bootstrap pulumi config set foundation:vm.host pulumi config set foundation:vm.sshPort <22|222> ./run.sh up # creates Caddy (re-issues LE cert via DNS-01), re-registers the runner ``` `up` is idempotent against the already-restored containers it can adopt, and creates the recreatable ones (Caddy, runner). Verify: `https://forge.olsitec.net` = 200, `git clone git@git.olsitec.net:olsitec/foundation.git`. 3. **Re-key reminder (D2)** — after a real disaster, rotate the Vault root token + the offsite creds (they were materialised on a possibly-compromised host): `pulumi up --replace` the relevant credential resources, then re-run a backup. ## 6. Gotchas (discovered during the T13 rehearsal) - The **docker gid** on the host is host-specific; the runner mounts the host socket (PLAN-002 R5). On a fresh VM re-check the gid before trusting runner jobs. - `raft snapshot restore -force` **re-seals** the node (it swaps in the snapshot's keyring) — you MUST unseal again with the OLD keys, not the throwaway init keys. - Restore reads **offsite** by default. RustFS on the new VM starts EMPTY; its blobs come from the bundle, not from a surviving RustFS.