foundation/dr/RUNBOOK.md
Andreas Niemann d807a45c79 feat(dr): disaster restore to a fresh VM + runbook (T13)
Rehearsed and validated. The destructive sibling of backup/restore.sh:
rebuilds the ENTIRE egg on a fresh, Docker-equipped VM from the offsite,
age-encrypted bundle, in the mandated order (CONTRACT_004 §4.4):
Vault -> Postgres -> RustFS -> Forgejo.

- restore-to-fresh-vm.sh (operator): pulls the disaster-survivable secret set
  from passphrase-encrypted config (age identity + Vault OLD unseal keys/root
  token), ships VERSIONS + the VM-side restorer, runs it (secrets on stdin).
- restore-to-fresh-vm-remote.sh (VM-side): decrypt+verify bundle; restore Vault
  (init throwaway -> raft snapshot restore -force -> re-unseal with OLD keys,
  with a settle+retry loop because -force re-seals asynchronously); read every
  other service's creds back out of the restored Vault; restore Postgres, RustFS
  (buckets + scoped service account + blobs), and Forgejo (full /data incl.
  app.ini); publish git :22 only when free.
- RUNBOOK.md: the human procedure, the {repo+passphrase+offsite} trust chain,
  and §5 re-establish-ingress (DNS, Caddy, runner, re-key).

Rehearsal (throwaway cx33, offsite source, then destroyed): DR RESTORE OK —
Vault unsealed with OLD keys, postgres rows=2, forge healthy against restored
DB+S3, `git clone ssh://git@<vm>:2222/olsitec/foundation.git` returns all 28
commits, ai-baseline present. Trust chain proven end-to-end.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-30 23:58:07 +02:00

107 lines
5.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# DR RUNBOOK — rebuild the foundation egg on a fresh VM (T13)
**Realises** CONTRACT_004 §4.4 (restore order) · **Companion**: `dr/restore-to-fresh-vm.sh`
(orchestrator) + `dr/restore-to-fresh-vm-remote.sh` (VM-side). This is the
**destructive** sibling of `backup/restore.sh` (the non-destructive scratch verifier).
## 0. When you are here
The Helsinki VM (or its Vault/data) is gone. You still have:
- **this git repo** (`olsitec/foundation`) — including `bootstrap/Pulumi.foundation.yaml`,
whose passphrase-encrypted `secure:` values hold the Vault **OLD unseal keys + root
token** (CONTRACT_002 §2.4) and the **age identity** (CONTRACT_004 §4.3);
- the **master passphrase** (`pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE`) — the
one external secret; without it nothing below decrypts;
- the **offsite bundle** (self-hosted MinIO `olsitec-foundation/<TS>/`) — RustFS is
assumed lost, so restore reads **offsite** (`--source off`).
`{repo + passphrase + offsite bundle}` is sufficient to fully reconstitute the egg.
Nothing else is needed — that is the whole point of the trust chain (PLAN-002 §4.1).
## 1. Pick the bundle
List candidates in the offsite store and choose the newest verified one:
```
ts=<UTC-YYYYMMDDTHHMMSSZ> # e.g. the latest in olsitec-foundation/
```
A bundle is trustworthy only if `backup/restore.sh <ts> off` has passed for it (the
weekly verify job, CONTRACT_004 §4.6). Prefer the last green one.
## 2. Provision a fresh VM
**Real DR** — use the provision stack so the new VM becomes the managed home:
```
cd provision
export HCLOUD_TOKEN="$(pass olsicloud4/HCLOUD_TOKEN)"
# edit the server name if the old one still exists in Hetzner; the cloud-init already
# installs docker + age + zstd (jq is a base package) — backupTools in provision/index.ts.
PULUMI_CONFIG_PASSPHRASE=dev-validation-throwaway pulumi up
```
The VM must have, at first boot: **docker, age, zstd, jq** (the cloud-init provides
them). SSH is on **:222** for a provision-stack VM (the vendored cloud-init moves it);
pass `--port 222` below.
**Rehearsal** — a throwaway VM created directly via the Hetzner API (cx33, debian-12,
ssh key `foundation-test-ssh-key`, cloud-init installing docker+age+zstd+jq), sshd on
**:22**. Destroy it immediately after (`DELETE /v1/servers/<id>`).
## 3. Restore (Vault → Postgres → RustFS → Forgejo)
```
./dr/restore-to-fresh-vm.sh --host <new-ip> --port <22|222> --ts "$ts" --source off
```
What it does, in the **mandated order** (CONTRACT_004 §4.4 — starting Forgejo before
13 is a defect):
1. **Decrypt** the bundle with the age identity; verify every artifact's MANIFEST sha256.
2. **Vault** — start a fresh raft node, init a throwaway node, `raft snapshot restore
-force`, then **unseal with the OLD keys** from config. Vault is now the source of
truth again; the OLD root token authenticates. All other creds are read back out.
3. **Postgres** — start with the super-password from Vault, restore `postgres.sql.gz`
(recreates the `forgejo` role + DB; asserts `"user"` rows ≥ 1).
4. **RustFS** — start with the admin keys from Vault, recreate the four buckets + the
scoped service account (the exact `serviceKeyId/Secret` Forgejo's app.ini expects),
sync `rustfs-blobs` back into the buckets.
5. **Forgejo** — extract `forgejo-repos.tar.zst` into the data volume (git repos +
`app.ini`, which already carries DB/S3 creds + INTERNAL_TOKEN/JWTs), inject
`SECRET_KEY` from Vault, start. Asserts healthz pass + `olsitec/foundation.git` present.
On success it prints `DR RESTORE OK (<ts>): …`.
## 4. What is NOT restored (recreatable — CONTRACT_004 §4.5)
Container images (re-pulled by digest), Caddy ACME data (re-issued), the runner's
ephemeral registration, search indexes/caches. These come back in §5.
## 5. Re-establish ingress + management
1. **DNS** — repoint `forge/git/s3/vault.olsitec.net` A records at the new IP
(Cloudflare; the bootstrap's `deployDns` does this once the stack is re-adopted).
2. **Caddy + runner** — re-adopt the stack so IaC manages the new VM:
```
cd bootstrap
pulumi config set foundation:vm.host <new-ip>
pulumi config set foundation:vm.sshPort <22|222>
./run.sh up # creates Caddy (re-issues LE cert via DNS-01), re-registers the runner
```
`up` is idempotent against the already-restored containers it can adopt, and creates
the recreatable ones (Caddy, runner). Verify: `https://forge.olsitec.net` = 200,
`git clone git@git.olsitec.net:olsitec/foundation.git`.
3. **Re-key reminder (D2)** — after a real disaster, rotate the Vault root token + the
offsite creds (they were materialised on a possibly-compromised host): `pulumi up
--replace` the relevant credential resources, then re-run a backup.
## 6. Gotchas (discovered during the T13 rehearsal)
- The **docker gid** on the host is host-specific; the runner mounts the host socket
(PLAN-002 R5). On a fresh VM re-check the gid before trusting runner jobs.
- `raft snapshot restore -force` **re-seals** the node (it swaps in the snapshot's
keyring) — you MUST unseal again with the OLD keys, not the throwaway init keys.
- Restore reads **offsite** by default. RustFS on the new VM starts EMPTY; its blobs
come from the bundle, not from a surviving RustFS.