foundation/dr/RUNBOOK.md

108 lines
5.2 KiB
Markdown
Raw Permalink Normal View History

# DR RUNBOOK — rebuild the foundation egg on a fresh VM (T13)
**Realises** CONTRACT_004 §4.4 (restore order) · **Companion**: `dr/restore-to-fresh-vm.sh`
(orchestrator) + `dr/restore-to-fresh-vm-remote.sh` (VM-side). This is the
**destructive** sibling of `backup/restore.sh` (the non-destructive scratch verifier).
## 0. When you are here
The Helsinki VM (or its Vault/data) is gone. You still have:
- **this git repo** (`olsitec/foundation`) — including `bootstrap/Pulumi.foundation.yaml`,
whose passphrase-encrypted `secure:` values hold the Vault **OLD unseal keys + root
token** (CONTRACT_002 §2.4) and the **age identity** (CONTRACT_004 §4.3);
- the **master passphrase** (`pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE`) — the
one external secret; without it nothing below decrypts;
- the **offsite bundle** (self-hosted MinIO `olsitec-foundation/<TS>/`) — RustFS is
assumed lost, so restore reads **offsite** (`--source off`).
`{repo + passphrase + offsite bundle}` is sufficient to fully reconstitute the egg.
Nothing else is needed — that is the whole point of the trust chain (PLAN-002 §4.1).
## 1. Pick the bundle
List candidates in the offsite store and choose the newest verified one:
```
ts=<UTC-YYYYMMDDTHHMMSSZ> # e.g. the latest in olsitec-foundation/
```
A bundle is trustworthy only if `backup/restore.sh <ts> off` has passed for it (the
weekly verify job, CONTRACT_004 §4.6). Prefer the last green one.
## 2. Provision a fresh VM
**Real DR** — use the provision stack so the new VM becomes the managed home:
```
cd provision
export HCLOUD_TOKEN="$(pass olsicloud4/HCLOUD_TOKEN)"
# edit the server name if the old one still exists in Hetzner; the cloud-init already
# installs docker + age + zstd (jq is a base package) — backupTools in provision/index.ts.
PULUMI_CONFIG_PASSPHRASE=dev-validation-throwaway pulumi up
```
The VM must have, at first boot: **docker, age, zstd, jq** (the cloud-init provides
them). SSH is on **:222** for a provision-stack VM (the vendored cloud-init moves it);
pass `--port 222` below.
**Rehearsal** — a throwaway VM created directly via the Hetzner API (cx33, debian-12,
ssh key `foundation-test-ssh-key`, cloud-init installing docker+age+zstd+jq), sshd on
**:22**. Destroy it immediately after (`DELETE /v1/servers/<id>`).
## 3. Restore (Vault → Postgres → RustFS → Forgejo)
```
./dr/restore-to-fresh-vm.sh --host <new-ip> --port <22|222> --ts "$ts" --source off
```
What it does, in the **mandated order** (CONTRACT_004 §4.4 — starting Forgejo before
13 is a defect):
1. **Decrypt** the bundle with the age identity; verify every artifact's MANIFEST sha256.
2. **Vault** — start a fresh raft node, init a throwaway node, `raft snapshot restore
-force`, then **unseal with the OLD keys** from config. Vault is now the source of
truth again; the OLD root token authenticates. All other creds are read back out.
3. **Postgres** — start with the super-password from Vault, restore `postgres.sql.gz`
(recreates the `forgejo` role + DB; asserts `"user"` rows ≥ 1).
4. **RustFS** — start with the admin keys from Vault, recreate the four buckets + the
scoped service account (the exact `serviceKeyId/Secret` Forgejo's app.ini expects),
sync `rustfs-blobs` back into the buckets.
5. **Forgejo** — extract `forgejo-repos.tar.zst` into the data volume (git repos +
`app.ini`, which already carries DB/S3 creds + INTERNAL_TOKEN/JWTs), inject
`SECRET_KEY` from Vault, start. Asserts healthz pass + `olsitec/foundation.git` present.
On success it prints `DR RESTORE OK (<ts>): …`.
## 4. What is NOT restored (recreatable — CONTRACT_004 §4.5)
Container images (re-pulled by digest), Caddy ACME data (re-issued), the runner's
ephemeral registration, search indexes/caches. These come back in §5.
## 5. Re-establish ingress + management
1. **DNS** — repoint `forge/git/s3/vault.olsitec.net` A records at the new IP
(Cloudflare; the bootstrap's `deployDns` does this once the stack is re-adopted).
2. **Caddy + runner** — re-adopt the stack so IaC manages the new VM:
```
cd bootstrap
pulumi config set foundation:vm.host <new-ip>
pulumi config set foundation:vm.sshPort <22|222>
./run.sh up # creates Caddy (re-issues LE cert via DNS-01), re-registers the runner
```
`up` is idempotent against the already-restored containers it can adopt, and creates
the recreatable ones (Caddy, runner). Verify: `https://forge.olsitec.net` = 200,
`git clone git@git.olsitec.net:olsitec/foundation.git`.
3. **Re-key reminder (D2)** — after a real disaster, rotate the Vault root token + the
offsite creds (they were materialised on a possibly-compromised host): `pulumi up
--replace` the relevant credential resources, then re-run a backup.
## 6. Gotchas (discovered during the T13 rehearsal)
- The **docker gid** on the host is host-specific; the runner mounts the host socket
(PLAN-002 R5). On a fresh VM re-check the gid before trusting runner jobs.
- `raft snapshot restore -force` **re-seals** the node (it swaps in the snapshot's
keyring) — you MUST unseal again with the OLD keys, not the throwaway init keys.
- Restore reads **offsite** by default. RustFS on the new VM starts EMPTY; its blobs
come from the bundle, not from a surviving RustFS.