Rehearsed and validated. The destructive sibling of backup/restore.sh:
rebuilds the ENTIRE egg on a fresh, Docker-equipped VM from the offsite,
age-encrypted bundle, in the mandated order (CONTRACT_004 §4.4):
Vault -> Postgres -> RustFS -> Forgejo.
- restore-to-fresh-vm.sh (operator): pulls the disaster-survivable secret set
from passphrase-encrypted config (age identity + Vault OLD unseal keys/root
token), ships VERSIONS + the VM-side restorer, runs it (secrets on stdin).
- restore-to-fresh-vm-remote.sh (VM-side): decrypt+verify bundle; restore Vault
(init throwaway -> raft snapshot restore -force -> re-unseal with OLD keys,
with a settle+retry loop because -force re-seals asynchronously); read every
other service's creds back out of the restored Vault; restore Postgres, RustFS
(buckets + scoped service account + blobs), and Forgejo (full /data incl.
app.ini); publish git :22 only when free.
- RUNBOOK.md: the human procedure, the {repo+passphrase+offsite} trust chain,
and §5 re-establish-ingress (DNS, Caddy, runner, re-key).
Rehearsal (throwaway cx33, offsite source, then destroyed): DR RESTORE OK —
Vault unsealed with OLD keys, postgres rows=2, forge healthy against restored
DB+S3, `git clone ssh://git@<vm>:2222/olsitec/foundation.git` returns all 28
commits, ai-baseline present. Trust chain proven end-to-end.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
107 lines
5.2 KiB
Markdown
107 lines
5.2 KiB
Markdown
# DR RUNBOOK — rebuild the foundation egg on a fresh VM (T13)
|
||
|
||
**Realises** CONTRACT_004 §4.4 (restore order) · **Companion**: `dr/restore-to-fresh-vm.sh`
|
||
(orchestrator) + `dr/restore-to-fresh-vm-remote.sh` (VM-side). This is the
|
||
**destructive** sibling of `backup/restore.sh` (the non-destructive scratch verifier).
|
||
|
||
## 0. When you are here
|
||
|
||
The Helsinki VM (or its Vault/data) is gone. You still have:
|
||
|
||
- **this git repo** (`olsitec/foundation`) — including `bootstrap/Pulumi.foundation.yaml`,
|
||
whose passphrase-encrypted `secure:` values hold the Vault **OLD unseal keys + root
|
||
token** (CONTRACT_002 §2.4) and the **age identity** (CONTRACT_004 §4.3);
|
||
- the **master passphrase** (`pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE`) — the
|
||
one external secret; without it nothing below decrypts;
|
||
- the **offsite bundle** (self-hosted MinIO `olsitec-foundation/<TS>/`) — RustFS is
|
||
assumed lost, so restore reads **offsite** (`--source off`).
|
||
|
||
`{repo + passphrase + offsite bundle}` is sufficient to fully reconstitute the egg.
|
||
Nothing else is needed — that is the whole point of the trust chain (PLAN-002 §4.1).
|
||
|
||
## 1. Pick the bundle
|
||
|
||
List candidates in the offsite store and choose the newest verified one:
|
||
|
||
```
|
||
ts=<UTC-YYYYMMDDTHHMMSSZ> # e.g. the latest in olsitec-foundation/
|
||
```
|
||
|
||
A bundle is trustworthy only if `backup/restore.sh <ts> off` has passed for it (the
|
||
weekly verify job, CONTRACT_004 §4.6). Prefer the last green one.
|
||
|
||
## 2. Provision a fresh VM
|
||
|
||
**Real DR** — use the provision stack so the new VM becomes the managed home:
|
||
|
||
```
|
||
cd provision
|
||
export HCLOUD_TOKEN="$(pass olsicloud4/HCLOUD_TOKEN)"
|
||
# edit the server name if the old one still exists in Hetzner; the cloud-init already
|
||
# installs docker + age + zstd (jq is a base package) — backupTools in provision/index.ts.
|
||
PULUMI_CONFIG_PASSPHRASE=dev-validation-throwaway pulumi up
|
||
```
|
||
|
||
The VM must have, at first boot: **docker, age, zstd, jq** (the cloud-init provides
|
||
them). SSH is on **:222** for a provision-stack VM (the vendored cloud-init moves it);
|
||
pass `--port 222` below.
|
||
|
||
**Rehearsal** — a throwaway VM created directly via the Hetzner API (cx33, debian-12,
|
||
ssh key `foundation-test-ssh-key`, cloud-init installing docker+age+zstd+jq), sshd on
|
||
**:22**. Destroy it immediately after (`DELETE /v1/servers/<id>`).
|
||
|
||
## 3. Restore (Vault → Postgres → RustFS → Forgejo)
|
||
|
||
```
|
||
./dr/restore-to-fresh-vm.sh --host <new-ip> --port <22|222> --ts "$ts" --source off
|
||
```
|
||
|
||
What it does, in the **mandated order** (CONTRACT_004 §4.4 — starting Forgejo before
|
||
1–3 is a defect):
|
||
|
||
1. **Decrypt** the bundle with the age identity; verify every artifact's MANIFEST sha256.
|
||
2. **Vault** — start a fresh raft node, init a throwaway node, `raft snapshot restore
|
||
-force`, then **unseal with the OLD keys** from config. Vault is now the source of
|
||
truth again; the OLD root token authenticates. All other creds are read back out.
|
||
3. **Postgres** — start with the super-password from Vault, restore `postgres.sql.gz`
|
||
(recreates the `forgejo` role + DB; asserts `"user"` rows ≥ 1).
|
||
4. **RustFS** — start with the admin keys from Vault, recreate the four buckets + the
|
||
scoped service account (the exact `serviceKeyId/Secret` Forgejo's app.ini expects),
|
||
sync `rustfs-blobs` back into the buckets.
|
||
5. **Forgejo** — extract `forgejo-repos.tar.zst` into the data volume (git repos +
|
||
`app.ini`, which already carries DB/S3 creds + INTERNAL_TOKEN/JWTs), inject
|
||
`SECRET_KEY` from Vault, start. Asserts healthz pass + `olsitec/foundation.git` present.
|
||
|
||
On success it prints `DR RESTORE OK (<ts>): …`.
|
||
|
||
## 4. What is NOT restored (recreatable — CONTRACT_004 §4.5)
|
||
|
||
Container images (re-pulled by digest), Caddy ACME data (re-issued), the runner's
|
||
ephemeral registration, search indexes/caches. These come back in §5.
|
||
|
||
## 5. Re-establish ingress + management
|
||
|
||
1. **DNS** — repoint `forge/git/s3/vault.olsitec.net` A records at the new IP
|
||
(Cloudflare; the bootstrap's `deployDns` does this once the stack is re-adopted).
|
||
2. **Caddy + runner** — re-adopt the stack so IaC manages the new VM:
|
||
```
|
||
cd bootstrap
|
||
pulumi config set foundation:vm.host <new-ip>
|
||
pulumi config set foundation:vm.sshPort <22|222>
|
||
./run.sh up # creates Caddy (re-issues LE cert via DNS-01), re-registers the runner
|
||
```
|
||
`up` is idempotent against the already-restored containers it can adopt, and creates
|
||
the recreatable ones (Caddy, runner). Verify: `https://forge.olsitec.net` = 200,
|
||
`git clone git@git.olsitec.net:olsitec/foundation.git`.
|
||
3. **Re-key reminder (D2)** — after a real disaster, rotate the Vault root token + the
|
||
offsite creds (they were materialised on a possibly-compromised host): `pulumi up
|
||
--replace` the relevant credential resources, then re-run a backup.
|
||
|
||
## 6. Gotchas (discovered during the T13 rehearsal)
|
||
|
||
- The **docker gid** on the host is host-specific; the runner mounts the host socket
|
||
(PLAN-002 R5). On a fresh VM re-check the gid before trusting runner jobs.
|
||
- `raft snapshot restore -force` **re-seals** the node (it swaps in the snapshot's
|
||
keyring) — you MUST unseal again with the OLD keys, not the throwaway init keys.
|
||
- Restore reads **offsite** by default. RustFS on the new VM starts EMPTY; its blobs
|
||
come from the bundle, not from a surviving RustFS.
|