Rehearsed and validated. The destructive sibling of backup/restore.sh:
rebuilds the ENTIRE egg on a fresh, Docker-equipped VM from the offsite,
age-encrypted bundle, in the mandated order (CONTRACT_004 §4.4):
Vault -> Postgres -> RustFS -> Forgejo.
- restore-to-fresh-vm.sh (operator): pulls the disaster-survivable secret set
from passphrase-encrypted config (age identity + Vault OLD unseal keys/root
token), ships VERSIONS + the VM-side restorer, runs it (secrets on stdin).
- restore-to-fresh-vm-remote.sh (VM-side): decrypt+verify bundle; restore Vault
(init throwaway -> raft snapshot restore -force -> re-unseal with OLD keys,
with a settle+retry loop because -force re-seals asynchronously); read every
other service's creds back out of the restored Vault; restore Postgres, RustFS
(buckets + scoped service account + blobs), and Forgejo (full /data incl.
app.ini); publish git :22 only when free.
- RUNBOOK.md: the human procedure, the {repo+passphrase+offsite} trust chain,
and §5 re-establish-ingress (DNS, Caddy, runner, re-key).
Rehearsal (throwaway cx33, offsite source, then destroyed): DR RESTORE OK —
Vault unsealed with OLD keys, postgres rows=2, forge healthy against restored
DB+S3, `git clone ssh://git@<vm>:2222/olsitec/foundation.git` returns all 28
commits, ai-baseline present. Trust chain proven end-to-end.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5.2 KiB
DR RUNBOOK — rebuild the foundation egg on a fresh VM (T13)
Realises CONTRACT_004 §4.4 (restore order) · Companion: dr/restore-to-fresh-vm.sh
(orchestrator) + dr/restore-to-fresh-vm-remote.sh (VM-side). This is the
destructive sibling of backup/restore.sh (the non-destructive scratch verifier).
0. When you are here
The Helsinki VM (or its Vault/data) is gone. You still have:
- this git repo (
olsitec/foundation) — includingbootstrap/Pulumi.foundation.yaml, whose passphrase-encryptedsecure:values hold the Vault OLD unseal keys + root token (CONTRACT_002 §2.4) and the age identity (CONTRACT_004 §4.3); - the master passphrase (
pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE) — the one external secret; without it nothing below decrypts; - the offsite bundle (self-hosted MinIO
olsitec-foundation/<TS>/) — RustFS is assumed lost, so restore reads offsite (--source off).
{repo + passphrase + offsite bundle} is sufficient to fully reconstitute the egg.
Nothing else is needed — that is the whole point of the trust chain (PLAN-002 §4.1).
1. Pick the bundle
List candidates in the offsite store and choose the newest verified one:
ts=<UTC-YYYYMMDDTHHMMSSZ> # e.g. the latest in olsitec-foundation/
A bundle is trustworthy only if backup/restore.sh <ts> off has passed for it (the
weekly verify job, CONTRACT_004 §4.6). Prefer the last green one.
2. Provision a fresh VM
Real DR — use the provision stack so the new VM becomes the managed home:
cd provision
export HCLOUD_TOKEN="$(pass olsicloud4/HCLOUD_TOKEN)"
# edit the server name if the old one still exists in Hetzner; the cloud-init already
# installs docker + age + zstd (jq is a base package) — backupTools in provision/index.ts.
PULUMI_CONFIG_PASSPHRASE=dev-validation-throwaway pulumi up
The VM must have, at first boot: docker, age, zstd, jq (the cloud-init provides
them). SSH is on :222 for a provision-stack VM (the vendored cloud-init moves it);
pass --port 222 below.
Rehearsal — a throwaway VM created directly via the Hetzner API (cx33, debian-12,
ssh key foundation-test-ssh-key, cloud-init installing docker+age+zstd+jq), sshd on
:22. Destroy it immediately after (DELETE /v1/servers/<id>).
3. Restore (Vault → Postgres → RustFS → Forgejo)
./dr/restore-to-fresh-vm.sh --host <new-ip> --port <22|222> --ts "$ts" --source off
What it does, in the mandated order (CONTRACT_004 §4.4 — starting Forgejo before 1–3 is a defect):
- Decrypt the bundle with the age identity; verify every artifact's MANIFEST sha256.
- Vault — start a fresh raft node, init a throwaway node,
raft snapshot restore -force, then unseal with the OLD keys from config. Vault is now the source of truth again; the OLD root token authenticates. All other creds are read back out. - Postgres — start with the super-password from Vault, restore
postgres.sql.gz(recreates theforgejorole + DB; asserts"user"rows ≥ 1). - RustFS — start with the admin keys from Vault, recreate the four buckets + the
scoped service account (the exact
serviceKeyId/SecretForgejo's app.ini expects), syncrustfs-blobsback into the buckets. - Forgejo — extract
forgejo-repos.tar.zstinto the data volume (git repos +app.ini, which already carries DB/S3 creds + INTERNAL_TOKEN/JWTs), injectSECRET_KEYfrom Vault, start. Asserts healthz pass +olsitec/foundation.gitpresent.
On success it prints DR RESTORE OK (<ts>): ….
4. What is NOT restored (recreatable — CONTRACT_004 §4.5)
Container images (re-pulled by digest), Caddy ACME data (re-issued), the runner's ephemeral registration, search indexes/caches. These come back in §5.
5. Re-establish ingress + management
- DNS — repoint
forge/git/s3/vault.olsitec.netA records at the new IP (Cloudflare; the bootstrap'sdeployDnsdoes this once the stack is re-adopted). - Caddy + runner — re-adopt the stack so IaC manages the new VM:
cd bootstrap pulumi config set foundation:vm.host <new-ip> pulumi config set foundation:vm.sshPort <22|222> ./run.sh up # creates Caddy (re-issues LE cert via DNS-01), re-registers the runnerupis idempotent against the already-restored containers it can adopt, and creates the recreatable ones (Caddy, runner). Verify:https://forge.olsitec.net= 200,git clone git@git.olsitec.net:olsitec/foundation.git. - Re-key reminder (D2) — after a real disaster, rotate the Vault root token + the
offsite creds (they were materialised on a possibly-compromised host):
pulumi up --replacethe relevant credential resources, then re-run a backup.
6. Gotchas (discovered during the T13 rehearsal)
- The docker gid on the host is host-specific; the runner mounts the host socket (PLAN-002 R5). On a fresh VM re-check the gid before trusting runner jobs.
raft snapshot restore -forcere-seals the node (it swaps in the snapshot's keyring) — you MUST unseal again with the OLD keys, not the throwaway init keys.- Restore reads offsite by default. RustFS on the new VM starts EMPTY; its blobs come from the bundle, not from a surviving RustFS.