108 lines
5.2 KiB
Markdown
108 lines
5.2 KiB
Markdown
|
|
# DR RUNBOOK — rebuild the foundation egg on a fresh VM (T13)
|
|||
|
|
|
|||
|
|
**Realises** CONTRACT_004 §4.4 (restore order) · **Companion**: `dr/restore-to-fresh-vm.sh`
|
|||
|
|
(orchestrator) + `dr/restore-to-fresh-vm-remote.sh` (VM-side). This is the
|
|||
|
|
**destructive** sibling of `backup/restore.sh` (the non-destructive scratch verifier).
|
|||
|
|
|
|||
|
|
## 0. When you are here
|
|||
|
|
|
|||
|
|
The Helsinki VM (or its Vault/data) is gone. You still have:
|
|||
|
|
|
|||
|
|
- **this git repo** (`olsitec/foundation`) — including `bootstrap/Pulumi.foundation.yaml`,
|
|||
|
|
whose passphrase-encrypted `secure:` values hold the Vault **OLD unseal keys + root
|
|||
|
|
token** (CONTRACT_002 §2.4) and the **age identity** (CONTRACT_004 §4.3);
|
|||
|
|
- the **master passphrase** (`pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE`) — the
|
|||
|
|
one external secret; without it nothing below decrypts;
|
|||
|
|
- the **offsite bundle** (self-hosted MinIO `olsitec-foundation/<TS>/`) — RustFS is
|
|||
|
|
assumed lost, so restore reads **offsite** (`--source off`).
|
|||
|
|
|
|||
|
|
`{repo + passphrase + offsite bundle}` is sufficient to fully reconstitute the egg.
|
|||
|
|
Nothing else is needed — that is the whole point of the trust chain (PLAN-002 §4.1).
|
|||
|
|
|
|||
|
|
## 1. Pick the bundle
|
|||
|
|
|
|||
|
|
List candidates in the offsite store and choose the newest verified one:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
ts=<UTC-YYYYMMDDTHHMMSSZ> # e.g. the latest in olsitec-foundation/
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
A bundle is trustworthy only if `backup/restore.sh <ts> off` has passed for it (the
|
|||
|
|
weekly verify job, CONTRACT_004 §4.6). Prefer the last green one.
|
|||
|
|
|
|||
|
|
## 2. Provision a fresh VM
|
|||
|
|
|
|||
|
|
**Real DR** — use the provision stack so the new VM becomes the managed home:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
cd provision
|
|||
|
|
export HCLOUD_TOKEN="$(pass olsicloud4/HCLOUD_TOKEN)"
|
|||
|
|
# edit the server name if the old one still exists in Hetzner; the cloud-init already
|
|||
|
|
# installs docker + age + zstd (jq is a base package) — backupTools in provision/index.ts.
|
|||
|
|
PULUMI_CONFIG_PASSPHRASE=dev-validation-throwaway pulumi up
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The VM must have, at first boot: **docker, age, zstd, jq** (the cloud-init provides
|
|||
|
|
them). SSH is on **:222** for a provision-stack VM (the vendored cloud-init moves it);
|
|||
|
|
pass `--port 222` below.
|
|||
|
|
|
|||
|
|
**Rehearsal** — a throwaway VM created directly via the Hetzner API (cx33, debian-12,
|
|||
|
|
ssh key `foundation-test-ssh-key`, cloud-init installing docker+age+zstd+jq), sshd on
|
|||
|
|
**:22**. Destroy it immediately after (`DELETE /v1/servers/<id>`).
|
|||
|
|
|
|||
|
|
## 3. Restore (Vault → Postgres → RustFS → Forgejo)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
./dr/restore-to-fresh-vm.sh --host <new-ip> --port <22|222> --ts "$ts" --source off
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
What it does, in the **mandated order** (CONTRACT_004 §4.4 — starting Forgejo before
|
|||
|
|
1–3 is a defect):
|
|||
|
|
|
|||
|
|
1. **Decrypt** the bundle with the age identity; verify every artifact's MANIFEST sha256.
|
|||
|
|
2. **Vault** — start a fresh raft node, init a throwaway node, `raft snapshot restore
|
|||
|
|
-force`, then **unseal with the OLD keys** from config. Vault is now the source of
|
|||
|
|
truth again; the OLD root token authenticates. All other creds are read back out.
|
|||
|
|
3. **Postgres** — start with the super-password from Vault, restore `postgres.sql.gz`
|
|||
|
|
(recreates the `forgejo` role + DB; asserts `"user"` rows ≥ 1).
|
|||
|
|
4. **RustFS** — start with the admin keys from Vault, recreate the four buckets + the
|
|||
|
|
scoped service account (the exact `serviceKeyId/Secret` Forgejo's app.ini expects),
|
|||
|
|
sync `rustfs-blobs` back into the buckets.
|
|||
|
|
5. **Forgejo** — extract `forgejo-repos.tar.zst` into the data volume (git repos +
|
|||
|
|
`app.ini`, which already carries DB/S3 creds + INTERNAL_TOKEN/JWTs), inject
|
|||
|
|
`SECRET_KEY` from Vault, start. Asserts healthz pass + `olsitec/foundation.git` present.
|
|||
|
|
|
|||
|
|
On success it prints `DR RESTORE OK (<ts>): …`.
|
|||
|
|
|
|||
|
|
## 4. What is NOT restored (recreatable — CONTRACT_004 §4.5)
|
|||
|
|
|
|||
|
|
Container images (re-pulled by digest), Caddy ACME data (re-issued), the runner's
|
|||
|
|
ephemeral registration, search indexes/caches. These come back in §5.
|
|||
|
|
|
|||
|
|
## 5. Re-establish ingress + management
|
|||
|
|
|
|||
|
|
1. **DNS** — repoint `forge/git/s3/vault.olsitec.net` A records at the new IP
|
|||
|
|
(Cloudflare; the bootstrap's `deployDns` does this once the stack is re-adopted).
|
|||
|
|
2. **Caddy + runner** — re-adopt the stack so IaC manages the new VM:
|
|||
|
|
```
|
|||
|
|
cd bootstrap
|
|||
|
|
pulumi config set foundation:vm.host <new-ip>
|
|||
|
|
pulumi config set foundation:vm.sshPort <22|222>
|
|||
|
|
./run.sh up # creates Caddy (re-issues LE cert via DNS-01), re-registers the runner
|
|||
|
|
```
|
|||
|
|
`up` is idempotent against the already-restored containers it can adopt, and creates
|
|||
|
|
the recreatable ones (Caddy, runner). Verify: `https://forge.olsitec.net` = 200,
|
|||
|
|
`git clone git@git.olsitec.net:olsitec/foundation.git`.
|
|||
|
|
3. **Re-key reminder (D2)** — after a real disaster, rotate the Vault root token + the
|
|||
|
|
offsite creds (they were materialised on a possibly-compromised host): `pulumi up
|
|||
|
|
--replace` the relevant credential resources, then re-run a backup.
|
|||
|
|
|
|||
|
|
## 6. Gotchas (discovered during the T13 rehearsal)
|
|||
|
|
|
|||
|
|
- The **docker gid** on the host is host-specific; the runner mounts the host socket
|
|||
|
|
(PLAN-002 R5). On a fresh VM re-check the gid before trusting runner jobs.
|
|||
|
|
- `raft snapshot restore -force` **re-seals** the node (it swaps in the snapshot's
|
|||
|
|
keyring) — you MUST unseal again with the OLD keys, not the throwaway init keys.
|
|||
|
|
- Restore reads **offsite** by default. RustFS on the new VM starts EMPTY; its blobs
|
|||
|
|
come from the bundle, not from a surviving RustFS.
|