docs(session): SESSION_003 (fenced runner fleet) + handover for next agent

Records the R5 fence work (build → harden → decoupled runners/ Pulumi stack → live cutover to foundation-runner-02 on crunchy01) and captures the operator's two new asks for the next session: a brix02 failover runner, and a k8s runner on crunchy's k3s for heavy (16CPU/64GB) seaspots-s57-utils jobs. Refreshes HANDOVER to prioritize those + the standing backlog. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-01 03:39:20 +02:00 · 2026-07-01 03:39:20 +02:00 · 1bba311d60
commit 1bba311d60
parent 44a96d84eb
2 changed files with 160 additions and 60 deletions
--- a/documentation/sessions/HANDOVER.md
+++ b/documentation/sessions/HANDOVER.md
@ -1,76 +1,82 @@
 # HANDOVER — next-session prompt (paste into a fresh context)

-> Living doc: overwritten each handover. The durable record is the dated
-> `SESSION_*` files. Latest state = `SESSION_2026-07-01_002.md`.
+> Living doc: overwritten each handover. Durable record = the dated `SESSION_*` files.
+> Latest state = `SESSION_2026-07-01_003.md` (read it first, then #002 + #001).

 ---

-Continue the **olsitec-foundation** build. You are the **Lead Agent, HIGH-RISK / INFRA mode**.
+Continue the **olsitec-foundation** build. You are the **Lead Agent, HIGH-RISK / INFRA mode**
+(remote VMs, k3s, Docker, secrets).

 ## Required reads (in `~/work/olsitec-foundation/foundation/`)
-1. `documentation/sessions/SESSION_2026-07-01_002.md` ← current state + known gaps + next steps
-2. `documentation/sessions/SESSION_2026-07-01_001.md` ← the prior session (gaps closed, T11/T13/T14-core)
-3. `documentation/contracts/CONTRACT_001–004` + `decisions/ADR_004,005,006,007`
-   (**ADR-007** is the control-plane mechanism the whole egg runs on — read it first)
-4. `.forgejo/workflows/README.md` ← the ecosystem-CI reusable-workflow contract + the Forgejo-11 caller quirk
-5. `documentation/999_testing.md` ← the operator's acceptance-test plan (now implemented)
+1. `documentation/sessions/SESSION_2026-07-01_003.md` ← runner fleet + the NEW asks below
+2. `documentation/sessions/SESSION_2026-07-01_002.md` ← T14 + ecosystem CI · `_001.md` ← the egg
+3. `documentation/contracts/CONTRACT_001–004` + `decisions/ADR_004,005,006,007` (ADR-007 first)
+4. `runners/README.md` ← the decoupled runner-fleet stack (host prep, config, gotchas)
+5. `.forgejo/workflows/README.md` ← the ecosystem-CI reusable-workflow contract (Forgejo-11 quirk)

-## Where things stand
-**The egg is LIVE; T11/T13/T14 are DONE; the ecosystem CI (999_testing) is built and validated.**
-Six containers on `foundation-net` (postgres/rustfs/vault/caddy/forgejo/runner), all healthy.
-`https://forge.olsitec.net`=200; `git clone git@git.olsitec.net:olsitec/foundation.git` works; origin is
-Forgejo (master default). Backups age-encrypted + restore-verified (RustFS + offsite); DR scripted (`dr/`).
-Working tree clean on `master`.
+## Where things stand (all green / live)
+- **The egg is LIVE** (6 containers on the Hetzner forge VM); T11/T13/T14 done; ecosystem CI
+  (reusable workflows + selftest) green; `https://forge.olsitec.net`=200.
+- **The R5 fence is LIVE + codified.** `foundation-runner-02` (crunchy01 VM, `192.168.1.16`,
+  8c/32G, label `fenced`) runs ecosystem/untrusted jobs OFF the forge VM. It's managed by the
+  **`runners/`** Pulumi stack — an **isolated project** (`bootstrap` never imports it), so
+  foundation deploy/refresh never touches crunchy01. Stack = `crunchy`; config + state are
+  gitignored (operator workstation only).
+- Foundation repo `master` clean, all pushed.

-**CI on the runner, all green:**
- `ci.yml` (preflight + typecheck), `pulumi-preview.yml` (read-only drift/PR check),
-  `backup-verify.yml` (weekly + dispatch; RESTORE VERIFY PASS from offsite).
- `ecosystem-selftest.yml` — semantic-release bump sequence (1.0.0→1.1.0→1.1.1→2.0.0→3.0.0) +
-  eslint/yamllint non-zero-exit gates.
- `.forgejo/workflows/reusable-*.yml` (node-build, docker-build, lint, semantic-release) — the
-  ecosystem-CI reuse layer. Downstream repos call them as
-  `uses: olsitec/foundation/.forgejo/workflows/<x>.yml@master`. **Forgejo-11 quirk:** the calling job
-  MUST set `runs-on` (omitting it → silently zero runs; removed by a v15 upgrade) and use the SHORT
-  cross-repo ref (not a full URL). See `.forgejo/workflows/README.md`.
+## THIS session's work (operator asks, in priority order)
+### 1. brix02 runner with failover from crunchy01
+Add a runner on **brix02 (`192.168.1.3`)** that picks up jobs **only when crunchy01 is
+unavailable**. **Forgejo has no native standby** — same-label runners load-balance; offline
+ones get nothing. Choose with the operator:
+- *HA-on-outage (simple, recommended):* register brix02 with the SAME `fenced` label → when
+  crunchy is down brix02 covers; when both up they share load.
+- *Strict standby (custom):* brix02 runner kept STOPPED + a watchdog (systemd timer polling
+  the Forgejo runners API) that starts it only when crunchy's runner is offline.
+The `runners/` stack is multi-host-capable: `cd runners && pulumi stack init brix02 &&
+pulumi config set host.address 192.168.1.3 && ... vm.name foundation-runner-03`. **FIRST**
+verify brix02 has KVM + libvirt + a LAN bridge (same host prep as crunchy — see runners/README
+§Host prep; brix02 is also the Graylog target, so it's a real box). Then `pulumi up` and prove
+a `fenced` job runs on it (and, for standby, that it idles while crunchy is up).

-`cd bootstrap && ./run.sh up` is idempotent and now also publishes `pulumi stack export` to RustFS
-(`bootstrap/state-publish.sh`) so the state-dependent CI has Pulumi state.
+### 2. k8s runner for heavy (16CPU/64GB) jobs on crunchy's k3s
+The seaspots GitLab pipelines (`~/work/seaspots/gitlab/pipelines/.gitlab-ci.yml`) run
+**seaspots-s57-utils** (ogr2ogr + tippecanoe; `registry.gitlab.com/seaspots/tools/
+seaspots-s57-utils:1.11.0`), `tags: [heavy-compute]`, needing **16+ CPU / 64+ GB RAM / 100+ GB
+disk** — they'd crush the 8c/32G VM runner. Stand up a **Forgejo runner inside crunchy's k3s
+cluster** (k8s-scheduled resources) with a distinct label (e.g. `heavy`), so `runs-on: heavy`
+jobs run there. DESIGN TASK (not started): Forgejo `act_runner` executes via **docker** or
+**host** mode — no mature native k8s executor (unlike GitLab). Evaluate act_runner as a k8s
+Deployment with big resource requests (host-mode or a DinD sidecar) vs. alternatives. Note
+crunchy's k3s already runs the GitLab runners (namespace `gitlab`) — do not disturb them.

 ## Operating essentials
- **VM**: `204.168.234.72`, admin SSH **:222**, key `~/.ssh/foundation-test_ed25519` (also the Forgejo
-  operator key). Git endpoint :22 (scp-form) + :2222.
- **Deploy**: `cd bootstrap && ./run.sh up`. Master passphrase: `pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE`.
- **Vault reboot**: `bootstrap/vault-unseal.sh`. **Backup**: `backup/backup.sh [ts]`; **restore-verify**:
-  `backup/restore.sh <ts> [rfs|off]`. **DR**: `dr/restore-to-fresh-vm.sh` (+ `dr/RUNBOOK.md`).
- **Forge admin**: `platform-admin` / Vault `foundation/forgejo/service-credentials:forgejoAdminPassword`.
-  (If you change the admin password in the UI, the API steps that set CI secrets need the new value.)
- **CI image**: built on the VM (`/tmp/ci-image`, from `containers/ci-image/Dockerfile`), tag
-  `foundation-ci:latest`, used locally by the runner (`force_pull:false`). Rebuild on toolchain change:
-  `scp` the Dockerfile + `docker build -t foundation-ci:latest .` on the VM.
- **CI secrets** (repo-scoped on `olsitec/foundation`, set via the admin API): `PULUMI_CONFIG_PASSPHRASE`,
-  `SSH_PRIVATE_KEY`, `RUSTFS_ACCESS_KEY`, `RUSTFS_SECRET_KEY`.
+- **Forge VM**: `204.168.234.72`, SSH **:222**, key `~/.ssh/foundation-test_ed25519`.
+  Deploy: `cd bootstrap && ./run.sh up`. Passphrase: `pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE`.
+  Forge admin: `platform-admin` / Vault `foundation/forgejo/service-credentials:forgejoAdminPassword`.
+- **crunchy01**: `root@192.168.1.2` (operator key in root's authorized_keys) OR `andiolsi`+sudo.
+  libvirt installed; pool `images`; `libvirt-bridge-forward.timer` active (kube-router-proof).
+  Runner fleet: `cd runners; export RUNNER_SSH_KEY_PATH=~/.ssh/foundation-test_ed25519;
+  export PULUMI_BACKEND_URL=file://$(pwd)/state; export PULUMI_CONFIG_PASSPHRASE=$(pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE);
+  pulumi stack select crunchy`.
+- **CI image**: `foundation-ci:latest`, built on the forge VM (`/tmp/ci-image`); rebuild on toolchain change.
+- **Reuse mechanism**: Forgejo 11 reusable workflows work but the CALLING job needs `runs-on`
+  + SHORT cross-repo ref (`.forgejo/workflows/README.md`). Composite actions need FULL-URL.

 ## Watchouts (HIGH-RISK)
- `pulumi-preview` shows a benign perpetual `~sshOpts` diff (the operator vs CI key path differ in the
-  docker provider) — informational; preview exits 0 on diffs by design. Don't add `--expect-no-changes`.
- `up --refresh` shows pessimistic `~triggers` replaces on the vault command chain (a preview artifact,
-  idempotent if applied). The VM sshd throttles bursts of docker-over-SSH → use `--parallel 1` for refresh,
-  or raise MaxStartups before wiring refresh into CI.
- Never print/commit the passphrase, Vault root token, or unseal keys (D2). Don't `pulumi up` the prod
-  `olsicloud4-*` stacks, and don't `up` the `provision` stack against the LIVE VM (it would recreate it).
- The runner holds the host Docker socket (root-equivalent). **R5 is deferred** (operator OK'd trusted
-  first-party CI on it) — fence it to a separate VM before any UNTRUSTED workflow. Commit atomically per task.
+- crunchy01 is a k3s node — the `physdev-is-bridged` FORWARD accept is what lets VMs reach the
+  LAN; if a runner goes dark, check that rule / the timer first. Don't disturb k3s (`gitlab`
+  namespace runners, `nominatim`, flannel/cni0).
+- Never commit the passphrase / Vault root token / unseal keys. `runners` stack state lives
+  only on the workstation (not backed up — a DR gap to address).
+- Stale offline `crunchy-runner` registration on the forge (from the retired hand-built VM) —
+  harmless; deregister at leisure. Don't `pulumi up` the prod `olsicloud4-*` stacks.

-## Next work (pick up here)
-1. **Package registry (Stage-2)** — populate the Forgejo package registry so cross-repo `@olsitec` deps
-   resolve: publish `olsicrypto`, `svelte-common`, … Then validate `docker-build` end-to-end for the two
-   registry-blocked candidates (**C1 seaspots-homepage**, **C5 token-service**) — pass an npmrc via the
-   action's `build-args`. (C2/C3/C4 already validated.)
-2. **R5 fence** — separate privileged runner VM (or socket-less DinD), labelled, before untrusted repos.
-3. **T15** — `index.ts` orchestration polish (phase marker still `T10-runner`) + Gate A/B comments +
-   `docs/DAY-ZERO-TIMELINE.md`.
-4. **Hardening** — pin floating refs (`IMAGE_REGISTRY` PIN_DIGEST, `IMAGE_RUSTFS` `latest`, `IMAGE_CI` tag);
-   pre-bake pulumi plugins into `foundation-ci` (drop preview's per-run auto-install); register in Olsitec
-   MCP (D6); a Forgejo v15 upgrade would drop the reusable-workflow caller `runs-on`/short-ref quirks.
+## Standing backlog (after the two asks above)
+- **Package registry (Stage-2)** — publish `@olsitec` pkgs so the C1/C5 docker candidates build.
+- **T15** — index.ts phase marker + Gate A/B comments + DAY-ZERO-TIMELINE.
+- **Hardening** — pin floating image refs; pre-bake pulumi plugins; MCP (D6); Forgejo v15 upgrade;
+  back up the `runners` stack state.

-Validate each task live (VM `./run.sh up` + the runner for CI) and commit per task.
+Validate each task live and commit atomically per task.