From 1bba311d60e6f9ecfad391bbb860040da4c9d2bc Mon Sep 17 00:00:00 2001 From: Andreas Niemann Date: Wed, 1 Jul 2026 03:39:20 +0200 Subject: [PATCH] docs(session): SESSION_003 (fenced runner fleet) + handover for next agent MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Records the R5 fence work (build → harden → decoupled runners/ Pulumi stack → live cutover to foundation-runner-02 on crunchy01) and captures the operator's two new asks for the next session: a brix02 failover runner, and a k8s runner on crunchy's k3s for heavy (16CPU/64GB) seaspots-s57-utils jobs. Refreshes HANDOVER to prioritize those + the standing backlog. Co-Authored-By: Claude Opus 4.8 (1M context) --- documentation/sessions/HANDOVER.md | 126 +++++++++--------- .../sessions/SESSION_2026-07-01_003.md | 94 +++++++++++++ 2 files changed, 160 insertions(+), 60 deletions(-) create mode 100644 documentation/sessions/SESSION_2026-07-01_003.md diff --git a/documentation/sessions/HANDOVER.md b/documentation/sessions/HANDOVER.md index 7109246..2a25fb5 100644 --- a/documentation/sessions/HANDOVER.md +++ b/documentation/sessions/HANDOVER.md @@ -1,76 +1,82 @@ # HANDOVER — next-session prompt (paste into a fresh context) -> Living doc: overwritten each handover. The durable record is the dated -> `SESSION_*` files. Latest state = `SESSION_2026-07-01_002.md`. +> Living doc: overwritten each handover. Durable record = the dated `SESSION_*` files. +> Latest state = `SESSION_2026-07-01_003.md` (read it first, then #002 + #001). --- -Continue the **olsitec-foundation** build. You are the **Lead Agent, HIGH-RISK / INFRA mode**. +Continue the **olsitec-foundation** build. You are the **Lead Agent, HIGH-RISK / INFRA mode** +(remote VMs, k3s, Docker, secrets). ## Required reads (in `~/work/olsitec-foundation/foundation/`) -1. `documentation/sessions/SESSION_2026-07-01_002.md` ← current state + known gaps + next steps -2. `documentation/sessions/SESSION_2026-07-01_001.md` ← the prior session (gaps closed, T11/T13/T14-core) -3. `documentation/contracts/CONTRACT_001–004` + `decisions/ADR_004,005,006,007` - (**ADR-007** is the control-plane mechanism the whole egg runs on — read it first) -4. `.forgejo/workflows/README.md` ← the ecosystem-CI reusable-workflow contract + the Forgejo-11 caller quirk -5. `documentation/999_testing.md` ← the operator's acceptance-test plan (now implemented) +1. `documentation/sessions/SESSION_2026-07-01_003.md` ← runner fleet + the NEW asks below +2. `documentation/sessions/SESSION_2026-07-01_002.md` ← T14 + ecosystem CI · `_001.md` ← the egg +3. `documentation/contracts/CONTRACT_001–004` + `decisions/ADR_004,005,006,007` (ADR-007 first) +4. `runners/README.md` ← the decoupled runner-fleet stack (host prep, config, gotchas) +5. `.forgejo/workflows/README.md` ← the ecosystem-CI reusable-workflow contract (Forgejo-11 quirk) -## Where things stand -**The egg is LIVE; T11/T13/T14 are DONE; the ecosystem CI (999_testing) is built and validated.** -Six containers on `foundation-net` (postgres/rustfs/vault/caddy/forgejo/runner), all healthy. -`https://forge.olsitec.net`=200; `git clone git@git.olsitec.net:olsitec/foundation.git` works; origin is -Forgejo (master default). Backups age-encrypted + restore-verified (RustFS + offsite); DR scripted (`dr/`). -Working tree clean on `master`. +## Where things stand (all green / live) +- **The egg is LIVE** (6 containers on the Hetzner forge VM); T11/T13/T14 done; ecosystem CI + (reusable workflows + selftest) green; `https://forge.olsitec.net`=200. +- **The R5 fence is LIVE + codified.** `foundation-runner-02` (crunchy01 VM, `192.168.1.16`, + 8c/32G, label `fenced`) runs ecosystem/untrusted jobs OFF the forge VM. It's managed by the + **`runners/`** Pulumi stack — an **isolated project** (`bootstrap` never imports it), so + foundation deploy/refresh never touches crunchy01. Stack = `crunchy`; config + state are + gitignored (operator workstation only). +- Foundation repo `master` clean, all pushed. -**CI on the runner, all green:** -- `ci.yml` (preflight + typecheck), `pulumi-preview.yml` (read-only drift/PR check), - `backup-verify.yml` (weekly + dispatch; RESTORE VERIFY PASS from offsite). -- `ecosystem-selftest.yml` — semantic-release bump sequence (1.0.0→1.1.0→1.1.1→2.0.0→3.0.0) + - eslint/yamllint non-zero-exit gates. -- `.forgejo/workflows/reusable-*.yml` (node-build, docker-build, lint, semantic-release) — the - ecosystem-CI reuse layer. Downstream repos call them as - `uses: olsitec/foundation/.forgejo/workflows/.yml@master`. **Forgejo-11 quirk:** the calling job - MUST set `runs-on` (omitting it → silently zero runs; removed by a v15 upgrade) and use the SHORT - cross-repo ref (not a full URL). See `.forgejo/workflows/README.md`. +## THIS session's work (operator asks, in priority order) +### 1. brix02 runner with failover from crunchy01 +Add a runner on **brix02 (`192.168.1.3`)** that picks up jobs **only when crunchy01 is +unavailable**. **Forgejo has no native standby** — same-label runners load-balance; offline +ones get nothing. Choose with the operator: +- *HA-on-outage (simple, recommended):* register brix02 with the SAME `fenced` label → when + crunchy is down brix02 covers; when both up they share load. +- *Strict standby (custom):* brix02 runner kept STOPPED + a watchdog (systemd timer polling + the Forgejo runners API) that starts it only when crunchy's runner is offline. +The `runners/` stack is multi-host-capable: `cd runners && pulumi stack init brix02 && +pulumi config set host.address 192.168.1.3 && ... vm.name foundation-runner-03`. **FIRST** +verify brix02 has KVM + libvirt + a LAN bridge (same host prep as crunchy — see runners/README +§Host prep; brix02 is also the Graylog target, so it's a real box). Then `pulumi up` and prove +a `fenced` job runs on it (and, for standby, that it idles while crunchy is up). -`cd bootstrap && ./run.sh up` is idempotent and now also publishes `pulumi stack export` to RustFS -(`bootstrap/state-publish.sh`) so the state-dependent CI has Pulumi state. +### 2. k8s runner for heavy (16CPU/64GB) jobs on crunchy's k3s +The seaspots GitLab pipelines (`~/work/seaspots/gitlab/pipelines/.gitlab-ci.yml`) run +**seaspots-s57-utils** (ogr2ogr + tippecanoe; `registry.gitlab.com/seaspots/tools/ +seaspots-s57-utils:1.11.0`), `tags: [heavy-compute]`, needing **16+ CPU / 64+ GB RAM / 100+ GB +disk** — they'd crush the 8c/32G VM runner. Stand up a **Forgejo runner inside crunchy's k3s +cluster** (k8s-scheduled resources) with a distinct label (e.g. `heavy`), so `runs-on: heavy` +jobs run there. DESIGN TASK (not started): Forgejo `act_runner` executes via **docker** or +**host** mode — no mature native k8s executor (unlike GitLab). Evaluate act_runner as a k8s +Deployment with big resource requests (host-mode or a DinD sidecar) vs. alternatives. Note +crunchy's k3s already runs the GitLab runners (namespace `gitlab`) — do not disturb them. ## Operating essentials -- **VM**: `204.168.234.72`, admin SSH **:222**, key `~/.ssh/foundation-test_ed25519` (also the Forgejo - operator key). Git endpoint :22 (scp-form) + :2222. -- **Deploy**: `cd bootstrap && ./run.sh up`. Master passphrase: `pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE`. -- **Vault reboot**: `bootstrap/vault-unseal.sh`. **Backup**: `backup/backup.sh [ts]`; **restore-verify**: - `backup/restore.sh [rfs|off]`. **DR**: `dr/restore-to-fresh-vm.sh` (+ `dr/RUNBOOK.md`). -- **Forge admin**: `platform-admin` / Vault `foundation/forgejo/service-credentials:forgejoAdminPassword`. - (If you change the admin password in the UI, the API steps that set CI secrets need the new value.) -- **CI image**: built on the VM (`/tmp/ci-image`, from `containers/ci-image/Dockerfile`), tag - `foundation-ci:latest`, used locally by the runner (`force_pull:false`). Rebuild on toolchain change: - `scp` the Dockerfile + `docker build -t foundation-ci:latest .` on the VM. -- **CI secrets** (repo-scoped on `olsitec/foundation`, set via the admin API): `PULUMI_CONFIG_PASSPHRASE`, - `SSH_PRIVATE_KEY`, `RUSTFS_ACCESS_KEY`, `RUSTFS_SECRET_KEY`. +- **Forge VM**: `204.168.234.72`, SSH **:222**, key `~/.ssh/foundation-test_ed25519`. + Deploy: `cd bootstrap && ./run.sh up`. Passphrase: `pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE`. + Forge admin: `platform-admin` / Vault `foundation/forgejo/service-credentials:forgejoAdminPassword`. +- **crunchy01**: `root@192.168.1.2` (operator key in root's authorized_keys) OR `andiolsi`+sudo. + libvirt installed; pool `images`; `libvirt-bridge-forward.timer` active (kube-router-proof). + Runner fleet: `cd runners; export RUNNER_SSH_KEY_PATH=~/.ssh/foundation-test_ed25519; + export PULUMI_BACKEND_URL=file://$(pwd)/state; export PULUMI_CONFIG_PASSPHRASE=$(pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE); + pulumi stack select crunchy`. +- **CI image**: `foundation-ci:latest`, built on the forge VM (`/tmp/ci-image`); rebuild on toolchain change. +- **Reuse mechanism**: Forgejo 11 reusable workflows work but the CALLING job needs `runs-on` + + SHORT cross-repo ref (`.forgejo/workflows/README.md`). Composite actions need FULL-URL. ## Watchouts (HIGH-RISK) -- `pulumi-preview` shows a benign perpetual `~sshOpts` diff (the operator vs CI key path differ in the - docker provider) — informational; preview exits 0 on diffs by design. Don't add `--expect-no-changes`. -- `up --refresh` shows pessimistic `~triggers` replaces on the vault command chain (a preview artifact, - idempotent if applied). The VM sshd throttles bursts of docker-over-SSH → use `--parallel 1` for refresh, - or raise MaxStartups before wiring refresh into CI. -- Never print/commit the passphrase, Vault root token, or unseal keys (D2). Don't `pulumi up` the prod - `olsicloud4-*` stacks, and don't `up` the `provision` stack against the LIVE VM (it would recreate it). -- The runner holds the host Docker socket (root-equivalent). **R5 is deferred** (operator OK'd trusted - first-party CI on it) — fence it to a separate VM before any UNTRUSTED workflow. Commit atomically per task. +- crunchy01 is a k3s node — the `physdev-is-bridged` FORWARD accept is what lets VMs reach the + LAN; if a runner goes dark, check that rule / the timer first. Don't disturb k3s (`gitlab` + namespace runners, `nominatim`, flannel/cni0). +- Never commit the passphrase / Vault root token / unseal keys. `runners` stack state lives + only on the workstation (not backed up — a DR gap to address). +- Stale offline `crunchy-runner` registration on the forge (from the retired hand-built VM) — + harmless; deregister at leisure. Don't `pulumi up` the prod `olsicloud4-*` stacks. -## Next work (pick up here) -1. **Package registry (Stage-2)** — populate the Forgejo package registry so cross-repo `@olsitec` deps - resolve: publish `olsicrypto`, `svelte-common`, … Then validate `docker-build` end-to-end for the two - registry-blocked candidates (**C1 seaspots-homepage**, **C5 token-service**) — pass an npmrc via the - action's `build-args`. (C2/C3/C4 already validated.) -2. **R5 fence** — separate privileged runner VM (or socket-less DinD), labelled, before untrusted repos. -3. **T15** — `index.ts` orchestration polish (phase marker still `T10-runner`) + Gate A/B comments + - `docs/DAY-ZERO-TIMELINE.md`. -4. **Hardening** — pin floating refs (`IMAGE_REGISTRY` PIN_DIGEST, `IMAGE_RUSTFS` `latest`, `IMAGE_CI` tag); - pre-bake pulumi plugins into `foundation-ci` (drop preview's per-run auto-install); register in Olsitec - MCP (D6); a Forgejo v15 upgrade would drop the reusable-workflow caller `runs-on`/short-ref quirks. +## Standing backlog (after the two asks above) +- **Package registry (Stage-2)** — publish `@olsitec` pkgs so the C1/C5 docker candidates build. +- **T15** — index.ts phase marker + Gate A/B comments + DAY-ZERO-TIMELINE. +- **Hardening** — pin floating image refs; pre-bake pulumi plugins; MCP (D6); Forgejo v15 upgrade; + back up the `runners` stack state. -Validate each task live (VM `./run.sh up` + the runner for CI) and commit per task. +Validate each task live and commit atomically per task. diff --git a/documentation/sessions/SESSION_2026-07-01_003.md b/documentation/sessions/SESSION_2026-07-01_003.md new file mode 100644 index 0000000..4f3dbf9 --- /dev/null +++ b/documentation/sessions/SESSION_2026-07-01_003.md @@ -0,0 +1,94 @@ +# Session 2026-07-01 #003 — the fenced runner fleet (R5): build → harden → codify → cutover + +## What was done +Continued from #002 (T14 + ecosystem CI done). Built the **R5 fence** the operator had +deferred: a Forgejo Actions runner on a **separate VM on separate hardware** (crunchy01), +so ecosystem/untrusted jobs (`runs-on: fenced`) run OFF the forge VM. Then hardened it, +formalized it as a **decoupled Pulumi stack**, and did the **live cutover**. + +### 1. Fenced runner — built by hand, then proven +- crunchy01 (`192.168.1.2`, 16c/128G, Debian 13, **k3s node** running the GitLab + runners + `nominatim`) had `/dev/kvm` + passwordless sudo but no libvirt. Installed + qemu-kvm/libvirt/virtinst/cloud-image-utils. +- Created an **Ubuntu 24.04** VM on the LAN bridge `br0`, docker inside, registered a + Forgejo runner (label `fenced`) against `https://forge.olsitec.net`. +- **Proven**: a `runs-on: fenced` job ran with kernel `6.8.0` (Ubuntu VM) + egress + `62.176.248.112` (site IP), vs the Hetzner forge VM's `6.1.0` / `204.168.234.72` — + i.e. it executed on crunchy, isolated from the forge. + +### 2. Hardening +- **kube-router-proof firewall.** crunchy01's k3s/kube-router sets `FORWARD policy DROP` + + `br_netfilter=1`, which drops bridged VM↔LAN traffic (incl. the runner→forge poll). + Fix = `iptables -I FORWARD -m physdev --physdev-is-bridged -j ACCEPT`, re-asserted by a + **60s systemd timer** (`libvirt-bridge-forward.timer`) because kube-router flushes + iptables on resync (a boot-only unit isn't enough). +- VM autostart; rotated the throwaway root console password; PTY console so + `virsh console` works. + +### 3. Formalized as a DECOUPLED Pulumi stack — `runners/` +New isolated project (peer to `bootstrap/`, `provision/`, `offsite-backup/`). +**Why decoupled** (operator's explicit concern, and the answer to "is this a Pulumi +problem like it was with Terraform"): a `@pulumi/libvirt` provider dials the runner host +on every up/refresh, so putting it in `bootstrap` would make the **foundation** +undeployable/unrefreshable whenever crunchy01 is down/unreachable. Pulumi isolates this +at the **stack boundary** — `bootstrap` never imports `runners/`. One-way dependency: +`runners` mints a token FROM the forge, so it's "step-0 after the foundation stands". + +### 4. Live `pulumi up` cutover — DONE +Ran the `crunchy` stack live: created **`foundation-runner-02`** (static `192.168.1.16`, +**8c/32G**), registered the `fenced` runner, and a `runs-on: fenced` job ran GREEN on it. +Then **retired the hand-built VM** (`foundation-runner-01`), so the Pulumi-managed +runner-02 is the sole fenced runner. Bugs the live run surfaced (all fixed): +NIC name isn't `enp1s0` → match `e*`; drop `qemuAgent:true` (blocks on the agent at +create); `dialErrorLimit:30` for boot; fix the register token passing; host prep = root +SSH + the `images` pool (crunchy has no `default` pool). + +## Current state +- **`foundation-runner-02`** live on crunchy01 (`192.168.1.16`, 8c/32G), Pulumi-managed, + label `fenced`, executing jobs. Its runner is a docker container in the VM. +- **`runners/`** project committed to the foundation repo (index.ts, README, Pulumi.yaml, + package.json). **`runners/Pulumi.crunchy.yaml` + `runners/state/` are gitignored** + (local to the operator workstation only — see Open threads: not backed up). +- Access: crunchy01 `root@192.168.1.2` (operator key `~/.ssh/foundation-test_ed25519` + now in root's authorized_keys; also `andiolsi` + sudo). libvirt installed; `images` + pool; `libvirt-bridge-forward.timer` active. Forge VM `root@204.168.234.72:222`. +- Foundation repo `master` clean, all pushed. Forge admin `platform-admin` / Vault + `foundation/forgejo/service-credentials:forgejoAdminPassword`. + +## NEW requirements from the operator (this session) — for the next agent +1. **brix02 (`192.168.1.3`) runner with failover from crunchy01.** Only when crunchy01 + is unavailable should brix02 pick up jobs. **Forgejo has no native standby**: same-label + runners load-balance, offline ones just get nothing. Two paths: + - *HA-on-outage (simple):* register brix02 with the SAME `fenced` label — when crunchy + is down, brix02 covers; when both up, they share load. + - *Strict standby (custom):* keep brix02's runner STOPPED + a watchdog (systemd timer + polling the Forgejo runners API) that starts it only when crunchy's runner is offline. + The `runners/` stack is already multi-host-capable via config — target brix02 with a + second stack (`pulumi stack init brix02`, `config set host.address 192.168.1.3`, + `vm.name foundation-runner-03`). **First verify brix02 has KVM + libvirt + a bridge** + (same host prep as crunchy; brix02 is also the Graylog target — see memory). +2. **k8s runner for heavy jobs.** The seaspots GitLab pipelines + (`~/work/seaspots/gitlab/pipelines`, `.gitlab-ci.yml`) run **seaspots-s57-utils** + (`registry.gitlab.com/seaspots/tools/seaspots-s57-utils:1.11.0`: GDAL/ogr2ogr + + tippecanoe) — `tags: [heavy-compute]`, needs **16+ CPU / 64+ GB RAM / 100+ GB disk**. + These would crush the 8c/32G VM runner. The operator wants such heavy/containerized + jobs to run on a **Forgejo runner inside crunchy's k3s cluster** (k8s-scheduled + resources), with a distinct label (e.g. `heavy`/`k8s`). Design note: Forgejo + `act_runner` executes jobs via **docker** or **host** mode — it has no mature native + k8s executor like GitLab's. The next agent must evaluate: act_runner as a k8s + Deployment (big resource requests) using host-mode or a DinD sidecar, vs. another + approach. This is a design task, not yet started. + +## Open threads / backlog (from #002 + this session) +- **`runners` stack state not backed up** — only on the operator workstation + (`runners/state/`, gitignored). A DR gap; consider backing it up like `bootstrap`'s. +- **Stale `crunchy-runner` registration** on the forge (from the retired hand-built VM) — + offline, harmless; deregister at leisure (Forgejo runners admin API/UI). +- **Package registry (Stage-2)** — publish `@olsitec` packages (olsicrypto, svelte-common) + so the two registry-blocked 999_testing candidates (seaspots-homepage, token-service) + build via `reusable-docker-build`. +- **T15** — `index.ts` phase marker still `T10-runner`; Gate A/B comments; DAY-ZERO-TIMELINE. +- **Hardening** — pin floating image refs; pre-bake pulumi plugins into foundation-ci; + MCP registration (D6); Forgejo v15 upgrade drops the reusable-workflow `runs-on` quirk. + +## Operating mode for next session: HIGH-RISK / INFRA (remote VMs, k3s, Docker, secrets).