foundation/documentation/sessions/HANDOVER.md
Andreas Niemann 1bba311d60
All checks were successful
CI / preflight (push) Successful in 7s
CI / typecheck (push) Successful in 17s
pulumi-preview / preview (push) Successful in 22s
docs(session): SESSION_003 (fenced runner fleet) + handover for next agent
Records the R5 fence work (build → harden → decoupled runners/ Pulumi stack →
live cutover to foundation-runner-02 on crunchy01) and captures the operator's
two new asks for the next session: a brix02 failover runner, and a k8s runner on
crunchy's k3s for heavy (16CPU/64GB) seaspots-s57-utils jobs. Refreshes HANDOVER
to prioritize those + the standing backlog.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-01 03:39:20 +02:00

5.5 KiB
Raw Blame History

HANDOVER — next-session prompt (paste into a fresh context)

Living doc: overwritten each handover. Durable record = the dated SESSION_* files. Latest state = SESSION_2026-07-01_003.md (read it first, then #002 + #001).


Continue the olsitec-foundation build. You are the Lead Agent, HIGH-RISK / INFRA mode (remote VMs, k3s, Docker, secrets).

Required reads (in ~/work/olsitec-foundation/foundation/)

  1. documentation/sessions/SESSION_2026-07-01_003.md ← runner fleet + the NEW asks below
  2. documentation/sessions/SESSION_2026-07-01_002.md ← T14 + ecosystem CI · _001.md ← the egg
  3. documentation/contracts/CONTRACT_001004 + decisions/ADR_004,005,006,007 (ADR-007 first)
  4. runners/README.md ← the decoupled runner-fleet stack (host prep, config, gotchas)
  5. .forgejo/workflows/README.md ← the ecosystem-CI reusable-workflow contract (Forgejo-11 quirk)

Where things stand (all green / live)

  • The egg is LIVE (6 containers on the Hetzner forge VM); T11/T13/T14 done; ecosystem CI (reusable workflows + selftest) green; https://forge.olsitec.net=200.
  • The R5 fence is LIVE + codified. foundation-runner-02 (crunchy01 VM, 192.168.1.16, 8c/32G, label fenced) runs ecosystem/untrusted jobs OFF the forge VM. It's managed by the runners/ Pulumi stack — an isolated project (bootstrap never imports it), so foundation deploy/refresh never touches crunchy01. Stack = crunchy; config + state are gitignored (operator workstation only).
  • Foundation repo master clean, all pushed.

THIS session's work (operator asks, in priority order)

1. brix02 runner with failover from crunchy01

Add a runner on brix02 (192.168.1.3) that picks up jobs only when crunchy01 is unavailable. Forgejo has no native standby — same-label runners load-balance; offline ones get nothing. Choose with the operator:

  • HA-on-outage (simple, recommended): register brix02 with the SAME fenced label → when crunchy is down brix02 covers; when both up they share load.
  • Strict standby (custom): brix02 runner kept STOPPED + a watchdog (systemd timer polling the Forgejo runners API) that starts it only when crunchy's runner is offline. The runners/ stack is multi-host-capable: cd runners && pulumi stack init brix02 && pulumi config set host.address 192.168.1.3 && ... vm.name foundation-runner-03. FIRST verify brix02 has KVM + libvirt + a LAN bridge (same host prep as crunchy — see runners/README §Host prep; brix02 is also the Graylog target, so it's a real box). Then pulumi up and prove a fenced job runs on it (and, for standby, that it idles while crunchy is up).

2. k8s runner for heavy (16CPU/64GB) jobs on crunchy's k3s

The seaspots GitLab pipelines (~/work/seaspots/gitlab/pipelines/.gitlab-ci.yml) run seaspots-s57-utils (ogr2ogr + tippecanoe; registry.gitlab.com/seaspots/tools/ seaspots-s57-utils:1.11.0), tags: [heavy-compute], needing 16+ CPU / 64+ GB RAM / 100+ GB disk — they'd crush the 8c/32G VM runner. Stand up a Forgejo runner inside crunchy's k3s cluster (k8s-scheduled resources) with a distinct label (e.g. heavy), so runs-on: heavy jobs run there. DESIGN TASK (not started): Forgejo act_runner executes via docker or host mode — no mature native k8s executor (unlike GitLab). Evaluate act_runner as a k8s Deployment with big resource requests (host-mode or a DinD sidecar) vs. alternatives. Note crunchy's k3s already runs the GitLab runners (namespace gitlab) — do not disturb them.

Operating essentials

  • Forge VM: 204.168.234.72, SSH :222, key ~/.ssh/foundation-test_ed25519. Deploy: cd bootstrap && ./run.sh up. Passphrase: pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE. Forge admin: platform-admin / Vault foundation/forgejo/service-credentials:forgejoAdminPassword.
  • crunchy01: root@192.168.1.2 (operator key in root's authorized_keys) OR andiolsi+sudo. libvirt installed; pool images; libvirt-bridge-forward.timer active (kube-router-proof). Runner fleet: cd runners; export RUNNER_SSH_KEY_PATH=~/.ssh/foundation-test_ed25519; export PULUMI_BACKEND_URL=file://$(pwd)/state; export PULUMI_CONFIG_PASSPHRASE=$(pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE); pulumi stack select crunchy.
  • CI image: foundation-ci:latest, built on the forge VM (/tmp/ci-image); rebuild on toolchain change.
  • Reuse mechanism: Forgejo 11 reusable workflows work but the CALLING job needs runs-on
    • SHORT cross-repo ref (.forgejo/workflows/README.md). Composite actions need FULL-URL.

Watchouts (HIGH-RISK)

  • crunchy01 is a k3s node — the physdev-is-bridged FORWARD accept is what lets VMs reach the LAN; if a runner goes dark, check that rule / the timer first. Don't disturb k3s (gitlab namespace runners, nominatim, flannel/cni0).
  • Never commit the passphrase / Vault root token / unseal keys. runners stack state lives only on the workstation (not backed up — a DR gap to address).
  • Stale offline crunchy-runner registration on the forge (from the retired hand-built VM) — harmless; deregister at leisure. Don't pulumi up the prod olsicloud4-* stacks.

Standing backlog (after the two asks above)

  • Package registry (Stage-2) — publish @olsitec pkgs so the C1/C5 docker candidates build.
  • T15 — index.ts phase marker + Gate A/B comments + DAY-ZERO-TIMELINE.
  • Hardening — pin floating image refs; pre-bake pulumi plugins; MCP (D6); Forgejo v15 upgrade; back up the runners stack state.

Validate each task live and commit atomically per task.