Records the R5 fence work (build → harden → decoupled runners/ Pulumi stack → live cutover to foundation-runner-02 on crunchy01) and captures the operator's two new asks for the next session: a brix02 failover runner, and a k8s runner on crunchy's k3s for heavy (16CPU/64GB) seaspots-s57-utils jobs. Refreshes HANDOVER to prioritize those + the standing backlog. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5.5 KiB
HANDOVER — next-session prompt (paste into a fresh context)
Living doc: overwritten each handover. Durable record = the dated
SESSION_*files. Latest state =SESSION_2026-07-01_003.md(read it first, then #002 + #001).
Continue the olsitec-foundation build. You are the Lead Agent, HIGH-RISK / INFRA mode (remote VMs, k3s, Docker, secrets).
Required reads (in ~/work/olsitec-foundation/foundation/)
documentation/sessions/SESSION_2026-07-01_003.md← runner fleet + the NEW asks belowdocumentation/sessions/SESSION_2026-07-01_002.md← T14 + ecosystem CI ·_001.md← the eggdocumentation/contracts/CONTRACT_001–004+decisions/ADR_004,005,006,007(ADR-007 first)runners/README.md← the decoupled runner-fleet stack (host prep, config, gotchas).forgejo/workflows/README.md← the ecosystem-CI reusable-workflow contract (Forgejo-11 quirk)
Where things stand (all green / live)
- The egg is LIVE (6 containers on the Hetzner forge VM); T11/T13/T14 done; ecosystem CI
(reusable workflows + selftest) green;
https://forge.olsitec.net=200. - The R5 fence is LIVE + codified.
foundation-runner-02(crunchy01 VM,192.168.1.16, 8c/32G, labelfenced) runs ecosystem/untrusted jobs OFF the forge VM. It's managed by therunners/Pulumi stack — an isolated project (bootstrapnever imports it), so foundation deploy/refresh never touches crunchy01. Stack =crunchy; config + state are gitignored (operator workstation only). - Foundation repo
masterclean, all pushed.
THIS session's work (operator asks, in priority order)
1. brix02 runner with failover from crunchy01
Add a runner on brix02 (192.168.1.3) that picks up jobs only when crunchy01 is
unavailable. Forgejo has no native standby — same-label runners load-balance; offline
ones get nothing. Choose with the operator:
- HA-on-outage (simple, recommended): register brix02 with the SAME
fencedlabel → when crunchy is down brix02 covers; when both up they share load. - Strict standby (custom): brix02 runner kept STOPPED + a watchdog (systemd timer polling
the Forgejo runners API) that starts it only when crunchy's runner is offline.
The
runners/stack is multi-host-capable:cd runners && pulumi stack init brix02 && pulumi config set host.address 192.168.1.3 && ... vm.name foundation-runner-03. FIRST verify brix02 has KVM + libvirt + a LAN bridge (same host prep as crunchy — see runners/README §Host prep; brix02 is also the Graylog target, so it's a real box). Thenpulumi upand prove afencedjob runs on it (and, for standby, that it idles while crunchy is up).
2. k8s runner for heavy (16CPU/64GB) jobs on crunchy's k3s
The seaspots GitLab pipelines (~/work/seaspots/gitlab/pipelines/.gitlab-ci.yml) run
seaspots-s57-utils (ogr2ogr + tippecanoe; registry.gitlab.com/seaspots/tools/ seaspots-s57-utils:1.11.0), tags: [heavy-compute], needing 16+ CPU / 64+ GB RAM / 100+ GB
disk — they'd crush the 8c/32G VM runner. Stand up a Forgejo runner inside crunchy's k3s
cluster (k8s-scheduled resources) with a distinct label (e.g. heavy), so runs-on: heavy
jobs run there. DESIGN TASK (not started): Forgejo act_runner executes via docker or
host mode — no mature native k8s executor (unlike GitLab). Evaluate act_runner as a k8s
Deployment with big resource requests (host-mode or a DinD sidecar) vs. alternatives. Note
crunchy's k3s already runs the GitLab runners (namespace gitlab) — do not disturb them.
Operating essentials
- Forge VM:
204.168.234.72, SSH :222, key~/.ssh/foundation-test_ed25519. Deploy:cd bootstrap && ./run.sh up. Passphrase:pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE. Forge admin:platform-admin/ Vaultfoundation/forgejo/service-credentials:forgejoAdminPassword. - crunchy01:
root@192.168.1.2(operator key in root's authorized_keys) ORandiolsi+sudo. libvirt installed; poolimages;libvirt-bridge-forward.timeractive (kube-router-proof). Runner fleet:cd runners; export RUNNER_SSH_KEY_PATH=~/.ssh/foundation-test_ed25519; export PULUMI_BACKEND_URL=file://$(pwd)/state; export PULUMI_CONFIG_PASSPHRASE=$(pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE); pulumi stack select crunchy. - CI image:
foundation-ci:latest, built on the forge VM (/tmp/ci-image); rebuild on toolchain change. - Reuse mechanism: Forgejo 11 reusable workflows work but the CALLING job needs
runs-on- SHORT cross-repo ref (
.forgejo/workflows/README.md). Composite actions need FULL-URL.
- SHORT cross-repo ref (
Watchouts (HIGH-RISK)
- crunchy01 is a k3s node — the
physdev-is-bridgedFORWARD accept is what lets VMs reach the LAN; if a runner goes dark, check that rule / the timer first. Don't disturb k3s (gitlabnamespace runners,nominatim, flannel/cni0). - Never commit the passphrase / Vault root token / unseal keys.
runnersstack state lives only on the workstation (not backed up — a DR gap to address). - Stale offline
crunchy-runnerregistration on the forge (from the retired hand-built VM) — harmless; deregister at leisure. Don'tpulumi upthe prodolsicloud4-*stacks.
Standing backlog (after the two asks above)
- Package registry (Stage-2) — publish
@olsitecpkgs so the C1/C5 docker candidates build. - T15 — index.ts phase marker + Gate A/B comments + DAY-ZERO-TIMELINE.
- Hardening — pin floating image refs; pre-bake pulumi plugins; MCP (D6); Forgejo v15 upgrade;
back up the
runnersstack state.
Validate each task live and commit atomically per task.