foundation/documentation/sessions/SESSION_2026-07-01_003.md
Andreas Niemann 1bba311d60
All checks were successful
CI / preflight (push) Successful in 7s
CI / typecheck (push) Successful in 17s
pulumi-preview / preview (push) Successful in 22s
docs(session): SESSION_003 (fenced runner fleet) + handover for next agent
Records the R5 fence work (build → harden → decoupled runners/ Pulumi stack →
live cutover to foundation-runner-02 on crunchy01) and captures the operator's
two new asks for the next session: a brix02 failover runner, and a k8s runner on
crunchy's k3s for heavy (16CPU/64GB) seaspots-s57-utils jobs. Refreshes HANDOVER
to prioritize those + the standing backlog.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-01 03:39:20 +02:00

6.4 KiB

Session 2026-07-01 #003 — the fenced runner fleet (R5): build → harden → codify → cutover

What was done

Continued from #002 (T14 + ecosystem CI done). Built the R5 fence the operator had deferred: a Forgejo Actions runner on a separate VM on separate hardware (crunchy01), so ecosystem/untrusted jobs (runs-on: fenced) run OFF the forge VM. Then hardened it, formalized it as a decoupled Pulumi stack, and did the live cutover.

1. Fenced runner — built by hand, then proven

  • crunchy01 (192.168.1.2, 16c/128G, Debian 13, k3s node running the GitLab runners + nominatim) had /dev/kvm + passwordless sudo but no libvirt. Installed qemu-kvm/libvirt/virtinst/cloud-image-utils.
  • Created an Ubuntu 24.04 VM on the LAN bridge br0, docker inside, registered a Forgejo runner (label fenced) against https://forge.olsitec.net.
  • Proven: a runs-on: fenced job ran with kernel 6.8.0 (Ubuntu VM) + egress 62.176.248.112 (site IP), vs the Hetzner forge VM's 6.1.0 / 204.168.234.72 — i.e. it executed on crunchy, isolated from the forge.

2. Hardening

  • kube-router-proof firewall. crunchy01's k3s/kube-router sets FORWARD policy DROP
    • br_netfilter=1, which drops bridged VM↔LAN traffic (incl. the runner→forge poll). Fix = iptables -I FORWARD -m physdev --physdev-is-bridged -j ACCEPT, re-asserted by a 60s systemd timer (libvirt-bridge-forward.timer) because kube-router flushes iptables on resync (a boot-only unit isn't enough).
  • VM autostart; rotated the throwaway root console password; PTY console so virsh console works.

3. Formalized as a DECOUPLED Pulumi stack — runners/

New isolated project (peer to bootstrap/, provision/, offsite-backup/). Why decoupled (operator's explicit concern, and the answer to "is this a Pulumi problem like it was with Terraform"): a @pulumi/libvirt provider dials the runner host on every up/refresh, so putting it in bootstrap would make the foundation undeployable/unrefreshable whenever crunchy01 is down/unreachable. Pulumi isolates this at the stack boundarybootstrap never imports runners/. One-way dependency: runners mints a token FROM the forge, so it's "step-0 after the foundation stands".

4. Live pulumi up cutover — DONE

Ran the crunchy stack live: created foundation-runner-02 (static 192.168.1.16, 8c/32G), registered the fenced runner, and a runs-on: fenced job ran GREEN on it. Then retired the hand-built VM (foundation-runner-01), so the Pulumi-managed runner-02 is the sole fenced runner. Bugs the live run surfaced (all fixed): NIC name isn't enp1s0 → match e*; drop qemuAgent:true (blocks on the agent at create); dialErrorLimit:30 for boot; fix the register token passing; host prep = root SSH + the images pool (crunchy has no default pool).

Current state

  • foundation-runner-02 live on crunchy01 (192.168.1.16, 8c/32G), Pulumi-managed, label fenced, executing jobs. Its runner is a docker container in the VM.
  • runners/ project committed to the foundation repo (index.ts, README, Pulumi.yaml, package.json). runners/Pulumi.crunchy.yaml + runners/state/ are gitignored (local to the operator workstation only — see Open threads: not backed up).
  • Access: crunchy01 root@192.168.1.2 (operator key ~/.ssh/foundation-test_ed25519 now in root's authorized_keys; also andiolsi + sudo). libvirt installed; images pool; libvirt-bridge-forward.timer active. Forge VM root@204.168.234.72:222.
  • Foundation repo master clean, all pushed. Forge admin platform-admin / Vault foundation/forgejo/service-credentials:forgejoAdminPassword.

NEW requirements from the operator (this session) — for the next agent

  1. brix02 (192.168.1.3) runner with failover from crunchy01. Only when crunchy01 is unavailable should brix02 pick up jobs. Forgejo has no native standby: same-label runners load-balance, offline ones just get nothing. Two paths:
    • HA-on-outage (simple): register brix02 with the SAME fenced label — when crunchy is down, brix02 covers; when both up, they share load.
    • Strict standby (custom): keep brix02's runner STOPPED + a watchdog (systemd timer polling the Forgejo runners API) that starts it only when crunchy's runner is offline. The runners/ stack is already multi-host-capable via config — target brix02 with a second stack (pulumi stack init brix02, config set host.address 192.168.1.3, vm.name foundation-runner-03). First verify brix02 has KVM + libvirt + a bridge (same host prep as crunchy; brix02 is also the Graylog target — see memory).
  2. k8s runner for heavy jobs. The seaspots GitLab pipelines (~/work/seaspots/gitlab/pipelines, .gitlab-ci.yml) run seaspots-s57-utils (registry.gitlab.com/seaspots/tools/seaspots-s57-utils:1.11.0: GDAL/ogr2ogr + tippecanoe) — tags: [heavy-compute], needs 16+ CPU / 64+ GB RAM / 100+ GB disk. These would crush the 8c/32G VM runner. The operator wants such heavy/containerized jobs to run on a Forgejo runner inside crunchy's k3s cluster (k8s-scheduled resources), with a distinct label (e.g. heavy/k8s). Design note: Forgejo act_runner executes jobs via docker or host mode — it has no mature native k8s executor like GitLab's. The next agent must evaluate: act_runner as a k8s Deployment (big resource requests) using host-mode or a DinD sidecar, vs. another approach. This is a design task, not yet started.

Open threads / backlog (from #002 + this session)

  • runners stack state not backed up — only on the operator workstation (runners/state/, gitignored). A DR gap; consider backing it up like bootstrap's.
  • Stale crunchy-runner registration on the forge (from the retired hand-built VM) — offline, harmless; deregister at leisure (Forgejo runners admin API/UI).
  • Package registry (Stage-2) — publish @olsitec packages (olsicrypto, svelte-common) so the two registry-blocked 999_testing candidates (seaspots-homepage, token-service) build via reusable-docker-build.
  • T15index.ts phase marker still T10-runner; Gate A/B comments; DAY-ZERO-TIMELINE.
  • Hardening — pin floating image refs; pre-bake pulumi plugins into foundation-ci; MCP registration (D6); Forgejo v15 upgrade drops the reusable-workflow runs-on quirk.

Operating mode for next session: HIGH-RISK / INFRA (remote VMs, k3s, Docker, secrets).