foundation/documentation/sessions/SESSION_2026-07-01_003.md

# Session 2026-07-01 #003 — the fenced runner fleet (R5): build → harden → codify → cutover

## What was done
Continued from #002 (T14 + ecosystem CI done). Built the **R5 fence** the operator had
deferred: a Forgejo Actions runner on a **separate VM on separate hardware** (crunchy01),
so ecosystem/untrusted jobs (`runs-on: fenced`) run OFF the forge VM. Then hardened it,
formalized it as a **decoupled Pulumi stack**, and did the **live cutover**.

### 1. Fenced runner — built by hand, then proven
- crunchy01 (`192.168.1.2`, 16c/128G, Debian 13, **k3s node** running the GitLab
  runners + `nominatim`) had `/dev/kvm` + passwordless sudo but no libvirt. Installed
  qemu-kvm/libvirt/virtinst/cloud-image-utils.
- Created an **Ubuntu 24.04** VM on the LAN bridge `br0`, docker inside, registered a
  Forgejo runner (label `fenced`) against `https://forge.olsitec.net`.
- **Proven**: a `runs-on: fenced` job ran with kernel `6.8.0` (Ubuntu VM) + egress
  `62.176.248.112` (site IP), vs the Hetzner forge VM's `6.1.0` / `204.168.234.72` —
  i.e. it executed on crunchy, isolated from the forge.

### 2. Hardening
- **kube-router-proof firewall.** crunchy01's k3s/kube-router sets `FORWARD policy DROP`
  + `br_netfilter=1`, which drops bridged VM↔LAN traffic (incl. the runner→forge poll).
  Fix = `iptables -I FORWARD -m physdev --physdev-is-bridged -j ACCEPT`, re-asserted by a
  **60s systemd timer** (`libvirt-bridge-forward.timer`) because kube-router flushes
  iptables on resync (a boot-only unit isn't enough).
- VM autostart; rotated the throwaway root console password; PTY console so
  `virsh console` works.

### 3. Formalized as a DECOUPLED Pulumi stack — `runners/`
New isolated project (peer to `bootstrap/`, `provision/`, `offsite-backup/`).
**Why decoupled** (operator's explicit concern, and the answer to "is this a Pulumi
problem like it was with Terraform"): a `@pulumi/libvirt` provider dials the runner host
on every up/refresh, so putting it in `bootstrap` would make the **foundation**
undeployable/unrefreshable whenever crunchy01 is down/unreachable. Pulumi isolates this
at the **stack boundary** — `bootstrap` never imports `runners/`. One-way dependency:
`runners` mints a token FROM the forge, so it's "step-0 after the foundation stands".

### 4. Live `pulumi up` cutover — DONE
Ran the `crunchy` stack live: created **`foundation-runner-02`** (static `192.168.1.16`,
**8c/32G**), registered the `fenced` runner, and a `runs-on: fenced` job ran GREEN on it.
Then **retired the hand-built VM** (`foundation-runner-01`), so the Pulumi-managed
runner-02 is the sole fenced runner. Bugs the live run surfaced (all fixed):
NIC name isn't `enp1s0` → match `e*`; drop `qemuAgent:true` (blocks on the agent at
create); `dialErrorLimit:30` for boot; fix the register token passing; host prep = root
SSH + the `images` pool (crunchy has no `default` pool).

## Current state
- **`foundation-runner-02`** live on crunchy01 (`192.168.1.16`, 8c/32G), Pulumi-managed,
  label `fenced`, executing jobs. Its runner is a docker container in the VM.
- **`runners/`** project committed to the foundation repo (index.ts, README, Pulumi.yaml,
  package.json). **`runners/Pulumi.crunchy.yaml` + `runners/state/` are gitignored**
  (local to the operator workstation only — see Open threads: not backed up).
- Access: crunchy01 `root@192.168.1.2` (operator key `~/.ssh/foundation-test_ed25519`
  now in root's authorized_keys; also `andiolsi` + sudo). libvirt installed; `images`
  pool; `libvirt-bridge-forward.timer` active. Forge VM `root@204.168.234.72:222`.
- Foundation repo `master` clean, all pushed. Forge admin `platform-admin` / Vault
  `foundation/forgejo/service-credentials:forgejoAdminPassword`.

## NEW requirements from the operator (this session) — for the next agent
1. **brix02 (`192.168.1.3`) runner with failover from crunchy01.** Only when crunchy01
   is unavailable should brix02 pick up jobs. **Forgejo has no native standby**: same-label
   runners load-balance, offline ones just get nothing. Two paths:
   - *HA-on-outage (simple):* register brix02 with the SAME `fenced` label — when crunchy
     is down, brix02 covers; when both up, they share load.
   - *Strict standby (custom):* keep brix02's runner STOPPED + a watchdog (systemd timer
     polling the Forgejo runners API) that starts it only when crunchy's runner is offline.
   The `runners/` stack is already multi-host-capable via config — target brix02 with a
   second stack (`pulumi stack init brix02`, `config set host.address 192.168.1.3`,
   `vm.name foundation-runner-03`). **First verify brix02 has KVM + libvirt + a bridge**
   (same host prep as crunchy; brix02 is also the Graylog target — see memory).
2. **k8s runner for heavy jobs.** The seaspots GitLab pipelines
   (`~/work/seaspots/gitlab/pipelines`, `.gitlab-ci.yml`) run **seaspots-s57-utils**
   (`registry.gitlab.com/seaspots/tools/seaspots-s57-utils:1.11.0`: GDAL/ogr2ogr +
   tippecanoe) — `tags: [heavy-compute]`, needs **16+ CPU / 64+ GB RAM / 100+ GB disk**.
   These would crush the 8c/32G VM runner. The operator wants such heavy/containerized
   jobs to run on a **Forgejo runner inside crunchy's k3s cluster** (k8s-scheduled
   resources), with a distinct label (e.g. `heavy`/`k8s`). Design note: Forgejo
   `act_runner` executes jobs via **docker** or **host** mode — it has no mature native
   k8s executor like GitLab's. The next agent must evaluate: act_runner as a k8s
   Deployment (big resource requests) using host-mode or a DinD sidecar, vs. another
   approach. This is a design task, not yet started.

## Open threads / backlog (from #002 + this session)
- **`runners` stack state not backed up** — only on the operator workstation
  (`runners/state/`, gitignored). A DR gap; consider backing it up like `bootstrap`'s.
- **Stale `crunchy-runner` registration** on the forge (from the retired hand-built VM) —
  offline, harmless; deregister at leisure (Forgejo runners admin API/UI).
- **Package registry (Stage-2)** — publish `@olsitec` packages (olsicrypto, svelte-common)
  so the two registry-blocked 999_testing candidates (seaspots-homepage, token-service)
  build via `reusable-docker-build`.
- **T15** — `index.ts` phase marker still `T10-runner`; Gate A/B comments; DAY-ZERO-TIMELINE.
- **Hardening** — pin floating image refs; pre-bake pulumi plugins into foundation-ci;
  MCP registration (D6); Forgejo v15 upgrade drops the reusable-workflow `runs-on` quirk.

## Operating mode for next session: HIGH-RISK / INFRA (remote VMs, k3s, Docker, secrets).
docs(session): SESSION_003 (fenced runner fleet) + handover for next agent Records the R5 fence work (build → harden → decoupled runners/ Pulumi stack → live cutover to foundation-runner-02 on crunchy01) and captures the operator's two new asks for the next session: a brix02 failover runner, and a k8s runner on crunchy's k3s for heavy (16CPU/64GB) seaspots-s57-utils jobs. Refreshes HANDOVER to prioritize those + the standing backlog. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> 2026-07-01 03:39:20 +02:00			`# Session 2026-07-01 #003 — the fenced runner fleet (R5): build → harden → codify → cutover`

			`## What was done`
			`Continued from #002 (T14 + ecosystem CI done). Built the R5 fence the operator had`
			`deferred: a Forgejo Actions runner on a separate VM on separate hardware (crunchy01),`
			so ecosystem/untrusted jobs (`runs-on: fenced`) run OFF the forge VM. Then hardened it,
			`formalized it as a decoupled Pulumi stack, and did the live cutover.`

			`### 1. Fenced runner — built by hand, then proven`
			- crunchy01 (`192.168.1.2`, 16c/128G, Debian 13, k3s node running the GitLab
			runners + `nominatim`) had `/dev/kvm` + passwordless sudo but no libvirt. Installed
			`qemu-kvm/libvirt/virtinst/cloud-image-utils.`
			- Created an Ubuntu 24.04 VM on the LAN bridge `br0`, docker inside, registered a
			Forgejo runner (label `fenced`) against `https://forge.olsitec.net`.
			- Proven: a `runs-on: fenced` job ran with kernel `6.8.0` (Ubuntu VM) + egress
			`62.176.248.112` (site IP), vs the Hetzner forge VM's `6.1.0` / `204.168.234.72` —
			`i.e. it executed on crunchy, isolated from the forge.`

			`### 2. Hardening`
			- kube-router-proof firewall. crunchy01's k3s/kube-router sets `FORWARD policy DROP`
			+ `br_netfilter=1`, which drops bridged VM↔LAN traffic (incl. the runner→forge poll).
			Fix = `iptables -I FORWARD -m physdev --physdev-is-bridged -j ACCEPT`, re-asserted by a
			60s systemd timer (`libvirt-bridge-forward.timer`) because kube-router flushes
			`iptables on resync (a boot-only unit isn't enough).`
			`- VM autostart; rotated the throwaway root console password; PTY console so`
			`virsh console` works.

			### 3. Formalized as a DECOUPLED Pulumi stack — `runners/`
			New isolated project (peer to `bootstrap/`, `provision/`, `offsite-backup/`).
			`Why decoupled (operator's explicit concern, and the answer to "is this a Pulumi`
			problem like it was with Terraform"): a `@pulumi/libvirt` provider dials the runner host
			on every up/refresh, so putting it in `bootstrap` would make the foundation
			`undeployable/unrefreshable whenever crunchy01 is down/unreachable. Pulumi isolates this`
			at the stack boundary — `bootstrap` never imports `runners/`. One-way dependency:
			`runners` mints a token FROM the forge, so it's "step-0 after the foundation stands".

			### 4. Live `pulumi up` cutover — DONE
			Ran the `crunchy` stack live: created `foundation-runner-02` (static `192.168.1.16`,
			8c/32G), registered the `fenced` runner, and a `runs-on: fenced` job ran GREEN on it.
			Then retired the hand-built VM (`foundation-runner-01`), so the Pulumi-managed
			`runner-02 is the sole fenced runner. Bugs the live run surfaced (all fixed):`
			NIC name isn't `enp1s0` → match `e*`; drop `qemuAgent:true` (blocks on the agent at
			create); `dialErrorLimit:30` for boot; fix the register token passing; host prep = root
			SSH + the `images` pool (crunchy has no `default` pool).

			`## Current state`
			- `foundation-runner-02` live on crunchy01 (`192.168.1.16`, 8c/32G), Pulumi-managed,
			label `fenced`, executing jobs. Its runner is a docker container in the VM.
			- `runners/` project committed to the foundation repo (index.ts, README, Pulumi.yaml,
			package.json). `runners/Pulumi.crunchy.yaml` + `runners/state/` are gitignored
			`(local to the operator workstation only — see Open threads: not backed up).`
			- Access: crunchy01 `root@192.168.1.2` (operator key `~/.ssh/foundation-test_ed25519`
			now in root's authorized_keys; also `andiolsi` + sudo). libvirt installed; `images`
			pool; `libvirt-bridge-forward.timer` active. Forge VM `root@204.168.234.72:222`.
			- Foundation repo `master` clean, all pushed. Forge admin `platform-admin` / Vault
			`foundation/forgejo/service-credentials:forgejoAdminPassword`.

			`## NEW requirements from the operator (this session) — for the next agent`
			1. brix02 (`192.168.1.3`) runner with failover from crunchy01. Only when crunchy01
			`is unavailable should brix02 pick up jobs. Forgejo has no native standby: same-label`
			`runners load-balance, offline ones just get nothing. Two paths:`
			- HA-on-outage (simple): register brix02 with the SAME `fenced` label — when crunchy
			`is down, brix02 covers; when both up, they share load.`
			`- Strict standby (custom): keep brix02's runner STOPPED + a watchdog (systemd timer`
			`polling the Forgejo runners API) that starts it only when crunchy's runner is offline.`
			The `runners/` stack is already multi-host-capable via config — target brix02 with a
			second stack (`pulumi stack init brix02`, `config set host.address 192.168.1.3`,
			`vm.name foundation-runner-03`). First verify brix02 has KVM + libvirt + a bridge
			`(same host prep as crunchy; brix02 is also the Graylog target — see memory).`
			`2. k8s runner for heavy jobs. The seaspots GitLab pipelines`
			(`~/work/seaspots/gitlab/pipelines`, `.gitlab-ci.yml`) run seaspots-s57-utils
			(`registry.gitlab.com/seaspots/tools/seaspots-s57-utils:1.11.0`: GDAL/ogr2ogr +
			tippecanoe) — `tags: [heavy-compute]`, needs 16+ CPU / 64+ GB RAM / 100+ GB disk.
			`These would crush the 8c/32G VM runner. The operator wants such heavy/containerized`
			`jobs to run on a Forgejo runner inside crunchy's k3s cluster (k8s-scheduled`
			resources), with a distinct label (e.g. `heavy`/`k8s`). Design note: Forgejo
			`act_runner` executes jobs via docker or host mode — it has no mature native
			`k8s executor like GitLab's. The next agent must evaluate: act_runner as a k8s`
			`Deployment (big resource requests) using host-mode or a DinD sidecar, vs. another`
			`approach. This is a design task, not yet started.`

			`## Open threads / backlog (from #002 + this session)`
			- `runners` stack state not backed up — only on the operator workstation
			(`runners/state/`, gitignored). A DR gap; consider backing it up like `bootstrap`'s.
			- Stale `crunchy-runner` registration on the forge (from the retired hand-built VM) —
			`offline, harmless; deregister at leisure (Forgejo runners admin API/UI).`
			- Package registry (Stage-2) — publish `@olsitec` packages (olsicrypto, svelte-common)
			`so the two registry-blocked 999_testing candidates (seaspots-homepage, token-service)`
			build via `reusable-docker-build`.
			- T15 — `index.ts` phase marker still `T10-runner`; Gate A/B comments; DAY-ZERO-TIMELINE.
			`- Hardening — pin floating image refs; pre-bake pulumi plugins into foundation-ci;`
			MCP registration (D6); Forgejo v15 upgrade drops the reusable-workflow `runs-on` quirk.

			`## Operating mode for next session: HIGH-RISK / INFRA (remote VMs, k3s, Docker, secrets).`