foundation/runners/README.md
Andreas Niemann cfa71847ba
All checks were successful
CI / preflight (push) Successful in 4s
CI / typecheck (push) Successful in 13s
pulumi-preview / preview (push) Successful in 17s
feat(runners): decoupled Pulumi stack for the fenced runner fleet (R5)
A separate, isolated Pulumi project (peer to bootstrap/provision/offsite-backup)
that provisions runner VM(s) on a libvirt host and registers Forgejo Actions
runners with a distinct `fenced` label — so ecosystem/untrusted jobs run OFF the
forge VM.

Decoupled ON PURPOSE: a @pulumi/libvirt provider dials the runner host on every
up/refresh, so keeping it in `bootstrap` would make the foundation undeployable/
unrefreshable whenever the host (crunchy01) is down or unreachable (the Terraform
coupling trap). As its own stack, bootstrap never imports it — foundation ops
never touch crunchy01, and this stack's health is independent. One-way dependency:
it mints a runner token FROM the forge, i.e. runs after the foundation stands.

Codifies what was built + hardened by hand this session (runners/README.md):
Ubuntu VM on the LAN bridge (docker + qemu-guest-agent via cloud-init), the
kube-router-proof FORWARD timer, and runner registration. Typechecked; the live
`pulumi up` cutover from the hand-built VM is the remaining validation step.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-01 03:15:39 +02:00

3.6 KiB

foundation-runners — the fenced Actions runner fleet (isolated stack)

Step-0 after the foundation stands. A separate Pulumi project/stack that provisions runner VM(s) on a libvirt host (crunchy01) and registers Forgejo Actions runners with a distinct label (fenced), so ecosystem/untrusted jobs (runs-on: fenced) execute off the forge VM — the R5 fence.

Why a separate stack (decoupling)

A @pulumi/libvirt provider dials the runner host on every up/refresh/preview of the stack it lives in. If the runner VM lived in bootstrap, then crunchy01 being down — or you not having access to it — would break pulumi refresh/up of the foundation itself (the classic Terraform coupling trap). Pulumi isolates this at the stack boundary: a provider only initializes when its own stack runs. So the fleet is its own project; bootstrap never imports it. Consequences:

  • Foundation deploy/refresh never touches crunchy01.
  • crunchy01 down ⇒ only this stack's refresh is affected, and only when you run it.
  • One-way dependency: this stack mints a runner token from the forge, so it runs after the foundation is up.

Host prep (one-time, kept OUT of this stack)

The libvirt provider needs something to connect to, so install libvirt on the host out-of-band (not via this stack), and ensure a LAN bridge exists:

sudo apt-get update
sudo apt-get install -y qemu-kvm libvirt-daemon-system libvirt-clients \
  bridge-utils dnsmasq qemu-utils virtinst cloud-image-utils
sudo systemctl enable --now libvirtd
# a LAN bridge (br0) enslaving the physical NIC must already exist (crunchy01 had it).

Deploy

export RUNNER_SSH_KEY_PATH=~/.ssh/foundation-test_ed25519   # reaches host + VM (root)
cd runners
pulumi stack init crunchy         # isolated file backend, like bootstrap/provision
pulumi config set host.address 192.168.1.2
pulumi config set forge.address 204.168.234.72
pulumi up

pulumi up will: apply the kube-router-proof FORWARD timer on the host, create an Ubuntu VM on br0 (docker + qemu-guest-agent via cloud-init), mint a runner token from the forge, and register + run the fenced runner in the VM. Verify with a runs-on: fenced job on any repo.

Cutover note. The first fenced runner was built by hand (SESSION_2026-07-01_003). A pulumi up here creates a fresh declarative VM; retire the hand-built foundation-runner-01 (virsh destroy/undefine) at cutover, or point config at a new vm.name to run both. This code is committed + typechecked; the live up cutover is the remaining validation step.

Gotchas baked into the code (learned the hard way)

  • k3s host firewall. crunchy01 is a k3s node; kube-router sets FORWARD policy DROP + br_netfilter=1, dropping bridged VM↔LAN traffic. Fix = iptables -I FORWARD -m physdev --physdev-is-bridged -j ACCEPT, re-asserted by a 60s systemd timer (kube-router flushes iptables on resync, so a boot-only rule isn't enough).
  • Ubuntu, not Debian genericcloud. Debian's cloud-init wrote netplan the image never applied → no IPv4 (static or DHCP). Ubuntu 24.04 renders + applies cleanly.
  • PTY console. The domain declares a pty serial console so virsh console <vm> works. (Don't back serial with a file — you lose interactive console.)
  • Docker socket gid. act_runner runs as uid 1000; the daemon container gets --group-add <docker gid> so it can reach /var/run/docker.sock.
  • IP is optional. The runner polls the forge outbound, so a fixed LAN IP isn't required — set vm.ipCidr empty for DHCP. Default is a static .15 for predictability.