foundation/runners
Andreas Niemann 44a96d84eb
All checks were successful
CI / preflight (push) Successful in 9s
CI / typecheck (push) Successful in 23s
pulumi-preview / preview (push) Successful in 26s
fix(runners): live-validated the crunchy stack; cutover done
Fixes found running `pulumi up` live against crunchy01 (foundation-runner-02,
static .16, 8c/32G — the new default sizing):

- network-config matches the NIC by glob (`match: {name: "e*"}`) instead of a
  hardcoded enp1s0 — the libvirt.Domain enumerated it differently, leaving the VM
  with no IP.
- drop `qemuAgent: true` — it blocks the provider on the guest agent (not up on a
  fresh boot) during create; we register over the static IP instead.
- runner-register connection gets `dialErrorLimit: 30` so it waits ~5 min for the
  VM to boot + apply its IP, landing the runner in a single `up`.
- fix the register token passing (the old /tmp/t hop was an ephemeral --rm
  container → empty token); pass it directly (pulumi redacts the secret).
- README: host prep (root SSH + the `images` pool), the exact stack config, and
  the cutover marked DONE — a `runs-on: fenced` job ran green on the Pulumi-managed
  runner-02; the hand-built VM was retired.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-01 03:35:06 +02:00
..
index.ts fix(runners): live-validated the crunchy stack; cutover done 2026-07-01 03:35:06 +02:00
package.json feat(runners): decoupled Pulumi stack for the fenced runner fleet (R5) 2026-07-01 03:15:39 +02:00
Pulumi.yaml feat(runners): decoupled Pulumi stack for the fenced runner fleet (R5) 2026-07-01 03:15:39 +02:00
README.md fix(runners): live-validated the crunchy stack; cutover done 2026-07-01 03:35:06 +02:00
tsconfig.json feat(runners): decoupled Pulumi stack for the fenced runner fleet (R5) 2026-07-01 03:15:39 +02:00

foundation-runners — the fenced Actions runner fleet (isolated stack)

Step-0 after the foundation stands. A separate Pulumi project/stack that provisions runner VM(s) on a libvirt host (crunchy01) and registers Forgejo Actions runners with a distinct label (fenced), so ecosystem/untrusted jobs (runs-on: fenced) execute off the forge VM — the R5 fence.

Why a separate stack (decoupling)

A @pulumi/libvirt provider dials the runner host on every up/refresh/preview of the stack it lives in. If the runner VM lived in bootstrap, then crunchy01 being down — or you not having access to it — would break pulumi refresh/up of the foundation itself (the classic Terraform coupling trap). Pulumi isolates this at the stack boundary: a provider only initializes when its own stack runs. So the fleet is its own project; bootstrap never imports it. Consequences:

  • Foundation deploy/refresh never touches crunchy01.
  • crunchy01 down ⇒ only this stack's refresh is affected, and only when you run it.
  • One-way dependency: this stack mints a runner token from the forge, so it runs after the foundation is up.

Host prep (one-time, kept OUT of this stack)

The libvirt provider needs something to connect to, so install libvirt on the host out-of-band (not via this stack), and ensure a LAN bridge exists:

sudo apt-get update
sudo apt-get install -y qemu-kvm libvirt-daemon-system libvirt-clients \
  bridge-utils dnsmasq qemu-utils virtinst cloud-image-utils
sudo systemctl enable --now libvirtd
# a LAN bridge (br0) enslaving the physical NIC must already exist (crunchy01 had it).

Also required on the host, one-time:

  • root SSH via key — the @pulumi/libvirt provider and the host firewall command connect as root (add the operator pubkey to /root/.ssh/authorized_keys).
  • a libvirt storage pool — crunchy01 already had one named images (at /var/lib/libvirt/images), so the stack is configured with host.pool images. On a host with the conventional default pool, leave host.pool at its default.

Deploy

export RUNNER_SSH_KEY_PATH=~/.ssh/foundation-test_ed25519   # reaches host + VM (root)
cd runners
pulumi stack init crunchy         # isolated file backend, like bootstrap/provision
pulumi config set host.address 192.168.1.2
pulumi config set host.pool images          # crunchy01's pool (see host prep)
pulumi config set forge.address 204.168.234.72
pulumi config set vm.name foundation-runner-02
pulumi config set vm.ipCidr 192.168.1.16/24
pulumi up

pulumi up will: apply the kube-router-proof FORWARD timer on the host, create an Ubuntu VM on br0 (docker + qemu-guest-agent via cloud-init), mint a runner token from the forge, and register + run the fenced runner in the VM. Verify with a runs-on: fenced job on any repo.

Cutover: DONE. pulumi up on the crunchy stack created foundation-runner-02 (static .16, 8c/32G), registered the fenced runner, and a runs-on: fenced job ran on it green. The hand-built foundation-runner-01 was then retired (virsh destroy/undefine + disk removed), so the Pulumi-managed runner-02 is the sole fenced runner. (A now-offline crunchy-runner registration from the hand-built VM may still be listed on the forge — harmless; deregister at leisure.)

Gotchas baked into the code (learned the hard way)

  • k3s host firewall. crunchy01 is a k3s node; kube-router sets FORWARD policy DROP + br_netfilter=1, dropping bridged VM↔LAN traffic. Fix = iptables -I FORWARD -m physdev --physdev-is-bridged -j ACCEPT, re-asserted by a 60s systemd timer (kube-router flushes iptables on resync, so a boot-only rule isn't enough).
  • Ubuntu, not Debian genericcloud. Debian's cloud-init wrote netplan the image never applied → no IPv4 (static or DHCP). Ubuntu 24.04 renders + applies cleanly.
  • NIC name-agnostic network-config. The cloud-init network-config matches the NIC by glob (match: {name: "e*"}), not a hardcoded enp1s0 — the libvirt.Domain may enumerate it as ens3/etc., which left the VM with no IP until matched generically.
  • No qemuAgent: true. It makes the provider block on the guest agent (not up on a fresh boot) during create. We register over the VM's static IP, so it's not needed.
  • Register dial window. The runner-register command uses dialErrorLimit: 30 so it waits ~5 min for the VM to boot + apply its IP, landing the runner in a single up.
  • PTY console. The domain declares a pty serial console so virsh console <vm> works. (Don't back serial with a file — you lose interactive console.)
  • Docker socket gid. act_runner runs as uid 1000; the daemon container gets --group-add <docker gid> so it can reach /var/run/docker.sock.
  • IP is optional. The runner polls the forge outbound, so a fixed LAN IP isn't required — set vm.ipCidr empty for DHCP. Default is a static .15 for predictability.