Fixes found running `pulumi up` live against crunchy01 (foundation-runner-02,
static .16, 8c/32G — the new default sizing):
- network-config matches the NIC by glob (`match: {name: "e*"}`) instead of a
hardcoded enp1s0 — the libvirt.Domain enumerated it differently, leaving the VM
with no IP.
- drop `qemuAgent: true` — it blocks the provider on the guest agent (not up on a
fresh boot) during create; we register over the static IP instead.
- runner-register connection gets `dialErrorLimit: 30` so it waits ~5 min for the
VM to boot + apply its IP, landing the runner in a single `up`.
- fix the register token passing (the old /tmp/t hop was an ephemeral --rm
container → empty token); pass it directly (pulumi redacts the secret).
- README: host prep (root SSH + the `images` pool), the exact stack config, and
the cutover marked DONE — a `runs-on: fenced` job ran green on the Pulumi-managed
runner-02; the hand-built VM was retired.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
4.9 KiB
foundation-runners — the fenced Actions runner fleet (isolated stack)
Step-0 after the foundation stands. A separate Pulumi project/stack that
provisions runner VM(s) on a libvirt host (crunchy01) and registers Forgejo Actions
runners with a distinct label (fenced), so ecosystem/untrusted jobs (runs-on: fenced) execute off the forge VM — the R5 fence.
Why a separate stack (decoupling)
A @pulumi/libvirt provider dials the runner host on every up/refresh/preview
of the stack it lives in. If the runner VM lived in bootstrap, then crunchy01 being
down — or you not having access to it — would break pulumi refresh/up of the
foundation itself (the classic Terraform coupling trap). Pulumi isolates this at
the stack boundary: a provider only initializes when its own stack runs. So the
fleet is its own project; bootstrap never imports it. Consequences:
- Foundation deploy/refresh never touches crunchy01.
- crunchy01 down ⇒ only this stack's refresh is affected, and only when you run it.
- One-way dependency: this stack mints a runner token from the forge, so it runs after the foundation is up.
Host prep (one-time, kept OUT of this stack)
The libvirt provider needs something to connect to, so install libvirt on the host out-of-band (not via this stack), and ensure a LAN bridge exists:
sudo apt-get update
sudo apt-get install -y qemu-kvm libvirt-daemon-system libvirt-clients \
bridge-utils dnsmasq qemu-utils virtinst cloud-image-utils
sudo systemctl enable --now libvirtd
# a LAN bridge (br0) enslaving the physical NIC must already exist (crunchy01 had it).
Also required on the host, one-time:
- root SSH via key — the
@pulumi/libvirtprovider and the host firewall command connect asroot(add the operator pubkey to/root/.ssh/authorized_keys). - a libvirt storage pool — crunchy01 already had one named
images(at/var/lib/libvirt/images), so the stack is configured withhost.pool images. On a host with the conventionaldefaultpool, leavehost.poolat its default.
Deploy
export RUNNER_SSH_KEY_PATH=~/.ssh/foundation-test_ed25519 # reaches host + VM (root)
cd runners
pulumi stack init crunchy # isolated file backend, like bootstrap/provision
pulumi config set host.address 192.168.1.2
pulumi config set host.pool images # crunchy01's pool (see host prep)
pulumi config set forge.address 204.168.234.72
pulumi config set vm.name foundation-runner-02
pulumi config set vm.ipCidr 192.168.1.16/24
pulumi up
pulumi up will: apply the kube-router-proof FORWARD timer on the host, create an
Ubuntu VM on br0 (docker + qemu-guest-agent via cloud-init), mint a runner token
from the forge, and register + run the fenced runner in the VM. Verify with a
runs-on: fenced job on any repo.
Cutover: DONE.
pulumi upon thecrunchystack createdfoundation-runner-02(static.16, 8c/32G), registered thefencedrunner, and aruns-on: fencedjob ran on it green. The hand-builtfoundation-runner-01was then retired (virsh destroy/undefine+ disk removed), so the Pulumi-managed runner-02 is the sole fenced runner. (A now-offlinecrunchy-runnerregistration from the hand-built VM may still be listed on the forge — harmless; deregister at leisure.)
Gotchas baked into the code (learned the hard way)
- k3s host firewall. crunchy01 is a k3s node; kube-router sets
FORWARD policy DROP+br_netfilter=1, dropping bridged VM↔LAN traffic. Fix =iptables -I FORWARD -m physdev --physdev-is-bridged -j ACCEPT, re-asserted by a 60s systemd timer (kube-router flushes iptables on resync, so a boot-only rule isn't enough). - Ubuntu, not Debian genericcloud. Debian's cloud-init wrote netplan the image never applied → no IPv4 (static or DHCP). Ubuntu 24.04 renders + applies cleanly.
- NIC name-agnostic network-config. The cloud-init network-config matches the NIC
by glob (
match: {name: "e*"}), not a hardcodedenp1s0— the libvirt.Domain may enumerate it asens3/etc., which left the VM with no IP until matched generically. - No
qemuAgent: true. It makes the provider block on the guest agent (not up on a fresh boot) during create. We register over the VM's static IP, so it's not needed. - Register dial window. The runner-register command uses
dialErrorLimit: 30so it waits ~5 min for the VM to boot + apply its IP, landing the runner in a singleup. - PTY console. The domain declares a
ptyserial console sovirsh console <vm>works. (Don't back serial with a file — you lose interactive console.) - Docker socket gid. act_runner runs as uid 1000; the daemon container gets
--group-add <docker gid>so it can reach/var/run/docker.sock. - IP is optional. The runner polls the forge outbound, so a fixed LAN IP isn't
required — set
vm.ipCidrempty for DHCP. Default is a static.15for predictability.