olsitec/foundation

Fork 0

Andreas Niemann 44a96d84eb

CI / preflight (push) Successful in 9s

Details

CI / typecheck (push) Successful in 23s

Details

pulumi-preview / preview (push) Successful in 26s

Details

fix(runners): live-validated the crunchy stack; cutover done

Fixes found running `pulumi up` live against crunchy01 (foundation-runner-02,
static .16, 8c/32G — the new default sizing):

- network-config matches the NIC by glob (`match: {name: "e*"}`) instead of a
  hardcoded enp1s0 — the libvirt.Domain enumerated it differently, leaving the VM
  with no IP.
- drop `qemuAgent: true` — it blocks the provider on the guest agent (not up on a
  fresh boot) during create; we register over the static IP instead.
- runner-register connection gets `dialErrorLimit: 30` so it waits ~5 min for the
  VM to boot + apply its IP, landing the runner in a single `up`.
- fix the register token passing (the old /tmp/t hop was an ephemeral --rm
  container → empty token); pass it directly (pulumi redacts the secret).
- README: host prep (root SSH + the `images` pool), the exact stack config, and
  the cutover marked DONE — a `runs-on: fenced` job ran green on the Pulumi-managed
  runner-02; the hand-built VM was retired.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-07-01 03:35:06 +02:00

4.9 KiB

Raw Permalink Blame History

foundation-runners — the fenced Actions runner fleet (isolated stack)

Step-0 after the foundation stands. A separate Pulumi project/stack that provisions runner VM(s) on a libvirt host (crunchy01) and registers Forgejo Actions runners with a distinct label (fenced), so ecosystem/untrusted jobs (runs-on: fenced) execute off the forge VM — the R5 fence.

Why a separate stack (decoupling)

A @pulumi/libvirt provider dials the runner host on every up/refresh/preview of the stack it lives in. If the runner VM lived in bootstrap, then crunchy01 being down — or you not having access to it — would break pulumi refresh/up of the foundation itself (the classic Terraform coupling trap). Pulumi isolates this at the stack boundary: a provider only initializes when its own stack runs. So the fleet is its own project; bootstrap never imports it. Consequences:

Foundation deploy/refresh never touches crunchy01.
crunchy01 down ⇒ only this stack's refresh is affected, and only when you run it.
One-way dependency: this stack mints a runner token from the forge, so it runs after the foundation is up.

Host prep (one-time, kept OUT of this stack)

The libvirt provider needs something to connect to, so install libvirt on the host out-of-band (not via this stack), and ensure a LAN bridge exists:

sudo apt-get update
sudo apt-get install -y qemu-kvm libvirt-daemon-system libvirt-clients \
  bridge-utils dnsmasq qemu-utils virtinst cloud-image-utils
sudo systemctl enable --now libvirtd
# a LAN bridge (br0) enslaving the physical NIC must already exist (crunchy01 had it).

Also required on the host, one-time:

root SSH via key — the @pulumi/libvirt provider and the host firewall command connect as root (add the operator pubkey to /root/.ssh/authorized_keys).
a libvirt storage pool — crunchy01 already had one named images (at /var/lib/libvirt/images), so the stack is configured with host.pool images. On a host with the conventional default pool, leave host.pool at its default.

Deploy

export RUNNER_SSH_KEY_PATH=~/.ssh/foundation-test_ed25519   # reaches host + VM (root)
cd runners
pulumi stack init crunchy         # isolated file backend, like bootstrap/provision
pulumi config set host.address 192.168.1.2
pulumi config set host.pool images          # crunchy01's pool (see host prep)
pulumi config set forge.address 204.168.234.72
pulumi config set vm.name foundation-runner-02
pulumi config set vm.ipCidr 192.168.1.16/24
pulumi up

pulumi up will: apply the kube-router-proof FORWARD timer on the host, create an Ubuntu VM on br0 (docker + qemu-guest-agent via cloud-init), mint a runner token from the forge, and register + run the fenced runner in the VM. Verify with a runs-on: fenced job on any repo.

Cutover: DONE. pulumi up on the crunchy stack created foundation-runner-02 (static .16, 8c/32G), registered the fenced runner, and a runs-on: fenced job ran on it green. The hand-built foundation-runner-01 was then retired (virsh destroy/undefine + disk removed), so the Pulumi-managed runner-02 is the sole fenced runner. (A now-offline crunchy-runner registration from the hand-built VM may still be listed on the forge — harmless; deregister at leisure.)

Gotchas baked into the code (learned the hard way)

k3s host firewall. crunchy01 is a k3s node; kube-router sets FORWARD policy DROP + br_netfilter=1, dropping bridged VM↔LAN traffic. Fix = iptables -I FORWARD -m physdev --physdev-is-bridged -j ACCEPT, re-asserted by a 60s systemd timer (kube-router flushes iptables on resync, so a boot-only rule isn't enough).
Ubuntu, not Debian genericcloud. Debian's cloud-init wrote netplan the image never applied → no IPv4 (static or DHCP). Ubuntu 24.04 renders + applies cleanly.
NIC name-agnostic network-config. The cloud-init network-config matches the NIC by glob (match: {name: "e*"}), not a hardcoded enp1s0 — the libvirt.Domain may enumerate it as ens3/etc., which left the VM with no IP until matched generically.
No qemuAgent: true. It makes the provider block on the guest agent (not up on a fresh boot) during create. We register over the VM's static IP, so it's not needed.
Register dial window. The runner-register command uses dialErrorLimit: 30 so it waits ~5 min for the VM to boot + apply its IP, landing the runner in a single up.
PTY console. The domain declares a pty serial console so virsh console <vm> works. (Don't back serial with a file — you lose interactive console.)
Docker socket gid. act_runner runs as uid 1000; the daemon container gets --group-add <docker gid> so it can reach /var/run/docker.sock.
IP is optional. The runner polls the forge outbound, so a fixed LAN IP isn't required — set vm.ipCidr empty for DHCP. Default is a static .15 for predictability.

4.9 KiB Raw Permalink Blame History

foundation-runners — the fenced Actions runner fleet (isolated stack)

Why a separate stack (decoupling)

Host prep (one-time, kept OUT of this stack)

Deploy

Gotchas baked into the code (learned the hard way)

4.9 KiB

Raw Permalink Blame History