foundation/runners/README.md

# foundation-runners — the fenced Actions runner fleet (isolated stack)

**Step-0 *after* the foundation stands.** A separate Pulumi project/stack that
provisions runner VM(s) on a libvirt host (crunchy01) and registers Forgejo Actions
runners with a distinct label (`fenced`), so ecosystem/untrusted jobs (`runs-on:
fenced`) execute **off** the forge VM — the R5 fence.

## Why a separate stack (decoupling)

A `@pulumi/libvirt` provider dials the runner host on **every** `up`/`refresh`/`preview`
of the stack it lives in. If the runner VM lived in `bootstrap`, then crunchy01 being
down — or you not having access to it — would break `pulumi refresh`/`up` of the
**foundation itself** (the classic Terraform coupling trap). Pulumi isolates this at
the **stack boundary**: a provider only initializes when *its own* stack runs. So the
fleet is its own project; `bootstrap` never imports it. Consequences:

- Foundation deploy/refresh **never touches** crunchy01.
- crunchy01 down ⇒ only *this* stack's refresh is affected, and only when you run it.
- One-way dependency: this stack mints a runner token *from* the forge, so it runs
  **after** the foundation is up.

## Host prep (one-time, kept OUT of this stack)

The libvirt provider needs something to connect to, so install libvirt on the host
out-of-band (not via this stack), and ensure a LAN bridge exists:

```sh
sudo apt-get update
sudo apt-get install -y qemu-kvm libvirt-daemon-system libvirt-clients \
  bridge-utils dnsmasq qemu-utils virtinst cloud-image-utils
sudo systemctl enable --now libvirtd
# a LAN bridge (br0) enslaving the physical NIC must already exist (crunchy01 had it).
```

## Deploy

```sh
export RUNNER_SSH_KEY_PATH=~/.ssh/foundation-test_ed25519   # reaches host + VM (root)
cd runners
pulumi stack init crunchy         # isolated file backend, like bootstrap/provision
pulumi config set host.address 192.168.1.2
pulumi config set forge.address 204.168.234.72
pulumi up
```

`pulumi up` will: apply the kube-router-proof FORWARD timer on the host, create an
Ubuntu VM on `br0` (docker + qemu-guest-agent via cloud-init), mint a runner token
from the forge, and register + run the `fenced` runner in the VM. Verify with a
`runs-on: fenced` job on any repo.

> **Cutover note.** The first fenced runner was built by hand (SESSION_2026-07-01_003).
> A `pulumi up` here creates a *fresh* declarative VM; retire the hand-built
> `foundation-runner-01` (`virsh destroy/undefine`) at cutover, or point config at a
> new `vm.name` to run both. This code is committed + typechecked; the live `up`
> cutover is the remaining validation step.

## Gotchas baked into the code (learned the hard way)

- **k3s host firewall.** crunchy01 is a k3s node; kube-router sets `FORWARD policy
  DROP` + `br_netfilter=1`, dropping bridged VM↔LAN traffic. Fix = `iptables -I FORWARD
  -m physdev --physdev-is-bridged -j ACCEPT`, re-asserted by a **60s systemd timer**
  (kube-router flushes iptables on resync, so a boot-only rule isn't enough).
- **Ubuntu, not Debian genericcloud.** Debian's cloud-init wrote netplan the image
  never applied → no IPv4 (static *or* DHCP). Ubuntu 24.04 renders + applies cleanly.
- **PTY console.** The domain declares a `pty` serial console so `virsh console <vm>`
  works. (Don't back serial with a file — you lose interactive console.)
- **Docker socket gid.** act_runner runs as uid 1000; the daemon container gets
  `--group-add <docker gid>` so it can reach `/var/run/docker.sock`.
- **IP is optional.** The runner polls the forge outbound, so a fixed LAN IP isn't
  required — set `vm.ipCidr` empty for DHCP. Default is a static `.15` for predictability.