foundation/documentation/sessions/HANDOVER.md
Andreas Niemann 1bba311d60
All checks were successful
CI / preflight (push) Successful in 7s
CI / typecheck (push) Successful in 17s
pulumi-preview / preview (push) Successful in 22s
docs(session): SESSION_003 (fenced runner fleet) + handover for next agent
Records the R5 fence work (build → harden → decoupled runners/ Pulumi stack →
live cutover to foundation-runner-02 on crunchy01) and captures the operator's
two new asks for the next session: a brix02 failover runner, and a k8s runner on
crunchy's k3s for heavy (16CPU/64GB) seaspots-s57-utils jobs. Refreshes HANDOVER
to prioritize those + the standing backlog.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-01 03:39:20 +02:00

82 lines
5.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# HANDOVER — next-session prompt (paste into a fresh context)
> Living doc: overwritten each handover. Durable record = the dated `SESSION_*` files.
> Latest state = `SESSION_2026-07-01_003.md` (read it first, then #002 + #001).
---
Continue the **olsitec-foundation** build. You are the **Lead Agent, HIGH-RISK / INFRA mode**
(remote VMs, k3s, Docker, secrets).
## Required reads (in `~/work/olsitec-foundation/foundation/`)
1. `documentation/sessions/SESSION_2026-07-01_003.md` ← runner fleet + the NEW asks below
2. `documentation/sessions/SESSION_2026-07-01_002.md` ← T14 + ecosystem CI · `_001.md` ← the egg
3. `documentation/contracts/CONTRACT_001004` + `decisions/ADR_004,005,006,007` (ADR-007 first)
4. `runners/README.md` ← the decoupled runner-fleet stack (host prep, config, gotchas)
5. `.forgejo/workflows/README.md` ← the ecosystem-CI reusable-workflow contract (Forgejo-11 quirk)
## Where things stand (all green / live)
- **The egg is LIVE** (6 containers on the Hetzner forge VM); T11/T13/T14 done; ecosystem CI
(reusable workflows + selftest) green; `https://forge.olsitec.net`=200.
- **The R5 fence is LIVE + codified.** `foundation-runner-02` (crunchy01 VM, `192.168.1.16`,
8c/32G, label `fenced`) runs ecosystem/untrusted jobs OFF the forge VM. It's managed by the
**`runners/`** Pulumi stack — an **isolated project** (`bootstrap` never imports it), so
foundation deploy/refresh never touches crunchy01. Stack = `crunchy`; config + state are
gitignored (operator workstation only).
- Foundation repo `master` clean, all pushed.
## THIS session's work (operator asks, in priority order)
### 1. brix02 runner with failover from crunchy01
Add a runner on **brix02 (`192.168.1.3`)** that picks up jobs **only when crunchy01 is
unavailable**. **Forgejo has no native standby** — same-label runners load-balance; offline
ones get nothing. Choose with the operator:
- *HA-on-outage (simple, recommended):* register brix02 with the SAME `fenced` label → when
crunchy is down brix02 covers; when both up they share load.
- *Strict standby (custom):* brix02 runner kept STOPPED + a watchdog (systemd timer polling
the Forgejo runners API) that starts it only when crunchy's runner is offline.
The `runners/` stack is multi-host-capable: `cd runners && pulumi stack init brix02 &&
pulumi config set host.address 192.168.1.3 && ... vm.name foundation-runner-03`. **FIRST**
verify brix02 has KVM + libvirt + a LAN bridge (same host prep as crunchy — see runners/README
§Host prep; brix02 is also the Graylog target, so it's a real box). Then `pulumi up` and prove
a `fenced` job runs on it (and, for standby, that it idles while crunchy is up).
### 2. k8s runner for heavy (16CPU/64GB) jobs on crunchy's k3s
The seaspots GitLab pipelines (`~/work/seaspots/gitlab/pipelines/.gitlab-ci.yml`) run
**seaspots-s57-utils** (ogr2ogr + tippecanoe; `registry.gitlab.com/seaspots/tools/
seaspots-s57-utils:1.11.0`), `tags: [heavy-compute]`, needing **16+ CPU / 64+ GB RAM / 100+ GB
disk** — they'd crush the 8c/32G VM runner. Stand up a **Forgejo runner inside crunchy's k3s
cluster** (k8s-scheduled resources) with a distinct label (e.g. `heavy`), so `runs-on: heavy`
jobs run there. DESIGN TASK (not started): Forgejo `act_runner` executes via **docker** or
**host** mode — no mature native k8s executor (unlike GitLab). Evaluate act_runner as a k8s
Deployment with big resource requests (host-mode or a DinD sidecar) vs. alternatives. Note
crunchy's k3s already runs the GitLab runners (namespace `gitlab`) — do not disturb them.
## Operating essentials
- **Forge VM**: `204.168.234.72`, SSH **:222**, key `~/.ssh/foundation-test_ed25519`.
Deploy: `cd bootstrap && ./run.sh up`. Passphrase: `pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE`.
Forge admin: `platform-admin` / Vault `foundation/forgejo/service-credentials:forgejoAdminPassword`.
- **crunchy01**: `root@192.168.1.2` (operator key in root's authorized_keys) OR `andiolsi`+sudo.
libvirt installed; pool `images`; `libvirt-bridge-forward.timer` active (kube-router-proof).
Runner fleet: `cd runners; export RUNNER_SSH_KEY_PATH=~/.ssh/foundation-test_ed25519;
export PULUMI_BACKEND_URL=file://$(pwd)/state; export PULUMI_CONFIG_PASSPHRASE=$(pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE);
pulumi stack select crunchy`.
- **CI image**: `foundation-ci:latest`, built on the forge VM (`/tmp/ci-image`); rebuild on toolchain change.
- **Reuse mechanism**: Forgejo 11 reusable workflows work but the CALLING job needs `runs-on`
+ SHORT cross-repo ref (`.forgejo/workflows/README.md`). Composite actions need FULL-URL.
## Watchouts (HIGH-RISK)
- crunchy01 is a k3s node — the `physdev-is-bridged` FORWARD accept is what lets VMs reach the
LAN; if a runner goes dark, check that rule / the timer first. Don't disturb k3s (`gitlab`
namespace runners, `nominatim`, flannel/cni0).
- Never commit the passphrase / Vault root token / unseal keys. `runners` stack state lives
only on the workstation (not backed up — a DR gap to address).
- Stale offline `crunchy-runner` registration on the forge (from the retired hand-built VM) —
harmless; deregister at leisure. Don't `pulumi up` the prod `olsicloud4-*` stacks.
## Standing backlog (after the two asks above)
- **Package registry (Stage-2)** — publish `@olsitec` pkgs so the C1/C5 docker candidates build.
- **T15** — index.ts phase marker + Gate A/B comments + DAY-ZERO-TIMELINE.
- **Hardening** — pin floating image refs; pre-bake pulumi plugins; MCP (D6); Forgejo v15 upgrade;
back up the `runners` stack state.
Validate each task live and commit atomically per task.