foundation/documentation/sessions/HANDOVER.md

# HANDOVER — next-session prompt (paste into a fresh context)

> Living doc: overwritten each handover. Durable record = the dated `SESSION_*` files.
> Latest state = `SESSION_2026-07-01_003.md` (read it first, then #002 + #001).

---

Continue the **olsitec-foundation** build. You are the **Lead Agent, HIGH-RISK / INFRA mode**
(remote VMs, k3s, Docker, secrets).

## Required reads (in `~/work/olsitec-foundation/foundation/`)
1. `documentation/sessions/SESSION_2026-07-01_003.md` ← runner fleet + the NEW asks below
2. `documentation/sessions/SESSION_2026-07-01_002.md` ← T14 + ecosystem CI · `_001.md` ← the egg
3. `documentation/contracts/CONTRACT_001–004` + `decisions/ADR_004,005,006,007` (ADR-007 first)
4. `runners/README.md` ← the decoupled runner-fleet stack (host prep, config, gotchas)
5. `.forgejo/workflows/README.md` ← the ecosystem-CI reusable-workflow contract (Forgejo-11 quirk)

## Where things stand (all green / live)
- **The egg is LIVE** (6 containers on the Hetzner forge VM); T11/T13/T14 done; ecosystem CI
  (reusable workflows + selftest) green; `https://forge.olsitec.net`=200.
- **The R5 fence is LIVE + codified.** `foundation-runner-02` (crunchy01 VM, `192.168.1.16`,
  8c/32G, label `fenced`) runs ecosystem/untrusted jobs OFF the forge VM. It's managed by the
  **`runners/`** Pulumi stack — an **isolated project** (`bootstrap` never imports it), so
  foundation deploy/refresh never touches crunchy01. Stack = `crunchy`; config + state are
  gitignored (operator workstation only).
- Foundation repo `master` clean, all pushed.

## THIS session's work (operator asks, in priority order)
### 1. brix02 runner with failover from crunchy01
Add a runner on **brix02 (`192.168.1.3`)** that picks up jobs **only when crunchy01 is
unavailable**. **Forgejo has no native standby** — same-label runners load-balance; offline
ones get nothing. Choose with the operator:
- *HA-on-outage (simple, recommended):* register brix02 with the SAME `fenced` label → when
  crunchy is down brix02 covers; when both up they share load.
- *Strict standby (custom):* brix02 runner kept STOPPED + a watchdog (systemd timer polling
  the Forgejo runners API) that starts it only when crunchy's runner is offline.
The `runners/` stack is multi-host-capable: `cd runners && pulumi stack init brix02 &&
pulumi config set host.address 192.168.1.3 && ... vm.name foundation-runner-03`. **FIRST**
verify brix02 has KVM + libvirt + a LAN bridge (same host prep as crunchy — see runners/README
§Host prep; brix02 is also the Graylog target, so it's a real box). Then `pulumi up` and prove
a `fenced` job runs on it (and, for standby, that it idles while crunchy is up).

### 2. k8s runner for heavy (16CPU/64GB) jobs on crunchy's k3s
The seaspots GitLab pipelines (`~/work/seaspots/gitlab/pipelines/.gitlab-ci.yml`) run
**seaspots-s57-utils** (ogr2ogr + tippecanoe; `registry.gitlab.com/seaspots/tools/
seaspots-s57-utils:1.11.0`), `tags: [heavy-compute]`, needing **16+ CPU / 64+ GB RAM / 100+ GB
disk** — they'd crush the 8c/32G VM runner. Stand up a **Forgejo runner inside crunchy's k3s
cluster** (k8s-scheduled resources) with a distinct label (e.g. `heavy`), so `runs-on: heavy`
jobs run there. DESIGN TASK (not started): Forgejo `act_runner` executes via **docker** or
**host** mode — no mature native k8s executor (unlike GitLab). Evaluate act_runner as a k8s
Deployment with big resource requests (host-mode or a DinD sidecar) vs. alternatives. Note
crunchy's k3s already runs the GitLab runners (namespace `gitlab`) — do not disturb them.

## Operating essentials
- **Forge VM**: `204.168.234.72`, SSH **:222**, key `~/.ssh/foundation-test_ed25519`.
  Deploy: `cd bootstrap && ./run.sh up`. Passphrase: `pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE`.
  Forge admin: `platform-admin` / Vault `foundation/forgejo/service-credentials:forgejoAdminPassword`.
- **crunchy01**: `root@192.168.1.2` (operator key in root's authorized_keys) OR `andiolsi`+sudo.
  libvirt installed; pool `images`; `libvirt-bridge-forward.timer` active (kube-router-proof).
  Runner fleet: `cd runners; export RUNNER_SSH_KEY_PATH=~/.ssh/foundation-test_ed25519;
  export PULUMI_BACKEND_URL=file://$(pwd)/state; export PULUMI_CONFIG_PASSPHRASE=$(pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE);
  pulumi stack select crunchy`.
- **CI image**: `foundation-ci:latest`, built on the forge VM (`/tmp/ci-image`); rebuild on toolchain change.
- **Reuse mechanism**: Forgejo 11 reusable workflows work but the CALLING job needs `runs-on`
  + SHORT cross-repo ref (`.forgejo/workflows/README.md`). Composite actions need FULL-URL.

## Watchouts (HIGH-RISK)
- crunchy01 is a k3s node — the `physdev-is-bridged` FORWARD accept is what lets VMs reach the
  LAN; if a runner goes dark, check that rule / the timer first. Don't disturb k3s (`gitlab`
  namespace runners, `nominatim`, flannel/cni0).
- Never commit the passphrase / Vault root token / unseal keys. `runners` stack state lives
  only on the workstation (not backed up — a DR gap to address).
- Stale offline `crunchy-runner` registration on the forge (from the retired hand-built VM) —
  harmless; deregister at leisure. Don't `pulumi up` the prod `olsicloud4-*` stacks.

## Standing backlog (after the two asks above)
- **Package registry (Stage-2)** — publish `@olsitec` pkgs so the C1/C5 docker candidates build.
- **T15** — index.ts phase marker + Gate A/B comments + DAY-ZERO-TIMELINE.
- **Hardening** — pin floating image refs; pre-bake pulumi plugins; MCP (D6); Forgejo v15 upgrade;
  back up the `runners` stack state.

Validate each task live and commit atomically per task.
-												docs(session): HANDOVER — next-session prompt (Wave 2 done, T11/T13/T14/T15 + gaps next)

Self-contained prompt for a fresh Lead Agent context: required reads (incl. ADR-007),
current live state, operating essentials (run.sh / vault-unseal / backup), HIGH-RISK
watchouts (the refresh ipam diff), and the remaining PLAN-002 task order.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-30 22:51:31 +02:00
+								# HANDOVER — next-session prompt (paste into a fresh context)
-												docs(session): SESSION_003 (fenced runner fleet) + handover for next agent

Records the R5 fence work (build → harden → decoupled runners/ Pulumi stack →
live cutover to foundation-runner-02 on crunchy01) and captures the operator's
two new asks for the next session: a brix02 failover runner, and a k8s runner on
crunchy's k3s for heavy (16CPU/64GB) seaspots-s57-utils jobs. Refreshes HANDOVER
to prioritize those + the standing backlog.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-07-01 03:39:20 +02:00
+								> Living doc: overwritten each handover. Durable record = the dated `SESSION_*` files.
 								> Latest state = `SESSION_2026-07-01_003.md` (read it first, then #002 + #001).
-												docs(session): HANDOVER — next-session prompt (Wave 2 done, T11/T13/T14/T15 + gaps next)

Self-contained prompt for a fresh Lead Agent context: required reads (incl. ADR-007),
current live state, operating essentials (run.sh / vault-unseal / backup), HIGH-RISK
watchouts (the refresh ipam diff), and the remaining PLAN-002 task order.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-30 22:51:31 +02:00
 								---
-												docs(session): SESSION_003 (fenced runner fleet) + handover for next agent

Records the R5 fence work (build → harden → decoupled runners/ Pulumi stack →
live cutover to foundation-runner-02 on crunchy01) and captures the operator's
two new asks for the next session: a brix02 failover runner, and a k8s runner on
crunchy's k3s for heavy (16CPU/64GB) seaspots-s57-utils jobs. Refreshes HANDOVER
to prioritize those + the standing backlog.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-07-01 03:39:20 +02:00
+								Continue the **olsitec-foundation** build. You are the **Lead Agent, HIGH-RISK / INFRA mode**
 								(remote VMs, k3s, Docker, secrets).
-												docs(session): HANDOVER — next-session prompt (Wave 2 done, T11/T13/T14/T15 + gaps next)

Self-contained prompt for a fresh Lead Agent context: required reads (incl. ADR-007),
current live state, operating essentials (run.sh / vault-unseal / backup), HIGH-RISK
watchouts (the refresh ipam diff), and the remaining PLAN-002 task order.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-30 22:51:31 +02:00
 								## Required reads (in `~/work/olsitec-foundation/foundation/`)
-												docs(session): SESSION_003 (fenced runner fleet) + handover for next agent

Records the R5 fence work (build → harden → decoupled runners/ Pulumi stack →
live cutover to foundation-runner-02 on crunchy01) and captures the operator's
two new asks for the next session: a brix02 failover runner, and a k8s runner on
crunchy's k3s for heavy (16CPU/64GB) seaspots-s57-utils jobs. Refreshes HANDOVER
to prioritize those + the standing backlog.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-07-01 03:39:20 +02:00
+. `documentation/sessions/SESSION_2026-07-01_003.md` ← runner fleet + the NEW asks below
 . `documentation/sessions/SESSION_2026-07-01_002.md` ← T14 + ecosystem CI · `_001.md` ← the egg
 . `documentation/contracts/CONTRACT_001–004` + `decisions/ADR_004,005,006,007` (ADR-007 first)
 . `runners/README.md` ← the decoupled runner-fleet stack (host prep, config, gotchas)
 . `.forgejo/workflows/README.md` ← the ecosystem-CI reusable-workflow contract (Forgejo-11 quirk)
-												docs(session): HANDOVER — next-session prompt (Wave 2 done, T11/T13/T14/T15 + gaps next)

Self-contained prompt for a fresh Lead Agent context: required reads (incl. ADR-007),
current live state, operating essentials (run.sh / vault-unseal / backup), HIGH-RISK
watchouts (the refresh ipam diff), and the remaining PLAN-002 task order.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-30 22:51:31 +02:00
-												docs(session): SESSION_003 (fenced runner fleet) + handover for next agent

Records the R5 fence work (build → harden → decoupled runners/ Pulumi stack →
live cutover to foundation-runner-02 on crunchy01) and captures the operator's
two new asks for the next session: a brix02 failover runner, and a k8s runner on
crunchy's k3s for heavy (16CPU/64GB) seaspots-s57-utils jobs. Refreshes HANDOVER
to prioritize those + the standing backlog.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-07-01 03:39:20 +02:00
+								## Where things stand (all green / live)
 								- **The egg is LIVE** (6 containers on the Hetzner forge VM); T11/T13/T14 done; ecosystem CI
 								  (reusable workflows + selftest) green; `https://forge.olsitec.net`=200.
 								- **The R5 fence is LIVE + codified.** `foundation-runner-02` (crunchy01 VM, `192.168.1.16`,
 c/32G, label `fenced`) runs ecosystem/untrusted jobs OFF the forge VM. It's managed by the
 								  **`runners/`** Pulumi stack — an **isolated project** (`bootstrap` never imports it), so
 								  foundation deploy/refresh never touches crunchy01. Stack = `crunchy`; config + state are
 								  gitignored (operator workstation only).
 								- Foundation repo `master` clean, all pushed.
-												docs(session): SESSION_2026-07-01_002 — T14 done + ecosystem CI (999_testing)

Records finishing the T14 state-dependent pipelines (pulumi-preview +
backup-verify, green on the runner) and the ecosystem CI: the composite-action
reuse layer (Forgejo 11 has no reusable workflows), the semantic-release bump
sequence + eslint/yamllint gates, and candidate coverage (C2/C3/C4 validated;
C1/C5 blocked on the unpublished package registry). Refreshes HANDOVER to the
new state + next steps, and tracks the operator's now-implemented 999_testing plan.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-07-01 01:18:32 +02:00
-												docs(session): SESSION_003 (fenced runner fleet) + handover for next agent

Records the R5 fence work (build → harden → decoupled runners/ Pulumi stack →
live cutover to foundation-runner-02 on crunchy01) and captures the operator's
two new asks for the next session: a brix02 failover runner, and a k8s runner on
crunchy's k3s for heavy (16CPU/64GB) seaspots-s57-utils jobs. Refreshes HANDOVER
to prioritize those + the standing backlog.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-07-01 03:39:20 +02:00
+								## THIS session's work (operator asks, in priority order)
 								### 1. brix02 runner with failover from crunchy01
 								Add a runner on **brix02 (`192.168.1.3`)** that picks up jobs **only when crunchy01 is
 								unavailable**. **Forgejo has no native standby** — same-label runners load-balance; offline
 								ones get nothing. Choose with the operator:
 								- *HA-on-outage (simple, recommended):* register brix02 with the SAME `fenced` label → when
 								  crunchy is down brix02 covers; when both up they share load.
 								- *Strict standby (custom):* brix02 runner kept STOPPED + a watchdog (systemd timer polling
 								  the Forgejo runners API) that starts it only when crunchy's runner is offline.
 								The `runners/` stack is multi-host-capable: `cd runners && pulumi stack init brix02 &&
 								pulumi config set host.address 192.168.1.3 && ... vm.name foundation-runner-03`. **FIRST**
 								verify brix02 has KVM + libvirt + a LAN bridge (same host prep as crunchy — see runners/README
 								§Host prep; brix02 is also the Graylog target, so it's a real box). Then `pulumi up` and prove
 								a `fenced` job runs on it (and, for standby, that it idles while crunchy is up).
-												docs(session): SESSION_2026-07-01_002 — T14 done + ecosystem CI (999_testing)

Records finishing the T14 state-dependent pipelines (pulumi-preview +
backup-verify, green on the runner) and the ecosystem CI: the composite-action
reuse layer (Forgejo 11 has no reusable workflows), the semantic-release bump
sequence + eslint/yamllint gates, and candidate coverage (C2/C3/C4 validated;
C1/C5 blocked on the unpublished package registry). Refreshes HANDOVER to the
new state + next steps, and tracks the operator's now-implemented 999_testing plan.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-07-01 01:18:32 +02:00
-												docs(session): SESSION_003 (fenced runner fleet) + handover for next agent

Records the R5 fence work (build → harden → decoupled runners/ Pulumi stack →
live cutover to foundation-runner-02 on crunchy01) and captures the operator's
two new asks for the next session: a brix02 failover runner, and a k8s runner on
crunchy's k3s for heavy (16CPU/64GB) seaspots-s57-utils jobs. Refreshes HANDOVER
to prioritize those + the standing backlog.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-07-01 03:39:20 +02:00
+								### 2. k8s runner for heavy (16CPU/64GB) jobs on crunchy's k3s
 								The seaspots GitLab pipelines (`~/work/seaspots/gitlab/pipelines/.gitlab-ci.yml`) run
 								**seaspots-s57-utils** (ogr2ogr + tippecanoe; `registry.gitlab.com/seaspots/tools/
 								seaspots-s57-utils:1.11.0`), `tags: [heavy-compute]`, needing **16+ CPU / 64+ GB RAM / 100+ GB
 								disk** — they'd crush the 8c/32G VM runner. Stand up a **Forgejo runner inside crunchy's k3s
 								cluster** (k8s-scheduled resources) with a distinct label (e.g. `heavy`), so `runs-on: heavy`
 								jobs run there. DESIGN TASK (not started): Forgejo `act_runner` executes via **docker** or
 								**host** mode — no mature native k8s executor (unlike GitLab). Evaluate act_runner as a k8s
 								Deployment with big resource requests (host-mode or a DinD sidecar) vs. alternatives. Note
 								crunchy's k3s already runs the GitLab runners (namespace `gitlab`) — do not disturb them.
-												docs(session): HANDOVER — next-session prompt (Wave 2 done, T11/T13/T14/T15 + gaps next)

Self-contained prompt for a fresh Lead Agent context: required reads (incl. ADR-007),
current live state, operating essentials (run.sh / vault-unseal / backup), HIGH-RISK
watchouts (the refresh ipam diff), and the remaining PLAN-002 task order.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-30 22:51:31 +02:00
 								## Operating essentials
-												docs(session): SESSION_003 (fenced runner fleet) + handover for next agent

Records the R5 fence work (build → harden → decoupled runners/ Pulumi stack →
live cutover to foundation-runner-02 on crunchy01) and captures the operator's
two new asks for the next session: a brix02 failover runner, and a k8s runner on
crunchy's k3s for heavy (16CPU/64GB) seaspots-s57-utils jobs. Refreshes HANDOVER
to prioritize those + the standing backlog.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-07-01 03:39:20 +02:00
+								- **Forge VM**: `204.168.234.72`, SSH **:222**, key `~/.ssh/foundation-test_ed25519`.
 								  Deploy: `cd bootstrap && ./run.sh up`. Passphrase: `pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE`.
 								  Forge admin: `platform-admin` / Vault `foundation/forgejo/service-credentials:forgejoAdminPassword`.
 								- **crunchy01**: `root@192.168.1.2` (operator key in root's authorized_keys) OR `andiolsi`+sudo.
 								  libvirt installed; pool `images`; `libvirt-bridge-forward.timer` active (kube-router-proof).
 								  Runner fleet: `cd runners; export RUNNER_SSH_KEY_PATH=~/.ssh/foundation-test_ed25519;
 								  export PULUMI_BACKEND_URL=file://$(pwd)/state; export PULUMI_CONFIG_PASSPHRASE=$(pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE);
 								  pulumi stack select crunchy`.
 								- **CI image**: `foundation-ci:latest`, built on the forge VM (`/tmp/ci-image`); rebuild on toolchain change.
 								- **Reuse mechanism**: Forgejo 11 reusable workflows work but the CALLING job needs `runs-on`
 								  + SHORT cross-repo ref (`.forgejo/workflows/README.md`). Composite actions need FULL-URL.
-												docs(session): HANDOVER — next-session prompt (Wave 2 done, T11/T13/T14/T15 + gaps next)

Self-contained prompt for a fresh Lead Agent context: required reads (incl. ADR-007),
current live state, operating essentials (run.sh / vault-unseal / backup), HIGH-RISK
watchouts (the refresh ipam diff), and the remaining PLAN-002 task order.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-30 22:51:31 +02:00
 								## Watchouts (HIGH-RISK)
-												docs(session): SESSION_003 (fenced runner fleet) + handover for next agent

Records the R5 fence work (build → harden → decoupled runners/ Pulumi stack →
live cutover to foundation-runner-02 on crunchy01) and captures the operator's
two new asks for the next session: a brix02 failover runner, and a k8s runner on
crunchy's k3s for heavy (16CPU/64GB) seaspots-s57-utils jobs. Refreshes HANDOVER
to prioritize those + the standing backlog.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-07-01 03:39:20 +02:00
+								- crunchy01 is a k3s node — the `physdev-is-bridged` FORWARD accept is what lets VMs reach the
 								  LAN; if a runner goes dark, check that rule / the timer first. Don't disturb k3s (`gitlab`
 								  namespace runners, `nominatim`, flannel/cni0).
 								- Never commit the passphrase / Vault root token / unseal keys. `runners` stack state lives
 								  only on the workstation (not backed up — a DR gap to address).
 								- Stale offline `crunchy-runner` registration on the forge (from the retired hand-built VM) —
 								  harmless; deregister at leisure. Don't `pulumi up` the prod `olsicloud4-*` stacks.
-												docs(session): focus HANDOVER on T14-remainder then 999_testing ecosystem CI

Sharpen the living handover for the next context: concrete starting points +
pre-surfaced blockers/decisions for (1) the stack-state-dependent CI pipelines
(state-fetch-from-RustFS + Forgejo Actions secrets) and (2) the 999_testing
ecosystem CI (reusable workflows, build matrix over the 5 candidates,
semantic-release bump tests, eslint/yamllint, R5 runner-fencing first).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-07-01 00:28:57 +02:00
-												docs(session): SESSION_003 (fenced runner fleet) + handover for next agent

Records the R5 fence work (build → harden → decoupled runners/ Pulumi stack →
live cutover to foundation-runner-02 on crunchy01) and captures the operator's
two new asks for the next session: a brix02 failover runner, and a k8s runner on
crunchy's k3s for heavy (16CPU/64GB) seaspots-s57-utils jobs. Refreshes HANDOVER
to prioritize those + the standing backlog.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-07-01 03:39:20 +02:00
+								## Standing backlog (after the two asks above)
 								- **Package registry (Stage-2)** — publish `@olsitec` pkgs so the C1/C5 docker candidates build.
 								- **T15** — index.ts phase marker + Gate A/B comments + DAY-ZERO-TIMELINE.
 								- **Hardening** — pin floating image refs; pre-bake pulumi plugins; MCP (D6); Forgejo v15 upgrade;
 								  back up the `runners` stack state.
-												docs(session): focus HANDOVER on T14-remainder then 999_testing ecosystem CI

Sharpen the living handover for the next context: concrete starting points +
pre-surfaced blockers/decisions for (1) the stack-state-dependent CI pipelines
(state-fetch-from-RustFS + Forgejo Actions secrets) and (2) the 999_testing
ecosystem CI (reusable workflows, build matrix over the 5 candidates,
semantic-release bump tests, eslint/yamllint, R5 runner-fencing first).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-07-01 00:28:57 +02:00
-												docs(session): SESSION_003 (fenced runner fleet) + handover for next agent

Records the R5 fence work (build → harden → decoupled runners/ Pulumi stack →
live cutover to foundation-runner-02 on crunchy01) and captures the operator's
two new asks for the next session: a brix02 failover runner, and a k8s runner on
crunchy's k3s for heavy (16CPU/64GB) seaspots-s57-utils jobs. Refreshes HANDOVER
to prioritize those + the standing backlog.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-07-01 03:39:20 +02:00
+								Validate each task live and commit atomically per task.