docs(handover): refresh for next session — asks #1/#2 done; PLAN-003/004 next
All checks were successful
CI / preflight (push) Successful in 4s
CI / typecheck (push) Successful in 14s
pulumi-preview / preview (push) Successful in 18s

Reflects ci-bot + registry push/repo-link + postgis k8s runner + MinIO creds in Vault;
points the next session at PLAN-004 (Forgejo15/OpenBao spike, primary) and PLAN-003
(org-as-code). Updated operating essentials (ci-bot git-push, k8s toolchain build/import,
backup/DR templates) and watchouts (spike on a throwaway VM, not the live forge).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Andreas Niemann 2026-07-01 16:58:18 +02:00
parent 47d12254e6
commit 325b66c2b6

View file

@ -1,7 +1,7 @@
# HANDOVER — next-session prompt (paste into a fresh context)
> Living doc: overwritten each handover. Durable record = the dated `SESSION_*` files.
> Latest state = `SESSION_2026-07-01_004.md` (read it first, then #003 + #002 + #001).
> Latest state = `SESSION_2026-07-01_005.md` (read it first, then #004 + #002 + #001).
---
@ -9,167 +9,100 @@ Continue the **olsitec-foundation** build. You are the **Lead Agent, HIGH-RISK /
(remote VMs, k3s, Docker, secrets).
## Required reads (in `~/work/olsitec-foundation/foundation/`)
1. `documentation/sessions/SESSION_2026-07-01_004.md` ← brix02 failover runner (ask #1 DONE) + #003 runner fleet
2. `documentation/sessions/SESSION_2026-07-01_002.md` ← T14 + ecosystem CI · `_001.md` ← the egg
3. `documentation/contracts/CONTRACT_001004` + `decisions/ADR_004,005,006,007` (ADR-007 first)
4. `runners/README.md` ← the decoupled runner-fleet stack (host prep, config, gotchas)
5. `.forgejo/workflows/README.md` ← the ecosystem-CI reusable-workflow contract (Forgejo-11 quirk)
1. `documentation/sessions/SESSION_2026-07-01_005.md` ← ci-bot + registry push/repo-link +
postgis k8s runner + MinIO creds in Vault (this-session's work)
2. `documentation/sessions/SESSION_2026-07-01_004.md` (brix02 runner) · `_002.md` (T14 + ecosystem
CI) · `_001.md` (the egg)
3. `documentation/planning/PLAN-004-forgejo15-openbao-spike.md` ← the NEXT-ask research (spike)
4. `documentation/planning/PLAN-003-forgejo-org-and-ci-config.md` ← the org/CI-as-code plan
5. `ci-bot/README.md` · `runners/README.md` · `runners-k8s/README.md` · `.forgejo/workflows/README.md`
6. `documentation/contracts/CONTRACT_001004` + `decisions/ADR_004,005,006,007` (ADR-007 first)
## Where things stand (all green / live)
- **The egg is LIVE** (6 containers on the Hetzner forge VM); T11/T13/T14 done; ecosystem CI
(reusable workflows + selftest) green; `https://forge.olsitec.net`=200.
- **The R5 fence is LIVE + codified, now an HA pair.** Two `fenced` runners:
`foundation-runner-02` (crunchy01 `192.168.1.16`, 8c/32G) + `foundation-runner-03`
(brix02 `192.168.1.17`, 8c/32G). Both managed by the **`runners/`** Pulumi project — an
**isolated project** (`bootstrap` never imports it) — as stacks `crunchy` + `brix02`.
HA-on-outage: share `fenced` load when both up, either covers alone. Per-host config +
state gitignored (operator workstation only).
- **Stage-2 package registry DONE + all 5 `999_testing` CI candidates GREEN.** `@olsitec` npm
pkgs live in Forgejo's registry (public/anonymous read); the 5 candidates run as
`olsitec/{olsicrypto,document-engine,olsitrack-api,seaspots-homepage,token-service}`.
foundation-ci gained `docker buildx`. See `SESSION_2026-07-01_004.md §Stage-2` + `999_testing.md`.
- Foundation repo `master`: several commits (brix02 runner, k8s runner, buildx, Stage-2 docs) are
**local, not yet pushed** (forge git push needs a token; `foundation-test` SSH key isn't authorized).
- **The egg is LIVE** (6 containers on the Hetzner forge VM); ecosystem CI green; `forge.olsitec.net`=200.
- **ci-bot is LIVE + codified** ([`ci-bot/`](../../ci-bot/)): a non-admin org identity (`ci` team,
`repo.code`+`repo.packages` write) with `registry`(write:package) + `git-push`(write:repository)
tokens (root-600 files on the forge) and org Actions secrets `FORGE_REGISTRY_USER/TOKEN`. It is the
**git-push identity** for the forge (HTTPS + `http.extraHeader: Authorization: token …`; the
`foundation-test` SSH key is NOT a registered Forgejo key).
- **CI images push to the forge container registry, repo-linked.** `reusable-docker-build` does
login → single-manifest build (`--provenance=false`) → push → **auto-link via the package API**.
Proven: `token-service:ci` + `seaspots-homepage:ci` at `/olsitec/-/packages?repo=<repo>`. **4
Forgejo-11 quirks resolved** (see `.forgejo/workflows/README.md`): `secrets: inherit`; `@master`
reusable refs are stale-parsed → **pin the SHA**; `--provenance=false`; no label auto-link →
explicit link API. (Repo-packages URL is `/{owner}/-/packages?repo=` — there is NO `/{owner}/{repo}/packages`.)
- **Runner fleet:** `foundation-runner` (docker,dind), the `fenced` HA pair `-02`(crunchy `.16`)+
`-03`(brix02 `.17`), and the k8s toolchain runners **s57 + postgis** (`runners-k8s/`, ids 6+7).
Stale `crunchy-runner` (id 2) **deregistered**. postgis proven via the image's own
`/app/entrypoint.sh preflight` gate (citest-fenced run #8) — the fast quality gate (no 40-60 min build).
- **MinIO/S3 creds** for the seaspots pipelines are captured in **Vault at `foundation/seaspots/minio`**
(`AWS_ENDPOINT_URL`/`AWS_ACCESS_KEY_ID`/`AWS_SECRET_ACCESS_KEY`). **Not yet wired** to a Forgejo org
secret — that's PLAN-003 work. (Values transited chat → consider rotating.)
- Foundation `master` is fully pushed to the forge (as ci-bot over HTTPS).
## THIS session's work — the remaining operator ask
### ~~brix02 runner with failover~~ — DONE (#004)
HA-on-outage, full 8c/32Gi, `foundation-runner-03` on brix02 `.17`, stack `brix02`.
Proven live (crunchy stopped → job ran on runner-03). See `SESSION_2026-07-01_004.md`
and `runners/README.md §Second host: brix02`. Nothing left here except: push the commit
(operator key) and, at leisure, deregister the stale `crunchy-runner` (id 2) on the forge.
### 1. (PRIMARY, NEW ASK) Push CI-built docker images to the forge container registry, linked to the repo
**Goal:** the docker CI candidates (and future repos) should PUSH their built image to
Forgejo's built-in **container registry** so it's **visible on the repo's packages page**
(`https://forge.olsitec.net/olsitec/<repo>/packages`), not just the org page
(`/olsitec/-/packages`). Currently `reusable-docker-build` builds with `push:false`, so the
images (`olsitec/token-service:ci`, `olsitec/seaspots-homepage:ci`) exist ONLY on the forge
VM's local docker daemon — not in any registry.
**Facts (verified this session):**
- Container registry is ENABLED (`https://forge.olsitec.net/v2/` → 401). Image ref shape:
`forge.olsitec.net/olsitec/<name>:<tag>` (owner = the `olsitec` org).
- **Repo-linking = OCI label.** A container package shows on the *repo* packages tab when the
image carries `org.opencontainers.image.source=https://forge.olsitec.net/olsitec/<repo>`
(Forgejo auto-links it). Without it, the package only shows on the org page. (Repo
`/packages` pages are 404 now — they populate once a linked package is pushed.)
- `reusable-docker-build`'s Push step is a bare `docker push` with **NO `docker login`** — so
it can't authenticate to the registry yet. (An earlier misfire pushed to docker.io and 401'd.)
**Build checklist:**
1. **CI push identity — DECIDED (operator, 2026-07-01): `ci-bot`, provisioned as a TIER-1 step
(like the runners), NOT in bootstrap.** A dedicated `ci-bot` user in the `olsitec` org with a
`write:package` token (NOT platform-admin). **Layering rationale** (operator raised it, it's
correct): creating the user/token needs the forge up, so it's a "step-0 after the foundation
stands" concern — same tier as `runners/`. Rolling it into `bootstrap` would couple every
`bootstrap up/refresh` to Vault being **unsealed** (Vault re-seals on every VM reboot →
`vault-unseal.sh`) + Forgejo admin reachable — the exact coupling trap we isolated the runners
to avoid. So: an isolated Tier-1 step/project (peer to `runners/`).
- **Mint-fresh, admin-driven (like the runners — NO hard Vault dependency).** SSH to the forge
and `docker exec -u git foundation-forgejo forgejo admin user create --username ci-bot …`
(idempotent) + `… generate-access-token --username ci-bot --scopes write:package`. No bot
password needed (admin mints on its behalf, same mechanism as the runner tokens).
- **Runtime = an org-level Forgejo Actions secret** (`FORGE_REGISTRY_USER` +
`FORGE_REGISTRY_TOKEN`), set via `POST /api/v1/orgs/olsitec/actions/secrets/<NAME>`. CI reads
THIS (never Vault) for `docker login`. Org-level (not per-repo) is preferred: org secrets
inject into every workflow, which likely **sidesteps the Forgejo-11 reusable-workflow
secret-inheritance limit** (verify).
- **Vault = OPTIONAL backup only** (e.g. `foundation/forgejo/ci-bot`). The idempotent
provisioning step IS the source of truth (re-run → re-mint), so **Vault going away breaks
nothing** — CI keeps working off the Actions secret, and re-provisioning regenerates. Don't
make Vault a hard dependency (that was the operator's concern; this design dissolves it).
This same `ci-bot` (given repo write) likely also becomes the **git-push identity** for the
forge (the standing "git push needs a token" gap). NEXT AGENT's first concrete step = the
Tier-1 `ci-bot` provisioning step, then wire step 2.
2. **reusable-docker-build changes:** before push, `docker login forge.olsitec.net -u $USER
--password-stdin` from the secret; add `--label org.opencontainers.image.source=<repo URL>`
to the build (new input, e.g. `source-repo`); keep `push`. **WATCH the Forgejo 11 reusable-
workflow secret-passing limitation** — verify whether `${{ secrets.X }}` is visible inside
the called workflow or must be passed via a `secrets:` block / `secrets: inherit` (unknown;
the `runs-on`/inputs quirks suggest testing this explicitly). Also note the caller must pass
inputs EXPLICITLY (booleans have no default applied — see `.forgejo/workflows/README.md`).
3. **Candidate ci.yml:** `push: true`, `image: forge.olsitec.net/olsitec/<repo>:ci`, source-repo label.
4. **Verify:** the image appears at `https://forge.olsitec.net/olsitec/<repo>/packages`.
Test repos ready to iterate on: `olsitec/token-service`, `olsitec/seaspots-homepage` (local
copies in `~/work/foundation-ci-candidates/`; both currently green with `push:false`).
### 2. k8s toolchain runners on crunchy k3s — s57 DONE (Pulumi-codified); postgis/osm left
**`seaspots-s57-utils` runner is LIVE + Pulumi-managed** on crunchy k3s (host-mode,
`foundation-runner-k8s-s57` id 6, label `seaspots-s57-utils`), proven via citest-fenced run #5
task 60. Codified as the isolated `@pulumi/kubernetes` project [`runners-k8s/`](../../runners-k8s/)
(stack `crunchy`; Dockerfile + index.ts + README). Host-mode because act_runner has no k8s
executor; the seaspots images are Debian-slim/uid 10001/**no git+node**, so the runner image is a
combined build (toolchain + git + node20 + forgejo-runner) `ctr images import`ed into crunchy.
**REMAINING under this ask:**
- **postgis + osm runners** — add a `toolchains` config entry + build/import that combined image
(see `runners-k8s/README.md §Adding a toolchain`), then `pulumi up`. postgis = 4/8/50; osm has
no active pipeline yet.
- Watch node disk (`/` 75% full; s57 PVC 120 Gi, local-path doesn't hard-enforce).
**crunchy k3s cluster caveats (half-removed Rancher/kubevirt) — see `runners-k8s/README.md`:**
- Namespace CREATE was failing (dead `rancher.cattle.io` `Fail` webhooks) — **fixed** (removed
the Fail webhooks; operator-approved).
- Namespace DELETE hangs `Terminating` (dead kubevirt APIServices `*.subresources.kubevirt.io`
finalization hangs cluster-wide; 7 ns stuck 600+ days). **Not fixed** — clean fix is
`kubectl delete apiservice v1.subresources.kubevirt.io v1alpha3.subresources.kubevirt.io`.
`pulumi destroy` hangs on the ns until then → force-finalize (README has the one-liner).
Authoritative build/deploy recipe + full design rationale: **`runners-k8s/README.md`** and
`SESSION_2026-07-01_004.md §Ask #2` (host-mode, per-toolchain combined image, local-path PVC
scratch, non-privileged; `fenced` stays the untrusted zone). To add postgis/osm: build+import
the combined image, add a `toolchains` config entry, `pulumi up`.
## Next work
### A. (NEW, PRIMARY) Spike — Forgejo 15 + OpenBao (upgrade + DR drill) → `PLAN-004`
Back up the foundation → redeploy on **Forgejo 15** + **OpenBao** (on a **throwaway VM/stack, NOT the
live forge**) → restore → re-run the whole test matrix. Research is done in **`PLAN-004`** — read it.
Headlines: Forgejo 11→15 is a supported direct LTS upgrade (v11 LTS **EOL 2026-07-16** → timely; review
v1215 breaking changes; v15 drops reusable-workflow quirks 12 and adds OIDC + ephemeral runners).
**OpenBao gives OSS namespaces** (solves PLAN-003's tenant limitation) **but is NOT a drop-in from Vault
1.18** — no in-place raft migration past 1.15 → **re-seed** OpenBao (export/import KV; leverage the
Pulumi-owns-credentials model). The re-seed is the spike's riskiest step.
### B. Stage-2 org & CI-config as code → `PLAN-003`
Isolated Pulumi `orgs/` project (Gitea + Vault/OpenBao providers) managing the `seaspots` org + repos +
teams + org secrets/variables (wire `foundation/seaspots/minio` → org secrets). **Verify the Gitea TF
provider covers Forgejo Actions secrets/variables first** (else a `@pulumi/command` shim). Ratify §7.
### C. osm k8s toolchain runner (`runners-k8s/`) — no active pipeline yet; recipe in its README.
## Operating essentials
- **Forge VM**: `204.168.234.72`, SSH **:222**, key `~/.ssh/foundation-test_ed25519`.
- **Forge VM**: `204.168.234.72`, SSH **:222**, key `~/.ssh/foundation-test_ed25519`, user `root`.
Deploy: `cd bootstrap && ./run.sh up`. Passphrase: `pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE`.
Forge admin: `platform-admin` / Vault `foundation/forgejo/service-credentials:forgejoAdminPassword`.
- **crunchy01**: `root@192.168.1.2` (operator key in root's authorized_keys) OR `andiolsi`+sudo.
libvirt installed; pool `images`; `libvirt-bridge-forward.timer` active (kube-router-proof).
Runner fleet: `cd runners; export RUNNER_SSH_KEY_PATH=~/.ssh/foundation-test_ed25519;
Forgejo DB: `docker exec foundation-postgres psql -U postgres -d forgejo`. Action logs live in RustFS
(`forgejo-packages` bucket, `actions_log/<owner>/<repo>/<n>/<m>.log.zst`) — fetch via an `mc` container
on `foundation-net` (creds in app.ini `[storage]`; there are multiple `MINIO_*` blocks → take the first).
- **ci-bot git push / API**: `GITTOK=$(ssh -p222 -i <key> root@204.168.234.72 cat /root/ci-bot-git-push.token)`
then `git -c http.extraHeader="Authorization: token $GITTOK" push https://forge.olsitec.net/olsitec/<repo>.git <branch>`.
Registry token at `/root/ci-bot-registry.token`. Admin API for org secrets etc: mint a **short-lived**
platform-admin token (`docker exec -u git foundation-forgejo forgejo admin user generate-access-token
--username platform-admin --scopes all --token-name tmp-… --raw`) and revoke it from the DB after
(`DELETE FROM access_token WHERE name LIKE 'tmp-%'`). Re-provision ci-bot: `cd ci-bot && ./provision.sh`.
- **crunchy01**: `root@192.168.1.2` (foundation-test key works) OR `andiolsi`+sudo. k3s v1.36.2.
Runner fleet (`runners/`): `export RUNNER_SSH_KEY_PATH=~/.ssh/foundation-test_ed25519;
export PULUMI_BACKEND_URL=file://$(pwd)/state; export PULUMI_CONFIG_PASSPHRASE=$(pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE);
pulumi stack select crunchy` (or `brix02`).
- **brix02**: `root@192.168.1.3` (operator key AND `foundation-test` pubkey now in root's
authorized_keys). Runs prod Graylog (docker) + VMs (logging01/dev01/univention01) +
`foundation-runner-03`. NOT a k3s node → runner stack sets `host.bridgeForwardTimer false`,
pool `images` (`/kvm/images`), bridge `br0`. Legacy `logging01` (16Gi) pending decommission.
- **k8s toolchain runners** (`runners-k8s/`, stack `crunchy`): `export CRUNCHY_KUBECONFIG=<crunchy
k3s kubeconfig, server https://192.168.1.2:6443>` (fetch `/etc/rancher/k3s/k3s.yaml`, sed the
server) + `RUNNER_SSH_KEY_PATH` + `PULUMI_BACKEND_URL=file://$(pwd)/state` + passphrase;
`pulumi stack select crunchy`. Local kubectl works against that kubeconfig. See `runners-k8s/README.md`.
- **CI image**: `foundation-ci:latest`, built on the forge VM (`/tmp/ci-image`); rebuild on toolchain change.
- **Reuse mechanism**: Forgejo 11 reusable workflows work but the CALLING job needs `runs-on`
+ SHORT cross-repo ref (`.forgejo/workflows/README.md`). Composite actions need FULL-URL.
- **k8s toolchain runners** (`runners-k8s/`, stack `crunchy`): fetch crunchy `/etc/rancher/k3s/k3s.yaml`,
`sed` the server to `https://192.168.1.2:6443`, `export CRUNCHY_KUBECONFIG=<that>` + `RUNNER_SSH_KEY_PATH`
+ `PULUMI_BACKEND_URL=file://$(pwd)/state` + passphrase; `pulumi stack select crunchy`. Node `/` had
**408 GB free** (52%). To add a toolchain: build the combined image on the forge VM (docker+buildx,
gitlab creds from the crunchy `gitlab`-ns `registry.gitlab.com` secret), stream `docker save | gzip`
`k3s ctr -n k8s.io images import -` (forge can't reach crunchy's private IP → relay via the workstation),
add a `toolchains` entry, `pulumi up`.
- **brix02**: `root@192.168.1.3` (foundation-test key in root authorized_keys). Prod Graylog + VMs +
`foundation-runner-03`. NOT a k3s node → runner stack sets `host.bridgeForwardTimer false`, pool `images`
(`/kvm/images`), bridge `br0`.
- **Backup/DR** (for the spike): `backup/{backup,restore}.sh`, `dr/RUNBOOK.md`,
`dr/restore-to-fresh-vm-remote.sh` (closest template for the spike's fresh-VM restore). Vault root token
is in `bootstrap` Pulumi config `vaultCredentials:rootToken` (stack `foundation`; KV mount `foundation/`).
- **Reuse mechanism**: Forgejo 11 reusable workflows need `runs-on` + SHORT cross-repo ref + **SHA pin** +
`secrets: inherit` (`.forgejo/workflows/README.md`). Composite actions need FULL-URL.
## Watchouts (HIGH-RISK)
- crunchy01 is a k3s node — the `physdev-is-bridged` FORWARD accept is what lets VMs reach the
LAN; if a runner goes dark, check that rule / the timer first. Don't disturb k3s (`gitlab`
namespace runners, `nominatim`, flannel/cni0).
- Never commit the passphrase / Vault root token / unseal keys. The `runners` AND `runners-k8s`
stack state (+ their gitignored `Pulumi.<stack>.yaml`) live only on the workstation (not backed
up — a DR gap to address).
- crunchy k3s is a half-removed Rancher/kubevirt cluster: ns CREATE was fixed (Fail-webhook
removal); ns DELETE still hangs (dead kubevirt APIServices) — see the PRIMARY task + `runners-k8s/README.md`.
- Stale offline `crunchy-runner` registration on the forge (from the retired hand-built VM) —
harmless; deregister at leisure. Don't `pulumi up` the prod `olsicloud4-*` stacks.
- crunchy01 is a k3s node — the `physdev-is-bridged` FORWARD accept (libvirt-bridge-forward timer) is what
lets VMs reach the LAN; if a runner goes dark, check that rule / the timer first. Don't disturb k3s
(`gitlab` ns runners, `nominatim`, flannel/cni0). Don't `pulumi up` the prod `olsicloud4-*` stacks.
- crunchy k3s is a half-removed Rancher/kubevirt cluster: ns CREATE fixed (removed dead `Fail` webhooks);
ns DELETE still hangs (dead kubevirt APIServices) — `pulumi destroy` of `runners-k8s` needs the
force-finalize one-liner in `runners-k8s/README.md`.
- Never commit the passphrase / Vault root token / unseal keys / ci-bot tokens. The `runners`,
`runners-k8s` stack state (+ gitignored `Pulumi.<stack>.yaml`) live only on the workstation (DR gap).
- **The spike must run on a throwaway VM/stack — do NOT upgrade the live forge in place.**
## Standing backlog (after the two asks above)
- **~~Package registry (Stage-2)~~ DONE** — `@olsitec` pkgs live in the Forgejo built-in npm
registry (`olsitec` org, public/anonymous read): `olsicrypto@2.0.1`, `svelte-common@13.1.6`.
Publish: `.npmrc` `@olsitec:registry=https://forge.olsitec.net/api/packages/olsitec/npm/`
+ `//…:_authToken=<token>`, then `npm publish`. **All 5 `999_testing` candidates now build
green** as `olsitec/{olsicrypto,document-engine,olsitrack-api,seaspots-homepage,token-service}`
(local copies in `~/work/foundation-ci-candidates/`). foundation-ci gained `docker buildx`
(Docker 29 needs it). More `@olsitec` svelte pkgs to publish as the repo migration proceeds
(`olsitec-nci/svelte/components/`). NB: git push to the forge needs a token — my
`foundation-test` SSH key isn't authorized; I pushed via HTTPS+admin-token (revoked after).
- **T15** — index.ts phase marker + Gate A/B comments + DAY-ZERO-TIMELINE.
- **Hardening** — pin floating image refs; pre-bake pulumi plugins; MCP (D6); Forgejo v15 upgrade;
back up the `runners` stack state.
Validate each task live and commit atomically per task.
## Standing backlog
- **osm** k8s toolchain runner (no pipeline yet). **PLAN-003** org-as-code (verify Gitea provider first).
- **PLAN-004** spike (Forgejo 15 + OpenBao) — the primary next ask.
- **DR/backup**: back up the `runners`/`runners-k8s` stack state; T15; hardening (pre-bake pulumi plugins,
MCP D6). Rotate the MinIO creds (transited chat).
- Validate each task live and commit atomically per task (conventional commits; end with the
`Co-Authored-By: Claude …` trailer to match history).