diff --git a/documentation/sessions/HANDOVER.md b/documentation/sessions/HANDOVER.md index f4a49a9..0aaca37 100644 --- a/documentation/sessions/HANDOVER.md +++ b/documentation/sessions/HANDOVER.md @@ -1,92 +1,74 @@ # HANDOVER — next-session prompt (paste into a fresh context) > Living doc: overwritten each handover. The durable record is the dated -> `SESSION_*` files. Latest state = `SESSION_2026-07-01_001.md`. +> `SESSION_*` files. Latest state = `SESSION_2026-07-01_002.md`. --- Continue the **olsitec-foundation** build. You are the **Lead Agent, HIGH-RISK / INFRA mode**. ## Required reads (in `~/work/olsitec-foundation/foundation/`) -1. `documentation/sessions/SESSION_2026-07-01_001.md` ← current state + known gaps + next steps -2. `documentation/000_baseline.md` + `000_TOPOLOGY.md` +1. `documentation/sessions/SESSION_2026-07-01_002.md` ← current state + known gaps + next steps +2. `documentation/sessions/SESSION_2026-07-01_001.md` ← the prior session (gaps closed, T11/T13/T14-core) 3. `documentation/contracts/CONTRACT_001–004` + `decisions/ADR_004,005,006,007` (**ADR-007** is the control-plane mechanism the whole egg runs on — read it first) -4. `documentation/planning/PLAN-002-foundation-implementation.md` §10 -5. `documentation/999_testing.md` ← the operator's acceptance-test plan for the ecosystem CI +4. `actions/README.md` ← the ecosystem-CI composite-action contract + the Forgejo-11 finding +5. `documentation/999_testing.md` ← the operator's acceptance-test plan (now implemented) ## Where things stand -**The egg is LIVE, all three known gaps are CLOSED, and T11/T13/T14-core are done.** Six containers -on `foundation-net` (postgres/rustfs/vault/caddy/forgejo/runner), all healthy. `https://forge.olsitec.net` -=200; `git clone git@git.olsitec.net:olsitec/foundation.git` works; the foundation repo's **origin is now -Forgejo** (master default); `ai-baseline` is mirrored. **Backups are age-encrypted** (restore-verified from -RustFS + offsite). **DR to a fresh VM is rehearsed + scripted** (`dr/`). The forge's **own CI runs green** -on its runner (`.forgejo/workflows/ci.yml`: preflight + typecheck, in the baked `foundation-ci` image). -`cd bootstrap && ./run.sh up` is idempotent. Working tree clean on `master` (except the operator's untracked -`documentation/999_testing.md`). +**The egg is LIVE; T11/T13/T14 are DONE; the ecosystem CI (999_testing) is built and validated.** +Six containers on `foundation-net` (postgres/rustfs/vault/caddy/forgejo/runner), all healthy. +`https://forge.olsitec.net`=200; `git clone git@git.olsitec.net:olsitec/foundation.git` works; origin is +Forgejo (master default). Backups age-encrypted + restore-verified (RustFS + offsite); DR scripted (`dr/`). +Working tree clean on `master`. + +**CI on the runner, all green:** +- `ci.yml` (preflight + typecheck), `pulumi-preview.yml` (read-only drift/PR check), + `backup-verify.yml` (weekly + dispatch; RESTORE VERIFY PASS from offsite). +- `ecosystem-selftest.yml` — semantic-release bump sequence (1.0.0→1.1.0→1.1.1→2.0.0→3.0.0) + + eslint/yamllint non-zero-exit gates. +- `actions/` composite actions (node-build, docker-build, lint, semantic-release-version) — the + ecosystem-CI reuse layer. **Forgejo 11 has NO reusable workflows**; downstream repos call composite + actions by FULL URL: `uses: https://forge.olsitec.net/olsitec/foundation/actions/@master`. + +`cd bootstrap && ./run.sh up` is idempotent and now also publishes `pulumi stack export` to RustFS +(`bootstrap/state-publish.sh`) so the state-dependent CI has Pulumi state. ## Operating essentials - **VM**: `204.168.234.72`, admin SSH **:222**, key `~/.ssh/foundation-test_ed25519` (also the Forgejo operator key). Git endpoint :22 (scp-form) + :2222. - **Deploy**: `cd bootstrap && ./run.sh up`. Master passphrase: `pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE`. - **Vault reboot**: `bootstrap/vault-unseal.sh`. **Backup**: `backup/backup.sh [ts]`; **restore-verify**: - `backup/restore.sh [rfs|off]`. **DR to fresh VM**: `dr/restore-to-fresh-vm.sh` (+ `dr/RUNBOOK.md`). + `backup/restore.sh [rfs|off]`. **DR**: `dr/restore-to-fresh-vm.sh` (+ `dr/RUNBOOK.md`). - **Forge admin**: `platform-admin` / Vault `foundation/forgejo/service-credentials:forgejoAdminPassword`. -- **CI image**: built on the VM (`/tmp/ci-image`, from `containers/ci-image/Dockerfile`), tag `foundation-ci:latest`, - used locally by the runner (`force_pull:false`). Rebuild on toolchain change. -- **Mechanism (ADR-007)**: in-VM control-plane ops = `@pulumi/command` `remote.Command` (docker-exec over - SSH); idempotent, readiness-gated, secrets on stdin. Images digest-pinned in `VERSIONS`. + (If you change the admin password in the UI, the API steps that set CI secrets need the new value.) +- **CI image**: built on the VM (`/tmp/ci-image`, from `containers/ci-image/Dockerfile`), tag + `foundation-ci:latest`, used locally by the runner (`force_pull:false`). Rebuild on toolchain change: + `scp` the Dockerfile + `docker build -t foundation-ci:latest .` on the VM. +- **CI secrets** (repo-scoped on `olsitec/foundation`, set via the admin API): `PULUMI_CONFIG_PASSPHRASE`, + `SSH_PRIVATE_KEY`, `RUSTFS_ACCESS_KEY`, `RUSTFS_SECRET_KEY`. ## Watchouts (HIGH-RISK) -- `up --refresh` no longer recreates the network (ipam `ignoreChanges`), but still shows pessimistic - `~triggers` replaces on the vault command chain in *preview* (refreshed `container.id`=`[unknown]`) — a - Pulumi preview artifact, idempotent if applied. Don't panic at it. -- The VM sshd throttles bursts of docker-over-SSH (e.g. parallel refresh) → "Connection closed". Use - `--parallel 1` for refresh, or raise sshd MaxStartups before wiring refresh into CI. -- Never print/commit the passphrase, Vault root token, or unseal keys (D2) — only the already-encrypted - `secure:` values. Don't `pulumi up` the prod `olsicloud4-*` stacks. Commit **atomically per task**. -- Don't `pulumi up` the `provision` stack against the LIVE VM (it would recreate the server — cloud-init - changes only affect fresh provisions). +- `pulumi-preview` shows a benign perpetual `~sshOpts` diff (the operator vs CI key path differ in the + docker provider) — informational; preview exits 0 on diffs by design. Don't add `--expect-no-changes`. +- `up --refresh` shows pessimistic `~triggers` replaces on the vault command chain (a preview artifact, + idempotent if applied). The VM sshd throttles bursts of docker-over-SSH → use `--parallel 1` for refresh, + or raise MaxStartups before wiring refresh into CI. +- Never print/commit the passphrase, Vault root token, or unseal keys (D2). Don't `pulumi up` the prod + `olsicloud4-*` stacks, and don't `up` the `provision` stack against the LIVE VM (it would recreate it). +- The runner holds the host Docker socket (root-equivalent). **R5 is deferred** (operator OK'd trusted + first-party CI on it) — fence it to a separate VM before any UNTRUSTED workflow. Commit atomically per task. -## Next work — THIS session: (1) finish T14, then (2) the 999_testing ecosystem CI - -T14-core already shipped: the baked `foundation-ci` image, the runner `config.yaml` -(`container.network=foundation-net`, `force_pull=false`), and `.forgejo/workflows/ci.yml` -(preflight + typecheck, **green**). Build on exactly that. - -### 1. T14 remainder — the stack-state-dependent pipelines -Author `pulumi-preview` (on push/PR) and `backup-verify` (weekly `schedule`) workflows. -**Blocker to solve first:** `bootstrap/state/` is gitignored, so a CI checkout has NO Pulumi -stack state — `pulumi`/`backup` scripts can't `pulumi config get` or `stack select`. -- **Recommended fix:** in `bootstrap/run.sh`, after a successful `up`, also `pulumi stack export` - and `mc cp` it to a dedicated RustFS object (secrets stay passphrase-encrypted within). The CI - job pulls it → `pulumi stack import` → `pulumi preview`. (Alternative: import the latest backup - bundle's `pulumi-state.json`, but that needs the age identity in CI — avoid.) -- **Forgejo Actions secrets** (set via the admin API, repo or org scope): `PULUMI_CONFIG_PASSPHRASE`, - the operator SSH key (write to a file + `SSH_PRIVATE_KEY_PATH`), and RustFS/offsite creds. The - scripts already read the passphrase from env and the key from `SSH_PRIVATE_KEY_PATH`. -- Jobs: `runs-on: docker` + `container: foundation-ci:latest`. preview should be read-only; gate any - `up` behind `workflow_dispatch` (never auto-`up` live infra from CI). -- Validate: push → both jobs green on the runner. `backup-verify` = `backup.sh` then `restore.sh off`. - -### 2. Ecosystem CI — the `999_testing.md` acceptance plan (architecture: REUSABLE workflows) -Reusable Forgejo workflows in THIS repo (`uses: olsitec/foundation/.forgejo/workflows/.yml@master`, -`on: workflow_call`) that each project references. Cover, per `999_testing.md`: -- **Build matrix** (5 named candidate repos — paths in the doc): docker-no-npm - (`seaspots/services/seaspots-homepage`), npm pkg (`olsitec-nci/lib/olsicrypto`), bun pkg - (`olsitec-nci/lib/document-engine`), non-artifact versioned (`olsitrack2/api`), docker+npm - (`olsitrack2/services/token-service`, depends on olsicrypto). -- **semantic-release** bump tests: init→`1.0.0`, `feat`→minor, `fix`/`chore`→patch, `feat!`→major, - `BREAKING CHANGE`→major. (Olsitec uses Conventional Commits + semantic-release-monorepo.) -- **Linters**: an eslint error and a yamllint error must each fail the job (non-zero exit). -- **Toolchain**: extend `containers/ci-image/Dockerfile` (or add a sibling `ci-node` image) with - `shellcheck`, `eslint`, `yamllint`, `semantic-release`; re-pin in `VERSIONS`. -- **DO THIS FIRST (R5):** the runner still holds the host Docker socket (root-equivalent). **Fence it - to a separate privileged VM before running any untrusted/ecosystem candidate**, or scope what runs. - -### Later (after the above) -- **T15** — `index.ts` orchestration polish + Gate A/B comments + `docs/DAY-ZERO-TIMELINE.md`. -- **Hardening** — pin floating refs (`IMAGE_REGISTRY` PIN_DIGEST, `IMAGE_RUSTFS` `latest`, `IMAGE_CI` tag); - register in Olsitec MCP (D6); Stage-2 publish `packages/pulumi-*`. +## Next work (pick up here) +1. **Package registry (Stage-2)** — populate the Forgejo package registry so cross-repo `@olsitec` deps + resolve: publish `olsicrypto`, `svelte-common`, … Then validate `docker-build` end-to-end for the two + registry-blocked candidates (**C1 seaspots-homepage**, **C5 token-service**) — pass an npmrc via the + action's `build-args`. (C2/C3/C4 already validated.) +2. **R5 fence** — separate privileged runner VM (or socket-less DinD), labelled, before untrusted repos. +3. **T15** — `index.ts` orchestration polish (phase marker still `T10-runner`) + Gate A/B comments + + `docs/DAY-ZERO-TIMELINE.md`. +4. **Hardening** — pin floating refs (`IMAGE_REGISTRY` PIN_DIGEST, `IMAGE_RUSTFS` `latest`, `IMAGE_CI` tag); + pre-bake pulumi plugins into `foundation-ci` (drop preview's per-run auto-install); register in Olsitec + MCP (D6); consider a Forgejo upgrade to regain reusable workflows. Validate each task live (VM `./run.sh up` + the runner for CI) and commit per task. diff --git a/documentation/sessions/SESSION_2026-07-01_002.md b/documentation/sessions/SESSION_2026-07-01_002.md new file mode 100644 index 0000000..412f218 --- /dev/null +++ b/documentation/sessions/SESSION_2026-07-01_002.md @@ -0,0 +1,96 @@ +# Session 2026-07-01 #002 — finish T14 + the 999_testing ecosystem CI + +## What was done +Picked up from SESSION_2026-07-01_001 (egg live, T14-core done). Finished the +**T14 remainder** (the stack-state-dependent pipelines) and built the **ecosystem +CI** (the 999_testing acceptance plan). Every task an atomic, conventional commit, +validated live on the runner. Egg stayed healthy throughout (6 containers). + +### T14 remainder — state-dependent pipelines (DONE, green on the runner) +- **State blocker solved.** `bootstrap/state/` is gitignored, so CI had no Pulumi + state. `bootstrap/state-publish.sh` ships a fresh `pulumi stack export` to + `rfs/foundation-ci-state/foundation-stack.json` via a throwaway `mc` container on + foundation-net (ADR-007, like backup.sh); `run.sh` calls it best-effort after every + `up`. Secrets inside the export stay passphrase-encrypted; **config** comes from the + committed (encrypted) `Pulumi.foundation.yaml` via the CI checkout. Declared the + `foundation-ci-state` bucket in `components/rustfs.ts` + the config array. +- **CI image: pulumi 3.145 → 3.243.** 3.145 rejects the `packagemanager: bun` project + option (`bootstrap/Pulumi.yaml`) so `preview` couldn't load the program; 3.149 is the + bun floor, pinned 3.243 for operator parity. `TOOL_PULUMI_MIN` bumped. Image rebuilt + on the VM. +- **Forgejo Actions secrets** (repo-scoped on `olsitec/foundation`, set via the admin + API, values via temp-file `curl -d @-`, never argv): `PULUMI_CONFIG_PASSPHRASE`, + `SSH_PRIVATE_KEY` (operator ed25519), `RUSTFS_ACCESS_KEY`/`RUSTFS_SECRET_KEY` (the + scoped service account, from Vault `foundation/rustfs/service-credentials`). +- **`.forgejo/workflows/pulumi-preview.yml`** (push/PR/dispatch): pulls + imports the + state object, materializes the operator key from the secret (the docker provider AND + `index.ts` read it — `index.ts` reads `.pub`, derived via `ssh-keygen -y`), + `mkdir -p state`, `pulumi preview` — **read-only, never up**. A diff is informational + (the job fails only on a program/preview error). The provider dials the VM over SSH + at the public IP:222, reachable from a foundation-net container (verified). **GREEN.** +- **`.forgejo/workflows/backup-verify.yml`** (weekly cron + dispatch): reuses + `backup.sh`/`restore.sh` UNCHANGED — they read everything from `pulumi config get` + and orchestrate on the VM over SSH. Imports real state so the bundle's + `pulumi-state.json` is real, not an empty deployment. **GREEN** (RESTORE VERIFY PASS + from offsite: postgres rows=2, repo present, 9 blobs, vault snapshot OK). + +### R5 — runner fence: DEFERRED (operator decision) +The runner still holds the host Docker socket (root-equivalent on the forge VM). The +operator chose to run the 5 **first-party/trusted** candidate repos on the existing +runner as-is, deferring the separate-VM fence to later hardening. The fence remains +real hardening for when UNTRUSTED workflows run. + +### Ecosystem CI — the 999_testing plan (DONE, validated on the runner) +- **CI image toolchain extended:** shellcheck + yamllint (apt), eslint@9.18.0 + + semantic-release@24.2.3 with the **conventionalcommits preset** + `@semantic-release/ + git`+`changelog` (the plugin set Olsitec's GitLab release template uses). Pinned in + VERSIONS (NOT in preflight's up-gating set — job tools, not deploy tools). +- **ARCHITECTURE PIVOT — Forgejo 11.0.15 does NOT support reusable workflows.** A + job-level `uses:`/`workflow_call` is silently dropped → **zero runs** (verified live, + both same-repo and cross-repo; an equivalent inline job ran green). The working + cross-repo reuse primitive is the **COMPOSITE ACTION referenced by FULL URL**: + `uses: https://forge.olsitec.net/olsitec/foundation/actions/@master` (short-form + resolves against the runner's `DEFAULT_ACTIONS_URL`=data.forgejo.org and 404s). + Replaced the (dead) `reusable-*.yml` with composite actions. +- **`actions/`** (composite, + README): `node-build` (npm/bun/none install+build), + `docker-build` (host-socket build; caller mounts the socket), `lint` (eslint+yamllint + gate), `semantic-release-version` (conventionalcommits dry-run version probe). +- **`.forgejo/workflows/ecosystem-selftest.yml`** + `ci/semantic-release-bumptest.sh`: + self-contained proof on the runner of the 999 criteria that need no external repo — + the **semantic-release bump sequence** `1.0.0→1.1.0→1.1.1→2.0.0→3.0.0` (Olsitec's exact + releaseRules; `--dry-run --no-ci --tag-format '${version}'` + grep, like the GitLab + `generate-release-version` job) and the **eslint/yamllint non-zero-exit gates**. **All GREEN.** +- **Candidate validation:** `node-build` ran **green on the runner** against a real bun + build (throwaway `citest-node`, since deleted). Real candidate code built in the + foundation-ci image: **C2 olsicrypto** (npm/tsc → dist) and **C3 document-engine** + (bun/tsc → dist). **C4 olsitrack/api** is no-build (install-only path). **C1 + seaspots-homepage** and **C5 token-service** are blocked on the not-yet-published + `@olsitec` package registry (svelte-common / olsicrypto) — Stage-2; documented. + +## Current state +- Repo `~/work/olsitec-foundation/foundation`, branch `master`, origin = Forgejo, + working tree clean. Commits this session (pushed): `fix(ci-image): pulumi 3.243`, + `feat(ci): T14 pipelines`, `feat(ci-image): ecosystem toolchain`, `feat(ci): reusable + workflows + selftest`, `refactor(ci): composite actions (Forgejo 11)` (+ a probe commit). +- Foundation's own CI green on master (preflight, typecheck, preview, semantic-release- + bumptest, eslint-gate, yamllint-gate). `pulumi-preview` + `backup-verify` green. +- `cd bootstrap && ./run.sh up` idempotent; it now also publishes state to RustFS. +- Master passphrase `pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE`; VM key + `~/.ssh/foundation-test_ed25519`; forge admin `platform-admin` / Vault + `foundation/forgejo/service-credentials:forgejoAdminPassword`. + +## Known gaps / next steps +- **R5 fence** — still pending (operator-deferred). Do before any UNTRUSTED workflow. +- **Package registry (Stage-2)** — C1/C5 + any cross-repo `@olsitec` dep need the + Forgejo package registry populated (publish `olsicrypto`, `svelte-common`, …). Then + `docker-build` for seaspots-homepage / token-service can be validated end-to-end + (npmrc via `build-args`). +- **Forgejo upgrade** — reusable workflows need a newer Forgejo; until then composite + actions are the contract (`actions/README.md`). +- **T15** — `index.ts` phase marker still `T10-runner`; Gate A/B comments; + `docs/DAY-ZERO-TIMELINE.md`. +- **Hardening** — pin floating refs (`IMAGE_REGISTRY` PIN_DIGEST, `IMAGE_RUSTFS` latest, + `IMAGE_CI` tag); pre-bake pulumi plugins into foundation-ci to drop preview's per-run + auto-install; register in Olsitec MCP (D6). VM sshd MaxStartups before refresh-in-CI. + +## Operating mode for next session: HIGH-RISK / INFRA (remote VM, Docker, secrets).