foundation/documentation/sessions/HANDOVER.md
Andreas Niemann 430c55cdf6
All checks were successful
CI / preflight (push) Successful in 5s
CI / typecheck (push) Successful in 13s
docs(session): focus HANDOVER on T14-remainder then 999_testing ecosystem CI
Sharpen the living handover for the next context: concrete starting points +
pre-surfaced blockers/decisions for (1) the stack-state-dependent CI pipelines
(state-fetch-from-RustFS + Forgejo Actions secrets) and (2) the 999_testing
ecosystem CI (reusable workflows, build matrix over the 5 candidates,
semantic-release bump tests, eslint/yamllint, R5 runner-fencing first).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-01 00:28:57 +02:00

6.6 KiB
Raw Blame History

HANDOVER — next-session prompt (paste into a fresh context)

Living doc: overwritten each handover. The durable record is the dated SESSION_* files. Latest state = SESSION_2026-07-01_001.md.


Continue the olsitec-foundation build. You are the Lead Agent, HIGH-RISK / INFRA mode.

Required reads (in ~/work/olsitec-foundation/foundation/)

  1. documentation/sessions/SESSION_2026-07-01_001.md ← current state + known gaps + next steps
  2. documentation/000_baseline.md + 000_TOPOLOGY.md
  3. documentation/contracts/CONTRACT_001004 + decisions/ADR_004,005,006,007 (ADR-007 is the control-plane mechanism the whole egg runs on — read it first)
  4. documentation/planning/PLAN-002-foundation-implementation.md §10
  5. documentation/999_testing.md ← the operator's acceptance-test plan for the ecosystem CI

Where things stand

The egg is LIVE, all three known gaps are CLOSED, and T11/T13/T14-core are done. Six containers on foundation-net (postgres/rustfs/vault/caddy/forgejo/runner), all healthy. https://forge.olsitec.net =200; git clone git@git.olsitec.net:olsitec/foundation.git works; the foundation repo's origin is now Forgejo (master default); ai-baseline is mirrored. Backups are age-encrypted (restore-verified from RustFS + offsite). DR to a fresh VM is rehearsed + scripted (dr/). The forge's own CI runs green on its runner (.forgejo/workflows/ci.yml: preflight + typecheck, in the baked foundation-ci image). cd bootstrap && ./run.sh up is idempotent. Working tree clean on master (except the operator's untracked documentation/999_testing.md).

Operating essentials

  • VM: 204.168.234.72, admin SSH :222, key ~/.ssh/foundation-test_ed25519 (also the Forgejo operator key). Git endpoint :22 (scp-form) + :2222.
  • Deploy: cd bootstrap && ./run.sh up. Master passphrase: pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE.
  • Vault reboot: bootstrap/vault-unseal.sh. Backup: backup/backup.sh [ts]; restore-verify: backup/restore.sh <ts> [rfs|off]. DR to fresh VM: dr/restore-to-fresh-vm.sh (+ dr/RUNBOOK.md).
  • Forge admin: platform-admin / Vault foundation/forgejo/service-credentials:forgejoAdminPassword.
  • CI image: built on the VM (/tmp/ci-image, from containers/ci-image/Dockerfile), tag foundation-ci:latest, used locally by the runner (force_pull:false). Rebuild on toolchain change.
  • Mechanism (ADR-007): in-VM control-plane ops = @pulumi/command remote.Command (docker-exec over SSH); idempotent, readiness-gated, secrets on stdin. Images digest-pinned in VERSIONS.

Watchouts (HIGH-RISK)

  • up --refresh no longer recreates the network (ipam ignoreChanges), but still shows pessimistic ~triggers replaces on the vault command chain in preview (refreshed container.id=[unknown]) — a Pulumi preview artifact, idempotent if applied. Don't panic at it.
  • The VM sshd throttles bursts of docker-over-SSH (e.g. parallel refresh) → "Connection closed". Use --parallel 1 for refresh, or raise sshd MaxStartups before wiring refresh into CI.
  • Never print/commit the passphrase, Vault root token, or unseal keys (D2) — only the already-encrypted secure: values. Don't pulumi up the prod olsicloud4-* stacks. Commit atomically per task.
  • Don't pulumi up the provision stack against the LIVE VM (it would recreate the server — cloud-init changes only affect fresh provisions).

Next work — THIS session: (1) finish T14, then (2) the 999_testing ecosystem CI

T14-core already shipped: the baked foundation-ci image, the runner config.yaml (container.network=foundation-net, force_pull=false), and .forgejo/workflows/ci.yml (preflight + typecheck, green). Build on exactly that.

1. T14 remainder — the stack-state-dependent pipelines

Author pulumi-preview (on push/PR) and backup-verify (weekly schedule) workflows. Blocker to solve first: bootstrap/state/ is gitignored, so a CI checkout has NO Pulumi stack state — pulumi/backup scripts can't pulumi config get or stack select.

  • Recommended fix: in bootstrap/run.sh, after a successful up, also pulumi stack export and mc cp it to a dedicated RustFS object (secrets stay passphrase-encrypted within). The CI job pulls it → pulumi stack importpulumi preview. (Alternative: import the latest backup bundle's pulumi-state.json, but that needs the age identity in CI — avoid.)
  • Forgejo Actions secrets (set via the admin API, repo or org scope): PULUMI_CONFIG_PASSPHRASE, the operator SSH key (write to a file + SSH_PRIVATE_KEY_PATH), and RustFS/offsite creds. The scripts already read the passphrase from env and the key from SSH_PRIVATE_KEY_PATH.
  • Jobs: runs-on: docker + container: foundation-ci:latest. preview should be read-only; gate any up behind workflow_dispatch (never auto-up live infra from CI).
  • Validate: push → both jobs green on the runner. backup-verify = backup.sh then restore.sh <ts> off.

2. Ecosystem CI — the 999_testing.md acceptance plan (architecture: REUSABLE workflows)

Reusable Forgejo workflows in THIS repo (uses: olsitec/foundation/.forgejo/workflows/<x>.yml@master, on: workflow_call) that each project references. Cover, per 999_testing.md:

  • Build matrix (5 named candidate repos — paths in the doc): docker-no-npm (seaspots/services/seaspots-homepage), npm pkg (olsitec-nci/lib/olsicrypto), bun pkg (olsitec-nci/lib/document-engine), non-artifact versioned (olsitrack2/api), docker+npm (olsitrack2/services/token-service, depends on olsicrypto).
  • semantic-release bump tests: init→1.0.0, feat→minor, fix/chore→patch, feat!→major, BREAKING CHANGE→major. (Olsitec uses Conventional Commits + semantic-release-monorepo.)
  • Linters: an eslint error and a yamllint error must each fail the job (non-zero exit).
  • Toolchain: extend containers/ci-image/Dockerfile (or add a sibling ci-node image) with shellcheck, eslint, yamllint, semantic-release; re-pin in VERSIONS.
  • DO THIS FIRST (R5): the runner still holds the host Docker socket (root-equivalent). Fence it to a separate privileged VM before running any untrusted/ecosystem candidate, or scope what runs.

Later (after the above)

  • T15index.ts orchestration polish + Gate A/B comments + docs/DAY-ZERO-TIMELINE.md.
  • Hardening — pin floating refs (IMAGE_REGISTRY PIN_DIGEST, IMAGE_RUSTFS latest, IMAGE_CI tag); register in Olsitec MCP (D6); Stage-2 publish packages/pulumi-*.

Validate each task live (VM ./run.sh up + the runner for CI) and commit per task.