foundation/documentation/sessions/HANDOVER.md
Andreas Niemann 430c55cdf6
All checks were successful
CI / preflight (push) Successful in 5s
CI / typecheck (push) Successful in 13s
docs(session): focus HANDOVER on T14-remainder then 999_testing ecosystem CI
Sharpen the living handover for the next context: concrete starting points +
pre-surfaced blockers/decisions for (1) the stack-state-dependent CI pipelines
(state-fetch-from-RustFS + Forgejo Actions secrets) and (2) the 999_testing
ecosystem CI (reusable workflows, build matrix over the 5 candidates,
semantic-release bump tests, eslint/yamllint, R5 runner-fencing first).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-01 00:28:57 +02:00

92 lines
6.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# HANDOVER — next-session prompt (paste into a fresh context)
> Living doc: overwritten each handover. The durable record is the dated
> `SESSION_*` files. Latest state = `SESSION_2026-07-01_001.md`.
---
Continue the **olsitec-foundation** build. You are the **Lead Agent, HIGH-RISK / INFRA mode**.
## Required reads (in `~/work/olsitec-foundation/foundation/`)
1. `documentation/sessions/SESSION_2026-07-01_001.md` ← current state + known gaps + next steps
2. `documentation/000_baseline.md` + `000_TOPOLOGY.md`
3. `documentation/contracts/CONTRACT_001004` + `decisions/ADR_004,005,006,007`
(**ADR-007** is the control-plane mechanism the whole egg runs on — read it first)
4. `documentation/planning/PLAN-002-foundation-implementation.md` §10
5. `documentation/999_testing.md` ← the operator's acceptance-test plan for the ecosystem CI
## Where things stand
**The egg is LIVE, all three known gaps are CLOSED, and T11/T13/T14-core are done.** Six containers
on `foundation-net` (postgres/rustfs/vault/caddy/forgejo/runner), all healthy. `https://forge.olsitec.net`
=200; `git clone git@git.olsitec.net:olsitec/foundation.git` works; the foundation repo's **origin is now
Forgejo** (master default); `ai-baseline` is mirrored. **Backups are age-encrypted** (restore-verified from
RustFS + offsite). **DR to a fresh VM is rehearsed + scripted** (`dr/`). The forge's **own CI runs green**
on its runner (`.forgejo/workflows/ci.yml`: preflight + typecheck, in the baked `foundation-ci` image).
`cd bootstrap && ./run.sh up` is idempotent. Working tree clean on `master` (except the operator's untracked
`documentation/999_testing.md`).
## Operating essentials
- **VM**: `204.168.234.72`, admin SSH **:222**, key `~/.ssh/foundation-test_ed25519` (also the Forgejo
operator key). Git endpoint :22 (scp-form) + :2222.
- **Deploy**: `cd bootstrap && ./run.sh up`. Master passphrase: `pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE`.
- **Vault reboot**: `bootstrap/vault-unseal.sh`. **Backup**: `backup/backup.sh [ts]`; **restore-verify**:
`backup/restore.sh <ts> [rfs|off]`. **DR to fresh VM**: `dr/restore-to-fresh-vm.sh` (+ `dr/RUNBOOK.md`).
- **Forge admin**: `platform-admin` / Vault `foundation/forgejo/service-credentials:forgejoAdminPassword`.
- **CI image**: built on the VM (`/tmp/ci-image`, from `containers/ci-image/Dockerfile`), tag `foundation-ci:latest`,
used locally by the runner (`force_pull:false`). Rebuild on toolchain change.
- **Mechanism (ADR-007)**: in-VM control-plane ops = `@pulumi/command` `remote.Command` (docker-exec over
SSH); idempotent, readiness-gated, secrets on stdin. Images digest-pinned in `VERSIONS`.
## Watchouts (HIGH-RISK)
- `up --refresh` no longer recreates the network (ipam `ignoreChanges`), but still shows pessimistic
`~triggers` replaces on the vault command chain in *preview* (refreshed `container.id`=`[unknown]`) — a
Pulumi preview artifact, idempotent if applied. Don't panic at it.
- The VM sshd throttles bursts of docker-over-SSH (e.g. parallel refresh) → "Connection closed". Use
`--parallel 1` for refresh, or raise sshd MaxStartups before wiring refresh into CI.
- Never print/commit the passphrase, Vault root token, or unseal keys (D2) — only the already-encrypted
`secure:` values. Don't `pulumi up` the prod `olsicloud4-*` stacks. Commit **atomically per task**.
- Don't `pulumi up` the `provision` stack against the LIVE VM (it would recreate the server — cloud-init
changes only affect fresh provisions).
## Next work — THIS session: (1) finish T14, then (2) the 999_testing ecosystem CI
T14-core already shipped: the baked `foundation-ci` image, the runner `config.yaml`
(`container.network=foundation-net`, `force_pull=false`), and `.forgejo/workflows/ci.yml`
(preflight + typecheck, **green**). Build on exactly that.
### 1. T14 remainder — the stack-state-dependent pipelines
Author `pulumi-preview` (on push/PR) and `backup-verify` (weekly `schedule`) workflows.
**Blocker to solve first:** `bootstrap/state/` is gitignored, so a CI checkout has NO Pulumi
stack state — `pulumi`/`backup` scripts can't `pulumi config get` or `stack select`.
- **Recommended fix:** in `bootstrap/run.sh`, after a successful `up`, also `pulumi stack export`
and `mc cp` it to a dedicated RustFS object (secrets stay passphrase-encrypted within). The CI
job pulls it → `pulumi stack import``pulumi preview`. (Alternative: import the latest backup
bundle's `pulumi-state.json`, but that needs the age identity in CI — avoid.)
- **Forgejo Actions secrets** (set via the admin API, repo or org scope): `PULUMI_CONFIG_PASSPHRASE`,
the operator SSH key (write to a file + `SSH_PRIVATE_KEY_PATH`), and RustFS/offsite creds. The
scripts already read the passphrase from env and the key from `SSH_PRIVATE_KEY_PATH`.
- Jobs: `runs-on: docker` + `container: foundation-ci:latest`. preview should be read-only; gate any
`up` behind `workflow_dispatch` (never auto-`up` live infra from CI).
- Validate: push → both jobs green on the runner. `backup-verify` = `backup.sh` then `restore.sh <ts> off`.
### 2. Ecosystem CI — the `999_testing.md` acceptance plan (architecture: REUSABLE workflows)
Reusable Forgejo workflows in THIS repo (`uses: olsitec/foundation/.forgejo/workflows/<x>.yml@master`,
`on: workflow_call`) that each project references. Cover, per `999_testing.md`:
- **Build matrix** (5 named candidate repos — paths in the doc): docker-no-npm
(`seaspots/services/seaspots-homepage`), npm pkg (`olsitec-nci/lib/olsicrypto`), bun pkg
(`olsitec-nci/lib/document-engine`), non-artifact versioned (`olsitrack2/api`), docker+npm
(`olsitrack2/services/token-service`, depends on olsicrypto).
- **semantic-release** bump tests: init→`1.0.0`, `feat`→minor, `fix`/`chore`→patch, `feat!`→major,
`BREAKING CHANGE`→major. (Olsitec uses Conventional Commits + semantic-release-monorepo.)
- **Linters**: an eslint error and a yamllint error must each fail the job (non-zero exit).
- **Toolchain**: extend `containers/ci-image/Dockerfile` (or add a sibling `ci-node` image) with
`shellcheck`, `eslint`, `yamllint`, `semantic-release`; re-pin in `VERSIONS`.
- **DO THIS FIRST (R5):** the runner still holds the host Docker socket (root-equivalent). **Fence it
to a separate privileged VM before running any untrusted/ecosystem candidate**, or scope what runs.
### Later (after the above)
- **T15** — `index.ts` orchestration polish + Gate A/B comments + `docs/DAY-ZERO-TIMELINE.md`.
- **Hardening** — pin floating refs (`IMAGE_REGISTRY` PIN_DIGEST, `IMAGE_RUSTFS` `latest`, `IMAGE_CI` tag);
register in Olsitec MCP (D6); Stage-2 publish `packages/pulumi-*`.
Validate each task live (VM `./run.sh up` + the runner for CI) and commit per task.