From eb005d5ca61119b714046b1215a0877ff11632b1 Mon Sep 17 00:00:00 2001 From: Andreas Niemann Date: Wed, 1 Jul 2026 00:18:24 +0200 Subject: [PATCH] =?UTF-8?q?docs(session):=20SESSION=5F2026-07-01=5F001=20?= =?UTF-8?q?=E2=80=94=20gaps=20closed=20+=20T11=20+=20T13=20+=20T14-core?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Record the session: all three known gaps closed (age encryption, Forgejo crypto mirror + empty-SECRET_KEY fix, ipam ignoreChanges), T11 (repos → Forgejo, origin switched), T13 (DR rehearsed on a throwaway VM + scripts + runbook), and T14 core (baked CI image + runner config + green preflight/typecheck workflow). Refresh HANDOVER to point at it; next: state-dependent CI + ecosystem CI (999_testing.md) + T15 + hardening. Co-Authored-By: Claude Opus 4.8 (1M context) --- documentation/sessions/HANDOVER.md | 80 +++++++++--------- .../sessions/SESSION_2026-07-01_001.md | 81 +++++++++++++++++++ 2 files changed, 122 insertions(+), 39 deletions(-) create mode 100644 documentation/sessions/SESSION_2026-07-01_001.md diff --git a/documentation/sessions/HANDOVER.md b/documentation/sessions/HANDOVER.md index 0638866..d5ac4a5 100644 --- a/documentation/sessions/HANDOVER.md +++ b/documentation/sessions/HANDOVER.md @@ -1,62 +1,64 @@ # HANDOVER — next-session prompt (paste into a fresh context) > Living doc: overwritten each handover. The durable record is the dated -> `SESSION_*` files. Latest state = `SESSION_2026-06-30_002.md`. +> `SESSION_*` files. Latest state = `SESSION_2026-07-01_001.md`. --- Continue the **olsitec-foundation** build. You are the **Lead Agent, HIGH-RISK / INFRA mode**. ## Required reads (in `~/work/olsitec-foundation/foundation/`) -1. `documentation/sessions/SESSION_2026-06-30_002.md` ← current state + known gaps + next steps +1. `documentation/sessions/SESSION_2026-07-01_001.md` ← current state + known gaps + next steps 2. `documentation/000_baseline.md` + `000_TOPOLOGY.md` 3. `documentation/contracts/CONTRACT_001–004` + `decisions/ADR_004,005,006,007` (**ADR-007** is the control-plane mechanism the whole egg runs on — read it first) 4. `documentation/planning/PLAN-002-foundation-implementation.md` §10 +5. `documentation/999_testing.md` ← the operator's acceptance-test plan for the ecosystem CI ## Where things stand -**The egg is LIVE and the goal is met.** Wave 2 (T03–T10, T12) is deployed to the Helsinki VM and -committed. `git clone git@git.olsitec.net:olsitec/foundation.git` works (:22 and :2222). Six containers -on `foundation-net`: postgres, rustfs, vault, caddy, forgejo, runner — all healthy. `https://forge.olsitec.net` -= 200 (LE DNS-01). CI green. Backups → RustFS + offsite, restore-verified from both. `cd bootstrap && -./run.sh up` is idempotent (**41 unchanged**). Working tree clean on `master`. +**The egg is LIVE, all three known gaps are CLOSED, and T11/T13/T14-core are done.** Six containers +on `foundation-net` (postgres/rustfs/vault/caddy/forgejo/runner), all healthy. `https://forge.olsitec.net` +=200; `git clone git@git.olsitec.net:olsitec/foundation.git` works; the foundation repo's **origin is now +Forgejo** (master default); `ai-baseline` is mirrored. **Backups are age-encrypted** (restore-verified from +RustFS + offsite). **DR to a fresh VM is rehearsed + scripted** (`dr/`). The forge's **own CI runs green** +on its runner (`.forgejo/workflows/ci.yml`: preflight + typecheck, in the baked `foundation-ci` image). +`cd bootstrap && ./run.sh up` is idempotent. Working tree clean on `master` (except the operator's untracked +`documentation/999_testing.md`). ## Operating essentials -- **VM**: `204.168.234.72`, admin SSH **:222**, key `~/.ssh/foundation-test_ed25519` (also the registered - Forgejo operator key). Git endpoint is :22 (scp-form) + :2222. -- **Deploy**: `cd bootstrap && ./run.sh up` (sets passphrase + key + per-process backend; captures Vault - keys to config after `up`). Master passphrase: `pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE`. -- **Vault reboot**: `bootstrap/vault-unseal.sh`. **Backup**: `backup/backup.sh [ts]`; - **restore-verify**: `backup/restore.sh [rfs|off]`. +- **VM**: `204.168.234.72`, admin SSH **:222**, key `~/.ssh/foundation-test_ed25519` (also the Forgejo + operator key). Git endpoint :22 (scp-form) + :2222. +- **Deploy**: `cd bootstrap && ./run.sh up`. Master passphrase: `pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE`. +- **Vault reboot**: `bootstrap/vault-unseal.sh`. **Backup**: `backup/backup.sh [ts]`; **restore-verify**: + `backup/restore.sh [rfs|off]`. **DR to fresh VM**: `dr/restore-to-fresh-vm.sh` (+ `dr/RUNBOOK.md`). +- **Forge admin**: `platform-admin` / Vault `foundation/forgejo/service-credentials:forgejoAdminPassword`. +- **CI image**: built on the VM (`/tmp/ci-image`, from `containers/ci-image/Dockerfile`), tag `foundation-ci:latest`, + used locally by the runner (`force_pull:false`). Rebuild on toolchain change. - **Mechanism (ADR-007)**: in-VM control-plane ops = `@pulumi/command` `remote.Command` (docker-exec over - SSH); idempotent, readiness-gated, **secrets on stdin** (never inline — the provider echoes the command - on error). Images are digest-pinned in `VERSIONS`. + SSH); idempotent, readiness-gated, secrets on stdin. Images digest-pinned in `VERSIONS`. ## Watchouts (HIGH-RISK) -- Do **NOT** `pulumi up --refresh` blindly — it surfaces a spurious `foundation-net` ipamConfigs diff; - applying it recreates the network and disconnects every container. Plain `up` ignores it. (Investigate + - fix the drift before enabling refresh in CI.) +- `up --refresh` no longer recreates the network (ipam `ignoreChanges`), but still shows pessimistic + `~triggers` replaces on the vault command chain in *preview* (refreshed `container.id`=`[unknown]`) — a + Pulumi preview artifact, idempotent if applied. Don't panic at it. +- The VM sshd throttles bursts of docker-over-SSH (e.g. parallel refresh) → "Connection closed". Use + `--parallel 1` for refresh, or raise sshd MaxStartups before wiring refresh into CI. - Never print/commit the passphrase, Vault root token, or unseal keys (D2) — only the already-encrypted - `secure: v1:…` values in `Pulumi.foundation.yaml`. -- Don't `pulumi up` against the production `olsicloud4-*` stacks. The `provision`/`offsite-backup` stacks - use the throwaway passphrase `dev-validation-throwaway` + `HCLOUD_TOKEN`/`MINIO_BACKUP_*` from `pass`. -- Commit **atomically per task** (conventional commits; group by concern; don't `git add .`). + `secure:` values. Don't `pulumi up` the prod `olsicloud4-*` stacks. Commit **atomically per task**. +- Don't `pulumi up` the `provision` stack against the LIVE VM (it would recreate the server — cloud-init + changes only affect fresh provisions). -## Next work — remaining PLAN-002 tasks + the known gaps -Pick up where the plan left off (parallelization map §10.2 Wave 5–6). Suggested order: -1. **Close the gaps from SESSION_2026-06-30_002 "Known gaps"** — they're small and de-risk the rest: - - age at-rest encryption of backups (CONTRACT_004 §4.3): generate the age key, store recipient/identity - (Vault `foundation/backup/backup-credentials` + passphrase config), encrypt artifacts before upload. - - Mirror Forgejo crypto secrets (SECRET_KEY/INTERNAL_TOKEN/JWT from app.ini) into - `foundation/forgejo/service-credentials`. - - Investigate + fix the `foundation-net` ipam refresh diff so `up --refresh` is safe. -2. **T11 handover** — push the foundation repo into Forgejo (`olsitec/foundation`) and switch origin; - mirror `ai-baseline`. (The repo already exists in Forgejo from T09 with a README — reconcile.) -3. **T13 DR** — `dr/RUNBOOK.md` + `dr/restore-to-fresh-vm.sh`; rehearse a full rebuild on a clean VM from - the offsite bundle (the destructive sibling of `backup/restore.sh`, restore order Vault→PG→RustFS→Forgejo). -4. **T14 CI** — `.forgejo/workflows/` (preflight, pulumi preview/up, backup-verify weekly). -5. **T15** — `index.ts` orchestration polish + Gate A/B comments + `docs/DAY-ZERO-TIMELINE.md` checklist. -6. **Then hardening**: pin remaining floating refs, fence the runner to a separate privileged VM (R5), - register the project in Olsitec MCP (D6 / PLAN-002 §8), and the Stage-2 publish of `packages/pulumi-*`. +## Next work — pick up from SESSION_2026-07-01_001 "Known gaps" +1. **T14 remainder (state-dependent CI)** — `pulumi preview` + weekly `backup-verify` workflows. Resolve the + blocker first: `bootstrap/state/` is gitignored, so CI has no stack state. Either fetch state from RustFS + in-job (the bundle carries `pulumi-state.json`; or push a dedicated `pulumi stack export` to RustFS each + `up`), then set Forgejo Actions secrets (`PULUMI_CONFIG_PASSPHRASE`, the SSH key, RustFS/offsite creds). +2. **Ecosystem CI (999_testing.md)** — reusable Forgejo workflows (chosen architecture) for docker/npm/bun + builds, semantic-release bump tests, eslint + yamllint, exercised against the 5 candidate repos. Extend + the CI image (shellcheck/eslint/yamllint/semantic-release) or add a sibling image. +3. **T15** — `index.ts` orchestration polish + Gate A/B comments + `docs/DAY-ZERO-TIMELINE.md`. +4. **Hardening** — pin floating refs (`IMAGE_REGISTRY` PIN_DIGEST, `IMAGE_RUSTFS` `latest`, `IMAGE_CI` tag); + fence the runner to a separate privileged VM (R5); register in Olsitec MCP (D6); Stage-2 publish + `packages/pulumi-*`. -Validate each task live on the VM via `./run.sh up` and commit per task. +Validate each task live on the VM via `./run.sh up` (and the runner for CI), and commit per task. diff --git a/documentation/sessions/SESSION_2026-07-01_001.md b/documentation/sessions/SESSION_2026-07-01_001.md new file mode 100644 index 0000000..36f6d9f --- /dev/null +++ b/documentation/sessions/SESSION_2026-07-01_001.md @@ -0,0 +1,81 @@ +# Session 2026-07-01 #001 — close the gaps + T11 + T13 (DR) + T14 (CI core) + +## What was done +Picked up from SESSION_2026-06-30_002 (egg live). Closed all three known gaps, did +T11 + T13, and stood up the foundation's own CI (T14 core). Each task an atomic, +conventional commit, validated live. Egg stayed healthy throughout. + +### Gaps closed +- **age at-rest encryption** (CONTRACT_004 §4.3) — every backup artifact is now + age-encrypted on the VM before upload (`*.age`); only `MANIFEST.json` is cleartext + (inventory + integrity gate; PLAINTEXT shas verified after decrypt). Seeded the age + key: recipient is non-secret config, identity is in passphrase-encrypted config + **and** Vault (`foundation/backup/backup-credentials`, also added — it was empty), + so `{repo + passphrase}` decrypts after total Vault loss. `age`+`zstd` added to the + provision cloud-init for DR. Validated: encrypted backup + restore-verify PASS from + RustFS **and** offsite. +- **Forgejo crypto secrets → Vault** — `foundation/forgejo/service-credentials` is now + single-owned at GATE B and holds admin + `SECRET_KEY`/`INTERNAL_TOKEN`/JWT secrets, + read off the live `app.ini`. **FINDING + FIX**: `SECRET_KEY` was EMPTY (skipping the + web installer under `INSTALL_LOCK` left it unset → weak at-rest crypto for 2FA/mirror/ + oauth). Generated it (`@pulumi/random`) and injected via `FORGEJO__security__SECRET_KEY` + while the egg is fresh (no re-encryption). Now 40 chars in app.ini + Vault. +- **foundation-net ipam refresh diff** — Docker auto-assigns gateway `.1`, which a + `pulumi up --refresh` surfaced as drift; `gateway` is ForceNew, so reconciling it + (declaring it OR applying the diff) would REPLACE the net + disconnect everything + (verified). Fix: `ignoreChanges:["ipamConfigs"]` on the immutable IPAM. Plain `up` + clean; `up --refresh` no longer recreates the net. (Residual, non-destructive: + `preview --refresh` shows pessimistic `~triggers` replaces on the vault command chain + because a refreshed `container.id` is `[unknown]` in preview — a Pulumi artifact, + idempotent if applied.) + +### Tasks +- **T11 handover** — pushed `olsitec/foundation` (28 commits incl. the above) into + Forgejo and switched `origin` to `git@git.olsitec.net`; made `master` the default, + dropped the T09 placeholder `main`. Created + pushed `olsitec/ai-baseline`. Both clone + from the canonical endpoint. (origin/sshCommand live in `.git/config`, nothing in-tree.) +- **T13 DR** — `dr/restore-to-fresh-vm.sh` + `-remote.sh` + `dr/RUNBOOK.md`. **Rehearsed + on a throwaway cx33 from the OFFSITE bundle, then destroyed it.** Restore order + Vault→Postgres→RustFS→Forgejo: `DR RESTORE OK` — Vault unsealed with OLD keys, pg + rows=2, forge healthy against restored DB+S3, `git clone ssh://git@:2222/...` + returns all 28 commits, ai-baseline present. **Findings fixed during the rehearsal**: + (a) backup only tarred `/data/git` — now tars the whole `/data` (app.ini + ssh host + keys, CONTRACT_004 §4.2); (b) `raft snapshot restore -force` re-seals asynchronously + → added a settle+retry unseal loop; (c) publish Forgejo git :22 only when free. +- **T14 CI core** — baked `foundation-ci` image (`containers/ci-image/Dockerfile`, + VERSIONS `IMAGE_CI`) with the full toolchain; built on the VM, used locally by the + runner. `runner.ts` now writes an act_runner `config.yaml` + (`container.network=foundation-net`, `force_pull=false`). `.forgejo/workflows/ci.yml` + (preflight tools+versions, typecheck `tsc --noEmit`) **runs GREEN on the runner**. + Scripts take `PULUMI_CONFIG_PASSPHRASE` from env (CI) falling back to `pass`. + +## Current state +- Repo `~/work/olsitec-foundation/foundation`, branch `master`, origin = Forgejo. Working + tree clean except the operator's untracked `documentation/999_testing.md` (the + acceptance-test plan for the ecosystem CI — see Next steps). +- `cd bootstrap && ./run.sh up` idempotent. 7 services (added: nothing new container-wise; + runner reconfigured). `https://forge.olsitec.net`=200, clone works, CI green. +- Master passphrase: `pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE`. VM key + `~/.ssh/foundation-test_ed25519`. Forge admin: `platform-admin` / Vault + `foundation/forgejo/service-credentials:forgejoAdminPassword`. + +## Known gaps / next steps +- **T14 remainder (state-dependent CI)** — `pulumi preview` + `backup-verify` (weekly) + workflows. BLOCKER: `bootstrap/state/` is gitignored, so a CI checkout has no stack + state. Needs (a) a state fetch from RustFS in-job (the bundle already carries + `pulumi-state.json`; or push a dedicated `pulumi stack export` to RustFS on each up), + and (b) Forgejo Actions secrets: `PULUMI_CONFIG_PASSPHRASE`, the SSH key, RustFS/offsite + creds. Then `runs-on: docker` + `container: foundation-ci:latest`. +- **Ecosystem CI (the 999_testing.md plan)** — reusable Forgejo workflows (chosen + architecture) for: docker build (±npm deps), npm + bun package builds, semantic-release + bump tests (1.0.0→feat→fix→`!`→BREAKING CHANGE), eslint + yamllint gating. Candidates: + seaspots-homepage, olsicrypto, document-engine, olsitrack2/api, token-service. Add + `shellcheck`/`eslint`/`yamllint`/`semantic-release` to the CI image or a sibling image. +- **T15** — `index.ts` orchestration polish + Gate A/B comments + `docs/DAY-ZERO-TIMELINE.md`. +- **Hardening** — pin floating refs (`IMAGE_REGISTRY=…PIN_DIGEST`, `IMAGE_RUSTFS` tag + `latest`, `IMAGE_CI` tag-only); fence the runner to a separate privileged VM (R5; it + still has the host docker socket); register in Olsitec MCP (D6); Stage-2 publish + `packages/pulumi-*`. Also: VM sshd throttles bursts of docker-over-SSH (refresh) — + serialize (`--parallel`) or raise MaxStartups before refresh-in-CI. + +## Operating mode for next session: HIGH-RISK / INFRA (remote VM, Docker, secrets).