foundation/documentation/sessions/SESSION_2026-07-01_001.md
Andreas Niemann eb005d5ca6
All checks were successful
CI / preflight (push) Successful in 7s
CI / typecheck (push) Successful in 18s
docs(session): SESSION_2026-07-01_001 — gaps closed + T11 + T13 + T14-core
Record the session: all three known gaps closed (age encryption, Forgejo
crypto mirror + empty-SECRET_KEY fix, ipam ignoreChanges), T11 (repos → Forgejo,
origin switched), T13 (DR rehearsed on a throwaway VM + scripts + runbook), and
T14 core (baked CI image + runner config + green preflight/typecheck workflow).
Refresh HANDOVER to point at it; next: state-dependent CI + ecosystem CI
(999_testing.md) + T15 + hardening.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-01 00:18:24 +02:00

5.7 KiB

Session 2026-07-01 #001 — close the gaps + T11 + T13 (DR) + T14 (CI core)

What was done

Picked up from SESSION_2026-06-30_002 (egg live). Closed all three known gaps, did T11 + T13, and stood up the foundation's own CI (T14 core). Each task an atomic, conventional commit, validated live. Egg stayed healthy throughout.

Gaps closed

  • age at-rest encryption (CONTRACT_004 §4.3) — every backup artifact is now age-encrypted on the VM before upload (*.age); only MANIFEST.json is cleartext (inventory + integrity gate; PLAINTEXT shas verified after decrypt). Seeded the age key: recipient is non-secret config, identity is in passphrase-encrypted config and Vault (foundation/backup/backup-credentials, also added — it was empty), so {repo + passphrase} decrypts after total Vault loss. age+zstd added to the provision cloud-init for DR. Validated: encrypted backup + restore-verify PASS from RustFS and offsite.
  • Forgejo crypto secrets → Vaultfoundation/forgejo/service-credentials is now single-owned at GATE B and holds admin + SECRET_KEY/INTERNAL_TOKEN/JWT secrets, read off the live app.ini. FINDING + FIX: SECRET_KEY was EMPTY (skipping the web installer under INSTALL_LOCK left it unset → weak at-rest crypto for 2FA/mirror/ oauth). Generated it (@pulumi/random) and injected via FORGEJO__security__SECRET_KEY while the egg is fresh (no re-encryption). Now 40 chars in app.ini + Vault.
  • foundation-net ipam refresh diff — Docker auto-assigns gateway .1, which a pulumi up --refresh surfaced as drift; gateway is ForceNew, so reconciling it (declaring it OR applying the diff) would REPLACE the net + disconnect everything (verified). Fix: ignoreChanges:["ipamConfigs"] on the immutable IPAM. Plain up clean; up --refresh no longer recreates the net. (Residual, non-destructive: preview --refresh shows pessimistic ~triggers replaces on the vault command chain because a refreshed container.id is [unknown] in preview — a Pulumi artifact, idempotent if applied.)

Tasks

  • T11 handover — pushed olsitec/foundation (28 commits incl. the above) into Forgejo and switched origin to git@git.olsitec.net; made master the default, dropped the T09 placeholder main. Created + pushed olsitec/ai-baseline. Both clone from the canonical endpoint. (origin/sshCommand live in .git/config, nothing in-tree.)
  • T13 DRdr/restore-to-fresh-vm.sh + -remote.sh + dr/RUNBOOK.md. Rehearsed on a throwaway cx33 from the OFFSITE bundle, then destroyed it. Restore order Vault→Postgres→RustFS→Forgejo: DR RESTORE OK — Vault unsealed with OLD keys, pg rows=2, forge healthy against restored DB+S3, git clone ssh://git@<vm>:2222/... returns all 28 commits, ai-baseline present. Findings fixed during the rehearsal: (a) backup only tarred /data/git — now tars the whole /data (app.ini + ssh host keys, CONTRACT_004 §4.2); (b) raft snapshot restore -force re-seals asynchronously → added a settle+retry unseal loop; (c) publish Forgejo git :22 only when free.
  • T14 CI core — baked foundation-ci image (containers/ci-image/Dockerfile, VERSIONS IMAGE_CI) with the full toolchain; built on the VM, used locally by the runner. runner.ts now writes an act_runner config.yaml (container.network=foundation-net, force_pull=false). .forgejo/workflows/ci.yml (preflight tools+versions, typecheck tsc --noEmit) runs GREEN on the runner. Scripts take PULUMI_CONFIG_PASSPHRASE from env (CI) falling back to pass.

Current state

  • Repo ~/work/olsitec-foundation/foundation, branch master, origin = Forgejo. Working tree clean except the operator's untracked documentation/999_testing.md (the acceptance-test plan for the ecosystem CI — see Next steps).
  • cd bootstrap && ./run.sh up idempotent. 7 services (added: nothing new container-wise; runner reconfigured). https://forge.olsitec.net=200, clone works, CI green.
  • Master passphrase: pass olsitec-foundation/PULUMI_CONFIG_PASSPHRASE. VM key ~/.ssh/foundation-test_ed25519. Forge admin: platform-admin / Vault foundation/forgejo/service-credentials:forgejoAdminPassword.

Known gaps / next steps

  • T14 remainder (state-dependent CI)pulumi preview + backup-verify (weekly) workflows. BLOCKER: bootstrap/state/ is gitignored, so a CI checkout has no stack state. Needs (a) a state fetch from RustFS in-job (the bundle already carries pulumi-state.json; or push a dedicated pulumi stack export to RustFS on each up), and (b) Forgejo Actions secrets: PULUMI_CONFIG_PASSPHRASE, the SSH key, RustFS/offsite creds. Then runs-on: docker + container: foundation-ci:latest.
  • Ecosystem CI (the 999_testing.md plan) — reusable Forgejo workflows (chosen architecture) for: docker build (±npm deps), npm + bun package builds, semantic-release bump tests (1.0.0→feat→fix→!→BREAKING CHANGE), eslint + yamllint gating. Candidates: seaspots-homepage, olsicrypto, document-engine, olsitrack2/api, token-service. Add shellcheck/eslint/yamllint/semantic-release to the CI image or a sibling image.
  • T15index.ts orchestration polish + Gate A/B comments + docs/DAY-ZERO-TIMELINE.md.
  • Hardening — pin floating refs (IMAGE_REGISTRY=…PIN_DIGEST, IMAGE_RUSTFS tag latest, IMAGE_CI tag-only); fence the runner to a separate privileged VM (R5; it still has the host docker socket); register in Olsitec MCP (D6); Stage-2 publish packages/pulumi-*. Also: VM sshd throttles bursts of docker-over-SSH (refresh) — serialize (--parallel) or raise MaxStartups before refresh-in-CI.

Operating mode for next session: HIGH-RISK / INFRA (remote VM, Docker, secrets).