# PLAN-002 — `olsitec-foundation` Implementation Strategy (Master Roadmap) > **Companion to** [PLAN-001-forgejo.md](PLAN-001-forgejo.md) (the vision) and > [002_platform_architecture.md](002_platform_architecture.md) (the existing olsicloud4 K8s platform). > **Status:** Draft for human ratification. **Mode at authoring:** EXPLORE (design only, no code). > **Author role:** Lead platform architect. **Date:** 2026-06-30. > > This document is **not** an implementation. It is the strategy that AI agents execute. > Confidence markers (High/Medium/Low) follow baseline PD-5. --- ## 0. The Pivotal Decision (read this first) **PLAN-001 deploys Forgejo *onto Kubernetes* via ArgoCD + Helm. The foundation must NOT.** The foundation is the **egg**: the thing every other platform is hatched from. Kubernetes, ArgoCD, Helm, cert-manager and ESO are themselves *hatched* by the platform — so the foundation cannot depend on them without creating an unrecoverable circular dependency (DR-from-nothing would require rebuilding K8s, which needs git+registry+secrets, which *are* the foundation). ### Recommendation — a layered platform (High confidence) | Layer | What | Substrate | Lifecycle | | ----- | ---- | --------- | --------- | | **Layer 0 — `olsitec-foundation` (the egg)** | Forgejo (+ Actions + OCI/npm registry), PostgreSQL, Vault, RustFS, reverse proxy, 1 runner | **Plain OCI containers on ONE VM**, orchestrated by Pulumi `@pulumi/docker` over SSH. **No K8s/ArgoCD/Helm.** | `pulumi up` (manual day-zero → CI later) | | **Layer 1+ — the olsicloud4 K8s platform & everything else** | K8s, ArgoCD, cert-manager, ESO, Authentik, Grafana/Prometheus, Longhorn, Renovate, additional registries | Kubernetes | **Consumes** Layer 0: repos in foundation-Forgejo, CI in foundation-Actions, images/charts in foundation-registry, secrets in foundation-Vault | **Why this is correct and not a downgrade:** - The existing repo *already* contains `pulumi/modules/docker/` (a `@pulumi/docker` SSH-to-host wrapper) and `pulumi/olsitec-core/run.sh` (Pulumi-initializes-Vault-then-captures-unseal-keys-back-into-passphrase-encrypted-config). The tooling is already pointed at this model. (High confidence — verified in source.) - PLAN-001's K8s topology remains valid as a **future, optional HA path** for Forgejo (its "True HA is a step change" note). It is not thrown away — it is deferred to §8. **Consequence:** Everywhere PLAN-001 says "StatefulSet / Helm / ArgoCD Application," Layer 0 reads "container + named volume / Pulumi `docker.Container` / Pulumi resource." The *data & state model* of PLAN-001 (git repos on a POSIX FS, Postgres, S3 for blobs) is unchanged and fully reused. --- ## 1. Architecture Review ### 1.1 Validated strengths of the vision - **Forgejo as one binary** (forge + CI + OCI + npm + 20 registries) genuinely collapses GitLab's 4–5 services into one. (High) — confirmed in PLAN-001. - **Single master passphrase as the only external secret** is achievable and already proven by `olsitec-core` (`PULUMI_CONFIG_PASSPHRASE` passphrase provider). (High) - **Pulumi-owns-credentials / Vault-distributes** (ADR-002) is the right steady-state. (High) - **Boring tech**: Postgres, Vault, S3, a reverse proxy, Docker containers. All well-understood. (High) ### 1.2 Weaknesses / risks identified | # | Risk | Severity | Mitigation (see section) | |---|------|----------|--------------------------| | R1 | **Single VM = single point of failure.** Forgejo is irreducibly stateful (git repos on FS). | High | Frequent backups to RustFS + **offsite**; DR rebuild ≤ 1h, tested (§6). HA is Layer-1 future (§8). | | R2 | **Vault auto-unseal paradox** — unattended reboot leaves Vault sealed; auto-unseal normally needs an external KMS (a SaaS or a second Vault). | High | Shamir unseal; keys held in passphrase-encrypted Pulumi config; passphrase-gated unseal helper (§4, §9). | | R3 | **RustFS maturity.** RustFS is a young MinIO-compatible S3. Foundation backups depend on it. | Medium | Keep S3 usage to the documented S3 API surface; **never** make RustFS the *only* copy of backups (offsite replica is non-S3-only). Treat RustFS as replaceable behind the S3 boundary. (Medium confidence on RustFS stability — flag for second-opinion.) | | R4 | **Pulumi state location before infra exists** (chicken-egg). | Medium | Local file backend during bootstrap → migrate to RustFS S3 backend after; state backed up offsite (§5, §9). | | R5 | **Privileged runner.** Forgejo Actions docker backend needs a privileged daemon. | Medium | Runner on a **throwaway sidecar VM** (or same VM, contained), never sharing the forge's trust boundary (§4a of PLAN-001 reused). | | R6 | **DinD/runner pulls from Docker Hub** → rate limits + SaaS dependency for CI base images. | Medium | Pull-through cache → mirror critical images into Forgejo's own OCI registry; pin by digest (§7, §8). | | R7 | **TLS day-zero**: ACME needs DNS resolving + reachability before the service is public. | Medium | DNS-01 via existing Cloudflare token (already in platform) OR reverse-proxy internal CA for day-zero, swap to real certs once DNS resolves (§4 certs). | | R8 | **Backup encryption keys / offsite creds** become a *second* must-survive secret. | Medium | Fold offsite + backup credentials into the same passphrase-encrypted config / Vault; never a bare file (§4, §6). | | R9 | **Forgejo Actions feature-completeness vs GitLab CI** for existing pipelines (Kaniko, semantic-release, helm push). | Low | PLAN-001 already mapped every job → `runs-on: docker`. Reuse that mapping. (High) | ### 1.3 Hidden dependencies to make explicit - **DNS** must resolve `forge.olsitec.de` (and friends) to the VM **before** TLS and **before** self-hosting handover. Who owns the zone? (Cloudflare, per existing platform.) → §9 Networking. - **An SSH key** trusted by the VM is needed for Pulumi's Docker-over-SSH provider. That key's trust is a day-zero identity question (§9 Identity). - **Container images** are an external dependency until mirrored. Pin by **digest** for determinism (§Determinism). - **The operator workstation** is an implicit trusted host for the very first `pulumi up`. Its toolchain must be validated (preflight, §2). ### 1.4 Suggested additions / changes to the component list - **Add a reverse proxy with automatic TLS** → recommend **Caddy** (auto-ACME, ~10-line Caddyfile, internal-CA fallback). Alternative: Traefik. nginx if maximum-boring is required but loses auto-TLS ergonomics. (Medium — Caddy recommended.) - **Add a Docker Hub pull-through cache** (`registry:2`) at Layer 0 from day one (PLAN-001 component #6) — removes a SaaS rate-limit dependency for CI. (Medium) - **Defer Valkey/Redis** — single-replica Forgejo needs no external queue/cache (PLAN-001 confirms). Add only with HA. (High) - **Defer Meilisearch** — search is not foundational. (High) - **Keep `@pulumi/random` for all credential generation** (reuse existing pattern). (High) - **Vault PKI engine** becomes the internal CA in §8 (replacing Caddy's bootstrap internal CA). --- ## 2. Bootstrapping Strategy (empty VM → operational) Phases are deployed by **one Pulumi project** with explicit ordering (component dependencies + a small number of phase gates). See §5 for the dependency graph and §9 for the full timeline. ``` Phase 0 PROVISION Bare VM (Hetzner) + cloud-init: docker engine, ssh key, firewall. Phase 1 PREFLIGHT Cloned repo validates host+toolchain (pulumi, node/bun, docker, ssh, dns, age/gpg). Phase 2 STATE+TRUST Pulumi local file backend; master passphrase set (PULUMI_CONFIG_PASSPHRASE via `pass`). Phase 3 DATA PLANE Docker network + PostgreSQL + RustFS (sealed/empty Vault container also started). Phase 4 VAULT INIT `vault operator init` → capture root token + unseal keys → write back to passphrase- encrypted Pulumi config (PROVEN pattern, olsitec-core/run.sh) → unseal. Phase 5 CREDENTIALS @pulumi/random generates all service creds → written to Vault KV v2 → RustFS buckets created → Postgres roles/DBs created. Phase 6 FORGE Reverse proxy + Forgejo (app.ini rendered with secrets from Vault/Pulumi) come up; Forgejo install-lock + first admin created deterministically. Phase 7 HANDOVER Push the foundation repo INTO Forgejo; switch git origin; create org + mirror infra repos; register first Actions runner (token from Vault). Phase 8 CI HANDOFF A `.forgejo/workflows/` pipeline runs `pulumi preview` (then `up` on approval). Phase 9 BACKUP+DR First backup taken (forgejo dump + pg_dump + vault snapshot + pulumi state) → RustFS → offsite. DR rebuild rehearsed on a fresh VM. ``` **Phase gates (only where strictly required):** - Gate A after Phase 4: Vault must be initialized+unsealed before Phase 5 writes secrets. - Gate B after Phase 6: Forgejo must be healthy before Phase 7 handover. Everything else flows through ordinary Pulumi resource dependencies — no extra gates. --- ## 3. Repository Structure A **single repo** = the DR unit. `git clone` + master passphrase ⇒ rebuild. ``` olsitec-foundation/ ├── README.md # 5-line quickstart + DR pointer ├── VERSIONS # pinned versions+digests for every image & tool (determinism) ├── preflight/ │ ├── preflight.sh # validates tools, versions, ssh, dns, docker reachability │ └── checks/ # individual check scripts (composable, testable) ├── pulumi/ │ ├── Pulumi.yaml # single project │ ├── Pulumi.foundation.yaml # stack: passphrase-encrypted config + secrets (committable) │ ├── index.ts # phase orchestration entrypoint │ ├── config.ts # typed config schema (CONTRACT_001) │ ├── components/ # one ComponentResource per concern │ │ ├── network.ts # docker network, firewall expectations │ │ ├── postgres.ts │ │ ├── rustfs.ts # + bucket provisioning │ │ ├── vault.ts # container + init/unseal capture lib │ │ ├── credentials.ts # @pulumi/random → Vault writer (CONTRACT_002 paths) │ │ ├── proxy.ts # Caddy + TLS strategy │ │ ├── forgejo.ts # app.ini render, install-lock, first admin │ │ └── runner.ts # act_runner + registration-token flow │ ├── phases/ # thin orchestrators: dataPlane(), vaultInit(), forge(), handover() │ └── lib/ # vaultInitCapture(), renderTemplate(), digest pinning helpers ├── containers/ # Dockerfiles for anything we build/mirror ourselves ├── config/ # rendered template SOURCES: app.ini.tmpl, Caddyfile.tmpl, pg-init.sql ├── backup/ │ ├── backup.sh # forgejo dump + pg_dump + vault snapshot + pulumi state → RustFS → offsite │ └── restore.sh # inverse, parametrized by target host ├── dr/ │ ├── RUNBOOK.md # human-readable DR procedure │ └── restore-to-fresh-vm.sh # automated rebuild used by the DR rehearsal test ├── docs/ │ ├── decisions/ # ADRs (ADR_F001 layered-platform, etc.) │ ├── DAY-ZERO-TIMELINE.md # §9 timeline as an executable checklist │ └── contracts/ # CONTRACT_001..004 (§10) ├── .forgejo/workflows/ # CI: preflight.yml, pulumi-preview.yml, pulumi-up.yml, backup-verify.yml └── .gitignore # state/ (local backend), node_modules, *.local ``` **Why this layout (High confidence):** - **One repo = one DR unit.** Vision requirement: "freshly cloned repo capable of pre-flight validation." - **`components/` mirror the deployment order** so an agent can own one file with a clear contract. - **`config/` holds template *sources*, never rendered secrets** — rendered output carries secrets and stays in container/Vault only (PD-2: don't version secrets). - **`VERSIONS` centralizes determinism** — preflight and CI both read it; upgrades are a one-line diff. - **`.forgejo/workflows/` co-located** so the repo that defines CI is the repo CI deploys (self-hosting). --- ## 4. Secret Management ### 4.1 Root of trust **The master passphrase** (`PULUMI_CONFIG_PASSPHRASE`) is the single root. It selects Pulumi's `passphrase` secrets provider (already in use: `encryptionsalt` in `Pulumi.olsitec-core.yaml`). Chain of trust: ``` Master passphrase └─ decrypts Pulumi stack config secrets (committable, `secure: v1:…`) └─ which hold Vault unseal keys + root token (captured at init) └─ Vault becomes the runtime distribution layer for ALL other secrets (ADR-002) ``` The passphrase is the **only** thing a human must carry out-of-band. Store it in `pass` (operator side), and/or split it among operators with Shamir, and/or a hardware token. It is never written to the platform. ### 4.2 Credential generation — deterministic vs random | Class | Examples | Source | Rationale | |-------|----------|--------|-----------| | **Random / high-entropy** | all service passwords, Postgres pw, RustFS access/secret keys, Forgejo `SECRET_KEY` + `INTERNAL_TOKEN` + JWT secrets, OCI/npm registry tokens, runner registration token, Forgejo admin password | `@pulumi/random` → Vault KV v2 | secrets must be unguessable; rotation = `--replace` | | **Derived / deterministic** | usernames, DB names, bucket names, container/DNS names, Vault mount layout, hostnames | computed from typed config | reproducible, non-secret, no entropy needed | | **External (the ONLY one)** | master passphrase | human | root of trust | This satisfies the vision: *"everything else should derive from that."* ### 4.3 Vault initialization & unseal (the hard part — High attention) - **Init:** Pulumi runs `vault operator init` (Shamir, e.g. 5 keys / threshold 3) inside the Vault container, captures `unsealKeys` + `rootToken` as **stack outputs**, then `run.sh` (or a Pulumi `local.Command`) writes them back as passphrase-encrypted Pulumi config secrets. **This exact pattern already exists in `olsitec-core/run.sh`** — reuse it verbatim. (High confidence.) - **Unseal on reboot (R2):** Vault seals on every restart. Options: 1. **Passphrase-gated unseal helper** *(recommended)* — a small script reads the unseal keys from Pulumi config (decrypted by the passphrase the operator provides) and unseals. Deterministic, reproducible, **no external KMS, no SaaS**. Cost: VM reboots need an operator (or a passphrase made available to a boot service — a trade-off to decide). 2. **Transit auto-unseal** — rejected at Layer 0 (needs a *second* Vault → circular). 3. **KMS auto-unseal** — rejected (SaaS dependency, violates design goal). → Recommend (1) for Layer 0; revisit auto-unseal when a second trust anchor exists at Layer 1. (Medium confidence — this is the main open operational question; flag for second-opinion.) ### 4.4 Rotation Per ADR-002: `pulumi up --replace` on the `RandomPassword` → new value in Vault → consumers reload. At Layer 0, consumers are containers, so rotation triggers a container recreate (Pulumi handles the dependency). Vault root token: rotate via `vault operator generate-root` after bootstrap; store new token in config. Unseal-key rotation: `vault operator rekey`. ### 4.5 Recovery & backup of secrets - **Vault data** backed up via `vault operator raft snapshot` → RustFS → offsite. - **Unseal keys + root token** survive inside the passphrase-encrypted Pulumi stack config, which is in the repo (and the repo is backed up). So {repo + passphrase} reconstitutes Vault access. - **Backup/offsite credentials (R8)** live in Vault *and* are mirrored into the passphrase-encrypted config, so they survive even total Vault loss. --- ## 5. Deployment Order & Dependency Graph ``` ┌─────────────────────────┐ │ master passphrase (ext) │ └───────────┬─────────────┘ │ selects secrets provider ┌──────▼───────┐ │ Pulumi state │ (local file backend → later RustFS S3) └──────┬───────┘ │ ┌──────────────┬──────────┼───────────┬──────────────┐ ▼ ▼ ▼ ▼ ▼ docker network PostgreSQL RustFS Vault(sealed) Caddy(proxy) │ │ │ │ │ │ │ │ [Gate A: init+unseal] │ │ │ │ │ │ │ │ │ Vault(unsealed) │ │ │ │ │ │ │ └──────────┴─────┐ credentials.ts │ │ ▼ (@pulumi/random→Vault) │ Postgres roles/DBs, RustFS buckets created │ │ └───────────────────────────────┼──────────────────────┐ ▼ │ Forgejo (app.ini ← Vault) ◄──┘ (proxy routes to it) │ [Gate B: healthy] ▼ first admin + org + repos │ ▼ act_runner (token ← Vault) │ ▼ CI assumes deploy duty ``` ### 5.1 What depends on what - **Everything** depends on Pulumi state + passphrase. - **credentials.ts** depends on Vault being unsealed (Gate A) and on Postgres/RustFS existing (to create roles/buckets with the generated creds). - **Forgejo** depends on Postgres (DB), RustFS (blob storage), Vault (secrets), proxy (TLS/ingress). - **Runner** depends on Forgejo (registration token) and on the proxy (to reach Forgejo). - **CI** depends on the runner. ### 5.2 Circular dependencies & resolutions (summary; full list §9) | Cycle | Resolution | |-------|-----------| | Pulumi needs a secret store; Vault is that store; Vault is deployed by Pulumi | Passphrase-encrypted config holds unseal keys at bootstrap; Vault holds the rest in steady state. | | Forgejo hosts the repo that deploys Forgejo | Deploy Forgejo from the **local clone** first; then push repo in + switch origin (handover). | | CI deploys the platform; CI runs on the platform | First `pulumi up` is **manual**; CI takes over only after the runner exists and a self-rebuild is proven. | | Registry hosts CI base images; CI fills the registry | Pull from upstream via pull-through cache day-zero; mirror into Forgejo registry afterward. | | TLS needs DNS+ACME; ACME account must be created | DNS-01 via existing Cloudflare token, or internal CA day-zero; real certs once DNS resolves. | --- ## 6. Disaster Recovery (total VM loss) **Premise:** survive on {a VM, the repo, the master passphrase}. ### 6.1 What must exist to recover 1. **The repo** (git clone — mirrored offsite, see below). 2. **The master passphrase** (operator's head / `pass` / Shamir split). 3. **The latest backup bundle** in the **offsite** location: `forgejo-dump.zip`, `pg_dump.sql`, `vault-raft.snap`, `pulumi-state` (if not reconstructible), `rustfs-data` (or it is the offsite). ### 6.2 Procedure (target ≤ 1 hour; `dr/restore-to-fresh-vm.sh` automates most) 1. Provision a fresh VM (Phase 0 cloud-init). 2. `git clone` foundation repo; run `preflight/`. 3. Set `PULUMI_CONFIG_PASSPHRASE`; `pulumi login` (local backend) or restore state from offsite. 4. `pulumi up` Phases 3–4: data plane + Vault container. **Restore Vault** from raft snapshot (`vault operator raft snapshot restore`); unseal with keys from config. 5. **Restore Postgres** (`pg_restore`) and **RustFS data** (sync from offsite) before starting Forgejo. 6. `pulumi up` Phase 6: Forgejo against restored DB + restored data dir (git repos). 7. Re-register the runner (new token) — runners are stateless, never restored. 8. Validate: clone a repo, run a pipeline, push an image, read a Vault secret. ### 6.3 What is recreatable and **not** backed up - Container images (re-pullable / rebuildable from pinned digests). - Search indexes (Forgejo rebuilds). - Caches, runner ephemeral state, pull-through cache contents. - Pulumi state *if* the local backend is reconstructible — but back it up anyway (cheap insurance). ### 6.4 Offsite requirement (critical) RustFS lives on the same VM → it cannot be the only backup copy (R3). Replicate the backup bundle to a **second location with a different failure domain** that is **not SaaS by hard dependency**: recommend a second small Hetzner VM / Storage Box in another DC, or a second self-hosted RustFS. (If a SaaS S3 is used, it must be *additive*, never the sole copy — preserving the no-SaaS guarantee.) --- ## 7. Operational Lifecycle ### 7.1 Upgrades - Bump the pinned **digest** in `VERSIONS` → PR → CI `pulumi preview` posts the plan → human approves → CI (or manual for Vault/Postgres major versions) `pulumi up`. - **Snapshot before** every Forgejo/Postgres/Vault upgrade (PD-4): take a backup bundle first. - Sequence: never upgrade Postgres and Forgejo in the same change; Vault upgrades are isolated. ### 7.2 Backups - **`backup/backup.sh` on a timer** (systemd timer or Forgejo Actions scheduled workflow): `forgejo dump` (repos+metadata) + `pg_dump` + `vault raft snapshot` + `pulumi state export` → RustFS bucket `foundation-backups` → replicate offsite. - **Verify** restorability weekly (`.forgejo/workflows/backup-verify.yml` restores into a scratch container and asserts row counts / repo presence). A backup that has never been restored is a guess. - First backup is part of bootstrap (Phase 9) — **before** declaring the platform operational. ### 7.3 Monitoring & alerting - **Bootstrap → minimal:** container healthchecks + an external uptime probe (offsite). - **Layer 1:** Prometheus/Grafana (on K8s) scrape the foundation node-exporter + Forgejo `/metrics`. - **Alerting trust rule:** the alerter must **not** run on the only host it watches. Put uptime/alert offsite so a dead VM can still page. (High confidence — common self-hosting footgun.) ### 7.4 Maintenance & the self-hosting milestone - **Self-hosting is reached when** (all true): foundation repo lives in Forgejo; CI can `pulumi up` the foundation; a DR rebuild has succeeded end-to-end from offsite backup. - After that, **all changes flow through Git + CI**, with manual `pulumi up` reserved as the documented break-glass for Layer-0-breaking changes (e.g., Vault/Postgres major upgrades). --- ## 8. Future Expansion (how Layer 1+ integrates) Every future service integrates through the **same four foundation interfaces**, never bypassing them: **(1) source repo in Forgejo, (2) images/charts in Forgejo's OCI registry, (3) secrets in Vault, (4) CI in Forgejo Actions.** This keeps the egg the single root for everything. | Service | Integration path | Notes | |---------|------------------|-------| | **Kubernetes** | Provisioned by a *new* Pulumi project whose repo lives in foundation-Forgejo; pulls images from foundation-registry; secrets from foundation-Vault. | This is where the **existing olsicloud4 K8s platform** reconnects — as a Layer-1 consumer. | | **ArgoCD** | Deployed on K8s; its app repos are Forgejo repos; bootstrap secret (git token) from Vault. | Replaces gitlab.com source in `002_platform_architecture.md` with Forgejo. | | **Internal PKI** | **Vault PKI secrets engine** becomes the org CA, replacing Caddy's bootstrap internal CA. cert-manager (Layer 1) uses the Vault issuer. | Promotes day-zero self-signed → real internal trust. | | **Authentik (SSO/OIDC)** | Deployed at Layer 1; Forgejo, Grafana, ArgoCD become OIDC clients. Introduce SSO **after** the platform is stable — not day-zero (avoid an identity dependency in the egg). | Forgejo can also *be* a temporary OIDC provider before Authentik exists. | | **Grafana / Prometheus** | Layer 1; scrape foundation + cluster; dashboards-as-code in Forgejo. | §7.3. | | **Longhorn** | Layer-1 storage for stateful K8s workloads — **not** used by Layer 0 (Layer 0 uses host volumes). | Keeps the egg storage-simple. | | **Renovate** | Self-hosted runner job in Forgejo Actions; opens PRs against `VERSIONS` and chart repos. | Automates §7.1 digest bumps. | | **Additional registries** | Forgejo's bundled registries cover OCI/npm/Helm/+20; add Harbor only if policy/scanning demands it. | Prefer not adding parts. | **Migration note:** the existing platform's gitlab.com dependency (git + OCI registry at `registry.gitlab.com/olsitec-nci/charts`, ADR-002 paths under `olsicloud4/...`) is **retired** by pointing those repos/registries at foundation-Forgejo. That migration is its own plan, gated on the foundation being proven. --- ## 9. Bootstrap Paradoxes & Day-Zero Analysis For each: *why it exists · what depends on what · automatable? · solution · deterministic?* ### 9.1 Infrastructure - **First VM provisioning.** Paradox: Pulumi provisions infra, but the VM hosts Pulumi's target. → A **thin separate Hetzner Pulumi project** (already exists: `pulumi/hetzner-cloud`) or one cloud-init creates the VM + installs Docker + plants the operator SSH key. Automatable: **yes**. Deterministic: yes (image + cloud-init pinned). The VM is the one piece provisioned *before* the foundation Pulumi runs. - **Pulumi's first credentials.** It needs (a) SSH to the VM, (b) the master passphrase. SSH key is the day-zero identity (§9.3); passphrase is the root of trust (§4). No other credential needed — everything else is generated. Deterministic: yes. - **Pulumi state before infra exists (R4).** → **Local file backend** on the operator machine during bootstrap; migrate to RustFS S3 backend after Phase 5; back up state offsite. Automatable: yes. Deterministic: yes (state is data, not derived, so it is *backed up*, not regenerated). - **First clone of the repo.** Before Forgejo exists the repo lives… somewhere external (operator workstation + an offsite git mirror — e.g. a bare repo on the backup host, or temporarily gitlab.com during migration). After handover, Forgejo is canonical. Automatable: partially (the very first clone is operator action). Deterministic: yes (content-addressed git). - **Binary installation.** `preflight/` checks; a pinned installer script fetches exact versions from `VERSIONS`. Automatable: yes. Deterministic: yes (pinned). - **Host validation.** `preflight/preflight.sh` asserts tool versions, docker reachability, ssh, dns, disk, clock. Fails closed before any deploy. Automatable: yes. ### 9.2 Secrets & Trust - **Root of trust:** master passphrase (§4.1). **Minimal external secret:** that passphrase, nothing else. - **Vault init / unseal keys / initial creds:** §4.3 — proven `olsitec-core` capture pattern. - **Deterministic vs random creds:** §4.2. - **Rotation / recovery after total loss:** §4.4 / §4.5 + §6. ### 9.3 Identity - **First administrator:** created **non-interactively** by Pulumi via `forgejo admin user create` (container exec) or `FORGEJO__security__INSTALL_LOCK=true` + env, with an admin password from `@pulumi/random` → Vault. No human types a password into a web form. Automatable: **yes**. Deterministic: the *flow* is; the password is random-but-stored. (High confidence — Forgejo supports headless admin creation.) - **First admin authentication:** operator reads the generated admin password from Vault (passphrase → Vault). No default/weak credential ever exists. - **First SSH key trusted:** the operator key is planted by cloud-init (Phase 0) — this is the irreducible day-zero trust seed. Subsequent keys are managed in Forgejo. - **Service identities:** each service gets its own Vault path + (later) AppRole, mirroring ADR-002. - **OIDC/SSO:** introduce at Layer 1 (§8), **not** day-zero — avoids an identity dependency inside the egg. ### 9.4 Certificates & Networking - **Initial TLS:** DNS-01 ACME via the **existing Cloudflare token** (already in the platform per `002_platform_architecture.md`), issued by Caddy — works even before the host is publicly reachable. Fallback: Caddy internal CA for day-zero, swap to real certs once DNS resolves. - **Internal PKI:** not required day-zero; Vault PKI adopts it at Layer 1 (§8). - **Cert rotation:** Caddy auto-renews ACME; Vault PKI handles internal rotation later. - **DNS assumptions:** `forge.olsitec.de` (+ registry/host) **must resolve to the VM before handover**. Owner: Cloudflare zone. This is a hard prerequisite — list it in preflight. - **Reverse proxy bootstrap:** `Caddyfile` rendered from template by Pulumi; routes web/API/registry on one host; Git-over-SSH exposed directly (port 22/2222) not via the HTTP proxy. ### 9.5 Forgejo - **First repository / first commit / repo arrival:** the foundation repo is pushed from the local clone into Forgejo at **Phase 7 handover**; origin is switched to Forgejo; this is the self-hosting moment. Automatable: yes (scripted `git remote` + `git push`). - **First CI runner & registration token:** token generated via `forgejo actions generate-runner-token` (or admin API) → stored in Vault → consumed by `runner.ts`. Automatable: **yes**. Deterministic flow. - **When CI owns deployments:** only after handover + runner registration + a proven self-`pulumi up`. Until then, manual `pulumi up` (§5.2, §7.4). ### 9.6 Storage - **Postgres init:** container with generated superuser pw; `pg-init.sql` creates Forgejo role+DB. Automatable: yes. - **RustFS init:** container with generated admin keys; `credentials.ts` creates service keys + buckets (`forgejo-packages`, `forgejo-artifacts`, `forgejo-lfs`, `foundation-backups`). Automatable: yes. - **Bucket creation:** Pulumi (S3 provider against RustFS) — deterministic names. - **Restore order after DR:** Vault → Postgres → RustFS data → **then** Forgejo (§6.2). Git repos (Forgejo data dir) are the irreplaceable core; restore before starting Forgejo. - **Recreatable data:** images, indexes, caches (§6.3). ### 9.7 Backups & Recovery - **First backup:** Phase 9, before "operational" is declared. - **Where stored:** RustFS `foundation-backups` + offsite replica (§6.4). - **Backup credential protection:** in Vault + mirrored to passphrase-encrypted config (R8/§4.5). - **Required to recover everything:** repo + passphrase + {forgejo dump, pg_dump, vault snapshot, pulumi state}. **Disposable:** images, indexes, caches, runner state (§6.3). ### 9.8 Operations - **Monitoring enabled:** minimal at bootstrap, full at Layer 1 (§7.3). - **Alerting trusted:** only when it runs offsite (§7.3). - **Upgrades before CI exists:** manual `pulumi up` with a pre-snapshot (§7.1). - **Becomes self-hosting / all-changes-through-CI:** §7.4 milestone. ### 9.9 Chronological Day-Zero Timeline ``` T0 Fresh OS Hetzner VM created (cloud-init: docker, ssh key, firewall, clock sync). T1 First command operator: git clone olsitec-foundation && ./preflight/preflight.sh T2 Trust set export PULUMI_CONFIG_PASSPHRASE (via pass); pulumi login (local file backend). T3 Infra deploy pulumi up → docker network + Postgres + RustFS + Vault(sealed) + Caddy. T4 Secret init vault operator init → capture keys → write to passphrase-encrypted config → unseal. T5 Credentials @pulumi/random → Vault; Postgres roles/DBs; RustFS keys+buckets. T6 Services init Forgejo up (app.ini ← secrets); headless first admin created. T7 Operational Web/API/registry reachable over TLS; admin password readable from Vault. T8 Self-hosting push foundation repo → Forgejo; switch origin; create org; register runner. T9 First CI deploy .forgejo/workflows runs pulumi preview → (approve) → up. CI now owns changes. T10 Backup backup.sh → RustFS → offsite. (first bundle) T11 DR validated restore-to-fresh-VM.sh rebuilds on a clean VM from offsite backup; smoke tests pass. ``` Goal achieved: **every step T1–T11 is scripted**; the only human actions are providing the passphrase and approving the first CI deploy. No undocumented manual step remains. --- ## 10. AI Execution Plan Work is split into low-coupling tasks. **Contracts are written first** (baseline §9) so tasks parallelize without inventing incompatible interfaces. Each task: reviewable commit, explicit acceptance criteria, conventional-commit subject. ### 10.0 Contracts (write before implementation tasks) | Contract | Defines | Consumed by | |----------|---------|-------------| | **CONTRACT_001 — Config schema** | typed Pulumi config keys (hostnames, versions, sizes, feature flags) | every component | | **CONTRACT_002 — Vault path layout** | `foundation//-credentials` keys (camelCase, ADR-002 style) | credentials, forgejo, runner, backup | | **CONTRACT_003 — Container network & DNS names** | network name, container names, internal ports | network, all services, proxy | | **CONTRACT_004 — Backup artifact format** | bundle filenames, layout, restore order | backup, dr, backup-verify | ### 10.1 Tasks | ID | Task | Depends on | Parallel? | Acceptance criteria | |----|------|-----------|-----------|---------------------| | **T00** | Contracts CONTRACT_001–004 + ADR_F001 (layered platform) | — | — | 4 contract docs + ADR committed; reviewed by human. | | **T01** | Repo scaffold + `preflight/` + `VERSIONS` | T00 | yes | `preflight.sh` exits non-zero on any missing/mismatched tool; passes on a prepared host. | | **T02** | Pulumi project skeleton + passphrase backend + `config.ts` (CONTRACT_001) | T00 | yes | `pulumi preview` runs with empty stack; config schema typed; secrets provider = passphrase. | | **T03** | `network.ts` + `postgres.ts` | T02, C003 | yes | Postgres container up via `@pulumi/docker`; role+DB created; healthcheck green. | | **T04** | `rustfs.ts` + bucket provisioning | T02, C002/C003 | yes | RustFS up; 4 buckets created; service key can put/get an object. | | **T05** | `vault.ts` + `lib/vaultInitCapture` (reuse olsitec-core pattern) | T02 | yes | Vault inits; keys+root captured into encrypted config; unseal helper unseals after restart. | | **T06** | `credentials.ts` (@pulumi/random → Vault, CONTRACT_002) | T05 | no (needs Vault) | All credential keys present in Vault at correct paths; idempotent on re-run. | | **T07** | `proxy.ts` (Caddy) + TLS strategy (DNS-01 + internal-CA fallback) | T02, C003 | yes | HTTPS terminates for `forge.*`; cert from Let's Encrypt (or internal CA in dev). | | **T08** | `forgejo.ts` — app.ini render, install-lock, S3+DB+Vault wiring | T03,T04,T06,T07 | no | Forgejo healthy; uses external Postgres + RustFS; web/API reachable via proxy. | | **T09** | Forgejo headless first-admin + org + repo bootstrap | T08 | no | Admin created non-interactively; password in Vault; org exists; no default creds. | | **T10** | `runner.ts` — registration-token flow + act_runner | T08,T09 | no | Runner registers via Vault token; a hello-world workflow runs to success. | | **T11** | Self-hosting handover script (push repo, switch origin, mirror infra repos) | T09 | no | Foundation repo present in Forgejo; origin switched; `git push` works over SSH. | | **T12** | `backup/` (backup.sh + restore.sh, CONTRACT_004) | T08 | yes | Bundle written to RustFS + offsite; restore.sh reconstructs into a scratch env. | | **T13** | `dr/` runbook + `restore-to-fresh-vm.sh` | T12 | no | Automated rebuild on a clean VM passes smoke tests (clone, pipeline, registry push, vault read). | | **T14** | `.forgejo/workflows/` (preflight, pulumi preview, pulumi up, backup-verify) | T10,T11 | yes | preview workflow posts plan; up workflow gated on approval; backup-verify restores+asserts. | | **T15** | `index.ts` phase orchestration + Gate A/B + DAY-ZERO checklist | T03–T08 | no | `pulumi up` from empty → operational in one command (modulo passphrase + approval). | ### 10.2 Parallelization map - **Wave 1 (parallel):** T01, T02 (after T00 contracts). - **Wave 2 (parallel):** T03, T04, T05, T07 (all depend only on T02 + contracts). - **Wave 3:** T06 (needs T05) ∥ start T12 design. - **Wave 4:** T08 (integrates T03/04/06/07). - **Wave 5:** T09 → T10 → T11 (sequential handover chain) ∥ T12 impl. - **Wave 6:** T13, T14, T15. ### 10.3 Per-task prompt skeleton (baseline §7.1) Each agent prompt must carry: Mission · Mode (BUILD or HIGH-RISK/INFRA) · the relevant **CONTRACT_00x** · the component file it owns · Non-goals (don't touch other components, don't edit generated/rendered secrets, don't run `pulumi up` against the real VM without approval) · Acceptance criteria (above) · Escalation (stop if Vault/state/secret behavior diverges from this plan). --- ## Ratified Decisions (2026-06-30) These four were decided by the human and are now binding (see ADR_004): 1. **Layered platform — RATIFIED.** Layer 0 = bare Docker on one VM via Pulumi; K8s/ArgoCD demoted to a Layer-1 consumer (§0). The whole plan stands on this. 2. **Vault unseal — passphrase-gated helper (§4.3 option 1).** No external KMS, no SaaS. Reboots require the master passphrase to be made available to the unseal step. Auto-unseal stays off until a Layer-1 trust anchor exists. 3. **Object storage — RustFS primary (§4 R3).** RustFS is the Layer-0 S3, matching the existing `rustfs` credential flag. **Hard rule:** the offsite replica is **non-RustFS**, so RustFS is never the only copy of a backup. 4. **Offsite backup — second self-hosted location (§6.4).** Different DC/failure domain, **no SaaS** dependency. Preferred seed: reuse `pulumi/hetzner-cloud` for both the Phase-0 VM and the offsite host. ### Remaining minor (reversible defaults — proceeding unless you object) - **Reverse proxy:** defaulting to **Caddy** (auto-TLS, internal-CA fallback). Cheap to swap later. - **Phase-0 VM seed:** defaulting to **`pulumi/hetzner-cloud`** for the foundation VM + the offsite host. --- ## Appendix — Mapping PLAN-001 → this plan - PLAN-001 "StatefulSet/Helm/ArgoCD" → Layer-0 "container/named-volume/Pulumi resource." - PLAN-001 data/state model (git on FS, Postgres, S3-for-blobs) → **reused unchanged.** - PLAN-001 runner mapping (every job `runs-on: docker`, code_quality `dind`) → **reused for §T10.** - PLAN-001 K8s HA topology → **§8 future HA path**, not bootstrap. ```