foundation/documentation/planning/PLAN-002-foundation-implementation.md
Andreas Niemann f18676e6b3 chore: scaffold olsitec-foundation mono-repo
Repo topology, baseline overlay, planning docs (PLAN-001/002), ADR-004/005,
and the bootstrap/packages/documentation skeleton. Implementation (T00+) not started.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-30 17:10:46 +02:00

551 lines
39 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# PLAN-002 — `olsitec-foundation` Implementation Strategy (Master Roadmap)
> **Companion to** [PLAN-001-forgejo.md](PLAN-001-forgejo.md) (the vision) and
> [002_platform_architecture.md](002_platform_architecture.md) (the existing olsicloud4 K8s platform).
> **Status:** Draft for human ratification. **Mode at authoring:** EXPLORE (design only, no code).
> **Author role:** Lead platform architect. **Date:** 2026-06-30.
>
> This document is **not** an implementation. It is the strategy that AI agents execute.
> Confidence markers (High/Medium/Low) follow baseline PD-5.
---
## 0. The Pivotal Decision (read this first)
**PLAN-001 deploys Forgejo *onto Kubernetes* via ArgoCD + Helm. The foundation must NOT.**
The foundation is the **egg**: the thing every other platform is hatched from. Kubernetes,
ArgoCD, Helm, cert-manager and ESO are themselves *hatched* by the platform — so the foundation
cannot depend on them without creating an unrecoverable circular dependency
(DR-from-nothing would require rebuilding K8s, which needs git+registry+secrets, which *are* the
foundation).
### Recommendation — a layered platform (High confidence)
| Layer | What | Substrate | Lifecycle |
| ----- | ---- | --------- | --------- |
| **Layer 0 — `olsitec-foundation` (the egg)** | Forgejo (+ Actions + OCI/npm registry), PostgreSQL, Vault, RustFS, reverse proxy, 1 runner | **Plain OCI containers on ONE VM**, orchestrated by Pulumi `@pulumi/docker` over SSH. **No K8s/ArgoCD/Helm.** | `pulumi up` (manual day-zero → CI later) |
| **Layer 1+ — the olsicloud4 K8s platform & everything else** | K8s, ArgoCD, cert-manager, ESO, Authentik, Grafana/Prometheus, Longhorn, Renovate, additional registries | Kubernetes | **Consumes** Layer 0: repos in foundation-Forgejo, CI in foundation-Actions, images/charts in foundation-registry, secrets in foundation-Vault |
**Why this is correct and not a downgrade:**
- The existing repo *already* contains `pulumi/modules/docker/` (a `@pulumi/docker` SSH-to-host wrapper) and `pulumi/olsitec-core/run.sh` (Pulumi-initializes-Vault-then-captures-unseal-keys-back-into-passphrase-encrypted-config). The tooling is already pointed at this model. (High confidence — verified in source.)
- PLAN-001's K8s topology remains valid as a **future, optional HA path** for Forgejo (its "True HA is a step change" note). It is not thrown away — it is deferred to §8.
**Consequence:** Everywhere PLAN-001 says "StatefulSet / Helm / ArgoCD Application," Layer 0 reads "container + named volume / Pulumi `docker.Container` / Pulumi resource." The *data & state model* of PLAN-001 (git repos on a POSIX FS, Postgres, S3 for blobs) is unchanged and fully reused.
---
## 1. Architecture Review
### 1.1 Validated strengths of the vision
- **Forgejo as one binary** (forge + CI + OCI + npm + 20 registries) genuinely collapses GitLab's 45 services into one. (High) — confirmed in PLAN-001.
- **Single master passphrase as the only external secret** is achievable and already proven by `olsitec-core` (`PULUMI_CONFIG_PASSPHRASE` passphrase provider). (High)
- **Pulumi-owns-credentials / Vault-distributes** (ADR-002) is the right steady-state. (High)
- **Boring tech**: Postgres, Vault, S3, a reverse proxy, Docker containers. All well-understood. (High)
### 1.2 Weaknesses / risks identified
| # | Risk | Severity | Mitigation (see section) |
|---|------|----------|--------------------------|
| R1 | **Single VM = single point of failure.** Forgejo is irreducibly stateful (git repos on FS). | High | Frequent backups to RustFS + **offsite**; DR rebuild ≤ 1h, tested (§6). HA is Layer-1 future (§8). |
| R2 | **Vault auto-unseal paradox** — unattended reboot leaves Vault sealed; auto-unseal normally needs an external KMS (a SaaS or a second Vault). | High | Shamir unseal; keys held in passphrase-encrypted Pulumi config; passphrase-gated unseal helper (§4, §9). |
| R3 | **RustFS maturity.** RustFS is a young MinIO-compatible S3. Foundation backups depend on it. | Medium | Keep S3 usage to the documented S3 API surface; **never** make RustFS the *only* copy of backups (offsite replica is non-S3-only). Treat RustFS as replaceable behind the S3 boundary. (Medium confidence on RustFS stability — flag for second-opinion.) |
| R4 | **Pulumi state location before infra exists** (chicken-egg). | Medium | Local file backend during bootstrap → migrate to RustFS S3 backend after; state backed up offsite (§5, §9). |
| R5 | **Privileged runner.** Forgejo Actions docker backend needs a privileged daemon. | Medium | Runner on a **throwaway sidecar VM** (or same VM, contained), never sharing the forge's trust boundary (§4a of PLAN-001 reused). |
| R6 | **DinD/runner pulls from Docker Hub** → rate limits + SaaS dependency for CI base images. | Medium | Pull-through cache → mirror critical images into Forgejo's own OCI registry; pin by digest (§7, §8). |
| R7 | **TLS day-zero**: ACME needs DNS resolving + reachability before the service is public. | Medium | DNS-01 via existing Cloudflare token (already in platform) OR reverse-proxy internal CA for day-zero, swap to real certs once DNS resolves (§4 certs). |
| R8 | **Backup encryption keys / offsite creds** become a *second* must-survive secret. | Medium | Fold offsite + backup credentials into the same passphrase-encrypted config / Vault; never a bare file (§4, §6). |
| R9 | **Forgejo Actions feature-completeness vs GitLab CI** for existing pipelines (Kaniko, semantic-release, helm push). | Low | PLAN-001 already mapped every job → `runs-on: docker`. Reuse that mapping. (High) |
### 1.3 Hidden dependencies to make explicit
- **DNS** must resolve `forge.olsitec.de` (and friends) to the VM **before** TLS and **before** self-hosting handover. Who owns the zone? (Cloudflare, per existing platform.) → §9 Networking.
- **An SSH key** trusted by the VM is needed for Pulumi's Docker-over-SSH provider. That key's trust is a day-zero identity question (§9 Identity).
- **Container images** are an external dependency until mirrored. Pin by **digest** for determinism (§Determinism).
- **The operator workstation** is an implicit trusted host for the very first `pulumi up`. Its toolchain must be validated (preflight, §2).
### 1.4 Suggested additions / changes to the component list
- **Add a reverse proxy with automatic TLS** → recommend **Caddy** (auto-ACME, ~10-line Caddyfile, internal-CA fallback). Alternative: Traefik. nginx if maximum-boring is required but loses auto-TLS ergonomics. (Medium — Caddy recommended.)
- **Add a Docker Hub pull-through cache** (`registry:2`) at Layer 0 from day one (PLAN-001 component #6) — removes a SaaS rate-limit dependency for CI. (Medium)
- **Defer Valkey/Redis** — single-replica Forgejo needs no external queue/cache (PLAN-001 confirms). Add only with HA. (High)
- **Defer Meilisearch** — search is not foundational. (High)
- **Keep `@pulumi/random` for all credential generation** (reuse existing pattern). (High)
- **Vault PKI engine** becomes the internal CA in §8 (replacing Caddy's bootstrap internal CA).
---
## 2. Bootstrapping Strategy (empty VM → operational)
Phases are deployed by **one Pulumi project** with explicit ordering (component dependencies + a small number of phase gates). See §5 for the dependency graph and §9 for the full timeline.
```
Phase 0 PROVISION Bare VM (Hetzner) + cloud-init: docker engine, ssh key, firewall.
Phase 1 PREFLIGHT Cloned repo validates host+toolchain (pulumi, node/bun, docker, ssh, dns, age/gpg).
Phase 2 STATE+TRUST Pulumi local file backend; master passphrase set (PULUMI_CONFIG_PASSPHRASE via `pass`).
Phase 3 DATA PLANE Docker network + PostgreSQL + RustFS (sealed/empty Vault container also started).
Phase 4 VAULT INIT `vault operator init` → capture root token + unseal keys → write back to passphrase-
encrypted Pulumi config (PROVEN pattern, olsitec-core/run.sh) → unseal.
Phase 5 CREDENTIALS @pulumi/random generates all service creds → written to Vault KV v2 → RustFS buckets
created → Postgres roles/DBs created.
Phase 6 FORGE Reverse proxy + Forgejo (app.ini rendered with secrets from Vault/Pulumi) come up;
Forgejo install-lock + first admin created deterministically.
Phase 7 HANDOVER Push the foundation repo INTO Forgejo; switch git origin; create org + mirror infra
repos; register first Actions runner (token from Vault).
Phase 8 CI HANDOFF A `.forgejo/workflows/` pipeline runs `pulumi preview` (then `up` on approval).
Phase 9 BACKUP+DR First backup taken (forgejo dump + pg_dump + vault snapshot + pulumi state) → RustFS
→ offsite. DR rebuild rehearsed on a fresh VM.
```
**Phase gates (only where strictly required):**
- Gate A after Phase 4: Vault must be initialized+unsealed before Phase 5 writes secrets.
- Gate B after Phase 6: Forgejo must be healthy before Phase 7 handover.
Everything else flows through ordinary Pulumi resource dependencies — no extra gates.
---
## 3. Repository Structure
A **single repo** = the DR unit. `git clone` + master passphrase ⇒ rebuild.
```
olsitec-foundation/
├── README.md # 5-line quickstart + DR pointer
├── VERSIONS # pinned versions+digests for every image & tool (determinism)
├── preflight/
│ ├── preflight.sh # validates tools, versions, ssh, dns, docker reachability
│ └── checks/ # individual check scripts (composable, testable)
├── pulumi/
│ ├── Pulumi.yaml # single project
│ ├── Pulumi.foundation.yaml # stack: passphrase-encrypted config + secrets (committable)
│ ├── index.ts # phase orchestration entrypoint
│ ├── config.ts # typed config schema (CONTRACT_001)
│ ├── components/ # one ComponentResource per concern
│ │ ├── network.ts # docker network, firewall expectations
│ │ ├── postgres.ts
│ │ ├── rustfs.ts # + bucket provisioning
│ │ ├── vault.ts # container + init/unseal capture lib
│ │ ├── credentials.ts # @pulumi/random → Vault writer (CONTRACT_002 paths)
│ │ ├── proxy.ts # Caddy + TLS strategy
│ │ ├── forgejo.ts # app.ini render, install-lock, first admin
│ │ └── runner.ts # act_runner + registration-token flow
│ ├── phases/ # thin orchestrators: dataPlane(), vaultInit(), forge(), handover()
│ └── lib/ # vaultInitCapture(), renderTemplate(), digest pinning helpers
├── containers/ # Dockerfiles for anything we build/mirror ourselves
├── config/ # rendered template SOURCES: app.ini.tmpl, Caddyfile.tmpl, pg-init.sql
├── backup/
│ ├── backup.sh # forgejo dump + pg_dump + vault snapshot + pulumi state → RustFS → offsite
│ └── restore.sh # inverse, parametrized by target host
├── dr/
│ ├── RUNBOOK.md # human-readable DR procedure
│ └── restore-to-fresh-vm.sh # automated rebuild used by the DR rehearsal test
├── docs/
│ ├── decisions/ # ADRs (ADR_F001 layered-platform, etc.)
│ ├── DAY-ZERO-TIMELINE.md # §9 timeline as an executable checklist
│ └── contracts/ # CONTRACT_001..004 (§10)
├── .forgejo/workflows/ # CI: preflight.yml, pulumi-preview.yml, pulumi-up.yml, backup-verify.yml
└── .gitignore # state/ (local backend), node_modules, *.local
```
**Why this layout (High confidence):**
- **One repo = one DR unit.** Vision requirement: "freshly cloned repo capable of pre-flight validation."
- **`components/` mirror the deployment order** so an agent can own one file with a clear contract.
- **`config/` holds template *sources*, never rendered secrets** — rendered output carries secrets and stays in container/Vault only (PD-2: don't version secrets).
- **`VERSIONS` centralizes determinism** — preflight and CI both read it; upgrades are a one-line diff.
- **`.forgejo/workflows/` co-located** so the repo that defines CI is the repo CI deploys (self-hosting).
---
## 4. Secret Management
### 4.1 Root of trust
**The master passphrase** (`PULUMI_CONFIG_PASSPHRASE`) is the single root. It selects Pulumi's
`passphrase` secrets provider (already in use: `encryptionsalt` in `Pulumi.olsitec-core.yaml`).
Chain of trust:
```
Master passphrase
└─ decrypts Pulumi stack config secrets (committable, `secure: v1:…`)
└─ which hold Vault unseal keys + root token (captured at init)
└─ Vault becomes the runtime distribution layer for ALL other secrets (ADR-002)
```
The passphrase is the **only** thing a human must carry out-of-band. Store it in `pass`
(operator side), and/or split it among operators with Shamir, and/or a hardware token. It is
never written to the platform.
### 4.2 Credential generation — deterministic vs random
| Class | Examples | Source | Rationale |
|-------|----------|--------|-----------|
| **Random / high-entropy** | all service passwords, Postgres pw, RustFS access/secret keys, Forgejo `SECRET_KEY` + `INTERNAL_TOKEN` + JWT secrets, OCI/npm registry tokens, runner registration token, Forgejo admin password | `@pulumi/random` → Vault KV v2 | secrets must be unguessable; rotation = `--replace` |
| **Derived / deterministic** | usernames, DB names, bucket names, container/DNS names, Vault mount layout, hostnames | computed from typed config | reproducible, non-secret, no entropy needed |
| **External (the ONLY one)** | master passphrase | human | root of trust |
This satisfies the vision: *"everything else should derive from that."*
### 4.3 Vault initialization & unseal (the hard part — High attention)
- **Init:** Pulumi runs `vault operator init` (Shamir, e.g. 5 keys / threshold 3) inside the Vault
container, captures `unsealKeys` + `rootToken` as **stack outputs**, then `run.sh` (or a Pulumi
`local.Command`) writes them back as passphrase-encrypted Pulumi config secrets. **This exact
pattern already exists in `olsitec-core/run.sh`** — reuse it verbatim. (High confidence.)
- **Unseal on reboot (R2):** Vault seals on every restart. Options:
1. **Passphrase-gated unseal helper** *(recommended)* — a small script reads the unseal keys from
Pulumi config (decrypted by the passphrase the operator provides) and unseals. Deterministic,
reproducible, **no external KMS, no SaaS**. Cost: VM reboots need an operator (or a
passphrase made available to a boot service — a trade-off to decide).
2. **Transit auto-unseal** — rejected at Layer 0 (needs a *second* Vault → circular).
3. **KMS auto-unseal** — rejected (SaaS dependency, violates design goal).
→ Recommend (1) for Layer 0; revisit auto-unseal when a second trust anchor exists at Layer 1.
(Medium confidence — this is the main open operational question; flag for second-opinion.)
### 4.4 Rotation
Per ADR-002: `pulumi up --replace` on the `RandomPassword` → new value in Vault → consumers reload.
At Layer 0, consumers are containers, so rotation triggers a container recreate (Pulumi handles the
dependency). Vault root token: rotate via `vault operator generate-root` after bootstrap; store new
token in config. Unseal-key rotation: `vault operator rekey`.
### 4.5 Recovery & backup of secrets
- **Vault data** backed up via `vault operator raft snapshot` → RustFS → offsite.
- **Unseal keys + root token** survive inside the passphrase-encrypted Pulumi stack config, which is
in the repo (and the repo is backed up). So {repo + passphrase} reconstitutes Vault access.
- **Backup/offsite credentials (R8)** live in Vault *and* are mirrored into the passphrase-encrypted
config, so they survive even total Vault loss.
---
## 5. Deployment Order & Dependency Graph
```
┌─────────────────────────┐
│ master passphrase (ext) │
└───────────┬─────────────┘
│ selects secrets provider
┌──────▼───────┐
│ Pulumi state │ (local file backend → later RustFS S3)
└──────┬───────┘
┌──────────────┬──────────┼───────────┬──────────────┐
▼ ▼ ▼ ▼ ▼
docker network PostgreSQL RustFS Vault(sealed) Caddy(proxy)
│ │ │ │ │
│ │ │ [Gate A: init+unseal] │
│ │ │ │ │
│ │ │ Vault(unsealed) │
│ │ │ │ │
│ └──────────┴─────┐ credentials.ts │
│ ▼ (@pulumi/random→Vault)
│ Postgres roles/DBs, RustFS buckets created
│ │
└───────────────────────────────┼──────────────────────┐
▼ │
Forgejo (app.ini ← Vault) ◄──┘ (proxy routes to it)
│ [Gate B: healthy]
first admin + org + repos
act_runner (token ← Vault)
CI assumes deploy duty
```
### 5.1 What depends on what
- **Everything** depends on Pulumi state + passphrase.
- **credentials.ts** depends on Vault being unsealed (Gate A) and on Postgres/RustFS existing (to create roles/buckets with the generated creds).
- **Forgejo** depends on Postgres (DB), RustFS (blob storage), Vault (secrets), proxy (TLS/ingress).
- **Runner** depends on Forgejo (registration token) and on the proxy (to reach Forgejo).
- **CI** depends on the runner.
### 5.2 Circular dependencies & resolutions (summary; full list §9)
| Cycle | Resolution |
|-------|-----------|
| Pulumi needs a secret store; Vault is that store; Vault is deployed by Pulumi | Passphrase-encrypted config holds unseal keys at bootstrap; Vault holds the rest in steady state. |
| Forgejo hosts the repo that deploys Forgejo | Deploy Forgejo from the **local clone** first; then push repo in + switch origin (handover). |
| CI deploys the platform; CI runs on the platform | First `pulumi up` is **manual**; CI takes over only after the runner exists and a self-rebuild is proven. |
| Registry hosts CI base images; CI fills the registry | Pull from upstream via pull-through cache day-zero; mirror into Forgejo registry afterward. |
| TLS needs DNS+ACME; ACME account must be created | DNS-01 via existing Cloudflare token, or internal CA day-zero; real certs once DNS resolves. |
---
## 6. Disaster Recovery (total VM loss)
**Premise:** survive on {a VM, the repo, the master passphrase}.
### 6.1 What must exist to recover
1. **The repo** (git clone — mirrored offsite, see below).
2. **The master passphrase** (operator's head / `pass` / Shamir split).
3. **The latest backup bundle** in the **offsite** location: `forgejo-dump.zip`, `pg_dump.sql`,
`vault-raft.snap`, `pulumi-state` (if not reconstructible), `rustfs-data` (or it is the offsite).
### 6.2 Procedure (target ≤ 1 hour; `dr/restore-to-fresh-vm.sh` automates most)
1. Provision a fresh VM (Phase 0 cloud-init).
2. `git clone` foundation repo; run `preflight/`.
3. Set `PULUMI_CONFIG_PASSPHRASE`; `pulumi login` (local backend) or restore state from offsite.
4. `pulumi up` Phases 34: data plane + Vault container. **Restore Vault** from raft snapshot
(`vault operator raft snapshot restore`); unseal with keys from config.
5. **Restore Postgres** (`pg_restore`) and **RustFS data** (sync from offsite) before starting Forgejo.
6. `pulumi up` Phase 6: Forgejo against restored DB + restored data dir (git repos).
7. Re-register the runner (new token) — runners are stateless, never restored.
8. Validate: clone a repo, run a pipeline, push an image, read a Vault secret.
### 6.3 What is recreatable and **not** backed up
- Container images (re-pullable / rebuildable from pinned digests).
- Search indexes (Forgejo rebuilds).
- Caches, runner ephemeral state, pull-through cache contents.
- Pulumi state *if* the local backend is reconstructible — but back it up anyway (cheap insurance).
### 6.4 Offsite requirement (critical)
RustFS lives on the same VM → it cannot be the only backup copy (R3). Replicate the backup bundle to
a **second location with a different failure domain** that is **not SaaS by hard dependency**:
recommend a second small Hetzner VM / Storage Box in another DC, or a second self-hosted RustFS.
(If a SaaS S3 is used, it must be *additive*, never the sole copy — preserving the no-SaaS guarantee.)
---
## 7. Operational Lifecycle
### 7.1 Upgrades
- Bump the pinned **digest** in `VERSIONS` → PR → CI `pulumi preview` posts the plan → human approves
→ CI (or manual for Vault/Postgres major versions) `pulumi up`.
- **Snapshot before** every Forgejo/Postgres/Vault upgrade (PD-4): take a backup bundle first.
- Sequence: never upgrade Postgres and Forgejo in the same change; Vault upgrades are isolated.
### 7.2 Backups
- **`backup/backup.sh` on a timer** (systemd timer or Forgejo Actions scheduled workflow):
`forgejo dump` (repos+metadata) + `pg_dump` + `vault raft snapshot` + `pulumi state export`
RustFS bucket `foundation-backups` → replicate offsite.
- **Verify** restorability weekly (`.forgejo/workflows/backup-verify.yml` restores into a scratch
container and asserts row counts / repo presence). A backup that has never been restored is a guess.
- First backup is part of bootstrap (Phase 9) — **before** declaring the platform operational.
### 7.3 Monitoring & alerting
- **Bootstrap → minimal:** container healthchecks + an external uptime probe (offsite).
- **Layer 1:** Prometheus/Grafana (on K8s) scrape the foundation node-exporter + Forgejo `/metrics`.
- **Alerting trust rule:** the alerter must **not** run on the only host it watches. Put uptime/alert
offsite so a dead VM can still page. (High confidence — common self-hosting footgun.)
### 7.4 Maintenance & the self-hosting milestone
- **Self-hosting is reached when** (all true): foundation repo lives in Forgejo; CI can `pulumi up`
the foundation; a DR rebuild has succeeded end-to-end from offsite backup.
- After that, **all changes flow through Git + CI**, with manual `pulumi up` reserved as the documented
break-glass for Layer-0-breaking changes (e.g., Vault/Postgres major upgrades).
---
## 8. Future Expansion (how Layer 1+ integrates)
Every future service integrates through the **same four foundation interfaces**, never bypassing them:
**(1) source repo in Forgejo, (2) images/charts in Forgejo's OCI registry, (3) secrets in Vault,
(4) CI in Forgejo Actions.** This keeps the egg the single root for everything.
| Service | Integration path | Notes |
|---------|------------------|-------|
| **Kubernetes** | Provisioned by a *new* Pulumi project whose repo lives in foundation-Forgejo; pulls images from foundation-registry; secrets from foundation-Vault. | This is where the **existing olsicloud4 K8s platform** reconnects — as a Layer-1 consumer. |
| **ArgoCD** | Deployed on K8s; its app repos are Forgejo repos; bootstrap secret (git token) from Vault. | Replaces gitlab.com source in `002_platform_architecture.md` with Forgejo. |
| **Internal PKI** | **Vault PKI secrets engine** becomes the org CA, replacing Caddy's bootstrap internal CA. cert-manager (Layer 1) uses the Vault issuer. | Promotes day-zero self-signed → real internal trust. |
| **Authentik (SSO/OIDC)** | Deployed at Layer 1; Forgejo, Grafana, ArgoCD become OIDC clients. Introduce SSO **after** the platform is stable — not day-zero (avoid an identity dependency in the egg). | Forgejo can also *be* a temporary OIDC provider before Authentik exists. |
| **Grafana / Prometheus** | Layer 1; scrape foundation + cluster; dashboards-as-code in Forgejo. | §7.3. |
| **Longhorn** | Layer-1 storage for stateful K8s workloads — **not** used by Layer 0 (Layer 0 uses host volumes). | Keeps the egg storage-simple. |
| **Renovate** | Self-hosted runner job in Forgejo Actions; opens PRs against `VERSIONS` and chart repos. | Automates §7.1 digest bumps. |
| **Additional registries** | Forgejo's bundled registries cover OCI/npm/Helm/+20; add Harbor only if policy/scanning demands it. | Prefer not adding parts. |
**Migration note:** the existing platform's gitlab.com dependency (git + OCI registry at
`registry.gitlab.com/olsitec-nci/charts`, ADR-002 paths under `olsicloud4/...`) is **retired** by
pointing those repos/registries at foundation-Forgejo. That migration is its own plan, gated on the
foundation being proven.
---
## 9. Bootstrap Paradoxes & Day-Zero Analysis
For each: *why it exists · what depends on what · automatable? · solution · deterministic?*
### 9.1 Infrastructure
- **First VM provisioning.** Paradox: Pulumi provisions infra, but the VM hosts Pulumi's target.
→ A **thin separate Hetzner Pulumi project** (already exists: `pulumi/hetzner-cloud`) or one
cloud-init creates the VM + installs Docker + plants the operator SSH key. Automatable: **yes**.
Deterministic: yes (image + cloud-init pinned). The VM is the one piece provisioned *before* the
foundation Pulumi runs.
- **Pulumi's first credentials.** It needs (a) SSH to the VM, (b) the master passphrase. SSH key is
the day-zero identity (§9.3); passphrase is the root of trust (§4). No other credential needed —
everything else is generated. Deterministic: yes.
- **Pulumi state before infra exists (R4).** → **Local file backend** on the operator machine during
bootstrap; migrate to RustFS S3 backend after Phase 5; back up state offsite. Automatable: yes.
Deterministic: yes (state is data, not derived, so it is *backed up*, not regenerated).
- **First clone of the repo.** Before Forgejo exists the repo lives… somewhere external (operator
workstation + an offsite git mirror — e.g. a bare repo on the backup host, or temporarily
gitlab.com during migration). After handover, Forgejo is canonical. Automatable: partially (the
very first clone is operator action). Deterministic: yes (content-addressed git).
- **Binary installation.** `preflight/` checks; a pinned installer script fetches exact versions from
`VERSIONS`. Automatable: yes. Deterministic: yes (pinned).
- **Host validation.** `preflight/preflight.sh` asserts tool versions, docker reachability, ssh, dns,
disk, clock. Fails closed before any deploy. Automatable: yes.
### 9.2 Secrets & Trust
- **Root of trust:** master passphrase (§4.1). **Minimal external secret:** that passphrase, nothing
else.
- **Vault init / unseal keys / initial creds:** §4.3 — proven `olsitec-core` capture pattern.
- **Deterministic vs random creds:** §4.2.
- **Rotation / recovery after total loss:** §4.4 / §4.5 + §6.
### 9.3 Identity
- **First administrator:** created **non-interactively** by Pulumi via `forgejo admin user create`
(container exec) or `FORGEJO__security__INSTALL_LOCK=true` + env, with an admin password from
`@pulumi/random` → Vault. No human types a password into a web form. Automatable: **yes**.
Deterministic: the *flow* is; the password is random-but-stored. (High confidence — Forgejo
supports headless admin creation.)
- **First admin authentication:** operator reads the generated admin password from Vault (passphrase
→ Vault). No default/weak credential ever exists.
- **First SSH key trusted:** the operator key is planted by cloud-init (Phase 0) — this is the
irreducible day-zero trust seed. Subsequent keys are managed in Forgejo.
- **Service identities:** each service gets its own Vault path + (later) AppRole, mirroring ADR-002.
- **OIDC/SSO:** introduce at Layer 1 (§8), **not** day-zero — avoids an identity dependency inside the
egg.
### 9.4 Certificates & Networking
- **Initial TLS:** DNS-01 ACME via the **existing Cloudflare token** (already in the platform per
`002_platform_architecture.md`), issued by Caddy — works even before the host is publicly
reachable. Fallback: Caddy internal CA for day-zero, swap to real certs once DNS resolves.
- **Internal PKI:** not required day-zero; Vault PKI adopts it at Layer 1 (§8).
- **Cert rotation:** Caddy auto-renews ACME; Vault PKI handles internal rotation later.
- **DNS assumptions:** `forge.olsitec.de` (+ registry/host) **must resolve to the VM before handover**.
Owner: Cloudflare zone. This is a hard prerequisite — list it in preflight.
- **Reverse proxy bootstrap:** `Caddyfile` rendered from template by Pulumi; routes web/API/registry on
one host; Git-over-SSH exposed directly (port 22/2222) not via the HTTP proxy.
### 9.5 Forgejo
- **First repository / first commit / repo arrival:** the foundation repo is pushed from the local
clone into Forgejo at **Phase 7 handover**; origin is switched to Forgejo; this is the
self-hosting moment. Automatable: yes (scripted `git remote` + `git push`).
- **First CI runner & registration token:** token generated via
`forgejo actions generate-runner-token` (or admin API) → stored in Vault → consumed by `runner.ts`.
Automatable: **yes**. Deterministic flow.
- **When CI owns deployments:** only after handover + runner registration + a proven self-`pulumi up`.
Until then, manual `pulumi up` (§5.2, §7.4).
### 9.6 Storage
- **Postgres init:** container with generated superuser pw; `pg-init.sql` creates Forgejo role+DB.
Automatable: yes.
- **RustFS init:** container with generated admin keys; `credentials.ts` creates service keys +
buckets (`forgejo-packages`, `forgejo-artifacts`, `forgejo-lfs`, `foundation-backups`).
Automatable: yes.
- **Bucket creation:** Pulumi (S3 provider against RustFS) — deterministic names.
- **Restore order after DR:** Vault → Postgres → RustFS data → **then** Forgejo (§6.2). Git repos
(Forgejo data dir) are the irreplaceable core; restore before starting Forgejo.
- **Recreatable data:** images, indexes, caches (§6.3).
### 9.7 Backups & Recovery
- **First backup:** Phase 9, before "operational" is declared.
- **Where stored:** RustFS `foundation-backups` + offsite replica (§6.4).
- **Backup credential protection:** in Vault + mirrored to passphrase-encrypted config (R8/§4.5).
- **Required to recover everything:** repo + passphrase + {forgejo dump, pg_dump, vault snapshot,
pulumi state}. **Disposable:** images, indexes, caches, runner state (§6.3).
### 9.8 Operations
- **Monitoring enabled:** minimal at bootstrap, full at Layer 1 (§7.3).
- **Alerting trusted:** only when it runs offsite (§7.3).
- **Upgrades before CI exists:** manual `pulumi up` with a pre-snapshot (§7.1).
- **Becomes self-hosting / all-changes-through-CI:** §7.4 milestone.
### 9.9 Chronological Day-Zero Timeline
```
T0 Fresh OS Hetzner VM created (cloud-init: docker, ssh key, firewall, clock sync).
T1 First command operator: git clone olsitec-foundation && ./preflight/preflight.sh
T2 Trust set export PULUMI_CONFIG_PASSPHRASE (via pass); pulumi login (local file backend).
T3 Infra deploy pulumi up → docker network + Postgres + RustFS + Vault(sealed) + Caddy.
T4 Secret init vault operator init → capture keys → write to passphrase-encrypted config → unseal.
T5 Credentials @pulumi/random → Vault; Postgres roles/DBs; RustFS keys+buckets.
T6 Services init Forgejo up (app.ini ← secrets); headless first admin created.
T7 Operational Web/API/registry reachable over TLS; admin password readable from Vault.
T8 Self-hosting push foundation repo → Forgejo; switch origin; create org; register runner.
T9 First CI deploy .forgejo/workflows runs pulumi preview → (approve) → up. CI now owns changes.
T10 Backup backup.sh → RustFS → offsite. (first bundle)
T11 DR validated restore-to-fresh-VM.sh rebuilds on a clean VM from offsite backup; smoke tests pass.
```
Goal achieved: **every step T1T11 is scripted**; the only human actions are providing the passphrase
and approving the first CI deploy. No undocumented manual step remains.
---
## 10. AI Execution Plan
Work is split into low-coupling tasks. **Contracts are written first** (baseline §9) so tasks
parallelize without inventing incompatible interfaces. Each task: reviewable commit, explicit
acceptance criteria, conventional-commit subject.
### 10.0 Contracts (write before implementation tasks)
| Contract | Defines | Consumed by |
|----------|---------|-------------|
| **CONTRACT_001 — Config schema** | typed Pulumi config keys (hostnames, versions, sizes, feature flags) | every component |
| **CONTRACT_002 — Vault path layout** | `foundation/<service>/<type>-credentials` keys (camelCase, ADR-002 style) | credentials, forgejo, runner, backup |
| **CONTRACT_003 — Container network & DNS names** | network name, container names, internal ports | network, all services, proxy |
| **CONTRACT_004 — Backup artifact format** | bundle filenames, layout, restore order | backup, dr, backup-verify |
### 10.1 Tasks
| ID | Task | Depends on | Parallel? | Acceptance criteria |
|----|------|-----------|-----------|---------------------|
| **T00** | Contracts CONTRACT_001004 + ADR_F001 (layered platform) | — | — | 4 contract docs + ADR committed; reviewed by human. |
| **T01** | Repo scaffold + `preflight/` + `VERSIONS` | T00 | yes | `preflight.sh` exits non-zero on any missing/mismatched tool; passes on a prepared host. |
| **T02** | Pulumi project skeleton + passphrase backend + `config.ts` (CONTRACT_001) | T00 | yes | `pulumi preview` runs with empty stack; config schema typed; secrets provider = passphrase. |
| **T03** | `network.ts` + `postgres.ts` | T02, C003 | yes | Postgres container up via `@pulumi/docker`; role+DB created; healthcheck green. |
| **T04** | `rustfs.ts` + bucket provisioning | T02, C002/C003 | yes | RustFS up; 4 buckets created; service key can put/get an object. |
| **T05** | `vault.ts` + `lib/vaultInitCapture` (reuse olsitec-core pattern) | T02 | yes | Vault inits; keys+root captured into encrypted config; unseal helper unseals after restart. |
| **T06** | `credentials.ts` (@pulumi/random → Vault, CONTRACT_002) | T05 | no (needs Vault) | All credential keys present in Vault at correct paths; idempotent on re-run. |
| **T07** | `proxy.ts` (Caddy) + TLS strategy (DNS-01 + internal-CA fallback) | T02, C003 | yes | HTTPS terminates for `forge.*`; cert from Let's Encrypt (or internal CA in dev). |
| **T08** | `forgejo.ts` — app.ini render, install-lock, S3+DB+Vault wiring | T03,T04,T06,T07 | no | Forgejo healthy; uses external Postgres + RustFS; web/API reachable via proxy. |
| **T09** | Forgejo headless first-admin + org + repo bootstrap | T08 | no | Admin created non-interactively; password in Vault; org exists; no default creds. |
| **T10** | `runner.ts` — registration-token flow + act_runner | T08,T09 | no | Runner registers via Vault token; a hello-world workflow runs to success. |
| **T11** | Self-hosting handover script (push repo, switch origin, mirror infra repos) | T09 | no | Foundation repo present in Forgejo; origin switched; `git push` works over SSH. |
| **T12** | `backup/` (backup.sh + restore.sh, CONTRACT_004) | T08 | yes | Bundle written to RustFS + offsite; restore.sh reconstructs into a scratch env. |
| **T13** | `dr/` runbook + `restore-to-fresh-vm.sh` | T12 | no | Automated rebuild on a clean VM passes smoke tests (clone, pipeline, registry push, vault read). |
| **T14** | `.forgejo/workflows/` (preflight, pulumi preview, pulumi up, backup-verify) | T10,T11 | yes | preview workflow posts plan; up workflow gated on approval; backup-verify restores+asserts. |
| **T15** | `index.ts` phase orchestration + Gate A/B + DAY-ZERO checklist | T03T08 | no | `pulumi up` from empty → operational in one command (modulo passphrase + approval). |
### 10.2 Parallelization map
- **Wave 1 (parallel):** T01, T02 (after T00 contracts).
- **Wave 2 (parallel):** T03, T04, T05, T07 (all depend only on T02 + contracts).
- **Wave 3:** T06 (needs T05) ∥ start T12 design.
- **Wave 4:** T08 (integrates T03/04/06/07).
- **Wave 5:** T09 → T10 → T11 (sequential handover chain) ∥ T12 impl.
- **Wave 6:** T13, T14, T15.
### 10.3 Per-task prompt skeleton (baseline §7.1)
Each agent prompt must carry: Mission · Mode (BUILD or HIGH-RISK/INFRA) · the relevant **CONTRACT_00x** ·
the component file it owns · Non-goals (don't touch other components, don't edit generated/rendered
secrets, don't run `pulumi up` against the real VM without approval) · Acceptance criteria (above) ·
Escalation (stop if Vault/state/secret behavior diverges from this plan).
---
## Ratified Decisions (2026-06-30)
These four were decided by the human and are now binding (see ADR_004):
1. **Layered platform — RATIFIED.** Layer 0 = bare Docker on one VM via Pulumi; K8s/ArgoCD demoted
to a Layer-1 consumer (§0). The whole plan stands on this.
2. **Vault unseal — passphrase-gated helper (§4.3 option 1).** No external KMS, no SaaS. Reboots
require the master passphrase to be made available to the unseal step. Auto-unseal stays off until
a Layer-1 trust anchor exists.
3. **Object storage — RustFS primary (§4 R3).** RustFS is the Layer-0 S3, matching the existing
`rustfs` credential flag. **Hard rule:** the offsite replica is **non-RustFS**, so RustFS is never
the only copy of a backup.
4. **Offsite backup — second self-hosted location (§6.4).** Different DC/failure domain, **no SaaS**
dependency. Preferred seed: reuse `pulumi/hetzner-cloud` for both the Phase-0 VM and the offsite
host.
### Remaining minor (reversible defaults — proceeding unless you object)
- **Reverse proxy:** defaulting to **Caddy** (auto-TLS, internal-CA fallback). Cheap to swap later.
- **Phase-0 VM seed:** defaulting to **`pulumi/hetzner-cloud`** for the foundation VM + the offsite host.
---
## Appendix — Mapping PLAN-001 → this plan
- PLAN-001 "StatefulSet/Helm/ArgoCD" → Layer-0 "container/named-volume/Pulumi resource."
- PLAN-001 data/state model (git on FS, Postgres, S3-for-blobs) → **reused unchanged.**
- PLAN-001 runner mapping (every job `runs-on: docker`, code_quality `dind`) → **reused for §T10.**
- PLAN-001 K8s HA topology → **§8 future HA path**, not bootstrap.
```