foundation/documentation/planning/PLAN-002-foundation-implementation.md
Andreas Niemann f18676e6b3 chore: scaffold olsitec-foundation mono-repo
Repo topology, baseline overlay, planning docs (PLAN-001/002), ADR-004/005,
and the bootstrap/packages/documentation skeleton. Implementation (T00+) not started.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-30 17:10:46 +02:00

39 KiB
Raw Blame History

PLAN-002 — olsitec-foundation Implementation Strategy (Master Roadmap)

Companion to PLAN-001-forgejo.md (the vision) and 002_platform_architecture.md (the existing olsicloud4 K8s platform). Status: Draft for human ratification. Mode at authoring: EXPLORE (design only, no code). Author role: Lead platform architect. Date: 2026-06-30.

This document is not an implementation. It is the strategy that AI agents execute. Confidence markers (High/Medium/Low) follow baseline PD-5.


0. The Pivotal Decision (read this first)

PLAN-001 deploys Forgejo onto Kubernetes via ArgoCD + Helm. The foundation must NOT.

The foundation is the egg: the thing every other platform is hatched from. Kubernetes, ArgoCD, Helm, cert-manager and ESO are themselves hatched by the platform — so the foundation cannot depend on them without creating an unrecoverable circular dependency (DR-from-nothing would require rebuilding K8s, which needs git+registry+secrets, which are the foundation).

Recommendation — a layered platform (High confidence)

Layer What Substrate Lifecycle
Layer 0 — olsitec-foundation (the egg) Forgejo (+ Actions + OCI/npm registry), PostgreSQL, Vault, RustFS, reverse proxy, 1 runner Plain OCI containers on ONE VM, orchestrated by Pulumi @pulumi/docker over SSH. No K8s/ArgoCD/Helm. pulumi up (manual day-zero → CI later)
Layer 1+ — the olsicloud4 K8s platform & everything else K8s, ArgoCD, cert-manager, ESO, Authentik, Grafana/Prometheus, Longhorn, Renovate, additional registries Kubernetes Consumes Layer 0: repos in foundation-Forgejo, CI in foundation-Actions, images/charts in foundation-registry, secrets in foundation-Vault

Why this is correct and not a downgrade:

  • The existing repo already contains pulumi/modules/docker/ (a @pulumi/docker SSH-to-host wrapper) and pulumi/olsitec-core/run.sh (Pulumi-initializes-Vault-then-captures-unseal-keys-back-into-passphrase-encrypted-config). The tooling is already pointed at this model. (High confidence — verified in source.)
  • PLAN-001's K8s topology remains valid as a future, optional HA path for Forgejo (its "True HA is a step change" note). It is not thrown away — it is deferred to §8.

Consequence: Everywhere PLAN-001 says "StatefulSet / Helm / ArgoCD Application," Layer 0 reads "container + named volume / Pulumi docker.Container / Pulumi resource." The data & state model of PLAN-001 (git repos on a POSIX FS, Postgres, S3 for blobs) is unchanged and fully reused.


1. Architecture Review

1.1 Validated strengths of the vision

  • Forgejo as one binary (forge + CI + OCI + npm + 20 registries) genuinely collapses GitLab's 45 services into one. (High) — confirmed in PLAN-001.
  • Single master passphrase as the only external secret is achievable and already proven by olsitec-core (PULUMI_CONFIG_PASSPHRASE passphrase provider). (High)
  • Pulumi-owns-credentials / Vault-distributes (ADR-002) is the right steady-state. (High)
  • Boring tech: Postgres, Vault, S3, a reverse proxy, Docker containers. All well-understood. (High)

1.2 Weaknesses / risks identified

# Risk Severity Mitigation (see section)
R1 Single VM = single point of failure. Forgejo is irreducibly stateful (git repos on FS). High Frequent backups to RustFS + offsite; DR rebuild ≤ 1h, tested (§6). HA is Layer-1 future (§8).
R2 Vault auto-unseal paradox — unattended reboot leaves Vault sealed; auto-unseal normally needs an external KMS (a SaaS or a second Vault). High Shamir unseal; keys held in passphrase-encrypted Pulumi config; passphrase-gated unseal helper (§4, §9).
R3 RustFS maturity. RustFS is a young MinIO-compatible S3. Foundation backups depend on it. Medium Keep S3 usage to the documented S3 API surface; never make RustFS the only copy of backups (offsite replica is non-S3-only). Treat RustFS as replaceable behind the S3 boundary. (Medium confidence on RustFS stability — flag for second-opinion.)
R4 Pulumi state location before infra exists (chicken-egg). Medium Local file backend during bootstrap → migrate to RustFS S3 backend after; state backed up offsite (§5, §9).
R5 Privileged runner. Forgejo Actions docker backend needs a privileged daemon. Medium Runner on a throwaway sidecar VM (or same VM, contained), never sharing the forge's trust boundary (§4a of PLAN-001 reused).
R6 DinD/runner pulls from Docker Hub → rate limits + SaaS dependency for CI base images. Medium Pull-through cache → mirror critical images into Forgejo's own OCI registry; pin by digest (§7, §8).
R7 TLS day-zero: ACME needs DNS resolving + reachability before the service is public. Medium DNS-01 via existing Cloudflare token (already in platform) OR reverse-proxy internal CA for day-zero, swap to real certs once DNS resolves (§4 certs).
R8 Backup encryption keys / offsite creds become a second must-survive secret. Medium Fold offsite + backup credentials into the same passphrase-encrypted config / Vault; never a bare file (§4, §6).
R9 Forgejo Actions feature-completeness vs GitLab CI for existing pipelines (Kaniko, semantic-release, helm push). Low PLAN-001 already mapped every job → runs-on: docker. Reuse that mapping. (High)

1.3 Hidden dependencies to make explicit

  • DNS must resolve forge.olsitec.de (and friends) to the VM before TLS and before self-hosting handover. Who owns the zone? (Cloudflare, per existing platform.) → §9 Networking.
  • An SSH key trusted by the VM is needed for Pulumi's Docker-over-SSH provider. That key's trust is a day-zero identity question (§9 Identity).
  • Container images are an external dependency until mirrored. Pin by digest for determinism (§Determinism).
  • The operator workstation is an implicit trusted host for the very first pulumi up. Its toolchain must be validated (preflight, §2).

1.4 Suggested additions / changes to the component list

  • Add a reverse proxy with automatic TLS → recommend Caddy (auto-ACME, ~10-line Caddyfile, internal-CA fallback). Alternative: Traefik. nginx if maximum-boring is required but loses auto-TLS ergonomics. (Medium — Caddy recommended.)
  • Add a Docker Hub pull-through cache (registry:2) at Layer 0 from day one (PLAN-001 component #6) — removes a SaaS rate-limit dependency for CI. (Medium)
  • Defer Valkey/Redis — single-replica Forgejo needs no external queue/cache (PLAN-001 confirms). Add only with HA. (High)
  • Defer Meilisearch — search is not foundational. (High)
  • Keep @pulumi/random for all credential generation (reuse existing pattern). (High)
  • Vault PKI engine becomes the internal CA in §8 (replacing Caddy's bootstrap internal CA).

2. Bootstrapping Strategy (empty VM → operational)

Phases are deployed by one Pulumi project with explicit ordering (component dependencies + a small number of phase gates). See §5 for the dependency graph and §9 for the full timeline.

Phase 0  PROVISION   Bare VM (Hetzner) + cloud-init: docker engine, ssh key, firewall.
Phase 1  PREFLIGHT   Cloned repo validates host+toolchain (pulumi, node/bun, docker, ssh, dns, age/gpg).
Phase 2  STATE+TRUST Pulumi local file backend; master passphrase set (PULUMI_CONFIG_PASSPHRASE via `pass`).
Phase 3  DATA PLANE  Docker network + PostgreSQL + RustFS (sealed/empty Vault container also started).
Phase 4  VAULT INIT  `vault operator init` → capture root token + unseal keys → write back to passphrase-
                     encrypted Pulumi config (PROVEN pattern, olsitec-core/run.sh) → unseal.
Phase 5  CREDENTIALS @pulumi/random generates all service creds → written to Vault KV v2 → RustFS buckets
                     created → Postgres roles/DBs created.
Phase 6  FORGE       Reverse proxy + Forgejo (app.ini rendered with secrets from Vault/Pulumi) come up;
                     Forgejo install-lock + first admin created deterministically.
Phase 7  HANDOVER    Push the foundation repo INTO Forgejo; switch git origin; create org + mirror infra
                     repos; register first Actions runner (token from Vault).
Phase 8  CI HANDOFF  A `.forgejo/workflows/` pipeline runs `pulumi preview` (then `up` on approval).
Phase 9  BACKUP+DR   First backup taken (forgejo dump + pg_dump + vault snapshot + pulumi state) → RustFS
                     → offsite. DR rebuild rehearsed on a fresh VM.

Phase gates (only where strictly required):

  • Gate A after Phase 4: Vault must be initialized+unsealed before Phase 5 writes secrets.
  • Gate B after Phase 6: Forgejo must be healthy before Phase 7 handover. Everything else flows through ordinary Pulumi resource dependencies — no extra gates.

3. Repository Structure

A single repo = the DR unit. git clone + master passphrase ⇒ rebuild.

olsitec-foundation/
├── README.md                      # 5-line quickstart + DR pointer
├── VERSIONS                       # pinned versions+digests for every image & tool (determinism)
├── preflight/
│   ├── preflight.sh               # validates tools, versions, ssh, dns, docker reachability
│   └── checks/                    # individual check scripts (composable, testable)
├── pulumi/
│   ├── Pulumi.yaml                # single project
│   ├── Pulumi.foundation.yaml     # stack: passphrase-encrypted config + secrets (committable)
│   ├── index.ts                   # phase orchestration entrypoint
│   ├── config.ts                  # typed config schema (CONTRACT_001)
│   ├── components/                # one ComponentResource per concern
│   │   ├── network.ts             # docker network, firewall expectations
│   │   ├── postgres.ts
│   │   ├── rustfs.ts              # + bucket provisioning
│   │   ├── vault.ts               # container + init/unseal capture lib
│   │   ├── credentials.ts         # @pulumi/random → Vault writer (CONTRACT_002 paths)
│   │   ├── proxy.ts               # Caddy + TLS strategy
│   │   ├── forgejo.ts             # app.ini render, install-lock, first admin
│   │   └── runner.ts              # act_runner + registration-token flow
│   ├── phases/                    # thin orchestrators: dataPlane(), vaultInit(), forge(), handover()
│   └── lib/                       # vaultInitCapture(), renderTemplate(), digest pinning helpers
├── containers/                    # Dockerfiles for anything we build/mirror ourselves
├── config/                        # rendered template SOURCES: app.ini.tmpl, Caddyfile.tmpl, pg-init.sql
├── backup/
│   ├── backup.sh                  # forgejo dump + pg_dump + vault snapshot + pulumi state → RustFS → offsite
│   └── restore.sh                 # inverse, parametrized by target host
├── dr/
│   ├── RUNBOOK.md                 # human-readable DR procedure
│   └── restore-to-fresh-vm.sh     # automated rebuild used by the DR rehearsal test
├── docs/
│   ├── decisions/                 # ADRs (ADR_F001 layered-platform, etc.)
│   ├── DAY-ZERO-TIMELINE.md       # §9 timeline as an executable checklist
│   └── contracts/                 # CONTRACT_001..004 (§10)
├── .forgejo/workflows/            # CI: preflight.yml, pulumi-preview.yml, pulumi-up.yml, backup-verify.yml
└── .gitignore                     # state/ (local backend), node_modules, *.local

Why this layout (High confidence):

  • One repo = one DR unit. Vision requirement: "freshly cloned repo capable of pre-flight validation."
  • components/ mirror the deployment order so an agent can own one file with a clear contract.
  • config/ holds template sources, never rendered secrets — rendered output carries secrets and stays in container/Vault only (PD-2: don't version secrets).
  • VERSIONS centralizes determinism — preflight and CI both read it; upgrades are a one-line diff.
  • .forgejo/workflows/ co-located so the repo that defines CI is the repo CI deploys (self-hosting).

4. Secret Management

4.1 Root of trust

The master passphrase (PULUMI_CONFIG_PASSPHRASE) is the single root. It selects Pulumi's passphrase secrets provider (already in use: encryptionsalt in Pulumi.olsitec-core.yaml). Chain of trust:

Master passphrase
  └─ decrypts Pulumi stack config secrets (committable, `secure: v1:…`)
       └─ which hold Vault unseal keys + root token (captured at init)
            └─ Vault becomes the runtime distribution layer for ALL other secrets (ADR-002)

The passphrase is the only thing a human must carry out-of-band. Store it in pass (operator side), and/or split it among operators with Shamir, and/or a hardware token. It is never written to the platform.

4.2 Credential generation — deterministic vs random

Class Examples Source Rationale
Random / high-entropy all service passwords, Postgres pw, RustFS access/secret keys, Forgejo SECRET_KEY + INTERNAL_TOKEN + JWT secrets, OCI/npm registry tokens, runner registration token, Forgejo admin password @pulumi/random → Vault KV v2 secrets must be unguessable; rotation = --replace
Derived / deterministic usernames, DB names, bucket names, container/DNS names, Vault mount layout, hostnames computed from typed config reproducible, non-secret, no entropy needed
External (the ONLY one) master passphrase human root of trust

This satisfies the vision: "everything else should derive from that."

4.3 Vault initialization & unseal (the hard part — High attention)

  • Init: Pulumi runs vault operator init (Shamir, e.g. 5 keys / threshold 3) inside the Vault container, captures unsealKeys + rootToken as stack outputs, then run.sh (or a Pulumi local.Command) writes them back as passphrase-encrypted Pulumi config secrets. This exact pattern already exists in olsitec-core/run.sh — reuse it verbatim. (High confidence.)
  • Unseal on reboot (R2): Vault seals on every restart. Options:
    1. Passphrase-gated unseal helper (recommended) — a small script reads the unseal keys from Pulumi config (decrypted by the passphrase the operator provides) and unseals. Deterministic, reproducible, no external KMS, no SaaS. Cost: VM reboots need an operator (or a passphrase made available to a boot service — a trade-off to decide).
    2. Transit auto-unseal — rejected at Layer 0 (needs a second Vault → circular).
    3. KMS auto-unseal — rejected (SaaS dependency, violates design goal). → Recommend (1) for Layer 0; revisit auto-unseal when a second trust anchor exists at Layer 1. (Medium confidence — this is the main open operational question; flag for second-opinion.)

4.4 Rotation

Per ADR-002: pulumi up --replace on the RandomPassword → new value in Vault → consumers reload. At Layer 0, consumers are containers, so rotation triggers a container recreate (Pulumi handles the dependency). Vault root token: rotate via vault operator generate-root after bootstrap; store new token in config. Unseal-key rotation: vault operator rekey.

4.5 Recovery & backup of secrets

  • Vault data backed up via vault operator raft snapshot → RustFS → offsite.
  • Unseal keys + root token survive inside the passphrase-encrypted Pulumi stack config, which is in the repo (and the repo is backed up). So {repo + passphrase} reconstitutes Vault access.
  • Backup/offsite credentials (R8) live in Vault and are mirrored into the passphrase-encrypted config, so they survive even total Vault loss.

5. Deployment Order & Dependency Graph

                       ┌─────────────────────────┐
                       │  master passphrase (ext) │
                       └───────────┬─────────────┘
                                   │ selects secrets provider
                            ┌──────▼───────┐
                            │ Pulumi state │  (local file backend → later RustFS S3)
                            └──────┬───────┘
                                   │
        ┌──────────────┬──────────┼───────────┬──────────────┐
        ▼              ▼          ▼            ▼              ▼
   docker network  PostgreSQL  RustFS    Vault(sealed)   Caddy(proxy)
        │              │          │            │              │
        │              │          │     [Gate A: init+unseal] │
        │              │          │            │              │
        │              │          │      Vault(unsealed)      │
        │              │          │            │              │
        │              └──────────┴─────┐ credentials.ts      │
        │                               ▼ (@pulumi/random→Vault)
        │                          Postgres roles/DBs, RustFS buckets created
        │                               │
        └───────────────────────────────┼──────────────────────┐
                                         ▼                       │
                                    Forgejo (app.ini ← Vault) ◄──┘ (proxy routes to it)
                                         │  [Gate B: healthy]
                                         ▼
                                  first admin + org + repos
                                         │
                                         ▼
                                  act_runner (token ← Vault)
                                         │
                                         ▼
                                  CI assumes deploy duty

5.1 What depends on what

  • Everything depends on Pulumi state + passphrase.
  • credentials.ts depends on Vault being unsealed (Gate A) and on Postgres/RustFS existing (to create roles/buckets with the generated creds).
  • Forgejo depends on Postgres (DB), RustFS (blob storage), Vault (secrets), proxy (TLS/ingress).
  • Runner depends on Forgejo (registration token) and on the proxy (to reach Forgejo).
  • CI depends on the runner.

5.2 Circular dependencies & resolutions (summary; full list §9)

Cycle Resolution
Pulumi needs a secret store; Vault is that store; Vault is deployed by Pulumi Passphrase-encrypted config holds unseal keys at bootstrap; Vault holds the rest in steady state.
Forgejo hosts the repo that deploys Forgejo Deploy Forgejo from the local clone first; then push repo in + switch origin (handover).
CI deploys the platform; CI runs on the platform First pulumi up is manual; CI takes over only after the runner exists and a self-rebuild is proven.
Registry hosts CI base images; CI fills the registry Pull from upstream via pull-through cache day-zero; mirror into Forgejo registry afterward.
TLS needs DNS+ACME; ACME account must be created DNS-01 via existing Cloudflare token, or internal CA day-zero; real certs once DNS resolves.

6. Disaster Recovery (total VM loss)

Premise: survive on {a VM, the repo, the master passphrase}.

6.1 What must exist to recover

  1. The repo (git clone — mirrored offsite, see below).
  2. The master passphrase (operator's head / pass / Shamir split).
  3. The latest backup bundle in the offsite location: forgejo-dump.zip, pg_dump.sql, vault-raft.snap, pulumi-state (if not reconstructible), rustfs-data (or it is the offsite).

6.2 Procedure (target ≤ 1 hour; dr/restore-to-fresh-vm.sh automates most)

  1. Provision a fresh VM (Phase 0 cloud-init).
  2. git clone foundation repo; run preflight/.
  3. Set PULUMI_CONFIG_PASSPHRASE; pulumi login (local backend) or restore state from offsite.
  4. pulumi up Phases 34: data plane + Vault container. Restore Vault from raft snapshot (vault operator raft snapshot restore); unseal with keys from config.
  5. Restore Postgres (pg_restore) and RustFS data (sync from offsite) before starting Forgejo.
  6. pulumi up Phase 6: Forgejo against restored DB + restored data dir (git repos).
  7. Re-register the runner (new token) — runners are stateless, never restored.
  8. Validate: clone a repo, run a pipeline, push an image, read a Vault secret.

6.3 What is recreatable and not backed up

  • Container images (re-pullable / rebuildable from pinned digests).
  • Search indexes (Forgejo rebuilds).
  • Caches, runner ephemeral state, pull-through cache contents.
  • Pulumi state if the local backend is reconstructible — but back it up anyway (cheap insurance).

6.4 Offsite requirement (critical)

RustFS lives on the same VM → it cannot be the only backup copy (R3). Replicate the backup bundle to a second location with a different failure domain that is not SaaS by hard dependency: recommend a second small Hetzner VM / Storage Box in another DC, or a second self-hosted RustFS. (If a SaaS S3 is used, it must be additive, never the sole copy — preserving the no-SaaS guarantee.)


7. Operational Lifecycle

7.1 Upgrades

  • Bump the pinned digest in VERSIONS → PR → CI pulumi preview posts the plan → human approves → CI (or manual for Vault/Postgres major versions) pulumi up.
  • Snapshot before every Forgejo/Postgres/Vault upgrade (PD-4): take a backup bundle first.
  • Sequence: never upgrade Postgres and Forgejo in the same change; Vault upgrades are isolated.

7.2 Backups

  • backup/backup.sh on a timer (systemd timer or Forgejo Actions scheduled workflow): forgejo dump (repos+metadata) + pg_dump + vault raft snapshot + pulumi state export → RustFS bucket foundation-backups → replicate offsite.
  • Verify restorability weekly (.forgejo/workflows/backup-verify.yml restores into a scratch container and asserts row counts / repo presence). A backup that has never been restored is a guess.
  • First backup is part of bootstrap (Phase 9) — before declaring the platform operational.

7.3 Monitoring & alerting

  • Bootstrap → minimal: container healthchecks + an external uptime probe (offsite).
  • Layer 1: Prometheus/Grafana (on K8s) scrape the foundation node-exporter + Forgejo /metrics.
  • Alerting trust rule: the alerter must not run on the only host it watches. Put uptime/alert offsite so a dead VM can still page. (High confidence — common self-hosting footgun.)

7.4 Maintenance & the self-hosting milestone

  • Self-hosting is reached when (all true): foundation repo lives in Forgejo; CI can pulumi up the foundation; a DR rebuild has succeeded end-to-end from offsite backup.
  • After that, all changes flow through Git + CI, with manual pulumi up reserved as the documented break-glass for Layer-0-breaking changes (e.g., Vault/Postgres major upgrades).

8. Future Expansion (how Layer 1+ integrates)

Every future service integrates through the same four foundation interfaces, never bypassing them: (1) source repo in Forgejo, (2) images/charts in Forgejo's OCI registry, (3) secrets in Vault, (4) CI in Forgejo Actions. This keeps the egg the single root for everything.

Service Integration path Notes
Kubernetes Provisioned by a new Pulumi project whose repo lives in foundation-Forgejo; pulls images from foundation-registry; secrets from foundation-Vault. This is where the existing olsicloud4 K8s platform reconnects — as a Layer-1 consumer.
ArgoCD Deployed on K8s; its app repos are Forgejo repos; bootstrap secret (git token) from Vault. Replaces gitlab.com source in 002_platform_architecture.md with Forgejo.
Internal PKI Vault PKI secrets engine becomes the org CA, replacing Caddy's bootstrap internal CA. cert-manager (Layer 1) uses the Vault issuer. Promotes day-zero self-signed → real internal trust.
Authentik (SSO/OIDC) Deployed at Layer 1; Forgejo, Grafana, ArgoCD become OIDC clients. Introduce SSO after the platform is stable — not day-zero (avoid an identity dependency in the egg). Forgejo can also be a temporary OIDC provider before Authentik exists.
Grafana / Prometheus Layer 1; scrape foundation + cluster; dashboards-as-code in Forgejo. §7.3.
Longhorn Layer-1 storage for stateful K8s workloads — not used by Layer 0 (Layer 0 uses host volumes). Keeps the egg storage-simple.
Renovate Self-hosted runner job in Forgejo Actions; opens PRs against VERSIONS and chart repos. Automates §7.1 digest bumps.
Additional registries Forgejo's bundled registries cover OCI/npm/Helm/+20; add Harbor only if policy/scanning demands it. Prefer not adding parts.

Migration note: the existing platform's gitlab.com dependency (git + OCI registry at registry.gitlab.com/olsitec-nci/charts, ADR-002 paths under olsicloud4/...) is retired by pointing those repos/registries at foundation-Forgejo. That migration is its own plan, gated on the foundation being proven.


9. Bootstrap Paradoxes & Day-Zero Analysis

For each: why it exists · what depends on what · automatable? · solution · deterministic?

9.1 Infrastructure

  • First VM provisioning. Paradox: Pulumi provisions infra, but the VM hosts Pulumi's target. → A thin separate Hetzner Pulumi project (already exists: pulumi/hetzner-cloud) or one cloud-init creates the VM + installs Docker + plants the operator SSH key. Automatable: yes. Deterministic: yes (image + cloud-init pinned). The VM is the one piece provisioned before the foundation Pulumi runs.
  • Pulumi's first credentials. It needs (a) SSH to the VM, (b) the master passphrase. SSH key is the day-zero identity (§9.3); passphrase is the root of trust (§4). No other credential needed — everything else is generated. Deterministic: yes.
  • Pulumi state before infra exists (R4).Local file backend on the operator machine during bootstrap; migrate to RustFS S3 backend after Phase 5; back up state offsite. Automatable: yes. Deterministic: yes (state is data, not derived, so it is backed up, not regenerated).
  • First clone of the repo. Before Forgejo exists the repo lives… somewhere external (operator workstation + an offsite git mirror — e.g. a bare repo on the backup host, or temporarily gitlab.com during migration). After handover, Forgejo is canonical. Automatable: partially (the very first clone is operator action). Deterministic: yes (content-addressed git).
  • Binary installation. preflight/ checks; a pinned installer script fetches exact versions from VERSIONS. Automatable: yes. Deterministic: yes (pinned).
  • Host validation. preflight/preflight.sh asserts tool versions, docker reachability, ssh, dns, disk, clock. Fails closed before any deploy. Automatable: yes.

9.2 Secrets & Trust

  • Root of trust: master passphrase (§4.1). Minimal external secret: that passphrase, nothing else.
  • Vault init / unseal keys / initial creds: §4.3 — proven olsitec-core capture pattern.
  • Deterministic vs random creds: §4.2.
  • Rotation / recovery after total loss: §4.4 / §4.5 + §6.

9.3 Identity

  • First administrator: created non-interactively by Pulumi via forgejo admin user create (container exec) or FORGEJO__security__INSTALL_LOCK=true + env, with an admin password from @pulumi/random → Vault. No human types a password into a web form. Automatable: yes. Deterministic: the flow is; the password is random-but-stored. (High confidence — Forgejo supports headless admin creation.)
  • First admin authentication: operator reads the generated admin password from Vault (passphrase → Vault). No default/weak credential ever exists.
  • First SSH key trusted: the operator key is planted by cloud-init (Phase 0) — this is the irreducible day-zero trust seed. Subsequent keys are managed in Forgejo.
  • Service identities: each service gets its own Vault path + (later) AppRole, mirroring ADR-002.
  • OIDC/SSO: introduce at Layer 1 (§8), not day-zero — avoids an identity dependency inside the egg.

9.4 Certificates & Networking

  • Initial TLS: DNS-01 ACME via the existing Cloudflare token (already in the platform per 002_platform_architecture.md), issued by Caddy — works even before the host is publicly reachable. Fallback: Caddy internal CA for day-zero, swap to real certs once DNS resolves.
  • Internal PKI: not required day-zero; Vault PKI adopts it at Layer 1 (§8).
  • Cert rotation: Caddy auto-renews ACME; Vault PKI handles internal rotation later.
  • DNS assumptions: forge.olsitec.de (+ registry/host) must resolve to the VM before handover. Owner: Cloudflare zone. This is a hard prerequisite — list it in preflight.
  • Reverse proxy bootstrap: Caddyfile rendered from template by Pulumi; routes web/API/registry on one host; Git-over-SSH exposed directly (port 22/2222) not via the HTTP proxy.

9.5 Forgejo

  • First repository / first commit / repo arrival: the foundation repo is pushed from the local clone into Forgejo at Phase 7 handover; origin is switched to Forgejo; this is the self-hosting moment. Automatable: yes (scripted git remote + git push).
  • First CI runner & registration token: token generated via forgejo actions generate-runner-token (or admin API) → stored in Vault → consumed by runner.ts. Automatable: yes. Deterministic flow.
  • When CI owns deployments: only after handover + runner registration + a proven self-pulumi up. Until then, manual pulumi up (§5.2, §7.4).

9.6 Storage

  • Postgres init: container with generated superuser pw; pg-init.sql creates Forgejo role+DB. Automatable: yes.
  • RustFS init: container with generated admin keys; credentials.ts creates service keys + buckets (forgejo-packages, forgejo-artifacts, forgejo-lfs, foundation-backups). Automatable: yes.
  • Bucket creation: Pulumi (S3 provider against RustFS) — deterministic names.
  • Restore order after DR: Vault → Postgres → RustFS data → then Forgejo (§6.2). Git repos (Forgejo data dir) are the irreplaceable core; restore before starting Forgejo.
  • Recreatable data: images, indexes, caches (§6.3).

9.7 Backups & Recovery

  • First backup: Phase 9, before "operational" is declared.
  • Where stored: RustFS foundation-backups + offsite replica (§6.4).
  • Backup credential protection: in Vault + mirrored to passphrase-encrypted config (R8/§4.5).
  • Required to recover everything: repo + passphrase + {forgejo dump, pg_dump, vault snapshot, pulumi state}. Disposable: images, indexes, caches, runner state (§6.3).

9.8 Operations

  • Monitoring enabled: minimal at bootstrap, full at Layer 1 (§7.3).
  • Alerting trusted: only when it runs offsite (§7.3).
  • Upgrades before CI exists: manual pulumi up with a pre-snapshot (§7.1).
  • Becomes self-hosting / all-changes-through-CI: §7.4 milestone.

9.9 Chronological Day-Zero Timeline

T0  Fresh OS         Hetzner VM created (cloud-init: docker, ssh key, firewall, clock sync).
T1  First command    operator: git clone olsitec-foundation && ./preflight/preflight.sh
T2  Trust set        export PULUMI_CONFIG_PASSPHRASE (via pass); pulumi login (local file backend).
T3  Infra deploy     pulumi up → docker network + Postgres + RustFS + Vault(sealed) + Caddy.
T4  Secret init      vault operator init → capture keys → write to passphrase-encrypted config → unseal.
T5  Credentials      @pulumi/random → Vault; Postgres roles/DBs; RustFS keys+buckets.
T6  Services init    Forgejo up (app.ini ← secrets); headless first admin created.
T7  Operational      Web/API/registry reachable over TLS; admin password readable from Vault.
T8  Self-hosting     push foundation repo → Forgejo; switch origin; create org; register runner.
T9  First CI deploy  .forgejo/workflows runs pulumi preview → (approve) → up. CI now owns changes.
T10 Backup           backup.sh → RustFS → offsite. (first bundle)
T11 DR validated     restore-to-fresh-VM.sh rebuilds on a clean VM from offsite backup; smoke tests pass.

Goal achieved: every step T1T11 is scripted; the only human actions are providing the passphrase and approving the first CI deploy. No undocumented manual step remains.


10. AI Execution Plan

Work is split into low-coupling tasks. Contracts are written first (baseline §9) so tasks parallelize without inventing incompatible interfaces. Each task: reviewable commit, explicit acceptance criteria, conventional-commit subject.

10.0 Contracts (write before implementation tasks)

Contract Defines Consumed by
CONTRACT_001 — Config schema typed Pulumi config keys (hostnames, versions, sizes, feature flags) every component
CONTRACT_002 — Vault path layout foundation/<service>/<type>-credentials keys (camelCase, ADR-002 style) credentials, forgejo, runner, backup
CONTRACT_003 — Container network & DNS names network name, container names, internal ports network, all services, proxy
CONTRACT_004 — Backup artifact format bundle filenames, layout, restore order backup, dr, backup-verify

10.1 Tasks

ID Task Depends on Parallel? Acceptance criteria
T00 Contracts CONTRACT_001004 + ADR_F001 (layered platform) 4 contract docs + ADR committed; reviewed by human.
T01 Repo scaffold + preflight/ + VERSIONS T00 yes preflight.sh exits non-zero on any missing/mismatched tool; passes on a prepared host.
T02 Pulumi project skeleton + passphrase backend + config.ts (CONTRACT_001) T00 yes pulumi preview runs with empty stack; config schema typed; secrets provider = passphrase.
T03 network.ts + postgres.ts T02, C003 yes Postgres container up via @pulumi/docker; role+DB created; healthcheck green.
T04 rustfs.ts + bucket provisioning T02, C002/C003 yes RustFS up; 4 buckets created; service key can put/get an object.
T05 vault.ts + lib/vaultInitCapture (reuse olsitec-core pattern) T02 yes Vault inits; keys+root captured into encrypted config; unseal helper unseals after restart.
T06 credentials.ts (@pulumi/random → Vault, CONTRACT_002) T05 no (needs Vault) All credential keys present in Vault at correct paths; idempotent on re-run.
T07 proxy.ts (Caddy) + TLS strategy (DNS-01 + internal-CA fallback) T02, C003 yes HTTPS terminates for forge.*; cert from Let's Encrypt (or internal CA in dev).
T08 forgejo.ts — app.ini render, install-lock, S3+DB+Vault wiring T03,T04,T06,T07 no Forgejo healthy; uses external Postgres + RustFS; web/API reachable via proxy.
T09 Forgejo headless first-admin + org + repo bootstrap T08 no Admin created non-interactively; password in Vault; org exists; no default creds.
T10 runner.ts — registration-token flow + act_runner T08,T09 no Runner registers via Vault token; a hello-world workflow runs to success.
T11 Self-hosting handover script (push repo, switch origin, mirror infra repos) T09 no Foundation repo present in Forgejo; origin switched; git push works over SSH.
T12 backup/ (backup.sh + restore.sh, CONTRACT_004) T08 yes Bundle written to RustFS + offsite; restore.sh reconstructs into a scratch env.
T13 dr/ runbook + restore-to-fresh-vm.sh T12 no Automated rebuild on a clean VM passes smoke tests (clone, pipeline, registry push, vault read).
T14 .forgejo/workflows/ (preflight, pulumi preview, pulumi up, backup-verify) T10,T11 yes preview workflow posts plan; up workflow gated on approval; backup-verify restores+asserts.
T15 index.ts phase orchestration + Gate A/B + DAY-ZERO checklist T03T08 no pulumi up from empty → operational in one command (modulo passphrase + approval).

10.2 Parallelization map

  • Wave 1 (parallel): T01, T02 (after T00 contracts).
  • Wave 2 (parallel): T03, T04, T05, T07 (all depend only on T02 + contracts).
  • Wave 3: T06 (needs T05) ∥ start T12 design.
  • Wave 4: T08 (integrates T03/04/06/07).
  • Wave 5: T09 → T10 → T11 (sequential handover chain) ∥ T12 impl.
  • Wave 6: T13, T14, T15.

10.3 Per-task prompt skeleton (baseline §7.1)

Each agent prompt must carry: Mission · Mode (BUILD or HIGH-RISK/INFRA) · the relevant CONTRACT_00x · the component file it owns · Non-goals (don't touch other components, don't edit generated/rendered secrets, don't run pulumi up against the real VM without approval) · Acceptance criteria (above) · Escalation (stop if Vault/state/secret behavior diverges from this plan).


Ratified Decisions (2026-06-30)

These four were decided by the human and are now binding (see ADR_004):

  1. Layered platform — RATIFIED. Layer 0 = bare Docker on one VM via Pulumi; K8s/ArgoCD demoted to a Layer-1 consumer (§0). The whole plan stands on this.
  2. Vault unseal — passphrase-gated helper (§4.3 option 1). No external KMS, no SaaS. Reboots require the master passphrase to be made available to the unseal step. Auto-unseal stays off until a Layer-1 trust anchor exists.
  3. Object storage — RustFS primary (§4 R3). RustFS is the Layer-0 S3, matching the existing rustfs credential flag. Hard rule: the offsite replica is non-RustFS, so RustFS is never the only copy of a backup.
  4. Offsite backup — second self-hosted location (§6.4). Different DC/failure domain, no SaaS dependency. Preferred seed: reuse pulumi/hetzner-cloud for both the Phase-0 VM and the offsite host.

Remaining minor (reversible defaults — proceeding unless you object)

  • Reverse proxy: defaulting to Caddy (auto-TLS, internal-CA fallback). Cheap to swap later.
  • Phase-0 VM seed: defaulting to pulumi/hetzner-cloud for the foundation VM + the offsite host.

Appendix — Mapping PLAN-001 → this plan

  • PLAN-001 "StatefulSet/Helm/ArgoCD" → Layer-0 "container/named-volume/Pulumi resource."
  • PLAN-001 data/state model (git on FS, Postgres, S3-for-blobs) → reused unchanged.
  • PLAN-001 runner mapping (every job runs-on: docker, code_quality dind) → reused for §T10.
  • PLAN-001 K8s HA topology → §8 future HA path, not bootstrap.