Repo topology, baseline overlay, planning docs (PLAN-001/002), ADR-004/005, and the bootstrap/packages/documentation skeleton. Implementation (T00+) not started. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
39 KiB
PLAN-002 — olsitec-foundation Implementation Strategy (Master Roadmap)
Companion to PLAN-001-forgejo.md (the vision) and 002_platform_architecture.md (the existing olsicloud4 K8s platform). Status: Draft for human ratification. Mode at authoring: EXPLORE (design only, no code). Author role: Lead platform architect. Date: 2026-06-30.
This document is not an implementation. It is the strategy that AI agents execute. Confidence markers (High/Medium/Low) follow baseline PD-5.
0. The Pivotal Decision (read this first)
PLAN-001 deploys Forgejo onto Kubernetes via ArgoCD + Helm. The foundation must NOT.
The foundation is the egg: the thing every other platform is hatched from. Kubernetes, ArgoCD, Helm, cert-manager and ESO are themselves hatched by the platform — so the foundation cannot depend on them without creating an unrecoverable circular dependency (DR-from-nothing would require rebuilding K8s, which needs git+registry+secrets, which are the foundation).
Recommendation — a layered platform (High confidence)
| Layer | What | Substrate | Lifecycle |
|---|---|---|---|
Layer 0 — olsitec-foundation (the egg) |
Forgejo (+ Actions + OCI/npm registry), PostgreSQL, Vault, RustFS, reverse proxy, 1 runner | Plain OCI containers on ONE VM, orchestrated by Pulumi @pulumi/docker over SSH. No K8s/ArgoCD/Helm. |
pulumi up (manual day-zero → CI later) |
| Layer 1+ — the olsicloud4 K8s platform & everything else | K8s, ArgoCD, cert-manager, ESO, Authentik, Grafana/Prometheus, Longhorn, Renovate, additional registries | Kubernetes | Consumes Layer 0: repos in foundation-Forgejo, CI in foundation-Actions, images/charts in foundation-registry, secrets in foundation-Vault |
Why this is correct and not a downgrade:
- The existing repo already contains
pulumi/modules/docker/(a@pulumi/dockerSSH-to-host wrapper) andpulumi/olsitec-core/run.sh(Pulumi-initializes-Vault-then-captures-unseal-keys-back-into-passphrase-encrypted-config). The tooling is already pointed at this model. (High confidence — verified in source.) - PLAN-001's K8s topology remains valid as a future, optional HA path for Forgejo (its "True HA is a step change" note). It is not thrown away — it is deferred to §8.
Consequence: Everywhere PLAN-001 says "StatefulSet / Helm / ArgoCD Application," Layer 0 reads "container + named volume / Pulumi docker.Container / Pulumi resource." The data & state model of PLAN-001 (git repos on a POSIX FS, Postgres, S3 for blobs) is unchanged and fully reused.
1. Architecture Review
1.1 Validated strengths of the vision
- Forgejo as one binary (forge + CI + OCI + npm + 20 registries) genuinely collapses GitLab's 4–5 services into one. (High) — confirmed in PLAN-001.
- Single master passphrase as the only external secret is achievable and already proven by
olsitec-core(PULUMI_CONFIG_PASSPHRASEpassphrase provider). (High) - Pulumi-owns-credentials / Vault-distributes (ADR-002) is the right steady-state. (High)
- Boring tech: Postgres, Vault, S3, a reverse proxy, Docker containers. All well-understood. (High)
1.2 Weaknesses / risks identified
| # | Risk | Severity | Mitigation (see section) |
|---|---|---|---|
| R1 | Single VM = single point of failure. Forgejo is irreducibly stateful (git repos on FS). | High | Frequent backups to RustFS + offsite; DR rebuild ≤ 1h, tested (§6). HA is Layer-1 future (§8). |
| R2 | Vault auto-unseal paradox — unattended reboot leaves Vault sealed; auto-unseal normally needs an external KMS (a SaaS or a second Vault). | High | Shamir unseal; keys held in passphrase-encrypted Pulumi config; passphrase-gated unseal helper (§4, §9). |
| R3 | RustFS maturity. RustFS is a young MinIO-compatible S3. Foundation backups depend on it. | Medium | Keep S3 usage to the documented S3 API surface; never make RustFS the only copy of backups (offsite replica is non-S3-only). Treat RustFS as replaceable behind the S3 boundary. (Medium confidence on RustFS stability — flag for second-opinion.) |
| R4 | Pulumi state location before infra exists (chicken-egg). | Medium | Local file backend during bootstrap → migrate to RustFS S3 backend after; state backed up offsite (§5, §9). |
| R5 | Privileged runner. Forgejo Actions docker backend needs a privileged daemon. | Medium | Runner on a throwaway sidecar VM (or same VM, contained), never sharing the forge's trust boundary (§4a of PLAN-001 reused). |
| R6 | DinD/runner pulls from Docker Hub → rate limits + SaaS dependency for CI base images. | Medium | Pull-through cache → mirror critical images into Forgejo's own OCI registry; pin by digest (§7, §8). |
| R7 | TLS day-zero: ACME needs DNS resolving + reachability before the service is public. | Medium | DNS-01 via existing Cloudflare token (already in platform) OR reverse-proxy internal CA for day-zero, swap to real certs once DNS resolves (§4 certs). |
| R8 | Backup encryption keys / offsite creds become a second must-survive secret. | Medium | Fold offsite + backup credentials into the same passphrase-encrypted config / Vault; never a bare file (§4, §6). |
| R9 | Forgejo Actions feature-completeness vs GitLab CI for existing pipelines (Kaniko, semantic-release, helm push). | Low | PLAN-001 already mapped every job → runs-on: docker. Reuse that mapping. (High) |
1.3 Hidden dependencies to make explicit
- DNS must resolve
forge.olsitec.de(and friends) to the VM before TLS and before self-hosting handover. Who owns the zone? (Cloudflare, per existing platform.) → §9 Networking. - An SSH key trusted by the VM is needed for Pulumi's Docker-over-SSH provider. That key's trust is a day-zero identity question (§9 Identity).
- Container images are an external dependency until mirrored. Pin by digest for determinism (§Determinism).
- The operator workstation is an implicit trusted host for the very first
pulumi up. Its toolchain must be validated (preflight, §2).
1.4 Suggested additions / changes to the component list
- Add a reverse proxy with automatic TLS → recommend Caddy (auto-ACME, ~10-line Caddyfile, internal-CA fallback). Alternative: Traefik. nginx if maximum-boring is required but loses auto-TLS ergonomics. (Medium — Caddy recommended.)
- Add a Docker Hub pull-through cache (
registry:2) at Layer 0 from day one (PLAN-001 component #6) — removes a SaaS rate-limit dependency for CI. (Medium) - Defer Valkey/Redis — single-replica Forgejo needs no external queue/cache (PLAN-001 confirms). Add only with HA. (High)
- Defer Meilisearch — search is not foundational. (High)
- Keep
@pulumi/randomfor all credential generation (reuse existing pattern). (High) - Vault PKI engine becomes the internal CA in §8 (replacing Caddy's bootstrap internal CA).
2. Bootstrapping Strategy (empty VM → operational)
Phases are deployed by one Pulumi project with explicit ordering (component dependencies + a small number of phase gates). See §5 for the dependency graph and §9 for the full timeline.
Phase 0 PROVISION Bare VM (Hetzner) + cloud-init: docker engine, ssh key, firewall.
Phase 1 PREFLIGHT Cloned repo validates host+toolchain (pulumi, node/bun, docker, ssh, dns, age/gpg).
Phase 2 STATE+TRUST Pulumi local file backend; master passphrase set (PULUMI_CONFIG_PASSPHRASE via `pass`).
Phase 3 DATA PLANE Docker network + PostgreSQL + RustFS (sealed/empty Vault container also started).
Phase 4 VAULT INIT `vault operator init` → capture root token + unseal keys → write back to passphrase-
encrypted Pulumi config (PROVEN pattern, olsitec-core/run.sh) → unseal.
Phase 5 CREDENTIALS @pulumi/random generates all service creds → written to Vault KV v2 → RustFS buckets
created → Postgres roles/DBs created.
Phase 6 FORGE Reverse proxy + Forgejo (app.ini rendered with secrets from Vault/Pulumi) come up;
Forgejo install-lock + first admin created deterministically.
Phase 7 HANDOVER Push the foundation repo INTO Forgejo; switch git origin; create org + mirror infra
repos; register first Actions runner (token from Vault).
Phase 8 CI HANDOFF A `.forgejo/workflows/` pipeline runs `pulumi preview` (then `up` on approval).
Phase 9 BACKUP+DR First backup taken (forgejo dump + pg_dump + vault snapshot + pulumi state) → RustFS
→ offsite. DR rebuild rehearsed on a fresh VM.
Phase gates (only where strictly required):
- Gate A after Phase 4: Vault must be initialized+unsealed before Phase 5 writes secrets.
- Gate B after Phase 6: Forgejo must be healthy before Phase 7 handover. Everything else flows through ordinary Pulumi resource dependencies — no extra gates.
3. Repository Structure
A single repo = the DR unit. git clone + master passphrase ⇒ rebuild.
olsitec-foundation/
├── README.md # 5-line quickstart + DR pointer
├── VERSIONS # pinned versions+digests for every image & tool (determinism)
├── preflight/
│ ├── preflight.sh # validates tools, versions, ssh, dns, docker reachability
│ └── checks/ # individual check scripts (composable, testable)
├── pulumi/
│ ├── Pulumi.yaml # single project
│ ├── Pulumi.foundation.yaml # stack: passphrase-encrypted config + secrets (committable)
│ ├── index.ts # phase orchestration entrypoint
│ ├── config.ts # typed config schema (CONTRACT_001)
│ ├── components/ # one ComponentResource per concern
│ │ ├── network.ts # docker network, firewall expectations
│ │ ├── postgres.ts
│ │ ├── rustfs.ts # + bucket provisioning
│ │ ├── vault.ts # container + init/unseal capture lib
│ │ ├── credentials.ts # @pulumi/random → Vault writer (CONTRACT_002 paths)
│ │ ├── proxy.ts # Caddy + TLS strategy
│ │ ├── forgejo.ts # app.ini render, install-lock, first admin
│ │ └── runner.ts # act_runner + registration-token flow
│ ├── phases/ # thin orchestrators: dataPlane(), vaultInit(), forge(), handover()
│ └── lib/ # vaultInitCapture(), renderTemplate(), digest pinning helpers
├── containers/ # Dockerfiles for anything we build/mirror ourselves
├── config/ # rendered template SOURCES: app.ini.tmpl, Caddyfile.tmpl, pg-init.sql
├── backup/
│ ├── backup.sh # forgejo dump + pg_dump + vault snapshot + pulumi state → RustFS → offsite
│ └── restore.sh # inverse, parametrized by target host
├── dr/
│ ├── RUNBOOK.md # human-readable DR procedure
│ └── restore-to-fresh-vm.sh # automated rebuild used by the DR rehearsal test
├── docs/
│ ├── decisions/ # ADRs (ADR_F001 layered-platform, etc.)
│ ├── DAY-ZERO-TIMELINE.md # §9 timeline as an executable checklist
│ └── contracts/ # CONTRACT_001..004 (§10)
├── .forgejo/workflows/ # CI: preflight.yml, pulumi-preview.yml, pulumi-up.yml, backup-verify.yml
└── .gitignore # state/ (local backend), node_modules, *.local
Why this layout (High confidence):
- One repo = one DR unit. Vision requirement: "freshly cloned repo capable of pre-flight validation."
components/mirror the deployment order so an agent can own one file with a clear contract.config/holds template sources, never rendered secrets — rendered output carries secrets and stays in container/Vault only (PD-2: don't version secrets).VERSIONScentralizes determinism — preflight and CI both read it; upgrades are a one-line diff..forgejo/workflows/co-located so the repo that defines CI is the repo CI deploys (self-hosting).
4. Secret Management
4.1 Root of trust
The master passphrase (PULUMI_CONFIG_PASSPHRASE) is the single root. It selects Pulumi's
passphrase secrets provider (already in use: encryptionsalt in Pulumi.olsitec-core.yaml).
Chain of trust:
Master passphrase
└─ decrypts Pulumi stack config secrets (committable, `secure: v1:…`)
└─ which hold Vault unseal keys + root token (captured at init)
└─ Vault becomes the runtime distribution layer for ALL other secrets (ADR-002)
The passphrase is the only thing a human must carry out-of-band. Store it in pass
(operator side), and/or split it among operators with Shamir, and/or a hardware token. It is
never written to the platform.
4.2 Credential generation — deterministic vs random
| Class | Examples | Source | Rationale |
|---|---|---|---|
| Random / high-entropy | all service passwords, Postgres pw, RustFS access/secret keys, Forgejo SECRET_KEY + INTERNAL_TOKEN + JWT secrets, OCI/npm registry tokens, runner registration token, Forgejo admin password |
@pulumi/random → Vault KV v2 |
secrets must be unguessable; rotation = --replace |
| Derived / deterministic | usernames, DB names, bucket names, container/DNS names, Vault mount layout, hostnames | computed from typed config | reproducible, non-secret, no entropy needed |
| External (the ONLY one) | master passphrase | human | root of trust |
This satisfies the vision: "everything else should derive from that."
4.3 Vault initialization & unseal (the hard part — High attention)
- Init: Pulumi runs
vault operator init(Shamir, e.g. 5 keys / threshold 3) inside the Vault container, capturesunsealKeys+rootTokenas stack outputs, thenrun.sh(or a Pulumilocal.Command) writes them back as passphrase-encrypted Pulumi config secrets. This exact pattern already exists inolsitec-core/run.sh— reuse it verbatim. (High confidence.) - Unseal on reboot (R2): Vault seals on every restart. Options:
- Passphrase-gated unseal helper (recommended) — a small script reads the unseal keys from Pulumi config (decrypted by the passphrase the operator provides) and unseals. Deterministic, reproducible, no external KMS, no SaaS. Cost: VM reboots need an operator (or a passphrase made available to a boot service — a trade-off to decide).
- Transit auto-unseal — rejected at Layer 0 (needs a second Vault → circular).
- KMS auto-unseal — rejected (SaaS dependency, violates design goal). → Recommend (1) for Layer 0; revisit auto-unseal when a second trust anchor exists at Layer 1. (Medium confidence — this is the main open operational question; flag for second-opinion.)
4.4 Rotation
Per ADR-002: pulumi up --replace on the RandomPassword → new value in Vault → consumers reload.
At Layer 0, consumers are containers, so rotation triggers a container recreate (Pulumi handles the
dependency). Vault root token: rotate via vault operator generate-root after bootstrap; store new
token in config. Unseal-key rotation: vault operator rekey.
4.5 Recovery & backup of secrets
- Vault data backed up via
vault operator raft snapshot→ RustFS → offsite. - Unseal keys + root token survive inside the passphrase-encrypted Pulumi stack config, which is in the repo (and the repo is backed up). So {repo + passphrase} reconstitutes Vault access.
- Backup/offsite credentials (R8) live in Vault and are mirrored into the passphrase-encrypted config, so they survive even total Vault loss.
5. Deployment Order & Dependency Graph
┌─────────────────────────┐
│ master passphrase (ext) │
└───────────┬─────────────┘
│ selects secrets provider
┌──────▼───────┐
│ Pulumi state │ (local file backend → later RustFS S3)
└──────┬───────┘
│
┌──────────────┬──────────┼───────────┬──────────────┐
▼ ▼ ▼ ▼ ▼
docker network PostgreSQL RustFS Vault(sealed) Caddy(proxy)
│ │ │ │ │
│ │ │ [Gate A: init+unseal] │
│ │ │ │ │
│ │ │ Vault(unsealed) │
│ │ │ │ │
│ └──────────┴─────┐ credentials.ts │
│ ▼ (@pulumi/random→Vault)
│ Postgres roles/DBs, RustFS buckets created
│ │
└───────────────────────────────┼──────────────────────┐
▼ │
Forgejo (app.ini ← Vault) ◄──┘ (proxy routes to it)
│ [Gate B: healthy]
▼
first admin + org + repos
│
▼
act_runner (token ← Vault)
│
▼
CI assumes deploy duty
5.1 What depends on what
- Everything depends on Pulumi state + passphrase.
- credentials.ts depends on Vault being unsealed (Gate A) and on Postgres/RustFS existing (to create roles/buckets with the generated creds).
- Forgejo depends on Postgres (DB), RustFS (blob storage), Vault (secrets), proxy (TLS/ingress).
- Runner depends on Forgejo (registration token) and on the proxy (to reach Forgejo).
- CI depends on the runner.
5.2 Circular dependencies & resolutions (summary; full list §9)
| Cycle | Resolution |
|---|---|
| Pulumi needs a secret store; Vault is that store; Vault is deployed by Pulumi | Passphrase-encrypted config holds unseal keys at bootstrap; Vault holds the rest in steady state. |
| Forgejo hosts the repo that deploys Forgejo | Deploy Forgejo from the local clone first; then push repo in + switch origin (handover). |
| CI deploys the platform; CI runs on the platform | First pulumi up is manual; CI takes over only after the runner exists and a self-rebuild is proven. |
| Registry hosts CI base images; CI fills the registry | Pull from upstream via pull-through cache day-zero; mirror into Forgejo registry afterward. |
| TLS needs DNS+ACME; ACME account must be created | DNS-01 via existing Cloudflare token, or internal CA day-zero; real certs once DNS resolves. |
6. Disaster Recovery (total VM loss)
Premise: survive on {a VM, the repo, the master passphrase}.
6.1 What must exist to recover
- The repo (git clone — mirrored offsite, see below).
- The master passphrase (operator's head /
pass/ Shamir split). - The latest backup bundle in the offsite location:
forgejo-dump.zip,pg_dump.sql,vault-raft.snap,pulumi-state(if not reconstructible),rustfs-data(or it is the offsite).
6.2 Procedure (target ≤ 1 hour; dr/restore-to-fresh-vm.sh automates most)
- Provision a fresh VM (Phase 0 cloud-init).
git clonefoundation repo; runpreflight/.- Set
PULUMI_CONFIG_PASSPHRASE;pulumi login(local backend) or restore state from offsite. pulumi upPhases 3–4: data plane + Vault container. Restore Vault from raft snapshot (vault operator raft snapshot restore); unseal with keys from config.- Restore Postgres (
pg_restore) and RustFS data (sync from offsite) before starting Forgejo. pulumi upPhase 6: Forgejo against restored DB + restored data dir (git repos).- Re-register the runner (new token) — runners are stateless, never restored.
- Validate: clone a repo, run a pipeline, push an image, read a Vault secret.
6.3 What is recreatable and not backed up
- Container images (re-pullable / rebuildable from pinned digests).
- Search indexes (Forgejo rebuilds).
- Caches, runner ephemeral state, pull-through cache contents.
- Pulumi state if the local backend is reconstructible — but back it up anyway (cheap insurance).
6.4 Offsite requirement (critical)
RustFS lives on the same VM → it cannot be the only backup copy (R3). Replicate the backup bundle to a second location with a different failure domain that is not SaaS by hard dependency: recommend a second small Hetzner VM / Storage Box in another DC, or a second self-hosted RustFS. (If a SaaS S3 is used, it must be additive, never the sole copy — preserving the no-SaaS guarantee.)
7. Operational Lifecycle
7.1 Upgrades
- Bump the pinned digest in
VERSIONS→ PR → CIpulumi previewposts the plan → human approves → CI (or manual for Vault/Postgres major versions)pulumi up. - Snapshot before every Forgejo/Postgres/Vault upgrade (PD-4): take a backup bundle first.
- Sequence: never upgrade Postgres and Forgejo in the same change; Vault upgrades are isolated.
7.2 Backups
backup/backup.shon a timer (systemd timer or Forgejo Actions scheduled workflow):forgejo dump(repos+metadata) +pg_dump+vault raft snapshot+pulumi state export→ RustFS bucketfoundation-backups→ replicate offsite.- Verify restorability weekly (
.forgejo/workflows/backup-verify.ymlrestores into a scratch container and asserts row counts / repo presence). A backup that has never been restored is a guess. - First backup is part of bootstrap (Phase 9) — before declaring the platform operational.
7.3 Monitoring & alerting
- Bootstrap → minimal: container healthchecks + an external uptime probe (offsite).
- Layer 1: Prometheus/Grafana (on K8s) scrape the foundation node-exporter + Forgejo
/metrics. - Alerting trust rule: the alerter must not run on the only host it watches. Put uptime/alert offsite so a dead VM can still page. (High confidence — common self-hosting footgun.)
7.4 Maintenance & the self-hosting milestone
- Self-hosting is reached when (all true): foundation repo lives in Forgejo; CI can
pulumi upthe foundation; a DR rebuild has succeeded end-to-end from offsite backup. - After that, all changes flow through Git + CI, with manual
pulumi upreserved as the documented break-glass for Layer-0-breaking changes (e.g., Vault/Postgres major upgrades).
8. Future Expansion (how Layer 1+ integrates)
Every future service integrates through the same four foundation interfaces, never bypassing them: (1) source repo in Forgejo, (2) images/charts in Forgejo's OCI registry, (3) secrets in Vault, (4) CI in Forgejo Actions. This keeps the egg the single root for everything.
| Service | Integration path | Notes |
|---|---|---|
| Kubernetes | Provisioned by a new Pulumi project whose repo lives in foundation-Forgejo; pulls images from foundation-registry; secrets from foundation-Vault. | This is where the existing olsicloud4 K8s platform reconnects — as a Layer-1 consumer. |
| ArgoCD | Deployed on K8s; its app repos are Forgejo repos; bootstrap secret (git token) from Vault. | Replaces gitlab.com source in 002_platform_architecture.md with Forgejo. |
| Internal PKI | Vault PKI secrets engine becomes the org CA, replacing Caddy's bootstrap internal CA. cert-manager (Layer 1) uses the Vault issuer. | Promotes day-zero self-signed → real internal trust. |
| Authentik (SSO/OIDC) | Deployed at Layer 1; Forgejo, Grafana, ArgoCD become OIDC clients. Introduce SSO after the platform is stable — not day-zero (avoid an identity dependency in the egg). | Forgejo can also be a temporary OIDC provider before Authentik exists. |
| Grafana / Prometheus | Layer 1; scrape foundation + cluster; dashboards-as-code in Forgejo. | §7.3. |
| Longhorn | Layer-1 storage for stateful K8s workloads — not used by Layer 0 (Layer 0 uses host volumes). | Keeps the egg storage-simple. |
| Renovate | Self-hosted runner job in Forgejo Actions; opens PRs against VERSIONS and chart repos. |
Automates §7.1 digest bumps. |
| Additional registries | Forgejo's bundled registries cover OCI/npm/Helm/+20; add Harbor only if policy/scanning demands it. | Prefer not adding parts. |
Migration note: the existing platform's gitlab.com dependency (git + OCI registry at
registry.gitlab.com/olsitec-nci/charts, ADR-002 paths under olsicloud4/...) is retired by
pointing those repos/registries at foundation-Forgejo. That migration is its own plan, gated on the
foundation being proven.
9. Bootstrap Paradoxes & Day-Zero Analysis
For each: why it exists · what depends on what · automatable? · solution · deterministic?
9.1 Infrastructure
- First VM provisioning. Paradox: Pulumi provisions infra, but the VM hosts Pulumi's target.
→ A thin separate Hetzner Pulumi project (already exists:
pulumi/hetzner-cloud) or one cloud-init creates the VM + installs Docker + plants the operator SSH key. Automatable: yes. Deterministic: yes (image + cloud-init pinned). The VM is the one piece provisioned before the foundation Pulumi runs. - Pulumi's first credentials. It needs (a) SSH to the VM, (b) the master passphrase. SSH key is the day-zero identity (§9.3); passphrase is the root of trust (§4). No other credential needed — everything else is generated. Deterministic: yes.
- Pulumi state before infra exists (R4). → Local file backend on the operator machine during bootstrap; migrate to RustFS S3 backend after Phase 5; back up state offsite. Automatable: yes. Deterministic: yes (state is data, not derived, so it is backed up, not regenerated).
- First clone of the repo. Before Forgejo exists the repo lives… somewhere external (operator workstation + an offsite git mirror — e.g. a bare repo on the backup host, or temporarily gitlab.com during migration). After handover, Forgejo is canonical. Automatable: partially (the very first clone is operator action). Deterministic: yes (content-addressed git).
- Binary installation.
preflight/checks; a pinned installer script fetches exact versions fromVERSIONS. Automatable: yes. Deterministic: yes (pinned). - Host validation.
preflight/preflight.shasserts tool versions, docker reachability, ssh, dns, disk, clock. Fails closed before any deploy. Automatable: yes.
9.2 Secrets & Trust
- Root of trust: master passphrase (§4.1). Minimal external secret: that passphrase, nothing else.
- Vault init / unseal keys / initial creds: §4.3 — proven
olsitec-corecapture pattern. - Deterministic vs random creds: §4.2.
- Rotation / recovery after total loss: §4.4 / §4.5 + §6.
9.3 Identity
- First administrator: created non-interactively by Pulumi via
forgejo admin user create(container exec) orFORGEJO__security__INSTALL_LOCK=true+ env, with an admin password from@pulumi/random→ Vault. No human types a password into a web form. Automatable: yes. Deterministic: the flow is; the password is random-but-stored. (High confidence — Forgejo supports headless admin creation.) - First admin authentication: operator reads the generated admin password from Vault (passphrase → Vault). No default/weak credential ever exists.
- First SSH key trusted: the operator key is planted by cloud-init (Phase 0) — this is the irreducible day-zero trust seed. Subsequent keys are managed in Forgejo.
- Service identities: each service gets its own Vault path + (later) AppRole, mirroring ADR-002.
- OIDC/SSO: introduce at Layer 1 (§8), not day-zero — avoids an identity dependency inside the egg.
9.4 Certificates & Networking
- Initial TLS: DNS-01 ACME via the existing Cloudflare token (already in the platform per
002_platform_architecture.md), issued by Caddy — works even before the host is publicly reachable. Fallback: Caddy internal CA for day-zero, swap to real certs once DNS resolves. - Internal PKI: not required day-zero; Vault PKI adopts it at Layer 1 (§8).
- Cert rotation: Caddy auto-renews ACME; Vault PKI handles internal rotation later.
- DNS assumptions:
forge.olsitec.de(+ registry/host) must resolve to the VM before handover. Owner: Cloudflare zone. This is a hard prerequisite — list it in preflight. - Reverse proxy bootstrap:
Caddyfilerendered from template by Pulumi; routes web/API/registry on one host; Git-over-SSH exposed directly (port 22/2222) not via the HTTP proxy.
9.5 Forgejo
- First repository / first commit / repo arrival: the foundation repo is pushed from the local
clone into Forgejo at Phase 7 handover; origin is switched to Forgejo; this is the
self-hosting moment. Automatable: yes (scripted
git remote+git push). - First CI runner & registration token: token generated via
forgejo actions generate-runner-token(or admin API) → stored in Vault → consumed byrunner.ts. Automatable: yes. Deterministic flow. - When CI owns deployments: only after handover + runner registration + a proven self-
pulumi up. Until then, manualpulumi up(§5.2, §7.4).
9.6 Storage
- Postgres init: container with generated superuser pw;
pg-init.sqlcreates Forgejo role+DB. Automatable: yes. - RustFS init: container with generated admin keys;
credentials.tscreates service keys + buckets (forgejo-packages,forgejo-artifacts,forgejo-lfs,foundation-backups). Automatable: yes. - Bucket creation: Pulumi (S3 provider against RustFS) — deterministic names.
- Restore order after DR: Vault → Postgres → RustFS data → then Forgejo (§6.2). Git repos (Forgejo data dir) are the irreplaceable core; restore before starting Forgejo.
- Recreatable data: images, indexes, caches (§6.3).
9.7 Backups & Recovery
- First backup: Phase 9, before "operational" is declared.
- Where stored: RustFS
foundation-backups+ offsite replica (§6.4). - Backup credential protection: in Vault + mirrored to passphrase-encrypted config (R8/§4.5).
- Required to recover everything: repo + passphrase + {forgejo dump, pg_dump, vault snapshot, pulumi state}. Disposable: images, indexes, caches, runner state (§6.3).
9.8 Operations
- Monitoring enabled: minimal at bootstrap, full at Layer 1 (§7.3).
- Alerting trusted: only when it runs offsite (§7.3).
- Upgrades before CI exists: manual
pulumi upwith a pre-snapshot (§7.1). - Becomes self-hosting / all-changes-through-CI: §7.4 milestone.
9.9 Chronological Day-Zero Timeline
T0 Fresh OS Hetzner VM created (cloud-init: docker, ssh key, firewall, clock sync).
T1 First command operator: git clone olsitec-foundation && ./preflight/preflight.sh
T2 Trust set export PULUMI_CONFIG_PASSPHRASE (via pass); pulumi login (local file backend).
T3 Infra deploy pulumi up → docker network + Postgres + RustFS + Vault(sealed) + Caddy.
T4 Secret init vault operator init → capture keys → write to passphrase-encrypted config → unseal.
T5 Credentials @pulumi/random → Vault; Postgres roles/DBs; RustFS keys+buckets.
T6 Services init Forgejo up (app.ini ← secrets); headless first admin created.
T7 Operational Web/API/registry reachable over TLS; admin password readable from Vault.
T8 Self-hosting push foundation repo → Forgejo; switch origin; create org; register runner.
T9 First CI deploy .forgejo/workflows runs pulumi preview → (approve) → up. CI now owns changes.
T10 Backup backup.sh → RustFS → offsite. (first bundle)
T11 DR validated restore-to-fresh-VM.sh rebuilds on a clean VM from offsite backup; smoke tests pass.
Goal achieved: every step T1–T11 is scripted; the only human actions are providing the passphrase and approving the first CI deploy. No undocumented manual step remains.
10. AI Execution Plan
Work is split into low-coupling tasks. Contracts are written first (baseline §9) so tasks parallelize without inventing incompatible interfaces. Each task: reviewable commit, explicit acceptance criteria, conventional-commit subject.
10.0 Contracts (write before implementation tasks)
| Contract | Defines | Consumed by |
|---|---|---|
| CONTRACT_001 — Config schema | typed Pulumi config keys (hostnames, versions, sizes, feature flags) | every component |
| CONTRACT_002 — Vault path layout | foundation/<service>/<type>-credentials keys (camelCase, ADR-002 style) |
credentials, forgejo, runner, backup |
| CONTRACT_003 — Container network & DNS names | network name, container names, internal ports | network, all services, proxy |
| CONTRACT_004 — Backup artifact format | bundle filenames, layout, restore order | backup, dr, backup-verify |
10.1 Tasks
| ID | Task | Depends on | Parallel? | Acceptance criteria |
|---|---|---|---|---|
| T00 | Contracts CONTRACT_001–004 + ADR_F001 (layered platform) | — | — | 4 contract docs + ADR committed; reviewed by human. |
| T01 | Repo scaffold + preflight/ + VERSIONS |
T00 | yes | preflight.sh exits non-zero on any missing/mismatched tool; passes on a prepared host. |
| T02 | Pulumi project skeleton + passphrase backend + config.ts (CONTRACT_001) |
T00 | yes | pulumi preview runs with empty stack; config schema typed; secrets provider = passphrase. |
| T03 | network.ts + postgres.ts |
T02, C003 | yes | Postgres container up via @pulumi/docker; role+DB created; healthcheck green. |
| T04 | rustfs.ts + bucket provisioning |
T02, C002/C003 | yes | RustFS up; 4 buckets created; service key can put/get an object. |
| T05 | vault.ts + lib/vaultInitCapture (reuse olsitec-core pattern) |
T02 | yes | Vault inits; keys+root captured into encrypted config; unseal helper unseals after restart. |
| T06 | credentials.ts (@pulumi/random → Vault, CONTRACT_002) |
T05 | no (needs Vault) | All credential keys present in Vault at correct paths; idempotent on re-run. |
| T07 | proxy.ts (Caddy) + TLS strategy (DNS-01 + internal-CA fallback) |
T02, C003 | yes | HTTPS terminates for forge.*; cert from Let's Encrypt (or internal CA in dev). |
| T08 | forgejo.ts — app.ini render, install-lock, S3+DB+Vault wiring |
T03,T04,T06,T07 | no | Forgejo healthy; uses external Postgres + RustFS; web/API reachable via proxy. |
| T09 | Forgejo headless first-admin + org + repo bootstrap | T08 | no | Admin created non-interactively; password in Vault; org exists; no default creds. |
| T10 | runner.ts — registration-token flow + act_runner |
T08,T09 | no | Runner registers via Vault token; a hello-world workflow runs to success. |
| T11 | Self-hosting handover script (push repo, switch origin, mirror infra repos) | T09 | no | Foundation repo present in Forgejo; origin switched; git push works over SSH. |
| T12 | backup/ (backup.sh + restore.sh, CONTRACT_004) |
T08 | yes | Bundle written to RustFS + offsite; restore.sh reconstructs into a scratch env. |
| T13 | dr/ runbook + restore-to-fresh-vm.sh |
T12 | no | Automated rebuild on a clean VM passes smoke tests (clone, pipeline, registry push, vault read). |
| T14 | .forgejo/workflows/ (preflight, pulumi preview, pulumi up, backup-verify) |
T10,T11 | yes | preview workflow posts plan; up workflow gated on approval; backup-verify restores+asserts. |
| T15 | index.ts phase orchestration + Gate A/B + DAY-ZERO checklist |
T03–T08 | no | pulumi up from empty → operational in one command (modulo passphrase + approval). |
10.2 Parallelization map
- Wave 1 (parallel): T01, T02 (after T00 contracts).
- Wave 2 (parallel): T03, T04, T05, T07 (all depend only on T02 + contracts).
- Wave 3: T06 (needs T05) ∥ start T12 design.
- Wave 4: T08 (integrates T03/04/06/07).
- Wave 5: T09 → T10 → T11 (sequential handover chain) ∥ T12 impl.
- Wave 6: T13, T14, T15.
10.3 Per-task prompt skeleton (baseline §7.1)
Each agent prompt must carry: Mission · Mode (BUILD or HIGH-RISK/INFRA) · the relevant CONTRACT_00x ·
the component file it owns · Non-goals (don't touch other components, don't edit generated/rendered
secrets, don't run pulumi up against the real VM without approval) · Acceptance criteria (above) ·
Escalation (stop if Vault/state/secret behavior diverges from this plan).
Ratified Decisions (2026-06-30)
These four were decided by the human and are now binding (see ADR_004):
- Layered platform — RATIFIED. Layer 0 = bare Docker on one VM via Pulumi; K8s/ArgoCD demoted to a Layer-1 consumer (§0). The whole plan stands on this.
- Vault unseal — passphrase-gated helper (§4.3 option 1). No external KMS, no SaaS. Reboots require the master passphrase to be made available to the unseal step. Auto-unseal stays off until a Layer-1 trust anchor exists.
- Object storage — RustFS primary (§4 R3). RustFS is the Layer-0 S3, matching the existing
rustfscredential flag. Hard rule: the offsite replica is non-RustFS, so RustFS is never the only copy of a backup. - Offsite backup — second self-hosted location (§6.4). Different DC/failure domain, no SaaS
dependency. Preferred seed: reuse
pulumi/hetzner-cloudfor both the Phase-0 VM and the offsite host.
Remaining minor (reversible defaults — proceeding unless you object)
- Reverse proxy: defaulting to Caddy (auto-TLS, internal-CA fallback). Cheap to swap later.
- Phase-0 VM seed: defaulting to
pulumi/hetzner-cloudfor the foundation VM + the offsite host.
Appendix — Mapping PLAN-001 → this plan
- PLAN-001 "StatefulSet/Helm/ArgoCD" → Layer-0 "container/named-volume/Pulumi resource."
- PLAN-001 data/state model (git on FS, Postgres, S3-for-blobs) → reused unchanged.
- PLAN-001 runner mapping (every job
runs-on: docker, code_qualitydind) → reused for §T10. - PLAN-001 K8s HA topology → §8 future HA path, not bootstrap.