foundation/documentation/planning/PLAN-001-forgejo.md
Andreas Niemann f18676e6b3 chore: scaffold olsitec-foundation mono-repo
Repo topology, baseline overlay, planning docs (PLAN-001/002), ADR-004/005,
and the bootstrap/packages/documentation skeleton. Implementation (T00+) not started.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-30 17:10:46 +02:00

234 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Forgejo CI/CD Platform — Kubernetes Infrastructure Plan
> Companion to [CICD-REQUIREMENTS-PROFILE.md](/Users/andiolsi/work/olsitec/gitlab/CICD-REQUIREMENTS-PROFILE.md) and
> [CICD-ALTERNATIVES-RESEARCH.md](/Users/andiolsi/work/olsitec/gitlab/CICD-ALTERNATIVES-RESEARCH.md).
> Target: deploy Forgejo as the GitLab CI replacement on Kubernetes.
---
## Mental model — why the part count is small
Forgejo is **one binary** that is simultaneously: the Git forge, the CI controller
(Forgejo Actions), **and** the bundled package registry (OCI container + Helm + npm + 20 more).
Everything GitLab splits into separate services (registry, package registry, CI coordinator)
is a single `forgejo` Pod here. That means the infra reduces to **three concerns**:
1. **Forgejo server** (forge + CI brain + registry) — stateful
2. **A datastore** (PostgreSQL; optionally Redis/Valkey + object storage)
3. **CI runners** (`act_runner`) — stateless pool, the part you scale
The single genuinely fiddly decision is **how runners execute job containers** (§4).
---
## Data & state architecture
**Forgejo is irreducibly stateful**: its core — the **git repositories** — are bare repos on a
POSIX **filesystem**, and that cannot be offloaded to S3 or a database. Even with everything else
externalized, a Forgejo deployment always has a filesystem volume. This is why it is a
**StatefulSet**, and why backups are `forgejo dump` (repos + DB) → object storage.
Conversely, it needs **no external message queue**, and the database can even be **embedded**
so a single pod with one PVC and zero dependencies is a complete deployment.
### Where each kind of state lives
| State | Where it lives | Default | Can offload to… | Needed? |
| ----- | -------------- | ------- | --------------- | ------- |
| **Git repositories** | **Filesystem** (bare repos) | local volume | ❌ nothing — git needs a real FS | **Always** |
| **Relational data** (users, orgs, repo/issue/PR metadata, CI run records, package metadata, perms, webhooks) | Database | **SQLite** (embedded) | PostgreSQL / MySQL | **Always** (embeddable) |
| **Async task queue** (webhooks, push processing, mirror sync, mailer, indexer updates) | Internal queue | **LevelDB on disk** (in-process) | Redis/Valkey | No external MQ |
| **Cache + sessions** | In-process | **memory** | Redis/Valkey | No |
| **Blobs** (LFS, attachments, avatars, **packages/registry**, **Actions artifacts & logs**) | Filesystem | local volume | ✅ **S3-compatible** | — |
| **Search indexes** (issue search; code search off by default) | Filesystem | **bleve on disk** | Meilisearch / Elasticsearch | Optional |
### The S3 boundary
S3 holds **blobs only** — LFS, attachments, packages, Actions artifacts/logs. S3 **cannot** hold:
- the **git repositories** (require a POSIX filesystem — the non-negotiable stateful core),
- the **database**,
- the **config** (`app.ini`, host SSH keys).
There is **no fully-stateless Forgejo**. Even with external Postgres + S3 for every blob, a PVC for
the git repos remains.
### What this means by sizing
- **Minimal / "all baked in":** 1 pod, 1 PVC — Forgejo + embedded SQLite + on-disk queue/cache/blobs/index. Zero external dependencies.
- **Recommended production:** Forgejo pod + PVC for **git repos** (mandatory) + external **Postgres** + **S3** for blobs. Valkey optional; Meilisearch only if code search is wanted.
- **HA (multi-replica):** the step change — requires **all** of: external Postgres, **Redis/Valkey** (queue+cache+session), **S3** for every blob, **RWX shared FS** (NFS/CephFS) for git repos, and an external search index. (Reason the plan stays single-replica.)
---
## The moving parts
| # | Component | Workload type | Replicas | Storage | Required? | Replaces (GitLab) |
|---|-----------|---------------|----------|---------|-----------|-------------------|
| 1 | **Forgejo server** | **StatefulSet** | 1 | PVC (RWO): repos, LFS, packages, Actions artifacts | **Required** | GitLab app + Container Registry + Package Registry + CI coordinator |
| 2 | **PostgreSQL** | **StatefulSet** | 1 (or external managed) | PVC (RWO) | **Required**¹ | GitLab's Postgres |
| 3 | **act_runner pool** | **Deployment** (+ DinD) | 1N | ephemeral (+ cache PVC optional) | **Required** | GitLab Runners |
| 4 | **Valkey/Redis** | Deployment/StatefulSet | 1 | optional PVC | Recommended² | GitLab's Redis |
| 5 | **Object storage (S3/MinIO)** | StatefulSet (MinIO) or external | 1+ | PVC / external | Recommended³ | GitLab object storage |
| 6 | **Docker Hub pull-through cache** | Deployment | 1 | small PVC | Recommended⁴ | GitLab Dependency Proxy |
| 7 | **Meilisearch** (code/issue search) | StatefulSet | 1 | PVC | Optional⁵ | GitLab Elasticsearch |
¹ Forgejo *can* run on bundled SQLite (zero extra pods) for a pure PoC, but Postgres is the production choice.
² Without Redis, Forgejo uses an internal queue/cache — fine for a single replica; required for multi-replica HA.
³ Without S3, packages/LFS/artifacts live on the Forgejo PVC — simplest, but couples storage to the pod. S3 decouples them and is needed for HA.
⁴ Forgejo does **not** bundle a Docker Hub proxy. A `registry:2` mirror (or Harbor proxy project) replaces `CI_DEPENDENCY_PROXY_*` to dodge Docker Hub rate limits.
⁵ Only if you want fast code search; not needed for CI/CD itself.
---
## Two sizings
### A. Proof-of-concept / staging — **3 workloads**
```
forgejo (StatefulSet, 1) ── PVC
postgresql (StatefulSet, 1) ── PVC [or SQLite → 2 workloads total]
act_runner (Deployment, 1) + DinD sidecar
```
Everything else (registry, packages, artifacts) is served by the Forgejo pod off its PVC.
This is enough to translate and run your existing pipelines end-to-end.
### B. Recommended small-team production — **~6 workloads**
```
forgejo (StatefulSet, 1) ── PVC (repos/LFS) + S3 for packages/artifacts
postgresql (StatefulSet, 1) ── PVC (or external managed Postgres → -1 in-cluster)
valkey (Deployment, 1) ── cache/queue
act_runner (Deployment, 23) + DinD ── the part you scale for throughput
registry:2 pull-through cache (Deployment, 1) ── Docker Hub mirror
minio (StatefulSet, 1) ── packages/artifacts/LFS [omit if using external S3]
```
Add Meilisearch only if you want search. Use an external managed Postgres/S3 and the
in-cluster count drops to **4** (forgejo, valkey, runner, registry-cache).
---
## §4 — The one real decision: runner execution model
`act_runner` itself is trivial (a stateless Deployment). The question is **what runs the job
containers** your pipelines declare (`runs-on:` / per-job images, Kaniko, etc.):
| Backend | How | Pros | Cons |
|---------|-----|------|------|
| **Docker (DinD)** ✅ default | runner pod + privileged `docker:dind` sidecar | Closest to GitLab's container executor; everything "just works"; caching, services, per-job images | **Privileged pod** (security review needed); DinD storage is ephemeral |
| **Host mode** | runner runs steps directly on the node | No privilege escalation for the daemon | No isolation between jobs; not recommended for shared CI |
| **Kubernetes-native** | runner schedules each job as a Pod | No privileged DinD; cloud-native | Less mature than GitLab's k8s executor; more config |
**Recommendation:** start with **DinD** (privileged) to get parity fast, isolate runners onto a
dedicated node pool / namespace with NetworkPolicies, then evaluate the k8s-native backend later.
Your **rootless image builds (Kaniko/Buildah)** run *inside* the job and don't require DinD for the
build itself — but the runner still needs a container backend to launch the job containers.
---
## §4a — Recommended runner topology: privileged VM(s) off-cluster
There is **no mature "clean unprivileged pod-per-job" backend** for Forgejo's `act_runner` yet —
native Kubernetes runners are an open design discussion
([forgejo/discussions #66](https://codeberg.org/forgejo/discussions/issues/66)); the standard
in-cluster path is **DinD (privileged sidecar)**. So you don't avoid privilege by moving execution
*into* k8s — you avoid it by moving execution **out** of k8s.
**Chosen topology: keep Kubernetes for the forge only; run all CI execution as docker-backed
`act_runner`s on dedicated VM(s).**
| Where | Workload | Runner label(s) | Privilege |
| ----- | -------- | --------------- | --------- |
| **Kubernetes** | Forgejo + Postgres (+ Valkey) | — | none — cluster stays clean |
| **Privileged VM(s)** | `act_runner` (docker backend), pooled | `docker`, `dind` | privileged, contained to throwaway VMs |
| *(optional)* **Kubernetes** | `act_runner` (host type) for cheap lint offload | `k8s` | none, but **no per-job image** |
Routing rules: same label on N runners → they **pool** and share the queue (scale by adding VMs).
A job listing multiple labels needs a runner with **all** of them. No auto-balancing across labels.
### Runner labels (`act_runner` config.yaml)
```yaml
# On each privileged VM:
runner:
labels:
- "docker:docker://catthehacker/ubuntu:act-22.04" # normal containerized jobs (per-job image honored)
- "dind:docker://-" # jobs that need a real docker daemon ("-" = job sets its own image)
# Optional in-cluster, host type (unprivileged, single shared image, no per-job image):
# - "k8s:host"
```
### Mapping the current pipeline jobs → `runs-on`
Almost every existing job sets a **per-job image**, which requires the **docker** backend — this is
the core reason CI execution belongs on docker-backed runners, not `host`-type pods.
| Current GitLab job | Image used today | `runs-on` | Why |
| ------------------ | ---------------- | --------- | --- |
| `yamllint` | `pipelinecomponents/yamllint` | `docker` | per-job image |
| `eslint` | custom `utils` image | `docker` | per-job image |
| `hadolint` | `pipelinecomponents/hadolint` | `docker` | per-job image |
| `container-build` (Kaniko) | `kaniko:debug` | `docker` | rootless build in its own container |
| `container-scan` (Trivy) | `trivy` image | `docker` | per-job image |
| `container-sbom` (Syft) | `syft` image | `docker` | per-job image |
| `generate-release-version` / `release` | `semantic-release` image | `docker` | per-job image + git push |
| `helm-lint` | `alpine/helm` | `docker` | per-job image |
| `helm-publish` | `semantic-release-helm` image | `docker` | per-job image + `helm push oci://` |
| `npm-publish` / `bun-build` | `node` / `bun` image | `docker` | per-job image |
| `renovate` (scheduled) | renovate-runner image | `docker` | per-job image |
| `code_quality` | `docker:dind` service | **`dind`** | genuinely needs a real Docker daemon |
Net: route everything to **`docker`** except the CodeClimate `code_quality` job (and any future
"needs a real docker daemon" job), which goes to **`dind`**. The optional `k8s` host-type label is
only worth it if you later rewrite a few light jobs to share one runner image.
---
## Non-workload Kubernetes objects (the "rest of the iceberg")
These aren't Pods but are part of the deploy:
- **Services** (forgejo HTTP, forgejo SSH, postgres, valkey, runner, registry-cache)
- **Ingress** — Forgejo web + API + registry over one host; SSH via LoadBalancer/NodePort (Git over SSH + registry push)
- **PersistentVolumeClaims** — one per stateful component (§ table)
- **Secrets** — Forgejo `SECRET_KEY`/`INTERNAL_TOKEN`, DB creds, runner registration token, S3 creds, registry-cache upstream creds
- **ConfigMap** — `app.ini` (Forgejo config) if not fully via env/secret
- **CronJob** — DB + repo backups (`forgejo dump`) → object storage
- **NetworkPolicy** — fence the privileged runner namespace
- **(optional) ServiceMonitor** — Forgejo exposes Prometheus metrics
---
## High availability note
Single-replica Forgejo is the right call for a small team (Git + CI + registry on one pod is
fine at your scale; downtime = a pod restart). **True HA (multi-replica Forgejo) is a step
change** — it requires *all* of: external Postgres, external Redis/Valkey, S3 for all blob
storage, **RWX** shared volume for repos, and an external search index. Don't start there; it
roughly doubles the moving parts for marginal benefit at small-team scale.
---
## Deployment mechanism (fits your existing stack)
You already run **ArgoCD + Helm** (you publish Helm charts and have `argocd/projects/...`).
Deploy Forgejo the same way:
- **Forgejo** → official `code.forgejo.org/forgejo-helm/forgejo` chart, wrapped as an ArgoCD
`Application`. The chart can bundle Postgres/Redis subcharts (toggle `postgresql.enabled`,
`redis-cluster.enabled`) — disable the HA subcharts for the small-team sizing.
- **Runners** → the `act_runner` / forgejo-runner Helm chart as a second ArgoCD Application
(separate so you scale/upgrade runners independently of the forge).
- **Registry cache + MinIO** → their respective community charts, or your own.
So in ArgoCD terms: **2 core Applications** (forgejo, runners) + **13 supporting**
(registry-cache, minio, valkey if not via subchart).
---
## Summary — "how many moving parts?"
- **Conceptually: 3** — Forgejo (forge+CI+registry), a database, runners.
- **PoC on k8s: 3 workloads** (forgejo + postgres + 1 runner).
- **Recommended small-team production: ~6 workloads** (forgejo, postgres, valkey, runner pool,
Docker Hub cache, object storage) — drops to **~4 in-cluster** if Postgres and S3 are external/managed.
- **The only non-trivial choice** is the runner execution backend (DinD vs k8s-native).
- Everything GitLab runs as separate registry/package services is **folded into the one Forgejo pod**.