foundation/documentation/planning/PLAN-001-forgejo.md

# Forgejo CI/CD Platform — Kubernetes Infrastructure Plan

> Companion to [CICD-REQUIREMENTS-PROFILE.md](/Users/andiolsi/work/olsitec/gitlab/CICD-REQUIREMENTS-PROFILE.md) and
> [CICD-ALTERNATIVES-RESEARCH.md](/Users/andiolsi/work/olsitec/gitlab/CICD-ALTERNATIVES-RESEARCH.md).
> Target: deploy Forgejo as the GitLab CI replacement on Kubernetes.

---

## Mental model — why the part count is small

Forgejo is **one binary** that is simultaneously: the Git forge, the CI controller
(Forgejo Actions), **and** the bundled package registry (OCI container + Helm + npm + 20 more).
Everything GitLab splits into separate services (registry, package registry, CI coordinator)
is a single `forgejo` Pod here. That means the infra reduces to **three concerns**:

1. **Forgejo server** (forge + CI brain + registry) — stateful
2. **A datastore** (PostgreSQL; optionally Redis/Valkey + object storage)
3. **CI runners** (`act_runner`) — stateless pool, the part you scale

The single genuinely fiddly decision is **how runners execute job containers** (§4).

---

## Data & state architecture

**Forgejo is irreducibly stateful**: its core — the **git repositories** — are bare repos on a
POSIX **filesystem**, and that cannot be offloaded to S3 or a database. Even with everything else
externalized, a Forgejo deployment always has a filesystem volume. This is why it is a
**StatefulSet**, and why backups are `forgejo dump` (repos + DB) → object storage.

Conversely, it needs **no external message queue**, and the database can even be **embedded** —
so a single pod with one PVC and zero dependencies is a complete deployment.

### Where each kind of state lives

| State | Where it lives | Default | Can offload to… | Needed? |
| ----- | -------------- | ------- | --------------- | ------- |
| **Git repositories** | **Filesystem** (bare repos) | local volume | ❌ nothing — git needs a real FS | **Always** |
| **Relational data** (users, orgs, repo/issue/PR metadata, CI run records, package metadata, perms, webhooks) | Database | **SQLite** (embedded) | PostgreSQL / MySQL | **Always** (embeddable) |
| **Async task queue** (webhooks, push processing, mirror sync, mailer, indexer updates) | Internal queue | **LevelDB on disk** (in-process) | Redis/Valkey | No external MQ |
| **Cache + sessions** | In-process | **memory** | Redis/Valkey | No |
| **Blobs** (LFS, attachments, avatars, **packages/registry**, **Actions artifacts & logs**) | Filesystem | local volume | ✅ **S3-compatible** | — |
| **Search indexes** (issue search; code search off by default) | Filesystem | **bleve on disk** | Meilisearch / Elasticsearch | Optional |

### The S3 boundary

S3 holds **blobs only** — LFS, attachments, packages, Actions artifacts/logs. S3 **cannot** hold:

- the **git repositories** (require a POSIX filesystem — the non-negotiable stateful core),
- the **database**,
- the **config** (`app.ini`, host SSH keys).

There is **no fully-stateless Forgejo**. Even with external Postgres + S3 for every blob, a PVC for
the git repos remains.

### What this means by sizing

- **Minimal / "all baked in":** 1 pod, 1 PVC — Forgejo + embedded SQLite + on-disk queue/cache/blobs/index. Zero external dependencies.
- **Recommended production:** Forgejo pod + PVC for **git repos** (mandatory) + external **Postgres** + **S3** for blobs. Valkey optional; Meilisearch only if code search is wanted.
- **HA (multi-replica):** the step change — requires **all** of: external Postgres, **Redis/Valkey** (queue+cache+session), **S3** for every blob, **RWX shared FS** (NFS/CephFS) for git repos, and an external search index. (Reason the plan stays single-replica.)

---

## The moving parts

| # | Component | Workload type | Replicas | Storage | Required? | Replaces (GitLab) |
|---|-----------|---------------|----------|---------|-----------|-------------------|
| 1 | **Forgejo server** | **StatefulSet** | 1 | PVC (RWO): repos, LFS, packages, Actions artifacts | **Required** | GitLab app + Container Registry + Package Registry + CI coordinator |
| 2 | **PostgreSQL** | **StatefulSet** | 1 (or external managed) | PVC (RWO) | **Required**¹ | GitLab's Postgres |
| 3 | **act_runner pool** | **Deployment** (+ DinD) | 1–N | ephemeral (+ cache PVC optional) | **Required** | GitLab Runners |
| 4 | **Valkey/Redis** | Deployment/StatefulSet | 1 | optional PVC | Recommended² | GitLab's Redis |
| 5 | **Object storage (S3/MinIO)** | StatefulSet (MinIO) or external | 1+ | PVC / external | Recommended³ | GitLab object storage |
| 6 | **Docker Hub pull-through cache** | Deployment | 1 | small PVC | Recommended⁴ | GitLab Dependency Proxy |
| 7 | **Meilisearch** (code/issue search) | StatefulSet | 1 | PVC | Optional⁵ | GitLab Elasticsearch |

¹ Forgejo *can* run on bundled SQLite (zero extra pods) for a pure PoC, but Postgres is the production choice.
² Without Redis, Forgejo uses an internal queue/cache — fine for a single replica; required for multi-replica HA.
³ Without S3, packages/LFS/artifacts live on the Forgejo PVC — simplest, but couples storage to the pod. S3 decouples them and is needed for HA.
⁴ Forgejo does **not** bundle a Docker Hub proxy. A `registry:2` mirror (or Harbor proxy project) replaces `CI_DEPENDENCY_PROXY_*` to dodge Docker Hub rate limits.
⁵ Only if you want fast code search; not needed for CI/CD itself.

---

## Two sizings

### A. Proof-of-concept / staging — **3 workloads**
```
forgejo (StatefulSet, 1)  ── PVC
postgresql (StatefulSet, 1) ── PVC          [or SQLite → 2 workloads total]
act_runner (Deployment, 1) + DinD sidecar
```
Everything else (registry, packages, artifacts) is served by the Forgejo pod off its PVC.
This is enough to translate and run your existing pipelines end-to-end.

### B. Recommended small-team production — **~6 workloads**
```
forgejo (StatefulSet, 1)        ── PVC (repos/LFS) + S3 for packages/artifacts
postgresql (StatefulSet, 1)     ── PVC   (or external managed Postgres → -1 in-cluster)
valkey (Deployment, 1)          ── cache/queue
act_runner (Deployment, 2–3)    + DinD   ── the part you scale for throughput
registry:2 pull-through cache (Deployment, 1) ── Docker Hub mirror
minio (StatefulSet, 1)          ── packages/artifacts/LFS   [omit if using external S3]
```
Add Meilisearch only if you want search. Use an external managed Postgres/S3 and the
in-cluster count drops to **4** (forgejo, valkey, runner, registry-cache).

---

## §4 — The one real decision: runner execution model

`act_runner` itself is trivial (a stateless Deployment). The question is **what runs the job
containers** your pipelines declare (`runs-on:` / per-job images, Kaniko, etc.):

| Backend | How | Pros | Cons |
|---------|-----|------|------|
| **Docker (DinD)** ✅ default | runner pod + privileged `docker:dind` sidecar | Closest to GitLab's container executor; everything "just works"; caching, services, per-job images | **Privileged pod** (security review needed); DinD storage is ephemeral |
| **Host mode** | runner runs steps directly on the node | No privilege escalation for the daemon | No isolation between jobs; not recommended for shared CI |
| **Kubernetes-native** | runner schedules each job as a Pod | No privileged DinD; cloud-native | Less mature than GitLab's k8s executor; more config |

**Recommendation:** start with **DinD** (privileged) to get parity fast, isolate runners onto a
dedicated node pool / namespace with NetworkPolicies, then evaluate the k8s-native backend later.
Your **rootless image builds (Kaniko/Buildah)** run *inside* the job and don't require DinD for the
build itself — but the runner still needs a container backend to launch the job containers.

---

## §4a — Recommended runner topology: privileged VM(s) off-cluster

There is **no mature "clean unprivileged pod-per-job" backend** for Forgejo's `act_runner` yet —
native Kubernetes runners are an open design discussion
([forgejo/discussions #66](https://codeberg.org/forgejo/discussions/issues/66)); the standard
in-cluster path is **DinD (privileged sidecar)**. So you don't avoid privilege by moving execution
*into* k8s — you avoid it by moving execution **out** of k8s.

**Chosen topology: keep Kubernetes for the forge only; run all CI execution as docker-backed
`act_runner`s on dedicated VM(s).**

| Where | Workload | Runner label(s) | Privilege |
| ----- | -------- | --------------- | --------- |
| **Kubernetes** | Forgejo + Postgres (+ Valkey) | — | none — cluster stays clean |
| **Privileged VM(s)** | `act_runner` (docker backend), pooled | `docker`, `dind` | privileged, contained to throwaway VMs |
| *(optional)* **Kubernetes** | `act_runner` (host type) for cheap lint offload | `k8s` | none, but **no per-job image** |

Routing rules: same label on N runners → they **pool** and share the queue (scale by adding VMs).
A job listing multiple labels needs a runner with **all** of them. No auto-balancing across labels.

### Runner labels (`act_runner` config.yaml)

```yaml
# On each privileged VM:
runner:
  labels:
    - "docker:docker://catthehacker/ubuntu:act-22.04"  # normal containerized jobs (per-job image honored)
    - "dind:docker://-"                                 # jobs that need a real docker daemon ("-" = job sets its own image)
# Optional in-cluster, host type (unprivileged, single shared image, no per-job image):
#   - "k8s:host"
```

### Mapping the current pipeline jobs → `runs-on`

Almost every existing job sets a **per-job image**, which requires the **docker** backend — this is
the core reason CI execution belongs on docker-backed runners, not `host`-type pods.

| Current GitLab job | Image used today | `runs-on` | Why |
| ------------------ | ---------------- | --------- | --- |
| `yamllint` | `pipelinecomponents/yamllint` | `docker` | per-job image |
| `eslint` | custom `utils` image | `docker` | per-job image |
| `hadolint` | `pipelinecomponents/hadolint` | `docker` | per-job image |
| `container-build` (Kaniko) | `kaniko:debug` | `docker` | rootless build in its own container |
| `container-scan` (Trivy) | `trivy` image | `docker` | per-job image |
| `container-sbom` (Syft) | `syft` image | `docker` | per-job image |
| `generate-release-version` / `release` | `semantic-release` image | `docker` | per-job image + git push |
| `helm-lint` | `alpine/helm` | `docker` | per-job image |
| `helm-publish` | `semantic-release-helm` image | `docker` | per-job image + `helm push oci://` |
| `npm-publish` / `bun-build` | `node` / `bun` image | `docker` | per-job image |
| `renovate` (scheduled) | renovate-runner image | `docker` | per-job image |
| `code_quality` | `docker:dind` service | **`dind`** | genuinely needs a real Docker daemon |

Net: route everything to **`docker`** except the CodeClimate `code_quality` job (and any future
"needs a real docker daemon" job), which goes to **`dind`**. The optional `k8s` host-type label is
only worth it if you later rewrite a few light jobs to share one runner image.

---

## Non-workload Kubernetes objects (the "rest of the iceberg")

These aren't Pods but are part of the deploy:

- **Services** (forgejo HTTP, forgejo SSH, postgres, valkey, runner, registry-cache)
- **Ingress** — Forgejo web + API + registry over one host; SSH via LoadBalancer/NodePort (Git over SSH + registry push)
- **PersistentVolumeClaims** — one per stateful component (§ table)
- **Secrets** — Forgejo `SECRET_KEY`/`INTERNAL_TOKEN`, DB creds, runner registration token, S3 creds, registry-cache upstream creds
- **ConfigMap** — `app.ini` (Forgejo config) if not fully via env/secret
- **CronJob** — DB + repo backups (`forgejo dump`) → object storage
- **NetworkPolicy** — fence the privileged runner namespace
- **(optional) ServiceMonitor** — Forgejo exposes Prometheus metrics

---

## High availability note

Single-replica Forgejo is the right call for a small team (Git + CI + registry on one pod is
fine at your scale; downtime = a pod restart). **True HA (multi-replica Forgejo) is a step
change** — it requires *all* of: external Postgres, external Redis/Valkey, S3 for all blob
storage, **RWX** shared volume for repos, and an external search index. Don't start there; it
roughly doubles the moving parts for marginal benefit at small-team scale.

---

## Deployment mechanism (fits your existing stack)

You already run **ArgoCD + Helm** (you publish Helm charts and have `argocd/projects/...`).
Deploy Forgejo the same way:

- **Forgejo** → official `code.forgejo.org/forgejo-helm/forgejo` chart, wrapped as an ArgoCD
  `Application`. The chart can bundle Postgres/Redis subcharts (toggle `postgresql.enabled`,
  `redis-cluster.enabled`) — disable the HA subcharts for the small-team sizing.
- **Runners** → the `act_runner` / forgejo-runner Helm chart as a second ArgoCD Application
  (separate so you scale/upgrade runners independently of the forge).
- **Registry cache + MinIO** → their respective community charts, or your own.

So in ArgoCD terms: **2 core Applications** (forgejo, runners) + **1–3 supporting**
(registry-cache, minio, valkey if not via subchart).

---

## Summary — "how many moving parts?"

- **Conceptually: 3** — Forgejo (forge+CI+registry), a database, runners.
- **PoC on k8s: 3 workloads** (forgejo + postgres + 1 runner).
- **Recommended small-team production: ~6 workloads** (forgejo, postgres, valkey, runner pool,
  Docker Hub cache, object storage) — drops to **~4 in-cluster** if Postgres and S3 are external/managed.
- **The only non-trivial choice** is the runner execution backend (DinD vs k8s-native).
- Everything GitLab runs as separate registry/package services is **folded into the one Forgejo pod**.