Progressive delivery (Flagger + Gateway API)
June 21, 2026 · View on GitHub
Flagger is the platform's standard progressive-delivery controller. Instead of
a plain RollingUpdate — where a bad release reaches 100% of traffic before
anyone notices — an onboarded app is rolled out as a canary: Flagger shifts a
small, increasing slice of traffic to the new version (or, for blue/green,
validates it out-of-band), checks request success-rate and latency at each step,
and automatically rolls back if the new version misbehaves.
This follows the upstream Flagger Gateway API tutorial and KEDA ScaledObject tutorial, adapted to this platform's Cilium Gateway API + Coroot stack.
How it works
Flagger watches a Canary, clones the target Deployment to <name>-primary,
creates <name>-primary / <name>-canary Services, and drives an analysis loop.
Promotion vs rollback is gated on:
- SLO metrics — there is no Istio/Envoy telemetry and no app instrumentation
here, so the
MetricTemplates query Coroot's bundled Prometheus (the same endpoint OpenCost uses,coroot-prometheus.observability.svc:9090). Coroot's eBPF node-agent exports server-sidecontainer_http_inbound_requests_total{status}(a counter) plus thecontainer_http_inbound_requests_duration_seconds_totalhistogram — a standard Prometheus histogram, so its queryable bucket series is..._total_bucket{le}— per container. The templates measure the canary pods — see the canary-vs-primary note. - Webhooks —
flagger-loadtesterruns an acceptance (smoke) test before traffic shifts and generates load during analysis (so Coroot has requests to measure). The webhooks hit the<name>-canaryService directly.
Two delivery modes
| Mode | Flagger provider | When | Traffic |
|---|---|---|---|
| Weighted canary | gatewayapi:v1 | App is routed directly by the Gateway | Flagger owns the HTTPRoute and shifts backendRef weights 10% → 50% |
| Blue/green | kubernetes | App is behind oauth2-proxy (no gateway-level split possible) | No live split; canary validated via the load-tester, then the apex Service is repointed |
What's deployed
| Component | Layer | Path |
|---|---|---|
flagger controller + flagger-loadtester | infra-controllers | controllers/flagger/ |
coroot-request-success-rate / coroot-request-duration MetricTemplates | infrastructure | infrastructure/flagger/ |
| umami Canary (weighted) | apps | apps/umami/canary.yaml |
| homepage Canary (blue/green) | apps | apps/homepage/canary.yaml |
| opencost Canary (blue/green) | infrastructure | infrastructure/flagger/canary-opencost.yaml |
CRD-vs-CR layering. The flagger HelmRelease ships the
Canary/MetricTemplateCRDs (infra-controllers). A CR of those kinds in the same Flux Kustomization fails the server-side dry-run (no matches for kind) and deadlocks the set. So app Canaries live in the apps layer and the opencost Canary + the MetricTemplates live in theinfrastructurelayer (both depend on, and wait for, infra-controllers) — the same split asinfrastructure/coroot/coroot.yaml.
Onboarded apps & status
- umami — weighted Gateway API canary. ⚠️ Prod-only (excluded from the
docker overlay — Flagger and canaries are opt-in locally, and CI validates
manifests statically rather than running them) and stateful
(one shared CloudNativePG DB; do not land a schema-changing upgrade as a
canary). Its old route's HSTS header is re-added via
service.headers; thegethomepage.dev/*tile annotations are not reproducible on a Flagger route. - homepage — blue/green (it's the root dashboard behind oauth2-proxy). High
blast radius; replicas are owned by a KEDA
ScaledObjectviaautoscalerRef(primary scales 2-3 on Coroot request rate — see "KEDA apps" below). - opencost — blue/green infra workload. ⚠️ Headlamp's cost plugin uses the
opencost:http-uinamed port via the apiserver proxy;portDiscoverymay not preserve that name — watch the Headlamp cost panel after rollout.
Excluded (and why)
| Workload | Reason |
|---|---|
| whoami, headlamp, actual-budget, fleetdm (parked), hubble-ui | Single-replica by design (always-on; auto-vpa right-sizes them). headlamp = single-pod in-memory OIDC; actual-budget = single-writer file DB; hubble-ui/whoami = single-replica UIs — none can run concurrent canary pods. fleetdm is disabled since 2026-06-03. |
| openbao | StatefulSet — Flagger only manages Deployments / DaemonSets. |
| coroot UI, hubble-ui | Operator-reconciled Deployments — the operator fights Flagger for ownership. |
| dex, oauth2-proxy, flux-operator | Critical SSO / GitOps — too risky to canary. |
Onboarding a new app
- Pick the mode (table above). Stateless, directly-routed apps → weighted; oauth2-proxy-fronted apps → blue/green.
- Free the route (weighted only) — delete the app's
httproute.yaml; Flagger generates the route from the Canary'sgatewayRefs+hosts. Re-add any response-header filters viaspec.service.headers. Blue/green keeps the existing route untouched. - Let Flagger (or its KEDA
autoscalerRef) own replicas — omit the chart'sreplicaCountvalue and do not pin/spec/replicasin a postRenderer, otherwise Flux re-applies the chart's replica count and fights Flagger's scale-to-zero of the canary (flapping). See any onboarded app'shelm-release.yaml. - Add the
Canary— copy umami's (weighted) or homepage's (blue/green), referencing the twoMetricTemplates and a loadtester acceptance + load-test webhook on<app>-canary. - Open the netpols — app namespace: ingress from
flagger-systemon the app port;flagger-system: load-tester egress to the app namespace+port (incontrollers/flagger/networkpolicy.yaml). - Place the Canary in the apps layer (app) or the
infrastructurelayer (infra component) — never ininfrastructure/controllers.
KEDA apps (used by homepage + umami)
Flagger's KEDA integration uses the core keda.sh ScaledObject.
Homepage and Umami use it: each app's scaled-object.yaml scales 2-3 replicas
on Coroot's inbound request-rate series — a metric auto-vpa does not
control, so vertical (VPA, up to maxAllowed) and horizontal scaling never
fight. The Canary references the ScaledObject and Flagger manages the
-primary scaler:
spec:
autoscalerRef:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
name: myapp-so # Flagger clones it to myapp-so-primary and pauses
primaryScalerQueries: # the source's scaler at 0 between rollouts.
requests: sum(rate(container_http_inbound_requests_total{...primary...}[1m]))
The trigger must be named (Flagger matches primaryScalerQueries by
trigger name), and the source query must exclude -primary- pods — reuse the
vowel-free hash regex described under "Measuring the canary".
Measuring the canary
Flagger gates on the canary ({{ target }} = targetRef.name, the original
deployment), not the primary. Coroot labels metrics by pod-name container_id
and RE2 lacks negative lookahead, so the MetricTemplates select canary pods
while excluding <target>-primary-* by exploiting that Kubernetes pod-template
hashes are vowel-free (bcdfghjklmnpqrstvwxz2456789): [bcdfghjklmnpqrstvwxz2-9]+
matches a hash but never "primary" (it has i/a).
⚠️ The PromQL is written against Coroot's documented schema but not validated
against live data — before trusting auto-promotion, confirm in
coroot-prometheus the status label format, the container_id format, and the
latency bucket series, and tune
infrastructure/flagger/metric-template-*.yaml.