Progressive delivery (Flagger + Gateway API)

June 21, 2026 · View on GitHub

Flagger is the platform's standard progressive-delivery controller. Instead of a plain RollingUpdate — where a bad release reaches 100% of traffic before anyone notices — an onboarded app is rolled out as a canary: Flagger shifts a small, increasing slice of traffic to the new version (or, for blue/green, validates it out-of-band), checks request success-rate and latency at each step, and automatically rolls back if the new version misbehaves.

This follows the upstream Flagger Gateway API tutorial and KEDA ScaledObject tutorial, adapted to this platform's Cilium Gateway API + Coroot stack.

How it works

Flagger watches a Canary, clones the target Deployment to <name>-primary, creates <name>-primary / <name>-canary Services, and drives an analysis loop. Promotion vs rollback is gated on:

  • SLO metrics — there is no Istio/Envoy telemetry and no app instrumentation here, so the MetricTemplates query Coroot's bundled Prometheus (the same endpoint OpenCost uses, coroot-prometheus.observability.svc:9090). Coroot's eBPF node-agent exports server-side container_http_inbound_requests_total{status} (a counter) plus the container_http_inbound_requests_duration_seconds_total histogram — a standard Prometheus histogram, so its queryable bucket series is ..._total_bucket{le} — per container. The templates measure the canary pods — see the canary-vs-primary note.
  • Webhooksflagger-loadtester runs an acceptance (smoke) test before traffic shifts and generates load during analysis (so Coroot has requests to measure). The webhooks hit the <name>-canary Service directly.

Two delivery modes

ModeFlagger providerWhenTraffic
Weighted canarygatewayapi:v1App is routed directly by the GatewayFlagger owns the HTTPRoute and shifts backendRef weights 10% → 50%
Blue/greenkubernetesApp is behind oauth2-proxy (no gateway-level split possible)No live split; canary validated via the load-tester, then the apex Service is repointed

What's deployed

ComponentLayerPath
flagger controller + flagger-loadtesterinfra-controllerscontrollers/flagger/
coroot-request-success-rate / coroot-request-duration MetricTemplatesinfrastructureinfrastructure/flagger/
umami Canary (weighted)appsapps/umami/canary.yaml
homepage Canary (blue/green)appsapps/homepage/canary.yaml
opencost Canary (blue/green)infrastructureinfrastructure/flagger/canary-opencost.yaml

CRD-vs-CR layering. The flagger HelmRelease ships the Canary / MetricTemplate CRDs (infra-controllers). A CR of those kinds in the same Flux Kustomization fails the server-side dry-run (no matches for kind) and deadlocks the set. So app Canaries live in the apps layer and the opencost Canary + the MetricTemplates live in the infrastructure layer (both depend on, and wait for, infra-controllers) — the same split as infrastructure/coroot/coroot.yaml.

Onboarded apps & status

  • umami — weighted Gateway API canary. ⚠️ Prod-only (excluded from the docker overlay — Flagger and canaries are opt-in locally, and CI validates manifests statically rather than running them) and stateful (one shared CloudNativePG DB; do not land a schema-changing upgrade as a canary). Its old route's HSTS header is re-added via service.headers; the gethomepage.dev/* tile annotations are not reproducible on a Flagger route.
  • homepage — blue/green (it's the root dashboard behind oauth2-proxy). High blast radius; replicas are owned by a KEDA ScaledObject via autoscalerRef (primary scales 2-3 on Coroot request rate — see "KEDA apps" below).
  • opencost — blue/green infra workload. ⚠️ Headlamp's cost plugin uses the opencost:http-ui named port via the apiserver proxy; portDiscovery may not preserve that name — watch the Headlamp cost panel after rollout.

Excluded (and why)

WorkloadReason
whoami, headlamp, actual-budget, fleetdm (parked), hubble-uiSingle-replica by design (always-on; auto-vpa right-sizes them). headlamp = single-pod in-memory OIDC; actual-budget = single-writer file DB; hubble-ui/whoami = single-replica UIs — none can run concurrent canary pods. fleetdm is disabled since 2026-06-03.
openbaoStatefulSet — Flagger only manages Deployments / DaemonSets.
coroot UI, hubble-uiOperator-reconciled Deployments — the operator fights Flagger for ownership.
dex, oauth2-proxy, flux-operatorCritical SSO / GitOps — too risky to canary.

Onboarding a new app

  1. Pick the mode (table above). Stateless, directly-routed apps → weighted; oauth2-proxy-fronted apps → blue/green.
  2. Free the route (weighted only) — delete the app's httproute.yaml; Flagger generates the route from the Canary's gatewayRefs + hosts. Re-add any response-header filters via spec.service.headers. Blue/green keeps the existing route untouched.
  3. Let Flagger (or its KEDA autoscalerRef) own replicas — omit the chart's replicaCount value and do not pin /spec/replicas in a postRenderer, otherwise Flux re-applies the chart's replica count and fights Flagger's scale-to-zero of the canary (flapping). See any onboarded app's helm-release.yaml.
  4. Add the Canary — copy umami's (weighted) or homepage's (blue/green), referencing the two MetricTemplates and a loadtester acceptance + load-test webhook on <app>-canary.
  5. Open the netpols — app namespace: ingress from flagger-system on the app port; flagger-system: load-tester egress to the app namespace+port (in controllers/flagger/networkpolicy.yaml).
  6. Place the Canary in the apps layer (app) or the infrastructure layer (infra component) — never in infrastructure/controllers.

KEDA apps (used by homepage + umami)

Flagger's KEDA integration uses the core keda.sh ScaledObject. Homepage and Umami use it: each app's scaled-object.yaml scales 2-3 replicas on Coroot's inbound request-rate series — a metric auto-vpa does not control, so vertical (VPA, up to maxAllowed) and horizontal scaling never fight. The Canary references the ScaledObject and Flagger manages the -primary scaler:

spec:
  autoscalerRef:
    apiVersion: keda.sh/v1alpha1
    kind: ScaledObject
    name: myapp-so          # Flagger clones it to myapp-so-primary and pauses
    primaryScalerQueries:   # the source's scaler at 0 between rollouts.
      requests: sum(rate(container_http_inbound_requests_total{...primary...}[1m]))

The trigger must be named (Flagger matches primaryScalerQueries by trigger name), and the source query must exclude -primary- pods — reuse the vowel-free hash regex described under "Measuring the canary".

Measuring the canary

Flagger gates on the canary ({{ target }} = targetRef.name, the original deployment), not the primary. Coroot labels metrics by pod-name container_id and RE2 lacks negative lookahead, so the MetricTemplates select canary pods while excluding <target>-primary-* by exploiting that Kubernetes pod-template hashes are vowel-free (bcdfghjklmnpqrstvwxz2456789): [bcdfghjklmnpqrstvwxz2-9]+ matches a hash but never "primary" (it has i/a).

⚠️ The PromQL is written against Coroot's documented schema but not validated against live data — before trusting auto-promotion, confirm in coroot-prometheus the status label format, the container_id format, and the latency bucket series, and tune infrastructure/flagger/metric-template-*.yaml.

References