Klag [![license-badge][]][license]
June 8, 2026 Β· View on GitHub
Kafka Consumer Lag Exporter β Know when your consumers fall behind, before it becomes a problem.
Inspired by kafka-lag-exporter (archived 2024). Built with Vert.x and Micrometer.
π Documentation lives at klag.dev
Full guides β configuration, Kafka ACLs, Helm/Strimzi deployment, integrations, metrics reference, and development β are at klag.dev. AI agents: a machine-readable corpus is at klag.dev/llms.txt. The docs source lives in
website/(Astro + Starlight).
Scales to large clusters: Monitors thousands of consumer groups in ~50MB heap. Request batching with configurable delays prevents overwhelming brokers β fetch offsets for 500+ groups without spiking cluster CPU.
Why Klag?
Consumer lag is the gap between what Kafka has produced and what your consumers have processed. Left unmonitored, growing lag leads to stale downstream data, memory pressure as consumers struggle to catch up, and silent failures when groups die without alerts. Klag continuously monitors all consumer groups and exposes metrics to your observability stack.
Key Features
| Feature | Why It Matters |
|---|---|
| Lag velocity | Know if lag is growing or shrinking β catch problems before they escalate |
| Time-based lag estimation | See lag in seconds/minutes, not just message counts |
| Hot partition detection | Find partitions with uneven load causing bottlenecks |
| Consumer group state tracking | Alert on Rebalancing, Dead, or Empty states |
| Data loss prevention | Alert before lag exceeds retention and data is lost |
| Request batching | Safely monitor large clusters without overwhelming brokers |
| AI-native | Opt-in read-only MCP endpoint for SRE/dev agents |
Sinks: Prometheus, Datadog, OTLP (Grafana Cloud, New Relic, etc.). See integrations.
How Klag compares
Klag, Burrow, and KMinion all monitor Kafka consumer lag. They differ in what they measure and where they send it.
| Feature | Klag | Burrow | KMinion |
|---|---|---|---|
| Lag in messages | β | β | β |
| Consumer group state | β | β | β |
| Lag velocity (growing/shrinking) | β | β οΈ status only | β |
| Time-based lag + time-to-catch-up | β | β | β |
| Hot partition detection | β | β | β |
| Data loss / retention alerting | β | β | β |
| Lag status rules + notifiers | β | β | β |
| Prometheus / Datadog / OTLP native | β | β οΈ exporter / β / β | β / β / β |
| AI agent endpoint (MCP) | β | β | β |
| Read-only ACLs (DESCRIBE only) | β | β | β οΈ writes for e2e |
Quick Start
docker run -e KAFKA_BOOTSTRAP_SERVERS=kafka:9092 \
-e METRICS_REPORTER=prometheus \
-p 8888:8888 \
themoah/klag:latest
Metrics available at http://localhost:8888/metrics.
A GraalVM native image (themoah/klag:native) starts in ~70-100 ms using ~44 MB RSS,
versus ~500 ms / ~119 MB for the JVM image β same config, endpoints, and metrics. See
Native Image.
Helm
helm repo add klag https://themoah.github.io/klag
helm repo update
helm install klag klag/klag --set kafka.bootstrapServers="kafka-broker:9092"
Published to Artifact Hub. Full chart config, SASL, and Strimzi: Kubernetes deployment and charts/klag/README.md.
Metrics
| Metric | Description |
|---|---|
klag.consumer.lag | Current lag per partition (also .sum, .max, .min) |
klag.consumer.lag.velocity | Rate of change β positive means falling behind |
klag.consumer.lag.ms | Lag in ms from Kafka log timestamps |
klag.consumer.lag.time_to_close_seconds | Estimated seconds until lag reaches zero |
klag.consumer.lag.retention_percent | Lag as % of available messages (data loss alerting) |
klag.consumer.group.state | Group health: Stable, Rebalancing, Dead, Empty |
klag.hot_partition[.lag] | Partitions with statistically abnormal throughput |
Full metrics reference and the pre-built Grafana dashboard are documented at klag.dev/metrics.
Configuration
Configure via src/main/resources/application.properties or environment variables. The
most common:
| Variable | Default | Description |
|---|---|---|
KAFKA_BOOTSTRAP_SERVERS | localhost:9092 | Kafka broker addresses |
METRICS_REPORTER | none | prometheus, datadog, or otlp |
METRICS_INTERVAL_MS | 60000 | How often to collect metrics |
METRICS_GROUP_FILTER | * | Comma-separated glob patterns. A group is included if it matches any segment (e.g. ingest*,categorize*). |
METRICS_GROUP_EXCLUDE | (empty) | Comma-separated glob patterns to exclude even if included by the filter (e.g. debug-*,canary-*,*-shadow). |
See CLAUDE.md for the complete configuration reference.
Broker Compatibility
Klag works with Apache Kafka 2.x and 3.x brokers.
Running against Kafka 2.x
When klag first talks to a 2.x broker, you'll see a single WARN log line β emitted once per process, not per topic and not per scrape:
MAX_TIMESTAMP listOffsets unsupported by broker (likely pre-Kafka 3.0);
falling back to LATEST for logEndTimestamp. Logged once per process;
further occurrences at DEBUG. Cause: ...
This is expected, and safe to ignore. All klag metrics remain accurate. Subsequent per-topic fallback events are emitted at DEBUG; flip LOG_LEVEL_KLAG=DEBUG if you want to see them.
Why this happens, and why the fallback exists at all
Klag asks the broker for partition end-offsets using OffsetSpec.MAX_TIMESTAMP, an API added in Kafka 3.0 (KIP-734). 2.x brokers don't support it and respond with UnsupportedVersionException. Klag catches that and falls back to OffsetSpec.LATEST, which every Kafka version supports. The fallback returns the same partition end-offset that drives every lag metric, so accuracy is preserved.
Why the fallback is necessary. Without it, klag refuses to start on a 2.x cluster. The first metrics scrape runs synchronously during klag's startup, and a failure there propagates back up to the process launcher, which exits with code 1 β CrashLoopBackOff under Kubernetes. One consumer group on a 2.x cluster was enough to brick startup.
Note for future contributors. On 2.x brokers, both logEndTimestamp and maxTimestampOffset fall back to the LATEST offset's timestamp/offset, so time-based lag interpolation degrades gracefully (the anchor becomes the broker-side append time of the last record rather than the highest-timestamp record).
Kafka ACL Permissions
Klag requires read-only access to monitor consumer lag. It uses only the Kafka Admin Client API with DESCRIBE permissionsβno write or alter access needed.
Required Permissions
| Resource | Name | Permission | Operations |
|---|---|---|---|
| CLUSTER | kafka-cluster | DESCRIBE | Health check, list consumer groups |
| TOPIC | * or prefixed | DESCRIBE | Get partition info and offsets |
| GROUP | * or prefixed | DESCRIBE | Get group state and committed offsets |
Self-Managed Kafka
Monitor all groups and topics
# Cluster permissions (required)
kafka-acls --bootstrap-server <broker> \
--add --allow-principal User:<klag-user> \
--operation Describe --cluster
# All topics
kafka-acls --bootstrap-server <broker> \
--add --allow-principal User:<klag-user> \
--operation Describe --topic '*'
# All consumer groups
kafka-acls --bootstrap-server <broker> \
--add --allow-principal User:<klag-user> \
--operation Describe --group '*'
Klag needs read-only Kafka access (Admin Client DESCRIBE on cluster, topics, groups). Full configuration reference and ACL setup (self-managed + Confluent Cloud) are at klag.dev/configuration and klag.dev/kafka/acl-permissions.
Development
Requires Java 21.
./gradlew clean test # Run tests
./gradlew clean assemble # Build fat JAR
./gradlew clean run # Run with hot-reload
End-to-end tests (k3d + real Kafka, Strimzi matrix) live in scripts/. See
Build from Source and
Contributing.
Some parts of the code were written with Claude
