Troubleshooting
May 17, 2026 ยท View on GitHub
Symptom: HTTP 410 Gone from /api/v1/*
Cause: client is calling the pre-v1.0 wire surface. v1.0 removed /api/v1/* and serves a 410 catch-all with a stable JSON body:
{
"error": "api/v1 removed in v1.0; use /api/v2",
"code": "gone",
"removed_version": "v1",
"preferred": "v2",
"docs": "https://github.com/ingero-io/ingero-fleet/blob/main/docs/api-versioning.md"
}
Fix: switch the client to /api/v2/*. The Grafana plugin handles this automatically; custom dashboards / scripts need a URL update.
Symptom: HTTP 401 Unauthorized from /api/v2/*
Causes:
- Missing
Authorization: Bearer <token>header. - Bearer-token mismatch (the token in the request does not match the running Echo's accept-set).
- Bearer was rotated and the client is still using the old token after the grace window (default 5 min) expired.
Diagnose:
# Check that the bearer hashes match.
echo -n "$BEARER" | sha256sum
# Compare with the audit log's live_hash on the most recent
# event=bearer_rotation_applied line.
kubectl -n ingero logs ingero-echo-0 | grep bearer_rotation_applied | tail -1
If the hashes match but you still get 401, check that the request actually reached Echo (not a reverse proxy returning 401 of its own).
Symptom: HTTP 403 tenant_scoped_bearer_refused
Cause: the bearer is tenant-scoped (has an allowed_clusters list) AND the request's cluster_id is not in that list, OR the request is to a tool that refuses tenant-scoped bearers (currently only fleet.cluster.run_analysis).
Diagnose:
# What clusters is this bearer allowed?
curl -fsS -H "Authorization: Bearer $BEARER" https://echo.example.com/api/v2/whoami
Fix: either route the request to one of the allowed clusters, or escalate to a wildcard bearer (rare; only for platform operators).
Symptom: HTTP 429 Too Many Requests
Cause: the per-bearer rate-limit bucket (default 30 rps / 100 burst) is empty.
Response carries Retry-After: <seconds> indicating how long to wait. If the client polls aggressively, back off. If the bucket is too tight for legitimate use, raise the limit via the chart's rateLimit.requestsPerSecond / rateLimit.burst values.
Symptom: HTTP 500 tool_backend_failure
Cause: DuckDB returned an error from a tool dispatch path. The wire body is scrubbed:
{"error":"tool backend failure","code":"tool_backend_failure","tool":"fleet.cluster.summary","req_id":"<16-hex>"}
The full underlying error is in the audit log, keyed by req_id:
kubectl -n ingero logs ingero-echo-0 | grep "$REQ_ID" | tail
Look for the event=tool_backend_failure line with the full err.Error() text.
If the audit log retention is too short (default Kubernetes journald rotation is ~50 MiB or a few hours of traffic), increase Loki/Splunk retention to 30+ days; the privacy-sensitive parts (bearer hashes, source IPs, key names) sit alongside the error text, so retention should match your audit-log policy.
Symptom: HTTP 503 shutting_down
Cause: Echo is in graceful-drain mode after receiving SIGTERM. New requests get 503 + Retry-After: 5 until the drain completes.
This is intentional. The plugin's health check tolerates one 503 by design.
If 503 persists for more than shutdown_deadline (default 30s), check pod status; Echo may be stuck on an in-flight query. The shutdown-deadline parent-context cancellation propagates context.Canceled into in-flight DuckDB queries to unblock them.
Symptom: Metrics scrape fails (Prometheus shows the target down)
Causes:
/metricsis bearer-required. The ServiceMonitor'sauthorizationblock must point at the bearer Secret. Without it, scrape returns 401.- NetworkPolicy is enabled but does NOT allow ingress from the Prometheus pods. The v1.0 chart's networkpolicy.yaml auto-adds an allow rule when
serviceMonitor.enabled=true, BUT requiresserviceMonitor.scraperSelectorto match Prometheus's pod labels. Default isapp.kubernetes.io/name=prometheus(kube-prometheus-stack convention). - The metrics endpoint exists but no traffic is flowing yet.
echo_http_requests_totalis empty until the first request lands.
Diagnose:
# Direct test (bearer + plain HTTP):
TOKEN=$(kubectl -n ingero get secret ingero-echo-auth -o jsonpath='{.data.token}' | base64 -d)
kubectl -n ingero port-forward svc/ingero-echo 8081:8081 &
curl -fsS -H "Authorization: Bearer $TOKEN" http://127.0.0.1:8081/metrics | head -20
If this works, the issue is between Prometheus and Echo (NetworkPolicy / ServiceMonitor / Secret reference). If it fails with 401, your token is wrong. If it fails with connection refused, port-forward failed.
Symptom: helm install fails with "no matches for kind ServiceMonitor"
Cause: serviceMonitor.enabled=true but the Prometheus Operator (which owns the monitoring.coreos.com/v1 API group) is not installed.
Fix: install kube-prometheus-stack, OR set serviceMonitor.enabled=false and use direct scrape config in your Prometheus config.
Symptom: PVC stuck Pending after install
Causes:
- No default StorageClass + chart's
persistence.storageClass="". Set explicitly. - StorageClass requires manual provisioner that isn't running.
- Quota / region capacity.
Diagnose:
kubectl -n ingero describe pvc data-ingero-echo-0
The Events section names the provisioner error.
Audit log retention guidance
Default Kubernetes journald rotation is ~50 MiB per container, which translates to hours of forensic visibility for a busy Echo. The audit log carries the FULL err.Error() text on event=tool_backend_failure lines (before the wire body is scrubbed), keyed by req_id so you can correlate a customer-reported 500 to the underlying DuckDB error.
For production, forward audit lines to Loki or Splunk with:
- Minimum 30 days retention.
bearer_hashandreq_idas indexed fields (for fast correlation).- Access controls matching the privacy posture of the running PVC.
A reasonable Loki forwarder config:
# Promtail / Alloy / vector config snippet
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
regex: ingero-echo
action: keep
Where to ask
- GitHub issues: https://github.com/ingero-io/ingero-fleet/issues
- GitHub discussions: https://github.com/orgs/ingero-io/discussions