Troubleshooting

May 17, 2026 ยท View on GitHub

Symptom: HTTP 410 Gone from /api/v1/*

Cause: client is calling the pre-v1.0 wire surface. v1.0 removed /api/v1/* and serves a 410 catch-all with a stable JSON body:

{
  "error": "api/v1 removed in v1.0; use /api/v2",
  "code": "gone",
  "removed_version": "v1",
  "preferred": "v2",
  "docs": "https://github.com/ingero-io/ingero-fleet/blob/main/docs/api-versioning.md"
}

Fix: switch the client to /api/v2/*. The Grafana plugin handles this automatically; custom dashboards / scripts need a URL update.

Symptom: HTTP 401 Unauthorized from /api/v2/*

Causes:

  1. Missing Authorization: Bearer <token> header.
  2. Bearer-token mismatch (the token in the request does not match the running Echo's accept-set).
  3. Bearer was rotated and the client is still using the old token after the grace window (default 5 min) expired.

Diagnose:

# Check that the bearer hashes match.
echo -n "$BEARER" | sha256sum
# Compare with the audit log's live_hash on the most recent
# event=bearer_rotation_applied line.
kubectl -n ingero logs ingero-echo-0 | grep bearer_rotation_applied | tail -1

If the hashes match but you still get 401, check that the request actually reached Echo (not a reverse proxy returning 401 of its own).

Symptom: HTTP 403 tenant_scoped_bearer_refused

Cause: the bearer is tenant-scoped (has an allowed_clusters list) AND the request's cluster_id is not in that list, OR the request is to a tool that refuses tenant-scoped bearers (currently only fleet.cluster.run_analysis).

Diagnose:

# What clusters is this bearer allowed?
curl -fsS -H "Authorization: Bearer $BEARER" https://echo.example.com/api/v2/whoami

Fix: either route the request to one of the allowed clusters, or escalate to a wildcard bearer (rare; only for platform operators).

Symptom: HTTP 429 Too Many Requests

Cause: the per-bearer rate-limit bucket (default 30 rps / 100 burst) is empty.

Response carries Retry-After: <seconds> indicating how long to wait. If the client polls aggressively, back off. If the bucket is too tight for legitimate use, raise the limit via the chart's rateLimit.requestsPerSecond / rateLimit.burst values.

Symptom: HTTP 500 tool_backend_failure

Cause: DuckDB returned an error from a tool dispatch path. The wire body is scrubbed:

{"error":"tool backend failure","code":"tool_backend_failure","tool":"fleet.cluster.summary","req_id":"<16-hex>"}

The full underlying error is in the audit log, keyed by req_id:

kubectl -n ingero logs ingero-echo-0 | grep "$REQ_ID" | tail

Look for the event=tool_backend_failure line with the full err.Error() text.

If the audit log retention is too short (default Kubernetes journald rotation is ~50 MiB or a few hours of traffic), increase Loki/Splunk retention to 30+ days; the privacy-sensitive parts (bearer hashes, source IPs, key names) sit alongside the error text, so retention should match your audit-log policy.

Symptom: HTTP 503 shutting_down

Cause: Echo is in graceful-drain mode after receiving SIGTERM. New requests get 503 + Retry-After: 5 until the drain completes.

This is intentional. The plugin's health check tolerates one 503 by design.

If 503 persists for more than shutdown_deadline (default 30s), check pod status; Echo may be stuck on an in-flight query. The shutdown-deadline parent-context cancellation propagates context.Canceled into in-flight DuckDB queries to unblock them.

Symptom: Metrics scrape fails (Prometheus shows the target down)

Causes:

  1. /metrics is bearer-required. The ServiceMonitor's authorization block must point at the bearer Secret. Without it, scrape returns 401.
  2. NetworkPolicy is enabled but does NOT allow ingress from the Prometheus pods. The v1.0 chart's networkpolicy.yaml auto-adds an allow rule when serviceMonitor.enabled=true, BUT requires serviceMonitor.scraperSelector to match Prometheus's pod labels. Default is app.kubernetes.io/name=prometheus (kube-prometheus-stack convention).
  3. The metrics endpoint exists but no traffic is flowing yet. echo_http_requests_total is empty until the first request lands.

Diagnose:

# Direct test (bearer + plain HTTP):
TOKEN=$(kubectl -n ingero get secret ingero-echo-auth -o jsonpath='{.data.token}' | base64 -d)
kubectl -n ingero port-forward svc/ingero-echo 8081:8081 &
curl -fsS -H "Authorization: Bearer $TOKEN" http://127.0.0.1:8081/metrics | head -20

If this works, the issue is between Prometheus and Echo (NetworkPolicy / ServiceMonitor / Secret reference). If it fails with 401, your token is wrong. If it fails with connection refused, port-forward failed.

Symptom: helm install fails with "no matches for kind ServiceMonitor"

Cause: serviceMonitor.enabled=true but the Prometheus Operator (which owns the monitoring.coreos.com/v1 API group) is not installed.

Fix: install kube-prometheus-stack, OR set serviceMonitor.enabled=false and use direct scrape config in your Prometheus config.

Symptom: PVC stuck Pending after install

Causes:

  1. No default StorageClass + chart's persistence.storageClass="". Set explicitly.
  2. StorageClass requires manual provisioner that isn't running.
  3. Quota / region capacity.

Diagnose:

kubectl -n ingero describe pvc data-ingero-echo-0

The Events section names the provisioner error.

Audit log retention guidance

Default Kubernetes journald rotation is ~50 MiB per container, which translates to hours of forensic visibility for a busy Echo. The audit log carries the FULL err.Error() text on event=tool_backend_failure lines (before the wire body is scrubbed), keyed by req_id so you can correlate a customer-reported 500 to the underlying DuckDB error.

For production, forward audit lines to Loki or Splunk with:

  • Minimum 30 days retention.
  • bearer_hash and req_id as indexed fields (for fast correlation).
  • Access controls matching the privacy posture of the running PVC.

A reasonable Loki forwarder config:

# Promtail / Alloy / vector config snippet
relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
    regex: ingero-echo
    action: keep

Where to ask