Ingero Fleet
May 22, 2026 ยท View on GitHub
Version: 1.0.1
Cluster-side OpenTelemetry Collector distribution for GPU clusters.
Fleet receives OTLP from every Ingero agent in your fleet, runs three GPU-specific computations on the stream (MAD-based straggler threshold, NCCL collective skew, provider cost attribution), and forwards everything to your existing observability stack. One binary covers what would otherwise be three separate hops.
| Role | Component | Headline output |
|---|---|---|
| Straggler detection | ingero processor + extension | per-cluster threshold; pushed back in OTLP response headers so agents self-classify in real time |
| NCCL collective health | nccl processor | per-rank skew + slow-rank attribution from libnccl uprobes |
| Cost attribution | provider-lookup processor | enriches every metric with ingero.provider so cost dashboards just work |
| EE / health surface | healthee extension | /healthz/ee open probe + bearer-authed /internal/ee/state rich body |
| OTEL backbone | standard collector | OTLP receivers, TLS/mTLS, auth, batching, retry, backends |
Companion service: Ingero Echo, the cluster-wide event
store and MCP query layer that Fleet forwards to. HTTP+JSON surface at
/api/v2/* (bearer-authed; OpenAPI-described; per-tool input + output JSON
schemas) plus a bearer-authed /metrics Prometheus endpoint with strict
label cardinality. See docs/api-versioning.md.
No agent needs inbound network access; everything is outbound push and pull. Threshold is delivered to agents in the OTLP push response, so the straggler classification path needs no extra polling.
Quick Start
Looking for a worked end-to-end example? These multi-node quickstart guides take you from zero to a detected straggler on three GPU hosts in about 20 minutes. Pick the deployment style that matches your environment:
See
docs/quickstart_fleet.mdfor a one-page comparison if you are not sure which to pick.
Option A: Use the pre-built Fleet distribution
# Docker (multi-arch manifest covers amd64 + arm64)
docker run -p 4317:4317 -p 8080:8080 ghcr.io/ingero-io/ingero-fleet:1.0.0
# Binary
# ingero-version:install-curl-version product=ingero-fleet channel=stable
VERSION=1.0.1
curl -fsSL "https://github.com/ingero-io/ingero-fleet/releases/download/v${VERSION}/ingero-fleet_${VERSION}_linux_amd64.tar.gz" | tar xz
./ingero-fleet --config fleet-config.yaml
Verify the release (cosign keyless OIDC; signed in GitHub Actions):
curl -fsSLO https://github.com/ingero-io/ingero-fleet/releases/download/v1.0.0/checksums.txt
curl -fsSLO https://github.com/ingero-io/ingero-fleet/releases/download/v1.0.0/checksums.txt.sig
cosign verify-blob checksums.txt --signature checksums.txt.sig \
--certificate-identity-regexp '^https://github.com/ingero-io/ingero-fleet/' \
--certificate-oidc-issuer https://token.actions.githubusercontent.com
sha256sum -c checksums.txt --ignore-missing
CycloneDX SBOMs ship alongside each archive (hash anchored in checksums.txt).
See docs/operator/deploy.md for the full flow.
Option B: Kubernetes (Helm)
From the published chart repo:
helm repo add ingero https://ingero-io.github.io/ingero-fleet
helm repo update
helm install ingero-fleet ingero/ingero-fleet
Or from a checkout of this repo:
helm install ingero-fleet ./helm/ingero-fleet
Companion Ingero Echo (datastore + MCP server) install (one chart, one Secret for the bearer token):
kubectl create secret generic ingero-echo-auth \
--from-literal=token=$(openssl rand -hex 32) -n ingero
helm install ingero-echo ./helm/ingero-echo \
--set persistence.storageClass=<encrypted-storageclass> \
--set 'extraEnv[0].name=AUTH_TOKEN' \
--set 'extraEnv[0].valueFrom.secretKeyRef.name=ingero-echo-auth' \
--set 'extraEnv[0].valueFrom.secretKeyRef.key=token'
The Echo chart's PVC encryption-at-rest gate fails helm install closed if
neither persistence.storageClass nor persistence.encryptionAcknowledged=true
is set. See docs/operator/deploy.md for cloud-
specific encrypted-StorageClass examples.
Chart default is replicaCount: 1; see High Availability section below for
multi-replica guidance.
Option C: Add Ingero to your existing OTEL Collector
Advanced path: drop the Ingero Go modules into your existing OCB manifest and rebuild your collector instead of running ours.
# builder-config.yaml
processors:
# ingero-version:builder-gomod-processor product=ingero-fleet channel=stable
- gomod: github.com/ingero-io/ingero-fleet/processor v1.0.1
extensions:
# ingero-version:builder-gomod-extension product=ingero-fleet channel=stable
- gomod: github.com/ingero-io/ingero-fleet/extension v1.0.1
ocb --config builder-config.yaml
Configure the agent
Point your Ingero agent at Fleet:
# ingero.yaml (on each GPU node)
fleet:
endpoint: https://fleet.example.com:4317
Architecture
The detailed Fleet-Agent component view is below.
Fleet-Agent Overview
graph TB
subgraph GPU Cluster
A1[Ingero Agent<br/>gpu-node-01<br/>score: 0.92]
A2[Ingero Agent<br/>gpu-node-02<br/>score: 0.91]
A3[Ingero Agent<br/>gpu-node-03<br/>score: 0.58]
A4[Ingero Agent<br/>gpu-node-04<br/>score: 0.93]
end
subgraph Fleet Service
R[OTLP Receiver<br/>gRPC :4317 / HTTP :4318]
P[Ingero Processor<br/>Score Map + MAD + EMA]
E[Ingero Extension<br/>Threshold API :8080<br/>Middleware Piggyback]
EX[Prometheus Exporter]
end
subgraph Observability
PR[Prometheus]
GR[Grafana]
end
A1 -->|OTLP push| R
A2 -->|OTLP push| R
A3 -->|OTLP push| R
A4 -->|OTLP push| R
R --> P
P -->|Set threshold| E
P --> EX
EX --> PR
PR --> GR
E -.->|threshold in<br/>push response| A1
E -.->|threshold in<br/>push response| A2
E -.->|threshold in<br/>push response| A3
E -.->|threshold in<br/>push response| A4
style A3 fill:#f66,stroke:#333,color:#fff
Node 03 (score 0.58) is below the fleet threshold (0.87) - detected as straggler.
Detailed Communication Flow
sequenceDiagram
participant A as Ingero Agent<br/>(GPU Node)
participant R as OTLP Receiver<br/>:4317 gRPC / :4318 HTTP
participant M as Ingero Extension<br/>(Middleware)
participant P as Ingero Processor
participant S as ThresholdStore<br/>(in-memory)
participant T as Timer<br/>(every 10s)
participant API as Threshold API<br/>:8080
Note over A: Computes health score<br/>from 4 GPU signals
A->>R: OTLP push (HTTP :4318)<br/>metric: ingero.node.health_score = 0.92<br/>attrs: node.id, cluster.id, state<br/>header: ingero.cluster.id = cluster-prod
R->>M: HTTP request passes through middleware
M->>R: Injects response headers:<br/>X-Ingero-Threshold: 0.87<br/>X-Ingero-Quorum-Met: true
R->>P: ConsumeMetrics(OTLP payload)
P->>P: Extract health_score from payload<br/>Write to score map[cluster:node]
R-->>A: HTTP 200 + threshold headers
Note over A: Reads X-Ingero-Threshold<br/>0.92 > 0.87 = healthy
T->>P: Timer tick (every push_interval)
P->>P: Read score map (RLock)<br/>Compute MAD per cluster<br/>Apply EMA smoothing<br/>Check quorum, panic mode
P->>S: Set(cluster_id, ThresholdResult)
Note over M: Next push reads<br/>updated threshold from store
A->>API: GET /api/v1/threshold?cluster_id=cluster-prod<br/>(fallback if no piggyback)
API->>S: Get(cluster_id)
S-->>API: ThresholdResult
API-->>A: {"threshold": 0.87, "quorum_met": true}
Ports and Protocols
| Port | Protocol | Component | Direction | Purpose |
|---|---|---|---|---|
| 4317 | gRPC | OTLP Receiver | Agent -> Fleet | Health score push (binary protobuf) |
| 4318 | HTTP | OTLP Receiver | Agent -> Fleet | Health score push (JSON). Threshold returned in response headers. |
| 8080 | HTTP | Ingero Extension | Agent -> Fleet | GET /api/v1/threshold fallback endpoint |
| 8081 | HTTP | Ingero Extension | Admin -> Fleet | Diagnostics endpoint (loopback only) |
| 8088 | HTTP | Healthee Extension | K8s / EE -> Fleet | /healthz/ee open probe + bearer-authed /internal/ee/state rich body |
| 55679 | HTTP | zPages | Internal | Health/readiness probes |
| 8888 | HTTP | OTEL Telemetry | Prometheus -> Fleet | Fleet self-monitoring metrics |
Data Sent per Push
Agent -> Fleet (OTLP metric payload):
Resource attributes:
ingero.node.id: "gpu-node-01"
ingero.cluster.id: "cluster-prod"
Gauge: ingero.node.health_score = 0.92
ingero.node.state: "active"
ingero.workload_type: "training"
HTTP header:
ingero.cluster.id: cluster-prod (for middleware routing)
Fleet -> Agent (push response headers):
X-Ingero-Threshold: 0.870348
X-Ingero-Quorum-Met: true
Fleet -> Agent (GET fallback response):
{"threshold":0.870348,"quorum_met":true}
Fleet is built as a custom OpenTelemetry Collector distribution. Two custom components, everything else is standard OTEL:
| Component | Type | What it does |
|---|---|---|
| Ingero Processor | OTEL processor | Accumulates health scores, computes MAD threshold with EMA smoothing |
| NCCL Processor | OTEL processor | Per-rank skew + slow-rank attribution from libnccl uprobe events |
| Provider-Lookup Processor | OTEL processor | Enriches metrics with ingero.provider for cost-attribution dashboards |
| Ingero Extension | OTEL extension | Threshold API for agent polling and diagnostics |
| Healthee Extension | OTEL extension | /healthz/ee open probe + bearer-authed /internal/ee/state rich body |
| Everything else | Standard OTEL | OTLP receiver, exporters, TLS, auth, batching |
Key Properties
- Stateless. No database, no disk. Health scores and threshold live in memory. Restart rebuilds state from incoming pushes in ~10 seconds.
- Fail-open. If Fleet goes down, agents use their cached threshold, then fall back to local baselines. Straggler detection degrades gracefully, never blocks workloads.
- Outbound-only. Agents push to Fleet and poll from Fleet - all outbound connections from GPU nodes. Zero firewall changes for enterprise GPU clusters with restricted inbound access.
- Composable. Run Fleet as a drop-in distribution, OR add the Ingero processors and extensions to your existing OTEL Collector via OCB and skip the new operational footprint entirely.
- Tiny. ~50MB RAM, negligible CPU for typical clusters.
How It Works
Health Score
Each agent computes a health score (0.0 - 1.0) from four signals:
| Signal | Weight | What it measures |
|---|---|---|
| CUDA throughput | 0.40 | CUDA operations/sec relative to baseline |
| Compute efficiency | 0.25 | Kernel launch rate relative to baseline |
| Memory headroom | 0.20 | Available VRAM fraction |
| CPU availability | 0.15 | Inverse of scheduler contention |
The throughput signal is workload-agnostic - it works for both training (step throughput) and inference (request processing rate). Baselines adapt via exponential moving average.
All four signals are normalized to [0.0, 1.0] against the agent's rolling fast-window baseline, then combined as a weighted sum. A hard floor per signal catches "close to zero" conditions (deep stalls, OOM pressure) that a weighted average could otherwise hide. The agent classifies itself against its local baseline during warmup and switches to the Fleet-computed peer threshold once quorum is met.
Threshold
Fleet computes the straggler threshold using Median Absolute Deviation (MAD):
threshold = median(scores) - k * MAD * 1.4826
MAD resists outliers (50% breakdown point vs 0% for mean/stddev). A single straggler - or even several - cannot shift the threshold. The k parameter (default 2.0) controls sensitivity.
The threshold is delivered to agents via the OTLP push response headers, eliminating a separate polling round-trip.
Straggler Classification
if my_score < threshold:
I am a straggler
That's it. The agent emits a straggler event via OTLP and the remediation protocol (--remediate flag).
High Availability
Single replica (recommended for most clusters)
replicaCount: 1 is the chart default. Vertical scale is the path to larger clusters: a single g4dn.xlarge-class node carries 100+ pushing agents at 5s intervals with p99 handler latency under 20 ms.
If a single Fleet pod dies, agents use their cached threshold (~5 min grace), then fall back to local baseline. Restarts repopulate within 1-2 push intervals.
Multi-replica HA (when you need it)
Each Fleet replica maintains its own in-memory score map. An agent push reaches ONE replica (selected by DNS or the service mesh); that replica's map is the only one that sees the score. Each replica computes its own threshold from its subset of agents.
For multi-replica deployments, put an L7 load balancer with consistent-hash on the cluster_id query parameter (Envoy / nginx / service mesh) in front of Fleet. Every agent from one cluster lands on the same replica, eliminating cross-replica drift.
Size statistical_min for the per-replica visible node count, not the cluster-wide count. Alert on sum_over_replicas(ingero_fleet_active_nodes) < expected_total_nodes for replica starvation.
Larger-cluster topologies (gateway-based shared state) are out of scope. Talk to us if you're approaching the per-replica vertical-scale ceiling.
See docs/architecture_fleet.md for the full behavior model and rationale.
Fleet Configuration
# fleet-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
ingero:
threshold:
k: 2.0 # MAD sensitivity (default: 2.0)
ema_alpha: 0.2 # Threshold smoothing (default: 0.2)
quorum:
statistical_min: 5 # Min active nodes for valid threshold
coverage_fraction: 0.80 # Coverage alert threshold
push_interval: 10s # Expected agent push interval
ttl_multiplier: 5 # Node expiry = push_interval * ttl_multiplier
extensions:
ingero_threshold:
agent_endpoint: 0.0.0.0:8080 # Agent threshold poll (fallback)
admin_endpoint: 127.0.0.1:8081 # Diagnostics (management plane only)
exporters:
prometheus:
endpoint: 0.0.0.0:9090
service:
extensions: [ingero_threshold]
pipelines:
metrics:
receivers: [otlp]
processors: [ingero]
exporters: [prometheus]
Observability
Fleet emits its own metrics:
| Metric | Type | Description |
|---|---|---|
ingero_fleet_threshold | Gauge | Current straggler threshold per cluster |
ingero_fleet_active_nodes | Gauge | Nodes actively reporting |
ingero_fleet_idle_nodes | Gauge | Nodes in idle state |
ingero_fleet_coverage_low | Gauge | 1 if coverage quorum not met |
ingero_fleet_panic_mode | Gauge | 1 if panic mode active |
ingero_fleet_median | Gauge | Fleet health score median |
ingero_fleet_mad | Gauge | Fleet MAD value |
Agent-side metrics:
| Metric | Type | Description |
|---|---|---|
ingero_agent_health_score | Gauge | This node's health score |
ingero_agent_detection_mode | Gauge | Current detection tier (fleet/cached/local/none) |
ingero_agent_fleet_reachable | Gauge | 1 if Fleet is reachable |
Security
Fleet runs on the standard OpenTelemetry Collector security surface: OTLP receivers support TLS, mTLS, and the collector auth extensions. The pieces specific to this distribution:
- Echo bearer auth on every transport. Echo's OTLP ingest, MCP server, and HTTP+JSON API all require a bearer token, compared in constant time. Tokens can be scoped to specific
cluster_idvalues, so a multi-tenant deployment shows each tenant only its own data. - TLS required by default. Echo refuses to start without a TLS keypair unless an explicit insecure flag is set for local trials. The healthee extension serves its routes over native TLS.
- Zero-restart bearer rotation.
SIGHUPre-reads the token file; the previous token stays valid for a configurable grace window, so rotation never drops a request. - Audit logging. Every authenticated Echo request is logged with the SHA-256 hash of the bearer (never the raw token), source IP, method, path, status, and latency.
- Encryption-at-rest gate. The Echo Helm chart fails
helm installclosed unless its DuckDB volume is on an encrypted StorageClass. - Signed releases. Each release is cosign-signed (keyless OIDC) with a CycloneDX SBOM per archive; see Quick Start for the verification commands.
Documentation
- SemVer policy - how v1.x versioning works + current scope boundaries
- API versioning -
/api/v2contract, version negotiation, removal policy - Operator guide - deploy, upgrade, backup, multi-tenant onboarding, TLS, troubleshooting
- Architecture - components, data flow, threshold computation
- Deployment Guide - K8s, Slurm, bare metal
- Configuration Reference - all config parameters
- API Reference - threshold endpoints
- Integrations - add Fleet to vanilla OTel Collector, Grafana Alloy, Datadog, New Relic, Groundcover, Splunk, AWS ADOT
- End-to-end walkthrough on Lambda Cloud - A100 + GH200 (arm64) reference deploy
- Ingero Echo - companion cluster-wide event store + MCP query layer that Fleet forwards to
Requirements
- Ingero agent on each GPU node
- Go 1.22+ (for building from source)
- Kubernetes 1.24+ (for Helm deployment)
Ingero Echo
Per-node agents emit signal; operators investigate clusters. Ingero Echo is the companion service that gives the fleet a queryable place to land: a StatefulSet that ingests OTLP from Fleet, persists to embedded DuckDB, and exposes two surfaces over one bearer-authed listener.
- MCP server for AI agents. Query tools cover cluster summaries, outlier and straggler ranking, anomaly streams, NCCL and memcpy bandwidth rollups, memory-fragmentation hot spots, and cost. The
/investigateprompt walks an LLM through the cluster-level WHERE, then hands off to the per-node agent for the WHY. - HTTP+JSON API for clients that prefer
curl: dashboards, CI scripts, Grafana datasources. URL-versioned at/api/v<N>/...; OpenAPI 3.1 described. Seedocs/api-versioning.mdfor the contract.
| Path | Auth | Purpose |
|---|---|---|
GET /api/versions | none | capability negotiation |
GET /api/v2/health | none | liveness probe |
GET /api/v2/whoami | bearer | bearer identity introspection |
GET /api/v2/tools/list | bearer | MCP tool catalog with input + output JSON schemas |
POST /api/v2/tools/<name> | bearer | invoke a tool with server-side schema validation |
POST /api/v2/sql | bearer (non-tenant-scoped) | read-only SQL against the event store |
GET /api/v2/openapi.json | bearer | OpenAPI 3.1 spec |
GET /metrics | bearer | Prometheus exposition (handler / status_class / api_version labels only) |
License
Apache License 2.0. See LICENSE.