Ingero Fleet

May 22, 2026 · View on GitHub

Version: 1.0.1

Cluster-side OpenTelemetry Collector distribution for GPU clusters.

Fleet receives OTLP from every Ingero agent in your fleet, runs three GPU-specific computations on the stream (MAD-based straggler threshold, NCCL collective skew, provider cost attribution), and forwards everything to your existing observability stack. One binary covers what would otherwise be three separate hops.

Role	Component	Headline output
Straggler detection	ingero processor + extension	per-cluster threshold; pushed back in OTLP response headers so agents self-classify in real time
NCCL collective health	nccl processor	per-rank skew + slow-rank attribution from `libnccl` uprobes
Cost attribution	provider-lookup processor	enriches every metric with `ingero.provider` so cost dashboards just work
EE / health surface	healthee extension	`/healthz/ee` open probe + bearer-authed `/internal/ee/state` rich body
OTEL backbone	standard collector	OTLP receivers, TLS/mTLS, auth, batching, retry, backends

Companion service: Ingero Echo, the cluster-wide event store and MCP query layer that Fleet forwards to. HTTP+JSON surface at /api/v2/* (bearer-authed; OpenAPI-described; per-tool input + output JSON schemas) plus a bearer-authed /metrics Prometheus endpoint with strict label cardinality. See docs/api-versioning.md.

No agent needs inbound network access; everything is outbound push and pull. Threshold is delivered to agents in the OTLP push response, so the straggler classification path needs no extra polling.

Quick Start

Looking for a worked end-to-end example? These multi-node quickstart guides take you from zero to a detected straggler on three GPU hosts in about 20 minutes. Pick the deployment style that matches your environment:

Kubernetes (Helm)

Bare-metal binary

Docker

See docs/quickstart_fleet.md for a one-page comparison if you are not sure which to pick.

Option A: Use the pre-built Fleet distribution

# Docker (multi-arch manifest covers amd64 + arm64)
docker run -p 4317:4317 -p 8080:8080 ghcr.io/ingero-io/ingero-fleet:1.0.0

# Binary
# ingero-version:install-curl-version product=ingero-fleet channel=stable
VERSION=1.0.1
curl -fsSL "https://github.com/ingero-io/ingero-fleet/releases/download/v${VERSION}/ingero-fleet_${VERSION}_linux_amd64.tar.gz" | tar xz
./ingero-fleet --config fleet-config.yaml

Verify the release (cosign keyless OIDC; signed in GitHub Actions):

curl -fsSLO https://github.com/ingero-io/ingero-fleet/releases/download/v1.0.0/checksums.txt
curl -fsSLO https://github.com/ingero-io/ingero-fleet/releases/download/v1.0.0/checksums.txt.sig
cosign verify-blob checksums.txt --signature checksums.txt.sig \
  --certificate-identity-regexp '^https://github.com/ingero-io/ingero-fleet/' \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com
sha256sum -c checksums.txt --ignore-missing

CycloneDX SBOMs ship alongside each archive (hash anchored in checksums.txt). See docs/operator/deploy.md for the full flow.

Option B: Kubernetes (Helm)

From the published chart repo:

helm repo add ingero https://ingero-io.github.io/ingero-fleet
helm repo update
helm install ingero-fleet ingero/ingero-fleet

Or from a checkout of this repo:

helm install ingero-fleet ./helm/ingero-fleet

Companion Ingero Echo (datastore + MCP server) install (one chart, one Secret for the bearer token):

kubectl create secret generic ingero-echo-auth \
  --from-literal=token=$(openssl rand -hex 32) -n ingero
helm install ingero-echo ./helm/ingero-echo \
  --set persistence.storageClass=<encrypted-storageclass> \
  --set 'extraEnv[0].name=AUTH_TOKEN' \
  --set 'extraEnv[0].valueFrom.secretKeyRef.name=ingero-echo-auth' \
  --set 'extraEnv[0].valueFrom.secretKeyRef.key=token'

The Echo chart's PVC encryption-at-rest gate fails helm install closed if neither persistence.storageClass nor persistence.encryptionAcknowledged=true is set. See docs/operator/deploy.md for cloud- specific encrypted-StorageClass examples.

Chart default is replicaCount: 1; see High Availability section below for multi-replica guidance.

Option C: Add Ingero to your existing OTEL Collector

Advanced path: drop the Ingero Go modules into your existing OCB manifest and rebuild your collector instead of running ours.

# builder-config.yaml
processors:
  # ingero-version:builder-gomod-processor product=ingero-fleet channel=stable
  - gomod: github.com/ingero-io/ingero-fleet/processor v1.0.1

extensions:
  # ingero-version:builder-gomod-extension product=ingero-fleet channel=stable
  - gomod: github.com/ingero-io/ingero-fleet/extension v1.0.1

ocb --config builder-config.yaml

Configure the agent

Point your Ingero agent at Fleet:

# ingero.yaml (on each GPU node)
fleet:
  endpoint: https://fleet.example.com:4317

Architecture

The detailed Fleet-Agent component view is below.

Fleet-Agent Overview

graph TB
    subgraph GPU Cluster
        A1[Ingero Agent<br/>gpu-node-01<br/>score: 0.92]
        A2[Ingero Agent<br/>gpu-node-02<br/>score: 0.91]
        A3[Ingero Agent<br/>gpu-node-03<br/>score: 0.58]
        A4[Ingero Agent<br/>gpu-node-04<br/>score: 0.93]
    end

    subgraph Fleet Service
        R[OTLP Receiver<br/>gRPC :4317 / HTTP :4318]
        P[Ingero Processor<br/>Score Map + MAD + EMA]
        E[Ingero Extension<br/>Threshold API :8080<br/>Middleware Piggyback]
        EX[Prometheus Exporter]
    end

    subgraph Observability
        PR[Prometheus]
        GR[Grafana]
    end

    A1 -->|OTLP push| R
    A2 -->|OTLP push| R
    A3 -->|OTLP push| R
    A4 -->|OTLP push| R

    R --> P
    P -->|Set threshold| E
    P --> EX

    EX --> PR
    PR --> GR

    E -.->|threshold in<br/>push response| A1
    E -.->|threshold in<br/>push response| A2
    E -.->|threshold in<br/>push response| A3
    E -.->|threshold in<br/>push response| A4

    style A3 fill:#f66,stroke:#333,color:#fff

Node 03 (score 0.58) is below the fleet threshold (0.87) - detected as straggler.

Detailed Communication Flow

sequenceDiagram
    participant A as Ingero Agent<br/>(GPU Node)
    participant R as OTLP Receiver<br/>:4317 gRPC / :4318 HTTP
    participant M as Ingero Extension<br/>(Middleware)
    participant P as Ingero Processor
    participant S as ThresholdStore<br/>(in-memory)
    participant T as Timer<br/>(every 10s)
    participant API as Threshold API<br/>:8080

    Note over A: Computes health score<br/>from 4 GPU signals

    A->>R: OTLP push (HTTP :4318)<br/>metric: ingero.node.health_score = 0.92<br/>attrs: node.id, cluster.id, state<br/>header: ingero.cluster.id = cluster-prod

    R->>M: HTTP request passes through middleware
    M->>R: Injects response headers:<br/>X-Ingero-Threshold: 0.87<br/>X-Ingero-Quorum-Met: true

    R->>P: ConsumeMetrics(OTLP payload)
    P->>P: Extract health_score from payload<br/>Write to score map[cluster:node]

    R-->>A: HTTP 200 + threshold headers

    Note over A: Reads X-Ingero-Threshold<br/>0.92 > 0.87 = healthy

    T->>P: Timer tick (every push_interval)
    P->>P: Read score map (RLock)<br/>Compute MAD per cluster<br/>Apply EMA smoothing<br/>Check quorum, panic mode
    P->>S: Set(cluster_id, ThresholdResult)

    Note over M: Next push reads<br/>updated threshold from store

    A->>API: GET /api/v1/threshold?cluster_id=cluster-prod<br/>(fallback if no piggyback)
    API->>S: Get(cluster_id)
    S-->>API: ThresholdResult
    API-->>A: {"threshold": 0.87, "quorum_met": true}

Ports and Protocols

Port	Protocol	Component	Direction	Purpose
4317	gRPC	OTLP Receiver	Agent -> Fleet	Health score push (binary protobuf)
4318	HTTP	OTLP Receiver	Agent -> Fleet	Health score push (JSON). Threshold returned in response headers.
8080	HTTP	Ingero Extension	Agent -> Fleet	`GET /api/v1/threshold` fallback endpoint
8081	HTTP	Ingero Extension	Admin -> Fleet	Diagnostics endpoint (loopback only)
8088	HTTP	Healthee Extension	K8s / EE -> Fleet	`/healthz/ee` open probe + bearer-authed `/internal/ee/state` rich body
55679	HTTP	zPages	Internal	Health/readiness probes
8888	HTTP	OTEL Telemetry	Prometheus -> Fleet	Fleet self-monitoring metrics

Data Sent per Push

Agent -> Fleet (OTLP metric payload):

Resource attributes:
  ingero.node.id:      "gpu-node-01"
  ingero.cluster.id:   "cluster-prod"

Gauge: ingero.node.health_score = 0.92
  ingero.node.state:      "active"
  ingero.workload_type:   "training"

HTTP header:
  ingero.cluster.id: cluster-prod    (for middleware routing)

Fleet -> Agent (push response headers):

X-Ingero-Threshold:  0.870348
X-Ingero-Quorum-Met: true

Fleet -> Agent (GET fallback response):

{"threshold":0.870348,"quorum_met":true}

Fleet is built as a custom OpenTelemetry Collector distribution. Two custom components, everything else is standard OTEL:

Component	Type	What it does
Ingero Processor	OTEL processor	Accumulates health scores, computes MAD threshold with EMA smoothing
NCCL Processor	OTEL processor	Per-rank skew + slow-rank attribution from `libnccl` uprobe events
Provider-Lookup Processor	OTEL processor	Enriches metrics with `ingero.provider` for cost-attribution dashboards
Ingero Extension	OTEL extension	Threshold API for agent polling and diagnostics
Healthee Extension	OTEL extension	`/healthz/ee` open probe + bearer-authed `/internal/ee/state` rich body
Everything else	Standard OTEL	OTLP receiver, exporters, TLS, auth, batching

Key Properties

Stateless. No database, no disk. Health scores and threshold live in memory. Restart rebuilds state from incoming pushes in ~10 seconds.
Fail-open. If Fleet goes down, agents use their cached threshold, then fall back to local baselines. Straggler detection degrades gracefully, never blocks workloads.
Outbound-only. Agents push to Fleet and poll from Fleet - all outbound connections from GPU nodes. Zero firewall changes for enterprise GPU clusters with restricted inbound access.
Composable. Run Fleet as a drop-in distribution, OR add the Ingero processors and extensions to your existing OTEL Collector via OCB and skip the new operational footprint entirely.
Tiny. ~50MB RAM, negligible CPU for typical clusters.

How It Works

Health Score

Each agent computes a health score (0.0 - 1.0) from four signals:

Signal	Weight	What it measures
CUDA throughput	0.40	CUDA operations/sec relative to baseline
Compute efficiency	0.25	Kernel launch rate relative to baseline
Memory headroom	0.20	Available VRAM fraction
CPU availability	0.15	Inverse of scheduler contention

The throughput signal is workload-agnostic - it works for both training (step throughput) and inference (request processing rate). Baselines adapt via exponential moving average.

All four signals are normalized to [0.0, 1.0] against the agent's rolling fast-window baseline, then combined as a weighted sum. A hard floor per signal catches "close to zero" conditions (deep stalls, OOM pressure) that a weighted average could otherwise hide. The agent classifies itself against its local baseline during warmup and switches to the Fleet-computed peer threshold once quorum is met.

Threshold

Fleet computes the straggler threshold using Median Absolute Deviation (MAD):

threshold = median(scores) - k * MAD * 1.4826

MAD resists outliers (50% breakdown point vs 0% for mean/stddev). A single straggler - or even several - cannot shift the threshold. The k parameter (default 2.0) controls sensitivity.

The threshold is delivered to agents via the OTLP push response headers, eliminating a separate polling round-trip.

Straggler Classification

if my_score < threshold:
    I am a straggler

That's it. The agent emits a straggler event via OTLP and the remediation protocol (--remediate flag).

High Availability

Single replica (recommended for most clusters)

replicaCount: 1 is the chart default. Vertical scale is the path to larger clusters: a single g4dn.xlarge-class node carries 100+ pushing agents at 5s intervals with p99 handler latency under 20 ms.

If a single Fleet pod dies, agents use their cached threshold (~5 min grace), then fall back to local baseline. Restarts repopulate within 1-2 push intervals.

Multi-replica HA (when you need it)

Each Fleet replica maintains its own in-memory score map. An agent push reaches ONE replica (selected by DNS or the service mesh); that replica's map is the only one that sees the score. Each replica computes its own threshold from its subset of agents.

For multi-replica deployments, put an L7 load balancer with consistent-hash on the cluster_id query parameter (Envoy / nginx / service mesh) in front of Fleet. Every agent from one cluster lands on the same replica, eliminating cross-replica drift.

Size statistical_min for the per-replica visible node count, not the cluster-wide count. Alert on sum_over_replicas(ingero_fleet_active_nodes) < expected_total_nodes for replica starvation.

Larger-cluster topologies (gateway-based shared state) are out of scope. Talk to us if you're approaching the per-replica vertical-scale ceiling.

See docs/architecture_fleet.md for the full behavior model and rationale.

Fleet Configuration

# fleet-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  ingero:
    threshold:
      k: 2.0                    # MAD sensitivity (default: 2.0)
      ema_alpha: 0.2             # Threshold smoothing (default: 0.2)
    quorum:
      statistical_min: 5         # Min active nodes for valid threshold
      coverage_fraction: 0.80    # Coverage alert threshold
    push_interval: 10s           # Expected agent push interval
    ttl_multiplier: 5            # Node expiry = push_interval * ttl_multiplier

extensions:
  ingero_threshold:
    agent_endpoint: 0.0.0.0:8080       # Agent threshold poll (fallback)
    admin_endpoint: 127.0.0.1:8081     # Diagnostics (management plane only)

exporters:
  prometheus:
    endpoint: 0.0.0.0:9090

service:
  extensions: [ingero_threshold]
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [ingero]
      exporters: [prometheus]

Observability

Fleet emits its own metrics:

Metric	Type	Description
`ingero_fleet_threshold`	Gauge	Current straggler threshold per cluster
`ingero_fleet_active_nodes`	Gauge	Nodes actively reporting
`ingero_fleet_idle_nodes`	Gauge	Nodes in idle state
`ingero_fleet_coverage_low`	Gauge	1 if coverage quorum not met
`ingero_fleet_panic_mode`	Gauge	1 if panic mode active
`ingero_fleet_median`	Gauge	Fleet health score median
`ingero_fleet_mad`	Gauge	Fleet MAD value

Agent-side metrics:

Metric	Type	Description
`ingero_agent_health_score`	Gauge	This node's health score
`ingero_agent_detection_mode`	Gauge	Current detection tier (fleet/cached/local/none)
`ingero_agent_fleet_reachable`	Gauge	1 if Fleet is reachable

Security

Fleet runs on the standard OpenTelemetry Collector security surface: OTLP receivers support TLS, mTLS, and the collector auth extensions. The pieces specific to this distribution:

Echo bearer auth on every transport. Echo's OTLP ingest, MCP server, and HTTP+JSON API all require a bearer token, compared in constant time. Tokens can be scoped to specific cluster_id values, so a multi-tenant deployment shows each tenant only its own data.
TLS required by default. Echo refuses to start without a TLS keypair unless an explicit insecure flag is set for local trials. The healthee extension serves its routes over native TLS.
Zero-restart bearer rotation. SIGHUP re-reads the token file; the previous token stays valid for a configurable grace window, so rotation never drops a request.
Audit logging. Every authenticated Echo request is logged with the SHA-256 hash of the bearer (never the raw token), source IP, method, path, status, and latency.
Encryption-at-rest gate. The Echo Helm chart fails helm install closed unless its DuckDB volume is on an encrypted StorageClass.
Signed releases. Each release is cosign-signed (keyless OIDC) with a CycloneDX SBOM per archive; see Quick Start for the verification commands.

Documentation

SemVer policy - how v1.x versioning works + current scope boundaries
API versioning - /api/v2 contract, version negotiation, removal policy
Operator guide - deploy, upgrade, backup, multi-tenant onboarding, TLS, troubleshooting
Architecture - components, data flow, threshold computation
Deployment Guide - K8s, Slurm, bare metal
Configuration Reference - all config parameters
API Reference - threshold endpoints
Integrations - add Fleet to vanilla OTel Collector, Grafana Alloy, Datadog, New Relic, Groundcover, Splunk, AWS ADOT
End-to-end walkthrough on Lambda Cloud - A100 + GH200 (arm64) reference deploy
Ingero Echo - companion cluster-wide event store + MCP query layer that Fleet forwards to

Requirements

Ingero agent on each GPU node
Go 1.22+ (for building from source)
Kubernetes 1.24+ (for Helm deployment)

Ingero Echo

Per-node agents emit signal; operators investigate clusters. Ingero Echo is the companion service that gives the fleet a queryable place to land: a StatefulSet that ingests OTLP from Fleet, persists to embedded DuckDB, and exposes two surfaces over one bearer-authed listener.

MCP server for AI agents. Query tools cover cluster summaries, outlier and straggler ranking, anomaly streams, NCCL and memcpy bandwidth rollups, memory-fragmentation hot spots, and cost. The /investigate prompt walks an LLM through the cluster-level WHERE, then hands off to the per-node agent for the WHY.
HTTP+JSON API for clients that prefer curl: dashboards, CI scripts, Grafana datasources. URL-versioned at /api/v<N>/...; OpenAPI 3.1 described. See docs/api-versioning.md for the contract.

Path	Auth	Purpose
`GET /api/versions`	none	capability negotiation
`GET /api/v2/health`	none	liveness probe
`GET /api/v2/whoami`	bearer	bearer identity introspection
`GET /api/v2/tools/list`	bearer	MCP tool catalog with input + output JSON schemas
`POST /api/v2/tools/<name>`	bearer	invoke a tool with server-side schema validation
`POST /api/v2/sql`	bearer (non-tenant-scoped)	read-only SQL against the event store
`GET /api/v2/openapi.json`	bearer	OpenAPI 3.1 spec
`GET /metrics`	bearer	Prometheus exposition (handler / status_class / api_version labels only)

License

Apache License 2.0. See LICENSE.