Ingero Fleet

May 22, 2026 ยท View on GitHub

License Artifact Hub

Version: 1.0.1

Cluster-side OpenTelemetry Collector distribution for GPU clusters.

Fleet receives OTLP from every Ingero agent in your fleet, runs three GPU-specific computations on the stream (MAD-based straggler threshold, NCCL collective skew, provider cost attribution), and forwards everything to your existing observability stack. One binary covers what would otherwise be three separate hops.

RoleComponentHeadline output
Straggler detectioningero processor + extensionper-cluster threshold; pushed back in OTLP response headers so agents self-classify in real time
NCCL collective healthnccl processorper-rank skew + slow-rank attribution from libnccl uprobes
Cost attributionprovider-lookup processorenriches every metric with ingero.provider so cost dashboards just work
EE / health surfacehealthee extension/healthz/ee open probe + bearer-authed /internal/ee/state rich body
OTEL backbonestandard collectorOTLP receivers, TLS/mTLS, auth, batching, retry, backends

Companion service: Ingero Echo, the cluster-wide event store and MCP query layer that Fleet forwards to. HTTP+JSON surface at /api/v2/* (bearer-authed; OpenAPI-described; per-tool input + output JSON schemas) plus a bearer-authed /metrics Prometheus endpoint with strict label cardinality. See docs/api-versioning.md.

No agent needs inbound network access; everything is outbound push and pull. Threshold is delivered to agents in the OTLP push response, so the straggler classification path needs no extra polling.

Quick Start

Looking for a worked end-to-end example? These multi-node quickstart guides take you from zero to a detected straggler on three GPU hosts in about 20 minutes. Pick the deployment style that matches your environment:

See docs/quickstart_fleet.md for a one-page comparison if you are not sure which to pick.

Option A: Use the pre-built Fleet distribution

# Docker (multi-arch manifest covers amd64 + arm64)
docker run -p 4317:4317 -p 8080:8080 ghcr.io/ingero-io/ingero-fleet:1.0.0

# Binary
# ingero-version:install-curl-version product=ingero-fleet channel=stable
VERSION=1.0.1
curl -fsSL "https://github.com/ingero-io/ingero-fleet/releases/download/v${VERSION}/ingero-fleet_${VERSION}_linux_amd64.tar.gz" | tar xz
./ingero-fleet --config fleet-config.yaml

Verify the release (cosign keyless OIDC; signed in GitHub Actions):

curl -fsSLO https://github.com/ingero-io/ingero-fleet/releases/download/v1.0.0/checksums.txt
curl -fsSLO https://github.com/ingero-io/ingero-fleet/releases/download/v1.0.0/checksums.txt.sig
cosign verify-blob checksums.txt --signature checksums.txt.sig \
  --certificate-identity-regexp '^https://github.com/ingero-io/ingero-fleet/' \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com
sha256sum -c checksums.txt --ignore-missing

CycloneDX SBOMs ship alongside each archive (hash anchored in checksums.txt). See docs/operator/deploy.md for the full flow.

Option B: Kubernetes (Helm)

From the published chart repo:

helm repo add ingero https://ingero-io.github.io/ingero-fleet
helm repo update
helm install ingero-fleet ingero/ingero-fleet

Or from a checkout of this repo:

helm install ingero-fleet ./helm/ingero-fleet

Companion Ingero Echo (datastore + MCP server) install (one chart, one Secret for the bearer token):

kubectl create secret generic ingero-echo-auth \
  --from-literal=token=$(openssl rand -hex 32) -n ingero
helm install ingero-echo ./helm/ingero-echo \
  --set persistence.storageClass=<encrypted-storageclass> \
  --set 'extraEnv[0].name=AUTH_TOKEN' \
  --set 'extraEnv[0].valueFrom.secretKeyRef.name=ingero-echo-auth' \
  --set 'extraEnv[0].valueFrom.secretKeyRef.key=token'

The Echo chart's PVC encryption-at-rest gate fails helm install closed if neither persistence.storageClass nor persistence.encryptionAcknowledged=true is set. See docs/operator/deploy.md for cloud- specific encrypted-StorageClass examples.

Chart default is replicaCount: 1; see High Availability section below for multi-replica guidance.

Option C: Add Ingero to your existing OTEL Collector

Advanced path: drop the Ingero Go modules into your existing OCB manifest and rebuild your collector instead of running ours.

# builder-config.yaml
processors:
  # ingero-version:builder-gomod-processor product=ingero-fleet channel=stable
  - gomod: github.com/ingero-io/ingero-fleet/processor v1.0.1

extensions:
  # ingero-version:builder-gomod-extension product=ingero-fleet channel=stable
  - gomod: github.com/ingero-io/ingero-fleet/extension v1.0.1
ocb --config builder-config.yaml

Configure the agent

Point your Ingero agent at Fleet:

# ingero.yaml (on each GPU node)
fleet:
  endpoint: https://fleet.example.com:4317

Architecture

Ingero architecture: per-node agent emits OTLP to Fleet collector; Fleet aggregates via ingeroprocessor, ncclprocessor, and providerlookupprocessor; healtheeextension serves the K8s probe + EE rich body; backends are Prometheus, Grafana, MCP clients, and UDS sinks

The detailed Fleet-Agent component view is below.

Fleet-Agent Overview

graph TB
    subgraph GPU Cluster
        A1[Ingero Agent<br/>gpu-node-01<br/>score: 0.92]
        A2[Ingero Agent<br/>gpu-node-02<br/>score: 0.91]
        A3[Ingero Agent<br/>gpu-node-03<br/>score: 0.58]
        A4[Ingero Agent<br/>gpu-node-04<br/>score: 0.93]
    end

    subgraph Fleet Service
        R[OTLP Receiver<br/>gRPC :4317 / HTTP :4318]
        P[Ingero Processor<br/>Score Map + MAD + EMA]
        E[Ingero Extension<br/>Threshold API :8080<br/>Middleware Piggyback]
        EX[Prometheus Exporter]
    end

    subgraph Observability
        PR[Prometheus]
        GR[Grafana]
    end

    A1 -->|OTLP push| R
    A2 -->|OTLP push| R
    A3 -->|OTLP push| R
    A4 -->|OTLP push| R

    R --> P
    P -->|Set threshold| E
    P --> EX

    EX --> PR
    PR --> GR

    E -.->|threshold in<br/>push response| A1
    E -.->|threshold in<br/>push response| A2
    E -.->|threshold in<br/>push response| A3
    E -.->|threshold in<br/>push response| A4

    style A3 fill:#f66,stroke:#333,color:#fff

Node 03 (score 0.58) is below the fleet threshold (0.87) - detected as straggler.

Detailed Communication Flow

sequenceDiagram
    participant A as Ingero Agent<br/>(GPU Node)
    participant R as OTLP Receiver<br/>:4317 gRPC / :4318 HTTP
    participant M as Ingero Extension<br/>(Middleware)
    participant P as Ingero Processor
    participant S as ThresholdStore<br/>(in-memory)
    participant T as Timer<br/>(every 10s)
    participant API as Threshold API<br/>:8080

    Note over A: Computes health score<br/>from 4 GPU signals

    A->>R: OTLP push (HTTP :4318)<br/>metric: ingero.node.health_score = 0.92<br/>attrs: node.id, cluster.id, state<br/>header: ingero.cluster.id = cluster-prod

    R->>M: HTTP request passes through middleware
    M->>R: Injects response headers:<br/>X-Ingero-Threshold: 0.87<br/>X-Ingero-Quorum-Met: true

    R->>P: ConsumeMetrics(OTLP payload)
    P->>P: Extract health_score from payload<br/>Write to score map[cluster:node]

    R-->>A: HTTP 200 + threshold headers

    Note over A: Reads X-Ingero-Threshold<br/>0.92 > 0.87 = healthy

    T->>P: Timer tick (every push_interval)
    P->>P: Read score map (RLock)<br/>Compute MAD per cluster<br/>Apply EMA smoothing<br/>Check quorum, panic mode
    P->>S: Set(cluster_id, ThresholdResult)

    Note over M: Next push reads<br/>updated threshold from store

    A->>API: GET /api/v1/threshold?cluster_id=cluster-prod<br/>(fallback if no piggyback)
    API->>S: Get(cluster_id)
    S-->>API: ThresholdResult
    API-->>A: {"threshold": 0.87, "quorum_met": true}

Ports and Protocols

PortProtocolComponentDirectionPurpose
4317gRPCOTLP ReceiverAgent -> FleetHealth score push (binary protobuf)
4318HTTPOTLP ReceiverAgent -> FleetHealth score push (JSON). Threshold returned in response headers.
8080HTTPIngero ExtensionAgent -> FleetGET /api/v1/threshold fallback endpoint
8081HTTPIngero ExtensionAdmin -> FleetDiagnostics endpoint (loopback only)
8088HTTPHealthee ExtensionK8s / EE -> Fleet/healthz/ee open probe + bearer-authed /internal/ee/state rich body
55679HTTPzPagesInternalHealth/readiness probes
8888HTTPOTEL TelemetryPrometheus -> FleetFleet self-monitoring metrics

Data Sent per Push

Agent -> Fleet (OTLP metric payload):

Resource attributes:
  ingero.node.id:      "gpu-node-01"
  ingero.cluster.id:   "cluster-prod"

Gauge: ingero.node.health_score = 0.92
  ingero.node.state:      "active"
  ingero.workload_type:   "training"

HTTP header:
  ingero.cluster.id: cluster-prod    (for middleware routing)

Fleet -> Agent (push response headers):

X-Ingero-Threshold:  0.870348
X-Ingero-Quorum-Met: true

Fleet -> Agent (GET fallback response):

{"threshold":0.870348,"quorum_met":true}

Fleet is built as a custom OpenTelemetry Collector distribution. Two custom components, everything else is standard OTEL:

ComponentTypeWhat it does
Ingero ProcessorOTEL processorAccumulates health scores, computes MAD threshold with EMA smoothing
NCCL ProcessorOTEL processorPer-rank skew + slow-rank attribution from libnccl uprobe events
Provider-Lookup ProcessorOTEL processorEnriches metrics with ingero.provider for cost-attribution dashboards
Ingero ExtensionOTEL extensionThreshold API for agent polling and diagnostics
Healthee ExtensionOTEL extension/healthz/ee open probe + bearer-authed /internal/ee/state rich body
Everything elseStandard OTELOTLP receiver, exporters, TLS, auth, batching

Key Properties

  • Stateless. No database, no disk. Health scores and threshold live in memory. Restart rebuilds state from incoming pushes in ~10 seconds.
  • Fail-open. If Fleet goes down, agents use their cached threshold, then fall back to local baselines. Straggler detection degrades gracefully, never blocks workloads.
  • Outbound-only. Agents push to Fleet and poll from Fleet - all outbound connections from GPU nodes. Zero firewall changes for enterprise GPU clusters with restricted inbound access.
  • Composable. Run Fleet as a drop-in distribution, OR add the Ingero processors and extensions to your existing OTEL Collector via OCB and skip the new operational footprint entirely.
  • Tiny. ~50MB RAM, negligible CPU for typical clusters.

How It Works

Health Score

Each agent computes a health score (0.0 - 1.0) from four signals:

SignalWeightWhat it measures
CUDA throughput0.40CUDA operations/sec relative to baseline
Compute efficiency0.25Kernel launch rate relative to baseline
Memory headroom0.20Available VRAM fraction
CPU availability0.15Inverse of scheduler contention

The throughput signal is workload-agnostic - it works for both training (step throughput) and inference (request processing rate). Baselines adapt via exponential moving average.

All four signals are normalized to [0.0, 1.0] against the agent's rolling fast-window baseline, then combined as a weighted sum. A hard floor per signal catches "close to zero" conditions (deep stalls, OOM pressure) that a weighted average could otherwise hide. The agent classifies itself against its local baseline during warmup and switches to the Fleet-computed peer threshold once quorum is met.

Threshold

Fleet computes the straggler threshold using Median Absolute Deviation (MAD):

threshold = median(scores) - k * MAD * 1.4826

MAD resists outliers (50% breakdown point vs 0% for mean/stddev). A single straggler - or even several - cannot shift the threshold. The k parameter (default 2.0) controls sensitivity.

The threshold is delivered to agents via the OTLP push response headers, eliminating a separate polling round-trip.

Straggler Classification

if my_score < threshold:
    I am a straggler

That's it. The agent emits a straggler event via OTLP and the remediation protocol (--remediate flag).

High Availability

replicaCount: 1 is the chart default. Vertical scale is the path to larger clusters: a single g4dn.xlarge-class node carries 100+ pushing agents at 5s intervals with p99 handler latency under 20 ms.

If a single Fleet pod dies, agents use their cached threshold (~5 min grace), then fall back to local baseline. Restarts repopulate within 1-2 push intervals.

Multi-replica HA (when you need it)

Each Fleet replica maintains its own in-memory score map. An agent push reaches ONE replica (selected by DNS or the service mesh); that replica's map is the only one that sees the score. Each replica computes its own threshold from its subset of agents.

For multi-replica deployments, put an L7 load balancer with consistent-hash on the cluster_id query parameter (Envoy / nginx / service mesh) in front of Fleet. Every agent from one cluster lands on the same replica, eliminating cross-replica drift.

Size statistical_min for the per-replica visible node count, not the cluster-wide count. Alert on sum_over_replicas(ingero_fleet_active_nodes) < expected_total_nodes for replica starvation.

Larger-cluster topologies (gateway-based shared state) are out of scope. Talk to us if you're approaching the per-replica vertical-scale ceiling.

See docs/architecture_fleet.md for the full behavior model and rationale.

Fleet Configuration

# fleet-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  ingero:
    threshold:
      k: 2.0                    # MAD sensitivity (default: 2.0)
      ema_alpha: 0.2             # Threshold smoothing (default: 0.2)
    quorum:
      statistical_min: 5         # Min active nodes for valid threshold
      coverage_fraction: 0.80    # Coverage alert threshold
    push_interval: 10s           # Expected agent push interval
    ttl_multiplier: 5            # Node expiry = push_interval * ttl_multiplier

extensions:
  ingero_threshold:
    agent_endpoint: 0.0.0.0:8080       # Agent threshold poll (fallback)
    admin_endpoint: 127.0.0.1:8081     # Diagnostics (management plane only)

exporters:
  prometheus:
    endpoint: 0.0.0.0:9090

service:
  extensions: [ingero_threshold]
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [ingero]
      exporters: [prometheus]

Observability

Fleet emits its own metrics:

MetricTypeDescription
ingero_fleet_thresholdGaugeCurrent straggler threshold per cluster
ingero_fleet_active_nodesGaugeNodes actively reporting
ingero_fleet_idle_nodesGaugeNodes in idle state
ingero_fleet_coverage_lowGauge1 if coverage quorum not met
ingero_fleet_panic_modeGauge1 if panic mode active
ingero_fleet_medianGaugeFleet health score median
ingero_fleet_madGaugeFleet MAD value

Agent-side metrics:

MetricTypeDescription
ingero_agent_health_scoreGaugeThis node's health score
ingero_agent_detection_modeGaugeCurrent detection tier (fleet/cached/local/none)
ingero_agent_fleet_reachableGauge1 if Fleet is reachable

Security

Fleet runs on the standard OpenTelemetry Collector security surface: OTLP receivers support TLS, mTLS, and the collector auth extensions. The pieces specific to this distribution:

  • Echo bearer auth on every transport. Echo's OTLP ingest, MCP server, and HTTP+JSON API all require a bearer token, compared in constant time. Tokens can be scoped to specific cluster_id values, so a multi-tenant deployment shows each tenant only its own data.
  • TLS required by default. Echo refuses to start without a TLS keypair unless an explicit insecure flag is set for local trials. The healthee extension serves its routes over native TLS.
  • Zero-restart bearer rotation. SIGHUP re-reads the token file; the previous token stays valid for a configurable grace window, so rotation never drops a request.
  • Audit logging. Every authenticated Echo request is logged with the SHA-256 hash of the bearer (never the raw token), source IP, method, path, status, and latency.
  • Encryption-at-rest gate. The Echo Helm chart fails helm install closed unless its DuckDB volume is on an encrypted StorageClass.
  • Signed releases. Each release is cosign-signed (keyless OIDC) with a CycloneDX SBOM per archive; see Quick Start for the verification commands.

Documentation

Requirements

  • Ingero agent on each GPU node
  • Go 1.22+ (for building from source)
  • Kubernetes 1.24+ (for Helm deployment)

Ingero Echo

Per-node agents emit signal; operators investigate clusters. Ingero Echo is the companion service that gives the fleet a queryable place to land: a StatefulSet that ingests OTLP from Fleet, persists to embedded DuckDB, and exposes two surfaces over one bearer-authed listener.

  • MCP server for AI agents. Query tools cover cluster summaries, outlier and straggler ranking, anomaly streams, NCCL and memcpy bandwidth rollups, memory-fragmentation hot spots, and cost. The /investigate prompt walks an LLM through the cluster-level WHERE, then hands off to the per-node agent for the WHY.
  • HTTP+JSON API for clients that prefer curl: dashboards, CI scripts, Grafana datasources. URL-versioned at /api/v<N>/...; OpenAPI 3.1 described. See docs/api-versioning.md for the contract.
PathAuthPurpose
GET /api/versionsnonecapability negotiation
GET /api/v2/healthnoneliveness probe
GET /api/v2/whoamibearerbearer identity introspection
GET /api/v2/tools/listbearerMCP tool catalog with input + output JSON schemas
POST /api/v2/tools/<name>bearerinvoke a tool with server-side schema validation
POST /api/v2/sqlbearer (non-tenant-scoped)read-only SQL against the event store
GET /api/v2/openapi.jsonbearerOpenAPI 3.1 spec
GET /metricsbearerPrometheus exposition (handler / status_class / api_version labels only)

License

Apache License 2.0. See LICENSE.