What Ingero Detects

May 24, 2026 · View on GitHub

Ingero addresses 25 documented GPU problems across training, inference, and AI agent workloads. The README's "What Ingero detects" section shows the highest-impact 8; the full list is below.

#GPU ProblemSeverityHow Ingero Detects It
1NCCL hangs & distributed training deadlocksCRITICALDirect ncclAllReduce / ncclSend / ncclRecv enter/exit uprobes (v0.12.0+) measure per-collective wall time, with rank/comm_id_hash/nranks correlation. sched_switch + TCP-retransmit tracing remain as host-side and network-side cross-checks.
2GPU underutilization / data pipeline starvationCRITICALHost scheduler + cudaStreamSync + cudaMemcpy pipeline bubble diagnosis. Block I/O shows DataLoader disk bottleneck
3CUDA OOM & memory fragmentationCRITICALcudaMalloc/cuMemAlloc allocation pattern tracing. cudaMallocManaged adds managed-memory over-subscription detection
4Silent data corruption (SDC)CRITICALAnomalous kernel timing as indirect signal (limited)
5Inference cost explosion (multi-step agents)CRITICALCUDA API burst/idle patterns per agent session
6KV cache pressure & preemption cascadesCRITICALcudaMalloc patterns + cudaStreamSync spikes during preemption. Managed-memory page fault detection
6bCUDA Graph re-capture latency spikes (vLLM, torch.compile)HIGHGraph lifecycle tracing: capture/instantiate/launch rates, pool exhaustion detection, OOM during capture, CPU contention during launch
7GPU hardware failures at scaleHIGHcudaMemcpy baseline drift, sched_switch frequency anomalies
8CPU bottleneck in GPU servingHIGHsched_switch on inference process + cudaStreamSync idle gaps
9GPU idle waste during agent tool executionHIGHCUDA API silence periods correlated with host process activity. TCP tracing shows "GPU idle during 2s HTTP tool call"
10GPU memory leaks in long-running servicesHIGHcudaMalloc/cudaFree imbalance tracking over time, per-container via cgroup
11Mixed precision (AMP) instabilityHIGHAnomalous kernel timing (skipped updates = fast sync)
12Goodput loss (training efficiency gap)HIGHScheduler preemption, memcpy latency, pipeline bubbles. Block I/O shows checkpoint write + data read overhead
13GPU scheduling & orchestration failuresHIGHPer-cgroup sched_switch latency + orchestrator metadata. v0.12.3 added multi-orchestrator detection: K8s (auto-discovers nvidia.com/gpu pods), Slurm (SLURM_JOB_ID), ECS (ECS_CONTAINER_METADATA_URI_V4/V3), Docker / containerd (cgroup hex match).
14Model swapping latency (multi-model agents)HIGHcudaMalloc + cudaMemcpy patterns during model load. Block I/O shows disk→CPU transfer time
15CUDA device-side asserts & illegal memory accessMEDIUMCUDA API call sequence + stack traces before crash
16NVIDIA driver / CUDA version incompatibilityMEDIUMUprobe attachment failure = library/driver mismatch signal
17Thermal throttling & power limit throttlingMEDIUMKernel duration trending over time
18Noisy neighbor / multi-tenant GPU interferenceMEDIUMPer-cgroup sched_switch latency + CUDA API latency correlation. Noisy neighbor detection via cgroup_schedstat
19Cold start / model loading latencyMEDIUMFull cold start sequence via CUDA API timing. Block I/O completes disk→CPU→GPU pipeline
20Multi-GPU tensor parallel communication overheadMEDIUMDirect NCCL collective uprobes (ncclAllReduce / ncclAllGather / ncclReduceScatter, v0.12.0+) measure barrier-wait time per rank with comm_id_hash + nranks labels. Host-side sched_switch + TCP-retransmit on NCCL ports remain as cross-checks.
21RAG pipeline GPU contentionMEDIUMPer-process CUDA API breakdown (explain --per-process): shows which process is hogging GPU time
22Checkpoint save/load failuresMEDIUMMemory spike detection + I/O blocking in cudaStreamSync. Block I/O shows actual write latency + NFS timeouts
23PCIe bottleneck (KV cache swap, model loading)MEDIUMcudaMemcpy per-operation tracing with direction/size/duration. cudaMallocManaged page migration + Block I/O shows NVMe-PCIe contention
24Loss spikes (non-AMP)LOW-MEDSystem event correlation with loss timing
25Triton Inference Server multi-GPU bugsLOW-MEDCUDA API tracing on Triton processes
26Per-request inference latency tied to GPU events (v0.19)HIGHOptional emitter (vLLM at examples/integrations/vllm/) writes ONE span annotation per request to the agent socket; the agent joins by process incarnation + time window so explain --by-request and query --by-request show CUDA / memcpy / barrier-wait counts inside each request's [arrival, finished] span. Rows are ranked by agent-MEASURED span duration only. Continuous-batching honesty: a kernel inside a span is NOT exclusively that request's work and the same kernel appears in every overlapping request's group; the renderer prints this caveat verbatim every time. Single trust domain only: the agent peer-credentials the writer process (SO_PEERCRED) but cannot prove tenant honesty, so v0.19 is in-scope when one operator owns both the inference server and the trace, and out of scope when a tenant could submit a forged request_id through the emitter. Token-count labels (prompt_len, output_len) are advisory; they never feed ranking.

Per-request inference correlation reading guide (v0.19)

--by-request answers "which CUDA / memcpy / barrier-wait events landed inside this request's lifecycle window?" It does NOT answer "which kernels did this request cause." Under continuous batching (vLLM v1, TRT-LLM, SGLang) a single kernel launch serves tokens from many in-flight requests, so the time-overlap slice is the only honest question the data can answer.

Read the rows this way:

  • duration is the agent-measured span (span_end - span_start of the request_id annotation). A forged prompt_len or output_len from a misbehaving emitter cannot reorder the table; the agent never ranks by emitter-supplied values.
  • events is the count of events from the analyzed set whose timestamps land inside that request's span. The SAME event will appear in EVERY request whose span contains it; do not sum these counts across rows to compute a total.
  • The two Note: lines above the table are NOT cosmetic. They are the contract: the renderer prints them verbatim, and any investigation that ignores them will mis-attribute kernel work.

Sensitivity: request_id and prompt_len / output_len are stored in the SQLite annotations table as plain text and survive trace rollover. When a raw request_id is sensitive (PII, prompt-hash proxy, customer ID), turn on the emitter's salted-hash mode so the on-disk value is a 16-char hex digest the operator can still group on but cannot reverse without the salt.