What Ingero Detects

May 24, 2026 · View on GitHub

Ingero addresses 25 documented GPU problems across training, inference, and AI agent workloads. The README's "What Ingero detects" section shows the highest-impact 8; the full list is below.

#	GPU Problem	Severity	How Ingero Detects It
1	NCCL hangs & distributed training deadlocks	CRITICAL	Direct `ncclAllReduce` / `ncclSend` / `ncclRecv` enter/exit uprobes (v0.12.0+) measure per-collective wall time, with rank/`comm_id_hash`/`nranks` correlation. `sched_switch` + TCP-retransmit tracing remain as host-side and network-side cross-checks.
2	GPU underutilization / data pipeline starvation	CRITICAL	Host scheduler + `cudaStreamSync` + `cudaMemcpy` pipeline bubble diagnosis. Block I/O shows DataLoader disk bottleneck
3	CUDA OOM & memory fragmentation	CRITICAL	`cudaMalloc`/`cuMemAlloc` allocation pattern tracing. `cudaMallocManaged` adds managed-memory over-subscription detection
4	Silent data corruption (SDC)	CRITICAL	Anomalous kernel timing as indirect signal (limited)
5	Inference cost explosion (multi-step agents)	CRITICAL	CUDA API burst/idle patterns per agent session
6	KV cache pressure & preemption cascades	CRITICAL	`cudaMalloc` patterns + `cudaStreamSync` spikes during preemption. Managed-memory page fault detection
6b	CUDA Graph re-capture latency spikes (vLLM, torch.compile)	HIGH	Graph lifecycle tracing: capture/instantiate/launch rates, pool exhaustion detection, OOM during capture, CPU contention during launch
7	GPU hardware failures at scale	HIGH	`cudaMemcpy` baseline drift, `sched_switch` frequency anomalies
8	CPU bottleneck in GPU serving	HIGH	`sched_switch` on inference process + `cudaStreamSync` idle gaps
9	GPU idle waste during agent tool execution	HIGH	CUDA API silence periods correlated with host process activity. TCP tracing shows "GPU idle during 2s HTTP tool call"
10	GPU memory leaks in long-running services	HIGH	`cudaMalloc`/`cudaFree` imbalance tracking over time, per-container via cgroup
11	Mixed precision (AMP) instability	HIGH	Anomalous kernel timing (skipped updates = fast sync)
12	Goodput loss (training efficiency gap)	HIGH	Scheduler preemption, memcpy latency, pipeline bubbles. Block I/O shows checkpoint write + data read overhead
13	GPU scheduling & orchestration failures	HIGH	Per-cgroup `sched_switch` latency + orchestrator metadata. v0.12.3 added multi-orchestrator detection: K8s (auto-discovers `nvidia.com/gpu` pods), Slurm (`SLURM_JOB_ID`), ECS (`ECS_CONTAINER_METADATA_URI_V4`/V3), Docker / containerd (cgroup hex match).
14	Model swapping latency (multi-model agents)	HIGH	`cudaMalloc` + `cudaMemcpy` patterns during model load. Block I/O shows disk→CPU transfer time
15	CUDA device-side asserts & illegal memory access	MEDIUM	CUDA API call sequence + stack traces before crash
16	NVIDIA driver / CUDA version incompatibility	MEDIUM	Uprobe attachment failure = library/driver mismatch signal
17	Thermal throttling & power limit throttling	MEDIUM	Kernel duration trending over time
18	Noisy neighbor / multi-tenant GPU interference	MEDIUM	Per-cgroup `sched_switch` latency + CUDA API latency correlation. Noisy neighbor detection via cgroup_schedstat
19	Cold start / model loading latency	MEDIUM	Full cold start sequence via CUDA API timing. Block I/O completes disk→CPU→GPU pipeline
20	Multi-GPU tensor parallel communication overhead	MEDIUM	Direct NCCL collective uprobes (`ncclAllReduce` / `ncclAllGather` / `ncclReduceScatter`, v0.12.0+) measure barrier-wait time per rank with `comm_id_hash` + `nranks` labels. Host-side `sched_switch` + TCP-retransmit on NCCL ports remain as cross-checks.
21	RAG pipeline GPU contention	MEDIUM	Per-process CUDA API breakdown (`explain --per-process`): shows which process is hogging GPU time
22	Checkpoint save/load failures	MEDIUM	Memory spike detection + I/O blocking in `cudaStreamSync`. Block I/O shows actual write latency + NFS timeouts
23	PCIe bottleneck (KV cache swap, model loading)	MEDIUM	`cudaMemcpy` per-operation tracing with direction/size/duration. `cudaMallocManaged` page migration + Block I/O shows NVMe-PCIe contention
24	Loss spikes (non-AMP)	LOW-MED	System event correlation with loss timing
25	Triton Inference Server multi-GPU bugs	LOW-MED	CUDA API tracing on Triton processes
26	Per-request inference latency tied to GPU events (v0.19)	HIGH	Optional emitter (vLLM at `examples/integrations/vllm/`) writes ONE span annotation per request to the agent socket; the agent joins by process incarnation + time window so `explain --by-request` and `query --by-request` show CUDA / memcpy / barrier-wait counts inside each request's `[arrival, finished]` span. Rows are ranked by agent-MEASURED span duration only. Continuous-batching honesty: a kernel inside a span is NOT exclusively that request's work and the same kernel appears in every overlapping request's group; the renderer prints this caveat verbatim every time. Single trust domain only: the agent peer-credentials the writer process (SO_PEERCRED) but cannot prove tenant honesty, so v0.19 is in-scope when one operator owns both the inference server and the trace, and out of scope when a tenant could submit a forged `request_id` through the emitter. Token-count labels (`prompt_len`, `output_len`) are advisory; they never feed ranking.

Per-request inference correlation reading guide (v0.19)

--by-request answers "which CUDA / memcpy / barrier-wait events landed inside this request's lifecycle window?" It does NOT answer "which kernels did this request cause." Under continuous batching (vLLM v1, TRT-LLM, SGLang) a single kernel launch serves tokens from many in-flight requests, so the time-overlap slice is the only honest question the data can answer.

Read the rows this way:

duration is the agent-measured span (span_end - span_start of the request_id annotation). A forged prompt_len or output_len from a misbehaving emitter cannot reorder the table; the agent never ranks by emitter-supplied values.
events is the count of events from the analyzed set whose timestamps land inside that request's span. The SAME event will appear in EVERY request whose span contains it; do not sum these counts across rows to compute a total.
The two Note: lines above the table are NOT cosmetic. They are the contract: the renderer prints them verbatim, and any investigation that ignores them will mis-attribute kernel work.

Sensitivity: request_id and prompt_len / output_len are stored in the SQLite annotations table as plain text and survive trace rollover. When a raw request_id is sensitive (PII, prompt-hash proxy, customer ID), turn on the emitter's salted-hash mode so the on-disk value is a 16-char hex digest the operator can still group on but cannot reverse without the salt.