OpenTelemetry Instrumentation for inference-perf

April 8, 2026 · View on GitHub

This document describes the OpenTelemetry (OTEL) instrumentation added to inference-perf for tracing LLM API calls.

Overview

The OTEL instrumentation provides distributed tracing capabilities for LLM inference requests, following the OpenTelemetry Semantic Conventions for GenAI operations.

Features

Automatic tracing of all LLM API calls (chat completions, completions) in inference-perf workloads
Standard GenAI semantic conventions for consistent observability
Rich span attributes including:
- Model name and operation type
- Request parameters (max_tokens, temperature, top_p, streaming)
- Input messages and output text
- Token usage (input/output tokens)
- Latency metrics (TTFT, TPOT, total latency)
- Finish reasons and error information
Support for all model servers: vLLM, SGlang, TGI, and any OpenAI-compatible API
Environment-based configuration: No code changes required
Graceful degradation: Works without OTEL packages installed (disabled mode)

Installation

pip install -e ".[otel]"

Configuration

OTEL instrumentation is controlled via environment variables. Tracing is disabled by default and must be explicitly enabled.

Environment Variables

OTEL_TRACES_ENABLED: Set to "true" to enable OTEL tracing (default: "false")
OTEL_EXPORTER_OTLP_ENDPOINT: OTLP endpoint for exporting traces (e.g., http://localhost:4317)
- If not set, traces are exported to console (stdout)
- If set, traces are exported via OTLP to the specified endpoint
OTEL_SERVICE_NAME: Service name for tracing (default: "inference-perf")
OTEL_TRACE_PER_STAGE: Set to "true" to create one trace per stage instead of per session (default: "false")
- When enabled, all sessions within a stage are grouped under a single stage-level trace
- Session spans are created as children of the stage span
- Useful for viewing all sessions in a stage together in trace visualization tools

Usage

Enable Tracing

To enable OTEL tracing, set the OTEL_TRACES_ENABLED environment variable:

export OTEL_TRACES_ENABLED="true"
python -m inference_perf.main --config config.yml

Console Output

When OTEL_EXPORTER_OTLP_ENDPOINT is not set, traces are printed to console in JSON format:

export OTEL_TRACES_ENABLED="true"
python -m inference_perf.main --config config.yml

Export to OTLP Endpoint

To export traces to an OTLP endpoint (e.g., Jaeger, Tempo, Grafana Cloud):

export OTEL_TRACES_ENABLED="true"
export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317"
export OTEL_SERVICE_NAME="my-inference-service"
python -m inference_perf.main --config config.yml

Using with Jaeger

1. Start Jaeger

Start Jaeger with OTLP support using Docker:

docker run -d --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/all-in-one:latest

Verify Jaeger is running by opening http://localhost:16686 in your browser.

2. Run with Jaeger

Option A: Using environment variables

export OTEL_TRACES_ENABLED="true"
export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317"
python -m inference_perf.main --config config.yml

Option B: Using the provided script

chmod +x examples/otel/run_with_jaeger.sh
./examples/otel/run_with_jaeger.sh

3. View Traces

Open Jaeger UI at http://localhost:16686 and:

Select "inference-perf" from the Service dropdown
Click "Find Traces"
Click on a trace to see detailed span information

Example Queries in Jaeger

Find slow requests:

In Jaeger UI, go to "Search" tab
Select service: "inference-perf"
Set "Min Duration" to filter slow traces
Click "Find Traces"

Analyze token usage:

Click on a trace
Expand the span details
Look for gen_ai.usage.* attributes

Compare models:

Search for traces with different gen_ai.request.model values
Compare latency and token usage

Span Attributes

The following GenAI semantic convention attributes are captured:

Request Attributes

gen_ai.system: System identifier (e.g., "openai_compatible")
gen_ai.request.model: Model name
gen_ai.request.max_tokens: Maximum tokens to generate
gen_ai.request.temperature: Sampling temperature
gen_ai.input.messages: Input messages as JSON string

Response Attributes

gen_ai.output.text: Generated text
gen_ai.usage.prompt_tokens: Number of input tokens
gen_ai.usage.completion_tokens: Number of output tokens
gen_ai.response.total_latency: Total request latency
gen_ai.response.time_to_first_token: Time to first token (TTFT)
gen_ai.response.time_per_output_token: Time per output token (TPOT)
gen_ai.response.finish_reason: Reason for completion

Additional Attributes

llm.request.type: Operation type (e.g., "chat.completions")
llm.is_streaming: Whether the request is streaming
llm.usage.total_tokens: Total tokens (input + output)

Architecture

The OTEL instrumentation is implemented in inference_perf/client/modelserver/otel_instrumentation.py and automatically integrated into all model server clients:

openai_client.py: Base OpenAI-compatible client
vllm_client.py: vLLM-specific client
sglang_client.py: SGlang-specific client
tgi_client.py: TGI-specific client

All clients automatically use the global OTEL instrumentation instance, which is configured via environment variables.

Advanced Configuration

Custom OTLP Endpoint

export OTEL_EXPORTER_OTLP_ENDPOINT="https://my-otel-collector:4317"
python -m inference_perf.main --config your_config.yml

Sampling

To reduce trace volume, configure sampling:

# Sample 10% of traces
export OTEL_TRACES_SAMPLER="traceidratio"
export OTEL_TRACES_SAMPLER_ARG="0.1"

Integration with Other Tools

Grafana Tempo:

export OTEL_EXPORTER_OTLP_ENDPOINT="http://tempo:4317"

Honeycomb:

export OTEL_EXPORTER_OTLP_ENDPOINT="https://api.honeycomb.io:443"
export OTEL_EXPORTER_OTLP_HEADERS="x-honeycomb-team=YOUR_API_KEY"

AWS X-Ray: Use the AWS Distro for OpenTelemetry (ADOT) Collector as an intermediary.

Troubleshooting

Traces not appearing in Jaeger

Check Jaeger is running:
```
curl http://localhost:16686
```
Check OTLP endpoint is accessible:
```
curl http://localhost:4317
```

Verify OTLP exporter is installed:

python -c "from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter"

Verify OTEL is enabled:

echo $OTEL_TRACES_ENABLED  # Should be "true"

Look for OTEL initialization messages in logs:

INFO - OTEL tracing enabled for service: inference-perf
INFO - Created OTEL tracer provider with OTLP exporter to http://localhost:4317