Inference Perf

May 11, 2026 ยท View on GitHub

License GitHub Release PyPI Release Container Image Tests Join Slack

Inference Perf

Inference Perf is a production-scale GenAI inference performance benchmarking tool that allows you to benchmark and analyze the performance of inference deployments. It is agnostic of model servers and can be used to measure performance and compare different systems apples-to-apples.

It was founded as a part of the inference benchmarking and metrics standardization effort in wg-serving to standardize the benchmark tooling and the metrics used to measure inference performance across the Kubernetes and model server communities.


๐Ÿ—๏ธ Architecture

Architecture Diagram


๐ŸŒŸ Key Capabilities

๐Ÿ“Š Rich Metrics & Analysis

  • Comprehensive Latency Metrics: TTFT, TPOT, ITL, and Normalized TPOT.
  • Throughput Tracking: Input, Output, and Total tokens per second.
  • Goodput Measurement: Measure rate of requests meeting your SLO constraints. See goodput.md.
  • Automatic Visualization: Generate charts for QPS vs Latency/Throughput/Goodput. See analysis.md.

๐Ÿง  Smart Data Generation

  • Real-world Datasets: Support for ShareGPT, CNN DailyMail, Infinity Instruct and Billsum.
  • Synthetic & Random: Configure exact input/output distributions.
  • Advanced Scenarios: Shared prefix and multi-turn chat conversations.
  • Multimodal: Synthetic image, video, and audio payloads with per-modality reporting. Resolutions/profiles/durations are passed through as-is; pick values within your model's accepted range. See docs/config.md.

โฑ๏ธ Flexible Load Generation

  • Load Patterns: Constant rate, Poisson arrival, and concurrent user simulation.
  • Multi-Stage Runs: Define stages with varying rates and durations to find saturation points.
  • Trace Replay: Replay real-world traces (e.g., Azure dataset) or OpenTelemetry traces with agentic tree-of-thought simulation and visualization.

๐Ÿš€ High Scalability

  • 10k+ QPS: Scalable to very high load due to optimized multi-process architecture.
  • Automatic Saturation Detection: Find the limits of your system via sweeps.

๐Ÿ”Œ Engine Agnostic

  • Verified support for vLLM, SGLang, and TGI with server side aggregate metrics and time series metrics.
  • Easily extensible to any OpenAI-compatible endpoint.

๐Ÿš€ Quick Start

Run Locally

  1. Install inference-perf:

    pip install inference-perf
    
  2. Run a benchmark with a simple random workload:

    inference-perf --server.type vllm --server.base_url http://localhost:8000 --data.type random --load.type constant --load.stages '[{"rate": 10, "duration": 60}]' --api.streaming true
    

Alternatively, you can run using a configuration file:

inference-perf --config_file config.yml

Sample Output

When you run inference-perf, it displays a rich summary table in the CLI:

Metrics Summary

Run in Docker

docker run -it --rm -v $(pwd)/config.yml:/workspace/config.yml quay.io/inference-perf/inference-perf

Run in Kubernetes

Refer to the guide in /deploy.


๐Ÿ“š Documentation Hub

Explore detailed documentation for specific topics:

TopicDescriptionLink
ConfigurationFull YAML configuration schema and options.config.md
CLI FlagsOverriding configuration via command line flags.cli_flags.md
Load GenerationDetailed explanation of load patterns and multi-worker setup.loadgen.md
MetricsDefinitions of TTFT, TPOT, ITL, etc.metrics.md
GoodputHow to measure requests meeting SLOs.goodput.md
ReportsUnderstanding generated JSON reports.reports.md
OTel ObservabilityInstrument benchmark runs with OpenTelemetry tracing to export to Jaeger, Tempo, etc.otel_instrumentation.md
OTel Trace ReplayData/load type for replaying production traces with complex dependency graphs.otel_trace_replay.md
Conversation ReplayData/load type for benchmarking concurrent multi-turn agentic conversations with configurable distributions.conversation_replay.md
AnalysisVisualizations and plots for performance metrics.analysis.md

๐Ÿค Contributing & Community

We welcome contributions! Please join us:

See CONTRIBUTING.md for details on how to get started.