Time-Based Benchmarking

April 17, 2026 · View on GitHub

Time-based benchmarking runs for a specific duration rather than a fixed number of requests. Use it for SLA validation, stability testing, capacity planning, and A/B comparisons where consistent time windows matter.

Quick Start

aiperf profile \
    --model your-model \
    --url localhost:8000 \
    --endpoint-type chat \
    --streaming \
    --concurrency 10 \
    --benchmark-duration 60

Requests are sent continuously until the duration expires. AIPerf then waits for in-flight requests to complete (up to the grace period).

How It Works

│          BENCHMARK DURATION           │   GRACE PERIOD    │
│        (sending requests)             │   (drain only)    │
├───────────────────────────────────────┼───────────────────┤
│ New requests dispatched               │ No new requests   │
│ Responses collected                   │ Wait for in-flight│
└───────────────────────────────────────┴───────────────────┘
                    ▲                             ▲
           Duration expires              Grace period ends
  • Grace period default: 30 seconds (use inf to wait forever, 0 for immediate completion)
  • Responses received within grace period are included in metrics; responses still pending when grace expires are not

Warning

--benchmark-grace-period requires --benchmark-duration to be set.

Combining with Request Count

Duration can be combined with count-based stopping—first condition reached wins:

# Stop when EITHER 1000 requests sent OR 120 seconds pass
aiperf profile \
    --model your-model \
    --url localhost:8000 \
    --endpoint-type chat \
    --streaming \
    --request-rate 20 \
    --benchmark-duration 120 \
    --request-count 1000

Examples

Stability Test (5 minutes)

aiperf profile \
    --model Qwen/Qwen2.5-7B-Instruct \
    --url localhost:8000 \
    --endpoint-type chat \
    --streaming \
    --concurrency 50 \
    --benchmark-duration 300 \
    --benchmark-grace-period 60 \
    --warmup-duration 30

Sample Output (Successful Run):

INFO     Starting AIPerf System
INFO     AIPerf System is WARMING UP

Warming Up: [00:30] - Running for 30 seconds...

INFO     Warmup completed, starting profiling phase
INFO     AIPerf System is PROFILING

Profiling: [05:00] - Running for 300 seconds...

INFO     Benchmark duration reached, draining in-flight requests
INFO     Grace period: waiting up to 60 seconds for responses

INFO     Benchmark completed successfully
INFO     Results saved to: artifacts/Qwen_Qwen2.5-7B-Instruct-chat-concurrency50/

            NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                     Metric ┃    avg ┃    min ┃    max ┃    p99 ┃    p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│       Request Latency (ms) │ 245.67 │ 178.90 │ 398.12 │ 367.89 │ 239.45 │
│   Time to First Token (ms) │  56.78 │  42.34 │  89.12 │  82.45 │  55.23 │
│   Inter Token Latency (ms) │  13.45 │  10.23 │  19.67 │  18.45 │  13.12 │
│ Request Throughput (req/s) │  89.23 │      - │      - │      - │      - │
└────────────────────────────┴────────┴────────┴────────┴────────┴────────┘

JSON Export: artifacts/Qwen_Qwen2.5-7B-Instruct-chat-concurrency50/profile_export_aiperf.json

Soak Test (1 hour)

aiperf profile \
    --model Qwen/Qwen2.5-7B-Instruct \
    --url localhost:8000 \
    --endpoint-type chat \
    --streaming \
    --concurrency 20 \
    --benchmark-duration 3600 \
    --benchmark-grace-period 120 \
    --warmup-duration 60

Sample Output (Successful Run):

INFO     Starting AIPerf System
INFO     AIPerf System is WARMING UP

Warming Up: [01:00] - Running for 60 seconds...

INFO     Warmup completed, starting profiling phase
INFO     AIPerf System is PROFILING

Profiling: [60:00] - Running for 3600 seconds...

INFO     Benchmark duration reached, draining in-flight requests
INFO     Grace period: waiting up to 120 seconds for responses

INFO     Benchmark completed successfully
INFO     Results saved to: artifacts/Qwen_Qwen2.5-7B-Instruct-chat-concurrency20/

            NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                     Metric ┃    avg ┃    min ┃    max ┃    p99 ┃    p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│       Request Latency (ms) │ 198.34 │ 156.78 │ 312.45 │ 289.67 │ 194.23 │
│   Time to First Token (ms) │  48.90 │  38.45 │  76.34 │  71.23 │  47.89 │
│   Inter Token Latency (ms) │  12.01 │   9.56 │  17.89 │  16.78 │  11.78 │
│ Request Throughput (req/s) │  45.67 │      - │      - │      - │      - │
└────────────────────────────┴────────┴────────┴────────┴────────┴────────┘

JSON Export: artifacts/Qwen_Qwen2.5-7B-Instruct-chat-concurrency20/profile_export_aiperf.json

CLI Reference

OptionTypeDefaultDescription
--benchmark-durationfloatNoneStop sending requests after this many seconds
--benchmark-grace-periodfloat30.0Seconds to wait for in-flight requests after duration. Use inf for unlimited. Requires --benchmark-duration.

Troubleshooting

IssueSolution
Requests cut off mid-responseIncrease --benchmark-grace-period or use inf
Grace period errorAdd --benchmark-duration (grace period requires it)