Custom Dataset Guide

May 21, 2026 · View on GitHub

Benchmark LLMs with your own data using single-turn requests, multi-turn conversations, or random sampling.

Overview

AIPerf supports three custom dataset types for benchmarking with your own data:

Dataset TypeBest ForMulti-TurnTiming ControlRandom Sampling
Single TurnIndependent single requestsNoYesNo
Multi TurnConversations with contextYesYes (per turn)No
Random PoolLoad testing with varietyNoNoYes

All three support:

  • Client-side batching
  • Automatic media handling: local files are converted to base64 format, while remote URLs are sent directly to the API

Server Setup

Start a vLLM server for testing:

docker pull vllm/vllm-openai:latest
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
  --model Qwen/Qwen3-0.6B \
  --host 0.0.0.0 --port 8000 &

Verify the server is ready:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [{"role": "user", "content": "test"}],
    "max_tokens": 10
  }' | jq

Single-Turn Datasets

Each line represents one independent single-turn request.

When to Use

Use single_turn when you need deterministic, sequential execution where requests always run in the exact order they appear in the file:

  • Debugging: Test specific prompts in a known sequence
  • Regression testing: Same input file → same output order every time
  • Timing control: Schedule requests with precise timestamps or delays
  • Predictable testing: Know exactly which request runs when

Execution: Sequential by default (request 1, then 2, then 3, etc.) Input: Single JSONL file only

Basic Text Example

cat > prompts.jsonl << 'EOF'
{"text": "What is machine learning?"}
{"text": "Explain neural networks."}
{"text": "How does backpropagation work?"}
{"text": "What are transformers?"}
{"text": "Define reinforcement learning."}
EOF

aiperf profile \
    --model Qwen/Qwen3-0.6B \
    --endpoint-type chat \
    --input-file prompts.jsonl \
    --custom-dataset-type single_turn \
    --streaming \
    --url localhost:8000 \
    --concurrency 2 \
    --request-count 10

Output:



                                     NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃               Metric ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p50 ┃      std ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│  Time to First Token │    19.99 │    12.53 │    49.62 │    48.89 │    42.24 │    13.93 │    12.92 │
│                 (ms) │          │          │          │          │          │          │          │
│ Time to Second Token │     3.81 │     2.01 │     8.25 │     7.94 │     5.15 │     3.36 │     1.62 │
│                 (ms) │          │          │          │          │          │          │          │
│ Time to First Output │    19.99 │    12.53 │    49.62 │    48.89 │    42.24 │    13.93 │    12.92 │
│           Token (ms) │          │          │          │          │          │          │          │
│ Request Latency (ms) │ 2,940.39 │ 1,536.67 │ 7,319.35 │ 7,034.86 │ 4,474.42 │ 2,239.67 │ 1,611.04 │
│  Inter Token Latency │     3.52 │     3.47 │     3.64 │     3.63 │     3.56 │     3.50 │     0.05 │
│                 (ms) │          │          │          │          │          │          │          │
│         Output Token │   284.54 │   274.60 │   288.35 │   288.33 │   288.13 │   285.38 │     3.98 │
│  Throughput Per User │          │          │          │          │          │          │          │
│    (tokens/sec/user) │          │          │          │          │          │          │          │
│      Output Sequence │   833.40 │   438.00 │ 2,106.00 │ 2,022.21 │ 1,268.10 │   626.50 │   465.81 │
│      Length (tokens) │          │          │          │          │          │          │          │
│       Input Sequence │     5.00 │     4.00 │     7.00 │     7.00 │     7.00 │     5.00 │     1.10 │
│      Length (tokens) │          │          │          │          │          │          │          │
│         Output Token │   527.06 │      N/A │      N/A │      N/A │      N/A │      N/A │      N/A │
│           Throughput │          │          │          │          │          │          │          │
│         (tokens/sec) │          │          │          │          │          │          │          │
│   Request Throughput │     0.63 │      N/A │      N/A │      N/A │      N/A │      N/A │      N/A │
│       (requests/sec) │          │          │          │          │          │          │          │
│        Request Count │    10.00 │      N/A │      N/A │      N/A │      N/A │      N/A │      N/A │
│           (requests) │          │          │          │          │          │          │          │
└──────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘

CLI Command: aiperf profile --model 'Qwen/Qwen3-0.6B' --endpoint-type 'chat' --input-file
'prompts.jsonl' --custom-dataset-type 'single_turn' --streaming --url 'localhost:8000' --concurrency
2
Benchmark Duration: 15.81 sec
CSV Export:
artifacts/Qwen_Qwen3-0.6B-openai-chat-concurrency2/profile_export_aiperf.csv
JSON Export:
artifacts/Qwen_Qwen3-0.6B-openai-chat-concurrency2/profile_export_aiperf.json
Log File: artifacts/Qwen_Qwen3-0.6B-openai-chat-concurrency2/logs/aiperf.log

Inline alternative

Same content as prompts.jsonl, embedded in the AIPerf YAML config:

benchmark:
  model: Qwen/Qwen3-0.6B
  endpoint:
    url: http://localhost:8000
    type: chat
  dataset:
    type: file
    format: single_turn
    records:
      - {text: "What is machine learning?"}
      - {text: "Explain neural networks."}
      - {text: "How does backpropagation work?"}
      - {text: "What are transformers?"}
      - {text: "Define reinforcement learning."}
  phases:
    type: concurrency
    concurrency: 2
    requests: 100

See Inline Datasets for the full feature reference.

Per-Request Output Length

Control the maximum output tokens per request using the output_length field:

cat > prompts_with_osl.jsonl << 'EOF'
{"text": "Write a haiku about mountains.", "output_length": 50}
{"text": "Explain quantum computing in detail.", "output_length": 500}
{"text": "What is 2+2?", "output_length": 10}
{"text": "Summarize machine learning."}
EOF

aiperf profile \
    --model Qwen/Qwen3-0.6B \
    --endpoint-type chat \
    --input-file prompts_with_osl.jsonl \
    --custom-dataset-type single_turn \
    --streaming \
    --url localhost:8000 \
    --osl 200 \
    --request-count 10

Precedence: Per-line output_length takes priority over the global --osl flag. Lines without output_length fall back to --osl if set (200 in this example), or let the server decide the output length.

The output_length field also works per-turn in multi_turn datasets.

Per-Request extra

Send vendor-specific or sampling parameters per request via the extra field. The dict is shallow-merged into the top of the request body at dispatch. Per-line keys win over --extra-inputs:

cat > prompts_with_extra.jsonl << 'EOF'
{"text": "Brainstorm a haiku.", "extra": {"temperature": 1.2, "top_p": 0.9}}
{"text": "Explain quantum computing.", "extra": {"temperature": 0.2, "seed": 42}}
{"text": "Summarize ML.", "extra": {"min_tokens": 50, "ignore_eos": true}}
EOF

The extra field also works per-turn in multi_turn datasets.


Multi-Turn Datasets

Each entry represents a complete conversation with multiple turns.

When to Use

Use multi_turn when you need conversations with context where each turn builds on previous turns in the conversation:

  • Chat testing: Test conversational AI that maintains context across turns
  • Realistic interactions: Simulate real user conversations with follow-up questions
  • Task completion: Test multi-step tasks that require conversation history

Execution: Sequential within each conversation (turn 1, then 2, then 3, etc.), but multiple conversations run concurrently Input: Single JSONL file only

Basic Conversation

cat > conversations.jsonl << 'EOF'
{"session_id": "chat_1", "turns": [{"text": "What is machine learning?"}, {"text": "Can you give me an example?"}]}
{"session_id": "chat_2", "turns": [{"text": "Explain neural networks."}, {"text": "How do they differ from traditional algorithms?"}, {"text": "Which architecture for image classification?"}]}
EOF

aiperf profile \
    --model Qwen/Qwen3-0.6B \
    --endpoint-type chat \
    --input-file conversations.jsonl \
    --custom-dataset-type multi_turn \
    --streaming \
    --url localhost:8000 \
    --concurrency 2 \
    --request-count 10

Output:

                                     NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┓
┃                 Metric ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p50 ┃    std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
│    Time to First Token │    23.17 │    11.83 │    56.70 │    55.34 │    43.06 │    18.00 │  13.66 │
│                   (ms) │          │          │          │          │          │          │        │
│   Time to Second Token │     4.77 │     2.29 │    15.41 │    14.65 │     7.73 │     3.44 │   3.74 │
│                   (ms) │          │          │          │          │          │          │        │
│   Time to First Output │    23.17 │    11.83 │    56.70 │    55.34 │    43.06 │    18.00 │  13.66 │
│             Token (ms) │          │          │          │          │          │          │        │
│   Request Latency (ms) │ 2,008.84 │ 1,348.13 │ 3,045.04 │ 3,007.53 │ 2,669.92 │ 2,082.32 │ 572.34 │
│    Inter Token Latency │     3.50 │     3.13 │     3.67 │     3.67 │     3.62 │     3.52 │   0.14 │
│                   (ms) │          │          │          │          │          │          │        │
│           Output Token │   286.03 │   272.35 │   319.58 │   316.89 │   292.60 │   283.77 │  12.33 │
│    Throughput Per User │          │          │          │          │          │          │        │
│      (tokens/sec/user) │          │          │          │          │          │          │        │
│ Output Sequence Length │   565.60 │   380.00 │   838.00 │   826.57 │   723.70 │   581.50 │ 150.96 │
│               (tokens) │          │          │          │          │          │          │        │
│  Input Sequence Length │   379.80 │     5.00 │ 1,331.00 │ 1,287.80 │   899.00 │   203.00 │ 438.88 │
│               (tokens) │          │          │          │          │          │          │        │
│           Output Token │   533.83 │      N/A │      N/A │      N/A │      N/A │      N/A │    N/A │
│             Throughput │          │          │          │          │          │          │        │
│           (tokens/sec) │          │          │          │          │          │          │        │
│     Request Throughput │     0.94 │      N/A │      N/A │      N/A │      N/A │      N/A │    N/A │
│         (requests/sec) │          │          │          │          │          │          │        │
│          Request Count │    10.00 │      N/A │      N/A │      N/A │      N/A │      N/A │    N/A │
│             (requests) │          │          │          │          │          │          │        │
└────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴────────┘

CLI Command: aiperf profile --model 'Qwen/Qwen3-0.6B' --endpoint-type 'chat' --input-file
'conversations.jsonl' --custom-dataset-type 'multi_turn' --streaming --url 'localhost:8000'
--concurrency 2
Benchmark Duration: 10.60 sec
CSV Export:
artifacts/Qwen_Qwen3-0.6B-openai-chat-concurrency2/profile_export_aiperf.csv
JSON Export:
artifacts/Qwen_Qwen3-0.6B-openai-chat-concurrency2/profile_export_aiperf.json
Log File: artifacts/Qwen_Qwen3-0.6B-openai-chat-concurrency2/logs/aiperf.log

Key Points:

  • Each turn includes full conversation history
  • Turns execute sequentially within each conversation
  • Multiple conversations run concurrently (up to --concurrency)
  • Each turn supports output_length and extra (same semantics as single_turn — vendor extras shallow-merged into the top of the wire body, latest turn wins for chat-style endpoints)

Inline alternative

benchmark:
  model: Qwen/Qwen3-0.6B
  endpoint:
    url: http://localhost:8000
    type: chat
  dataset:
    type: file
    format: multi_turn
    records:
      - session_id: chat_1
        turns:
          - {text: "What is machine learning?"}
          - {text: "Can you give me an example?"}
      - session_id: chat_2
        turns:
          - {text: "Explain neural networks."}
          - {text: "How do they differ from traditional algorithms?"}
          - {text: "Which architecture for image classification?"}
  phases:
    type: concurrency
    concurrency: 2
    requests: 100

Random Pool Datasets

Randomly sample from one or more data pools for varied request patterns.

When to Use

Use random_pool when you need random sampling with replacement for unpredictable, varied request patterns:

  • Load testing: Generate diverse request patterns with variety
  • Production simulation: Model real-world workloads where requests vary
  • Stress testing: Test system behavior under mixed input patterns
  • Multiple data sources: Combine files from a directory (each file becomes a pool)

Execution: Random sampling with replacement (same entry can be selected multiple times) Input: Single JSONL file OR directory of multiple JSONL files Note: Does NOT support timing control or multi-turn conversations

Basic Single-File Sampling

cat > pool.jsonl << 'EOF'
{"text": "What is machine learning?"}
{"text": "Explain neural networks."}
{"text": "How does backpropagation work?"}
{"text": "What are transformers?"}
{"text": "Define reinforcement learning."}
{"text": "What is transfer learning?"}
{"text": "Explain gradient descent."}
{"text": "What are GANs?"}
EOF

aiperf profile \
    --model Qwen/Qwen3-0.6B \
    --endpoint-type chat \
    --input-file pool.jsonl \
    --custom-dataset-type random_pool \
    --num-conversations 50 \
    --streaming \
    --concurrency 4 \
    --random-seed 42 \
    --url localhost:8000

Output:

                                     NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃              Metric ┃      avg ┃      min ┃       max ┃      p99 ┃      p90 ┃      p50 ┃      std ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│ Time to First Token │    17.73 │    12.25 │     53.21 │    53.17 │    19.85 │    14.63 │     9.90 │
│                (ms) │          │          │           │          │          │          │          │
│      Time to Second │     3.73 │     2.20 │     10.38 │     7.68 │     4.08 │     3.66 │     1.10 │
│          Token (ms) │          │          │           │          │          │          │          │
│       Time to First │    17.73 │    12.25 │     53.21 │    53.17 │    19.85 │    14.63 │     9.90 │
│   Output Token (ms) │          │          │           │          │          │          │          │
│     Request Latency │ 3,321.54 │ 1,356.57 │ 10,393.82 │ 9,063.81 │ 5,372.92 │ 2,917.73 │ 1,644.46 │
│                (ms) │          │          │           │          │          │          │          │
│ Inter Token Latency │     3.81 │     3.53 │      4.17 │     4.15 │     3.97 │     3.79 │     0.12 │
│                (ms) │          │          │           │          │          │          │          │
│        Output Token │   262.66 │   239.55 │    283.24 │   279.36 │   270.36 │   264.13 │     8.25 │
│ Throughput Per User │          │          │           │          │          │          │          │
│   (tokens/sec/user) │          │          │           │          │          │          │          │
│     Output Sequence │   861.02 │   369.00 │  2,615.00 │ 2,255.83 │ 1,306.40 │   766.00 │   404.28 │
│     Length (tokens) │          │          │           │          │          │          │          │
│      Input Sequence │     5.00 │     4.00 │      7.00 │     7.00 │     6.10 │     5.00 │     0.96 │
│     Length (tokens) │          │          │           │          │          │          │          │
│        Output Token │ 1,007.36 │      N/A │       N/A │      N/A │      N/A │      N/A │      N/A │
│          Throughput │          │          │           │          │          │          │          │
│        (tokens/sec) │          │          │           │          │          │          │          │
│  Request Throughput │     1.17 │      N/A │       N/A │      N/A │      N/A │      N/A │      N/A │
│      (requests/sec) │          │          │           │          │          │          │          │
│       Request Count │    50.00 │      N/A │       N/A │      N/A │      N/A │      N/A │      N/A │
│          (requests) │          │          │           │          │          │          │          │
└─────────────────────┴──────────┴──────────┴───────────┴──────────┴──────────┴──────────┴──────────┘

CLI Command: aiperf profile --model 'Qwen/Qwen3-0.6B' --endpoint-type 'chat' --input-file
'pool.jsonl' --custom-dataset-type 'random_pool' --num-conversations 50 --streaming --concurrency 4
--random-seed 42 --url 'localhost:8000'
Benchmark Duration: 42.74 sec
CSV Export:
artifacts/Qwen_Qwen3-0.6B-openai-chat-concurrency4/profile_export_aiperf.csv
JSON Export:
artifacts/Qwen_Qwen3-0.6B-openai-chat-concurrency4/profile_export_aiperf.json
Log File: artifacts/Qwen_Qwen3-0.6B-openai-chat-concurrency4/logs/aiperf.log

Behavior:

  • Randomly samples 50 requests from 8-entry pool
  • Sampling with replacement (entries can repeat)
  • Use --random-seed for reproducibility

Inline alternative (multi-pool)

benchmark:
  model: Qwen/Qwen3-0.6B
  endpoint:
    url: http://localhost:8000
    type: chat
  dataset:
    type: file
    format: random_pool
    sampling: random
    records:
      queries:
        - {text: "What is your refund policy?", type: random_pool}
        - {text: "How do I reset my password?", type: random_pool}
      passages:
        - {text: "Refunds are processed within 5 business days.", type: random_pool}
        - {text: "Click 'Forgot password' on the login page.", type: random_pool}
  phases:
    type: concurrency
    concurrency: 2
    requests: 50