benchmark-datasets.md

May 15, 2026 · View on GitHub

This document describes datasets that AIPerf can use to generate stimulus. Additional support is under development, so check back often.

Dataset Options

Dataset Support Data Source
Synthetic Text Synthetically generated text prompts pulled from Shakespeare
Synthetic Audio Synthetically generated audio samples
Synthetic Images Synthetically generated image samples
Custom Data --input-file your_file.jsonl --custom-dataset-type single_turn
Mooncake Mooncake trace file --input-file your_trace_file.jsonl --custom-dataset-type mooncake_trace
ShareGPT Conversations from --public-dataset sharegpt
Agentic Code Synthetic multi-turn coding-agent traces with shared prompt layers, repository context, and cache-aware turn growth. Generated via aiperf synthesize agentic-code and replayed as a Mooncake trace.