benchmark-datasets.md
May 15, 2026 · View on GitHub
This document describes datasets that AIPerf can use to generate stimulus. Additional support is under development, so check back often.
Dataset Options
| Dataset | Support | Data Source |
|---|---|---|
| Synthetic Text | ✅ | Synthetically generated text prompts pulled from Shakespeare |
| Synthetic Audio | ✅ | Synthetically generated audio samples |
| Synthetic Images | ✅ | Synthetically generated image samples |
| Custom Data | ✅ | --input-file your_file.jsonl --custom-dataset-type single_turn |
| Mooncake | ✅ | Mooncake trace file --input-file your_trace_file.jsonl --custom-dataset-type mooncake_trace |
| ShareGPT | ✅ | Conversations from --public-dataset sharegpt |
| Agentic Code | ✅ | Synthetic multi-turn coding-agent traces with shared prompt layers, repository context, and cache-aware turn growth. Generated via aiperf synthesize agentic-code and replayed as a Mooncake trace. |