benchmark-datasets.md

May 15, 2026 · View on GitHub

This document describes datasets that AIPerf can use to generate stimulus. Additional support is under development, so check back often.

Dataset Options

Dataset	Support	Data Source
Synthetic Text	✅	Synthetically generated text prompts pulled from Shakespeare
Synthetic Audio	✅	Synthetically generated audio samples
Synthetic Images	✅	Synthetically generated image samples
Custom Data	✅	--input-file your_file.jsonl --custom-dataset-type single_turn
Mooncake	✅	Mooncake trace file `--input-file your_trace_file.jsonl --custom-dataset-type mooncake_trace`
ShareGPT	✅	Conversations from `--public-dataset sharegpt`
Agentic Code	✅	Synthetic multi-turn coding-agent traces with shared prompt layers, repository context, and cache-aware turn growth. Generated via `aiperf synthesize agentic-code` and replayed as a Mooncake trace.