Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception
April 7, 2026 Ā· View on GitHub
š Tech Report | š¤ Qwen3-Omni-Captioner Live Demo | š¤ Qwen3.5-Omni Offline Mode Live Demo | šµļø Omni-Detective Pipeline | š§āš« Omni-Cloze Benchmark
News
- [2026.03.30] š„ Qwen3.5-Omni has been released. Check out the Demo Page for the detailed audio-visual captioning, and try HuggingFace Offline Demo and ModelScope Offline Demo for live demo.
- [2026.03.18] The manually verified Omni-Cloze Benchmark has been released.
- [2025.10.17] A minimal, extensible implementation of the OmniāDetective agentic data pipeline has been released.
- [2025.10.14] Omni-Captioner techinical report has been release on arXiv.
- [2025.09.22] Qwen3-Omni-Captioner has been released. Check out HuggingFace Demo and ModelScope Demo, and refer to the cookbook for usage.
Contents
- Overview
- Agentic Data Generation: Omni-Detective
- Omni Detailed Captioning Model: Omni-Captioner
- Omni Detailed Captioning Benchmark: Omni-Cloze
- Citation
Overview
Fine-grained perception of multimodal information is critical for advancing humanāAI interaction. OmniāCaptioner is a multimodal large language model capable of producing highly detailed, low-hallucination audioāvisual captions. We have released the audio version, Qwen3āOmniāCaptioner, and the audioāvideo version will follow. We also propose Omni-Detective for data generation and Omni-Cloze for evaluation omni detailed captioning.

Agentic Data Generation: Omni-Detective
Guides
OmniāDetective is an agentic data generation framework that leverages iterative Query-Observation cycles to autonomously extract and synthesize richly detailed, and minimally hallucinatory audioāvisual annotations. Omni-Detective is composed of three key components:
- A Detective Agent that spontaneously orchestrates the perception process;
- A Tool Box containing multiple tools for extracting information from multimodal data;
- Independent Observers that interact with raw audio-video streams to probe targeted aspects.

Quick Start
We provide a minimal, extensible implementation only contains an audio language model (Qwen3-Omni-flash) in the tool box. You can extend it with additional tools and designs.
cd Omni-Detective
python main.py \
--input_path ./data/input.jsonl \
--output_path ./data/output.jsonl \
--num_workers 3
Omni Detailed Captioning Model: Omni-Captioner
Guides
Leveraging the high-fidelity multimodal detailed captioning data produced by Omni-Detective, we train Audio-Captioner and Omni-Captioner with a two-stage curriculum over the audio and audioāvisual modalities. We have released the audio version, Qwen3āOmniāCaptioner, and the audioāvideo version inside Qwen3.5āOmni.
Quick Start
Qwen3-Omni-Captioner (Audio Version)
Refer to the cookbook for usage. Check out HuggingFace Demo and ModelScope Demo for Live Demo.
Note: Qwen3-Omni-30B-A3B-Captioner is a single-turn model that accepts only one audio input per inference. It does not accept any text prompts and supports audio input only, with text output only. As Qwen3-Omni-30B-A3B-Captioner is designed for generating fineāgrained descriptions of audio, excessively long audio clips may diminish detail perception. We recommend, as a best practice, limiting audio length to no more than 30 seconds.
Qwen3.5-Omni Offline Mode (AudioāVideo Version)
Prompt Design
To make Qwen3.5-Omni produce more stable, structured, and reliable detailed captioning, we recommend using the following structured-description prompt template.
Recommended Audio-Video Structured-Description Prompt
Provide a detailed description of the video.
It should explicitly include three sections:
1. A structured chronological storyline of **every noticeable audio and visual details**
2. A structured list of all visible text. For each text element, include start timestamp, end timestamp, the exact text content, the appearance characteristics. If no text appears, explicitly state so.
3. A structured speech-to-text transcription, include speakerļ¼Corresponding to the character or voiceāover in Section 1, including their accent and toneļ¼, exact spoken content, start timestamp, end timestamp, and speaking state (prosody, emotion, and style). If no speech appears, explicitly state so.
Aside from these three required sections, you are free to organize any additional content in any way you find helpful. This additional content can include global information about the entire video or localized information about specific moments. You may choose the topic of this extra content freely.
Output Format:
## Storyline
<xx:xx.xxx> - <xx:xx.xxx>
<an unstructured long paragraph in natural language describing what happened during this period, blending both audio and video details.>
<xx:xx.xxx> - <xx:xx.xxx>
<an unstructured long paragraph in natural language describing what happened during this period, blending both audio and video details.>
<xx:xx.xxx> - <xx:xx.xxx>
<an unstructured long paragraph in natural language describing what happened during this period, blending both audio and video details.>
...
## Visible Text
<xx:xx.xxx> - <xx:xx.xxx>
ā<element>ā: <appearance>
ā<element>ā: <appearance>
<xx:xx.xxx> - <xx:xx.xxx>
ā<element>ā: <appearance>
ā<element>ā: <appearance>
ā<element>ā: <appearance>
<xx:xx.xxx> - <xx:xx.xxx>
ā<element>ā: <appearance>
...
## Speakers and Transcript
Speaker profiles:
<speaker> - <profile>
<speaker> - <profile>
<speaker> - <profile>
...
<xx:xx.xxx> - <xx:xx.xxx>
Speaker: <speaker>
State: <description>
Content: ā<content>ā
<xx:xx.xxx> - <xx:xx.xxx>
Speaker: <speaker>
State: <description>
Content: ā<content>ā
<xx:xx.xxx> - <xx:xx.xxx>
Speaker: <speaker>
State: <description>
Content: ā<content>ā
...
## <another section>
<paragraphs>
## <another section>
<paragraphs>
...
Recommended Audio Structured-Description Prompt
Provide a detailed description of the audio.
It should explicitly include two sections:
1. A structured chronological storyline of **every noticeable audio details**
2. A structured speech-to-text transcription, include speakerļ¼Corresponding to the character or voiceāover in Section 1, including their accent and toneļ¼, exact spoken content, start timestamp, end timestamp, and speaking state (prosody, emotion, and style). If no speech appears, explicitly state so.
Aside from these two required components, you are free to organize any additional content in any way you find helpful. This additional content can include global information about the entire audio or localized information about specific moments. You may choose the topic of this extra content freely.
Output Format:
## Storyline
<xx:xx.xxx> - <xx:xx.xxx>
<an unstructured long paragraph in natural language describing what happened during this period, blending both audio details.>
<xx:xx.xxx> - <xx:xx.xxx>
<an unstructured long paragraph in natural language describing what happened during this period, blending both audio details.>
<xx:xx.xxx> - <xx:xx.xxx>
<an unstructured long paragraph in natural language describing what happened during this period, blending both audio details.>
...
...
## Speakers and Transcript
Speaker profiles:
<speaker> - <profile>
<speaker> - <profile>
<speaker> - <profile>
...
<xx:xx.xxx> - <xx:xx.xxx>
Speaker: <speaker>
State: <description>
Content: ā<content>ā
<xx:xx.xxx> - <xx:xx.xxx>
Speaker: <speaker>
State: <description>
Content: ā<content>ā
<xx:xx.xxx> - <xx:xx.xxx>
Speaker: <speaker>
State: <description>
Content: ā<content>ā
...
## <another section>
<paragraphs>
## <another section>
<paragraphs>
...
Performance
| Model | Omni-Cloze |
|---|---|
| Gemini-3.1 Pro | 57.2 |
| Qwen3.5-Omni-Flash | 63.0 |
| Qwen3.5-Omni-Plus | 64.8 |
Qwen3.5-Omni demonstrates strong performance on Omni-Cloze, with the Flash achieving 63.0 and the Plus reaching 64.8. By comparison, Gemini-3.1 Pro scores 57.2, which highlights Qwen3.5-Omniās improved ability to output fine-grained audio-visual caption that align with challenging cloze-style questions.
Omni Detailed Captioning Benchmark: Omni-Cloze
Guides
Omni-Cloze frames detailed captioning evaluation as a cloze-style multiple-choice proxy task. Omni-Cloze is a unified benchmark for evaluating detailed captioning across audio-only, visual-only, and audioāvisual settings. The dataset spans 9 main domains and 47 sub-categories covering diverse topics such as education, entertainment, sports, news, science, and lifestyle, with a total of 2k video clips with 70k fine-grained cloze blanks.

Quick Start
1. Prepare Video Data
The video dataset is split into several tarball parts. Concatenate and extract them to the videos/ directory:
for f in videos.part*.tar; do tar -xvf "$f"; done
2. Prepare Inference Results
The core metadata file is omni_cloze.jsonl, which contains 2,320 audio-visual files and their corresponding cloze questions.
To evaluate your model, you must first run inference and save the generated descriptions into a new field named predicted_caption within the JSONL file. Each line in your input file should follow this structure:
{
"uuid": 1,
"video_path": "./videos/0000001.mp4",
"predicted_caption": "Your model's detailed description of the audio and visual content goes here...",
"...": [...]
}
3. Run Evaluation
The evaluation process uses an LLM to map your detailed captions to the specific cloze blanks.
# 1. Set API environment variables
export OPENAI_API_KEY="your-api-key-here"
export OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
# 2. Specify input and output data paths, and the model predicted caption field name is "predicted_caption" in the input file
input_file="your-input-file-here.jsonl"
output_file="your-output-file-here.jsonl"
# 3. Run the evaluation script
python generate_prediction.py --input $input_file --output $output_file --workers 100
# 4. Run the statistics script
python compute_acc.py --input $output_file --show-subcategory
Citation
If you find our data pipeline, models, or the benchmark useful, please consider giving a star and a citation, thanks!
@article{omni-captioner,
title={Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception},
author={Ma, Ziyang and Xu, Ruiyang and Xing, Zhenghao and Chu, Yunfei and Wang, Yuxuan and He, Jinzheng and Xu, Jin and Heng, Pheng-Ann and Yu, Kai and Lin, Junyang and others},
journal={Proc. ICLR},
year={2026}
}