StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding

May 16, 2025 · View on GitHub

🏠 Project Page | 📄 arXiv Paper | 📦 Dataset | 🏅Leaderboard

StreamingBench evaluates Multimodal Large Language Models (MLLMs) in real-time, streaming video understanding tasks. 🌟

[NEW! 2025.05.15] 🔥: Seed1.5-VL achieved ALL model SOTA with a score of 82.80 on the Proactive Output.

[NEW! 2025.03.17] ⭐: ViSpeeker achieved Open-Source SOTA with a score of 61.60 on the Omni-Source Understanding.

[NEW! 2025.01.14] 🚀: MiniCPM-o 2.6 achieved Streaming SOTA with a score of 66.01 on the Overall benchmark.

[NEW! 2025.01.06] 🏆: Dispider achieved Streaming SOTA with a score of 53.12 on the Overall benchmark.

[NEW! 2024.12.09] 🎉: InternLM-XComposer2.5-OmniLive achieved 73.79 on Real-Time Visual Understanding.

🎞️ Overview

As MLLMs continue to advance, they remain largely focused on offline video comprehension, where all frames are pre-loaded before making queries. However, this is far from the human ability to process and respond to video streams in real-time, capturing the dynamic nature of multimedia content. To bridge this gap, StreamingBench introduces the first comprehensive benchmark for streaming video understanding in MLLMs.

Key Evaluation Aspects

🎯 Real-time Visual Understanding: Can the model process and respond to visual changes in real-time?
🔊 Omni-source Understanding: Does the model integrate visual and audio inputs synchronously in real-time video streams?
🎬 Contextual Understanding: Can the model comprehend the broader context within video streams?

Dataset Statistics

📊 900 diverse videos
📝 4,500 human-annotated QA pairs
⏱️ Five questions per video at different timestamps

🎬 Video Categories

🔍 Task Taxonomy

📐 Dataset Examples

https://github.com/user-attachments/assets/e6d1655d-ab3f-47a7-973a-8fd6c8962307

🔮 Evaluation Pipeline

Requirements

Python 3.x
ffmpeg-python

Data Preparation

Download Dataset: Retrieve all necessary files from the StreamingBench Dataset.

Decompress Files: Extract the downloaded files and organize them in the ./data directory as follows:

StreamingBench/
├── data/
│   ├── real/               # Unzip Real Time Visual Understanding_*.zip into this folder
│   ├── omni/               # Unzip other .zip files into this folder
│   ├── sqa/                # Unzip Sequential Question Answering_*.zip into this folder
│   └── proactive/          # Unzip Proactive Output_*.zip into this folder

Preprocess Data: Run the following command to preprocess the data:
```
cd ./scripts
bash preprocess.sh
```

Model Preparation

Prepare your own model for evaluation by following the instructions provided here. This guide will help you set up and configure your model to ensure it is ready for testing against the dataset.

Evaluation

Now you can run the benchmark:

bash eval.sh

This will run the benchmark and save the results to the specified output file. Then you can calculate the metrics using the following command:

bash stats.sh

🔬 Experimental Results

Performance of Various MLLMs on StreamingBench

60 seconds of context preceding the query time (Main)

All Context (+ Long Context)

Comparison of Main Experiment vs. 60 Seconds of Video Context

Performance of Different MLLMs on the Proactive Output Task

"≤ xs" means that the answer is considered correct if the actual output time is within x seconds of the ground truth.

📝 Citation

@article{lin2024streaming,
  title={StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding},
  author={Junming Lin and Zheng Fang and Chi Chen and Zihao Wan and Fuwen Luo and Peng Li and Yang Liu and Maosong Sun},
  journal={arXiv preprint arXiv:2411.03628},
  year={2024}
}