StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding

May 16, 2025 ยท View on GitHub

StreamingBench Banner

StreamingBench evaluates Multimodal Large Language Models (MLLMs) in real-time, streaming video understanding tasks. ๐ŸŒŸ


[NEW! 2025.05.15] ๐Ÿ”ฅ: Seed1.5-VL achieved ALL model SOTA with a score of 82.80 on the Proactive Output.

[NEW! 2025.03.17] โญ: ViSpeeker achieved Open-Source SOTA with a score of 61.60 on the Omni-Source Understanding.

[NEW! 2025.01.14] ๐Ÿš€: MiniCPM-o 2.6 achieved Streaming SOTA with a score of 66.01 on the Overall benchmark.

[NEW! 2025.01.06] ๐Ÿ†: Dispider achieved Streaming SOTA with a score of 53.12 on the Overall benchmark.

[NEW! 2024.12.09] ๐ŸŽ‰: InternLM-XComposer2.5-OmniLive achieved 73.79 on Real-Time Visual Understanding.


๐ŸŽž๏ธ Overview

As MLLMs continue to advance, they remain largely focused on offline video comprehension, where all frames are pre-loaded before making queries. However, this is far from the human ability to process and respond to video streams in real-time, capturing the dynamic nature of multimedia content. To bridge this gap, StreamingBench introduces the first comprehensive benchmark for streaming video understanding in MLLMs.

Key Evaluation Aspects

  • ๐ŸŽฏ Real-time Visual Understanding: Can the model process and respond to visual changes in real-time?
  • ๐Ÿ”Š Omni-source Understanding: Does the model integrate visual and audio inputs synchronously in real-time video streams?
  • ๐ŸŽฌ Contextual Understanding: Can the model comprehend the broader context within video streams?

Dataset Statistics

  • ๐Ÿ“Š 900 diverse videos
  • ๐Ÿ“ 4,500 human-annotated QA pairs
  • โฑ๏ธ Five questions per video at different timestamps

๐ŸŽฌ Video Categories

Video Categories

๐Ÿ” Task Taxonomy

Task Taxonomy

๐Ÿ“ Dataset Examples

https://github.com/user-attachments/assets/e6d1655d-ab3f-47a7-973a-8fd6c8962307

๐Ÿ”ฎ Evaluation Pipeline

Requirements

  • Python 3.x
  • ffmpeg-python

Data Preparation

  1. Download Dataset: Retrieve all necessary files from the StreamingBench Dataset.

  2. Decompress Files: Extract the downloaded files and organize them in the ./data directory as follows:

    StreamingBench/
    โ”œโ”€โ”€ data/
    โ”‚   โ”œโ”€โ”€ real/               # Unzip Real Time Visual Understanding_*.zip into this folder
    โ”‚   โ”œโ”€โ”€ omni/               # Unzip other .zip files into this folder
    โ”‚   โ”œโ”€โ”€ sqa/                # Unzip Sequential Question Answering_*.zip into this folder
    โ”‚   โ””โ”€โ”€ proactive/          # Unzip Proactive Output_*.zip into this folder
    
  3. Preprocess Data: Run the following command to preprocess the data:

    cd ./scripts
    bash preprocess.sh
    

Model Preparation

Prepare your own model for evaluation by following the instructions provided here. This guide will help you set up and configure your model to ensure it is ready for testing against the dataset.

Evaluation

Now you can run the benchmark:

bash eval.sh

This will run the benchmark and save the results to the specified output file. Then you can calculate the metrics using the following command:

bash stats.sh

๐Ÿ”ฌ Experimental Results

Performance of Various MLLMs on StreamingBench

  • 60 seconds of context preceding the query time (Main)
Task Taxonomy
  • All Context (+ Long Context)
Task Taxonomy
  • Comparison of Main Experiment vs. 60 Seconds of Video Context
  • Task Taxonomy

Performance of Different MLLMs on the Proactive Output Task

"โ‰ค xs" means that the answer is considered correct if the actual output time is within x seconds of the ground truth.

Task Taxonomy

๐Ÿ“ Citation

@article{lin2024streaming,
  title={StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding},
  author={Junming Lin and Zheng Fang and Chi Chen and Zihao Wan and Fuwen Luo and Peng Li and Yang Liu and Maosong Sun},
  journal={arXiv preprint arXiv:2411.03628},
  year={2024}
}