CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models ๐ŸŽฌ

February 12, 2026 ยท View on GitHub

Paper HuggingFace License

CrossVid Logo

๐ŸŒŸ Introduction

CrossVid is the first comprehensive benchmark for evaluating Cross-Video Reasoning (CVR) in Multimodal Large Language Models (MLLMs). Unlike existing benchmarks focusing on single-video analysis, CrossVid challenges models to simultaneously understand, aggregate, and compare information across multiple videos.

Key Highlights:

  • ๐ŸŽฏ First systematic CVR benchmark with hierarchical task design
  • ๐Ÿ“Š 9,015 QA pairs across 5,331 videos from 6 diverse datasets
  • ๐Ÿ—๏ธ 10 specific tasks spanning 4 dimensions (Comparative, Temporal, Multi-View, Free-Form)
  • ๐ŸŒ 32 genres covering real-world scenarios
  • โฑ๏ธ Long-context: Average 770 seconds per query
  • ๐Ÿ“ Multiple formats: Single-choice, multiple-choice, and open-ended questions
Genre Distribution Task Hierarchy

๐Ÿ“ข News

  • [2025-11] ๐ŸŽ‰ CrossVid accepted by AAAI 2026!
  • [2025-11] ๐Ÿ“Š Dataset available on HuggingFace.
  • [2025-11] ๐Ÿ”ง Evaluation code uploaded.

๐ŸŽฏ Benchmark Overview

Task Dimensions

๐Ÿ“Š Comparative Analysis - Behavioral Understanding (BU), Narrative Comprehension (NC), Culinary Comparison (CC), and Procedural Eror Analysis (PEA)

โฑ๏ธ Temporal Understanding - Plot Inference (PI), Functional Step Alignment (FSA), Procedural Step Sequencing (PSS)

๐Ÿ‘๏ธ Multi-View Reasoning - Multi-view Spatial Reasoning (MSR) and Multi-view Object Counting (MOC)

โœ๏ธ Free-Form QA - Comparative Culinary QA (CCQA)

Data Sources & Statistics

Videos from 6 public datasets: Animal Kingdom ๐Ÿฆ | MovieChat-1K ๐ŸŽฌ | YouCook2 ๐Ÿ‘จโ€๐Ÿณ | VisDrone ๐Ÿš | Charades ๐Ÿ  | Assembly101 ๐Ÿ”ง.

We thank the creators of these valuable datasets for providing the foundational video resources.

MetricValueMetricValue
๐Ÿ“น Videos5,331๐ŸŽญ Genres32
โ“ QA Pairs9,015๐ŸŽฏ Tasks10
โฑ๏ธ Avg Video Length215s๐Ÿ“Š Avg Query Duration770s

๐Ÿ“ธ Examples

CrossVid Examples

Representative examples showing different cross-video reasoning tasks


๐Ÿ—๏ธ Annotation Pipeline

Evaluation Pipeline

Process: Frame Extraction (Qwen2.5-VL-72B) โ†’ QA Generation (DeepSeek-R1) โ†’ Manual Filtration โ†’ Refinement โ†’ Quality Control


๐Ÿš€ Quick Start

We provide a evaluation script named by the task name that supports parallel inference using OpenAI-compatible APIs (e.g., vLLM, LMDeploy, or SGLang).

1. Preparation

Due to copyright restrictions, please download the Charades and Animal Kingdom datasets from their official repositories.
After downloading, merge all original videos and place them under:

videos/behavior

Before running the evaluation, download annotations and other videos from HuggingFace and clone this repository. Ensure your environment are set up correctly:

Directory Structure
Ensure your project directory looks like this:

CrossVid/
โ”‚โ”€โ”€ uav/                 # VisDrone
โ”‚   โ”œโ”€โ”€ bbox/
โ”‚   โ””โ”€โ”€ frames/
โ”œโ”€โ”€ videos/              # Folder containing video files
โ”‚   โ”œโ”€โ”€ assembly/        # Assembly101
โ”‚   โ”œโ”€โ”€ behavior/        # Charades & Animal Kingdom
โ”‚   โ”œโ”€โ”€ cook/            # YouCook2
โ”‚   โ””โ”€โ”€ movie/           # MovieChat-1K
โ”‚โ”€โ”€ QA/                  # Folder containing QA JSON files
โ”‚   โ”œโ”€โ”€ BU.json
โ”‚   โ”œโ”€โ”€ CC.json
โ”‚   โ”œโ”€โ”€ CCQA.json
โ”‚   โ”œโ”€โ”€ ...
|โ”€โ”€ eval/                # The evaluation scripts
โ”‚   โ”œโ”€โ”€ utils/
โ”‚   โ”œโ”€โ”€ BU.py
โ”‚   โ”œโ”€โ”€ CC.py
โ”‚   โ”œโ”€โ”€ ...
โ”‚   โ”œโ”€โ”€ score_CCQA.py
โ””โ”€โ”€ README.md

Python environment
Install the following required packages:

pip install openai opencv-python decord numpy

Enter root directory:

cd CrossVid

2. Run Evaluation

To evaluate a task, run the evaluation script with the following command. The script will process videos, perform inference via the API, and automatically calculate the accuracy. For example, you can evaluate task BU via:

python eval/BU.py \
    --model "your-model-name" \
    --video_root "videos" \
    --QA_path "QA/BU.json" \
    --save_path "results/BU_result.json" \
    --port 8000 \
    --threads 20

3. Arguments

ArgumentTypeDefaultDescription
--modelstrRequiredThe model name used for inference.
--QA_pathstrQA/BU.jsonPath to the input Question-Answer JSON file.
--video_rootstrvideosRoot directory containing the video files.
--save_pathstrRequiredPath where the inference results will be saved.
--portint8000The port number of your running API server.
--threadsint20Number of parallel threads for faster inference.
--framesint128Total number of frames to sample per inference.
--lengthint360The resolution length (long side) for frame resizing.

4. Output & Metrics

Upon completion, the script saves detailed results to the specified JSON file and prints the overall accuracy:

The performance of <model_name> on task BU is 0.654

5. Open-ended Evaluation

For open-ended tasks (e.g., CCQA), we employ an LLM-as-a-Judge approach to score responses based on Coverage and Correctness of key scoring points.

Remember to configure the API key/URL in eval/score_CCQA.py.

python eval/score_CCQA.py \
    --QA_path "QA/CCQA.json" \
    --answer_path "results/CCQA_result.json" \
    --save_path "results/CCQA_score.json"

๐Ÿ“Š Leaderboard

Benchmark Results

The following table shows the performance of 22 evaluated MLLMs on CrossVid dataset, ranked by Overall Average (O.Avg) score.

Model Leaderboard

RankModel#FramesO.AvgC.AvgT.AvgM.AvgCCQA
Closed-Source Models
๐Ÿฅ‡Gemini-2.5-Pro12850.454.756.028.759.8
๐ŸฅˆGPT-4.1<5045.247.646.738.444.6
๐Ÿฅ‰Doubao-1.5-VL-pro25644.353.836.134.750.1
4GPT-4o<5036.843.135.527.434.2
Open-Source Models
5GLM-4.1V-9B-Thinking25635.144.723.137.826.9
6Qwen2.5-VL-72B25634.442.129.223.541.2
7Qwen2.5-VL-32B25633.738.326.531.741.2
8MiMo-7B25628.331.223.033.622.0
9Kimi-VL-A3B-Thinking25628.233.417.932.729.2
10LLaVA-Video-72B12827.533.922.027.917.8
11LLaVA-OV-72B2427.527.929.330.514.6
12InternVL3-78B12825.833.115.628.123.2
13InternVL3-8B12825.626.120.340.79.7
14MiniCPM-O 2.612825.626.226.431.49.0
15ERNIE-4.5-VL-A3B44024.825.419.732.522.5
16Qwen2.5-Omni-7B6424.626.721.629.615.3
17InternVL3-38B12823.527.810.138.616.2
18Video-R1-7B25621.618.526.926.98.0
19Phi-3.5-vision6421.525.917.227.64.3
20Qwen2.5-VL-7B25618.319.320.016.812.0
21LongVA-7B-DPO25618.023.57.526.310.7
22VideoLLaMA3-7B18015.320.86.719.89.8

Note: An equal number of frames are sampled uniformly from each video and resized to 360px on the longer side.


Metrics Description

  • O.Avg: Overall average accuracy across all ten tasks
  • C.Avg: Average accuracy on Comparative Analysis tasks (BU, NC, CC, PEA)
  • T.Avg: Average accuracy on Temporal Understanding tasks (PI, FSA, PSS)
  • M.Avg: Average accuracy on Multi-view Reasoning tasks (MSR, MOC)
  • CCQA: Comparative Culinary QA accuracy
  • #Frames: Total number of input frames per query

Note: Bold numbers in each column indicate the best performance among models in that category.


๐Ÿ“„ License & Contact

License:

We do not own any copyrights of these videos. All video contents are from public datasets, and their copyrights belong to the original authors / dataset creators.

Contact:

Acknowledgements: Thanks to dataset authors and our expert annotators.


๐Ÿ“ Citation

If you find CrossVid useful for your research, please cite our paper:

@article{li2025crossvid,
  title={CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models},
  author={Li, Jingyao and Wang, Jingyun and Tan, Molin and Wang, Haochen and Yan, Cilin and Shi, Likun and Cai, Jiayin and Jiang, Xiaolong and Hu, Yao},
  journal={arXiv preprint arXiv:2511.12263},
  year={2025}
}

โญ Star us on GitHub! โญ

GitHub Stars