Evaluating VideoMind

September 21, 2025 ยท View on GitHub

๐Ÿ› ๏ธ Environment Setup

Please refer to TRAIN.md for setting up the environment.

๐Ÿ“š Checkpoint Preparation

Download the base models and VideoMind checkpoints, and place them into the model_zoo folder.

VideoMind
โ””โ”€ model_zoo
   โ”œโ”€ Qwen2-VL-2B-Instruct
   โ”œโ”€ Qwen2-VL-7B-Instruct
   โ”œโ”€ VideoMind-2B
   โ”œโ”€ VideoMind-7B
   โ””โ”€ VideoMind-2B-FT-QVHighlights

๐Ÿ“ฆ Dataset Preparation

Download the desired datasets / benchmarks from Hugging Face, extract the videos, and place them into the data folder. The processed files should be organized in the following structure (taking charades_sta as an example).

VideoMind
โ””โ”€ data
   โ””โ”€ charades_sta
      โ”œโ”€ videos_3fps_480_noaudio
      โ”œโ”€ durations.json
      โ”œโ”€ charades_sta_train.txt
      โ””โ”€ charades_sta_test.txt

๐Ÿ”ฎ Start Evaluation

Multi-Process Inference (one GPU / NPU per process)

Use the following commands to evaluate VideoMind on different benchmarks. The default setting is to distribute the samples to 8 processes (each with one device) for acceleration. This mode requires at least 32GB memory per device.

# Evaluate VideoMind (2B / 7B) on benchmarks other than QVHighlights
bash scripts/evaluation/eval_auto_2b.sh <dataset> [<split>]
bash scripts/evaluation/eval_auto_7b.sh <dataset> [<split>]

# Evaluate VideoMind (2B) on QVHighlights
bash scripts/evaluation/eval_qvhighlights.sh

Here, <dataset> could be replaced with the following dataset names:

  • Grounded VideoQA: cgbench, rextime, nextgqa, qa_ego4d
  • Video Temporal Grounding: charades_sta, activitynet_captions, tacos, ego4d_nlq, activitynet_rtl
  • General VideoQA: videomme, mlvu, lvbench, mvbench, longvideobench, star

The optional argument <split> could be valid or test, with test by default.

The inference outputs and evaluation metrics will be saved in the outputs_2b or outputs_7b folders by default.

Multi-Device Inference (multiple GPUs / NPUs in one process)

You can also distribute the model to multiple devices to save memory. In this mode, only one process would be launched and the model is loaded into 8 devices.

bash scripts/evaluation/eval_dist_auto_2b.sh <dataset> [<split>]
bash scripts/evaluation/eval_dist_auto_7b.sh <dataset> [<split>]