videos3fps480noaudio.tar.gz.00, videos3fps480noaudio.tar.gz.01

January 27, 2026 · View on GitHub

VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded
Video Reasoning

Ye Liu1†, Kevin Qinghong Lin2†, Chang Wen Chen1, Mike Zheng Shou2

1The Hong Kong Polytechnic University 2Show Lab, National University of Singapore

TL;DR: Pioneer DeepSearch-like Video Understanding.

VideoMind is a multi-modal agent framework that enhances video reasoning by emulating human-like processes, such as breaking down tasks, localizing and verifying moments, and synthesizing answers. This approach addresses the unique challenges of temporal-grounded reasoning in a progressive strategy.

🔥 News

  • 2026.01.27 🌴 Our full paper has been accepted by ICLR 2026.
  • 2025.09.23 🧠 VideoMind has been accepted by LAW Workshop @ NeurIPS 2025 as Spotlight.
  • 2025.04.05 📊 See BENCHMARK.md for evaluation results of VideoMind on public benchmarks.
  • 2025.03.28 🚀 VideoMind-2B is ready on Hugging Face Spaces. Check it out!
  • 2025.03.21 ⭐️ Code, model, and dataset release.
  • 2025.03.17 🎉 Our tech report is available online.

🏆 VideoMind on Public Benchmarks

BenchmarkEvaluation Results (2B/7B)
ZS CG-Bench (mini)long-acc: 31.0/38.4 rec@IoU: 8.50/9.93 acc@IoU: 4.02/4.67
ZS ReXTime (val)mIoU: 24.83/27.61 Acc: 69.06/74.59 Acc@IoU: 17.26/20.20
ZS NExT-GQA (test)mIoU: 28.6/31.4 mIoP: 36.4/39.0 Acc@GQA: 25.2/28.2
ZS DeVE-QA (val)*mIoU: 26.3/30.1 mIoP: 49.9/51.9 Acc@GQA: 41.2/44.2
ZS Charades-STA (test)R@0.5: 51.1/59.1 R@0.7: 26.0/31.2 mIoU: 45.2/50.2
ZS ActivityNet-Captions (val_2)R@0.5: 26.5/30.3 R@0.7: 12.6/15.7 mIoU: 30.1/33.3
FT QVHighlights (test)R@0.5: 75.42/78.53 R@0.7: 59.35/61.09 mAP: 51.60/54.19
FT TACoS (test)R@0.5: 26.9/36.2 R@0.7: 15.5/21.4 mIoU: 27.4/34.4
ZS Ego4D-NLQ (val)R@0.5: 2.9/3.7 R@0.7: 1.2/1.7 mIoU: 4.7/5.4
ZS ActivityNet-RTL (val)P@0.5: 20.1/28.0 mIoU: 22.7/31.3
ZS Video-MME (w/o subs)All: 55.4/58.2 Long: 46.3/49.2
ZS MLVUM-Avg: 58.7/64.4
ZS LVBenchOverall: 35.4/40.8
ZS MVBenchAcc: 62.5/64.6
ZS LongVideoBenchAcc: 48.8/56.3

ZS and FT refer to zero-shot and fine-tuned settings, respectively. * means third-party results.

See BENCHMARK.md for full evaluation results.

🕹️ Gradio Demo

https://github.com/user-attachments/assets/a4d99c05-aa73-4ed9-a275-2362a201bfec

Play with our online demo or see DEMO.md for guidelines about how to deploy it locally.

📦 VideoMind-SFT Dataset

We provide raw videos, compressed videos, and pre-processed annotations of 27 video grounding / QA datasets, including our VideoMind-SFT (481K) for training and multiple benchmarks for evaluation. We also release the datasets used during our early exploration (but not included in the final version) to facilitate future research.

The list of source datasets is shown below. See our dataset repo for more details.

Grounder (210K):

DatasetSourceProcessed (Recommended)
QVHighlightsLinkqvhighlights
DiDeMoLinkdidemo
TACoSLinktacos
QuerYDLinkqueryd
HiREST (Grounding)Linkhirest
HiREST (Step Captioning)Linkhirest
CosMo-CapLinkcosmo_cap
InternVid-VTimeLinkinternvid_vtime

Verifier (232K):

DatasetSourceProcessed (Recommended)
QVHighlights-VerifyLinkverifying, qvhighlights
DiDeMo-VerifyLinkverifying, didemo
TACoS-VerifyLinkverifying,tacos

Planner (39K):

DatasetSourceProcessed (Recommended)
NExT-QA-PlanLinkplanning, nextqa
QVHighlights-PlanLinkplanning, qvhighlights

Benchmarks

DatasetTaskSourceProcessed (Recommended)
CG-BenchGrounded VideoQALinkcgbench
ReXTimeGrounded VideoQALinkrextime, activitynet, qvhighlights
NExT-GQAGrounded VideoQALinknextgqa
Charades-STAVTGLinkcharades_sta
ActivityNet-CaptionsVTGLinkactivitynet_captions, activitynet
QVHighlightsVTGLinkqvhighlights
TACoSVTGLinktacos
Ego4D-NLQVTGLinkego4d_nlq, ego4d
ActivityNet-RTLVTGLinkactivitynet_rtl, activitynet
Video-MMEGeneral VideoQALinkvideomme
MLVUGeneral VideoQALinkmlvu
LVBenchGeneral VideoQALinklvbench
MVBenchGeneral VideoQALinkmvbench
LongVideoBenchGeneral VideoQALinklongvideobench

The following datasets are not used in our project (partially used during early exploration), but we still share them to facilitate future research.

DatasetTaskTrainingEvaluationSourceProcessed (Recommended)
QaEgo4DGrounded VideoQALinkqa_ego4d, ego4d
Ego4D-NaQVTGLinkego4d_naq, ego4d
Ego-TimeQAVTGLinkego_timeqa, ego4d
Vid-MorpVTGLinkvid_morp
VideoXumVTG (originally VS)Linkvideoxum
YouCook2VTG (originally DVC)Linkyoucook2
STARVideoQALinkstar, charades_sta
COIN---Linkcoin

Notes:

  1. For some datasets (e.g., ReXTime), the annotations and videos are stored in different folders. All the directories in Processed need to be downloaded.
  2. Use the following commands to concatenate and extract video tar splits (e.g., videos.tar.gz.00, videos_3fps_480_noaudio.tar.gz.00).
# videos.tar.gz.00, videos.tar.gz.01
cat videos.tar.gz.* | tar -zxvf -

# videos_3fps_480_noaudio.tar.gz.00, videos_3fps_480_noaudio.tar.gz.01
cat videos_3fps_480_noaudio.tar.gz.* | tar -zxvf -

🚀 Training

Our codebase supports training and evaluating on 27 video datasets and benchmarks with the following features.

  • Flexible hardware settings: NVIDIA GPU / Ascend NPU, Single-Node / Multi-Node
  • Efficient training techniques: DeepSpeed ZeRO, BF16, LoRA, SDPA, FlashAttention2, Liger-Kernel
  • Customizing the base LLM and conversation templates
  • Monitoring the training process via Tensorboard / Wandb
  • Group sampling for mixed dataset training
  • Multi-process / multi-device evaluation on public benchmarks

See TRAIN.md for a quick start guide.

🔮 Evaluation

See EVAL.md for details about evaluating VideoMind on public benchmarks.

📖 Citation

Please kindly cite our paper if you find this project helpful.

@inproceedings{liu2026videomind,
  title={VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning},
  author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}

Star History Chart