readme.md
October 20, 2025 ยท View on GitHub
๐ฌ VideoExplorer: Thinking with Video for Long-Form Understanding
๐ฌ Demo
๐ Introduction
VideoExplorer is a novel framework for long-video understanding that moves beyond single-pass reasoning. Inspired by the "thinking with video" principle, it performs faithful, efficient, and interpretable reasoning by dynamically exploring video content.
๐ News
2025.10.16 - We released the newest version of VideoDeepResearch called VideoExplorer! It's smaller, cheaper, but just as effective in long video understanding. Details refer to our updated paper. โจ
2025.06.10 - We released the first version of VideoDeepResearch. ๐ฌ
๐ Overview
Long-video understanding is challenging. Existing methods often sacrifice detail by downsampling or rely on task-agnostic representations, limiting their perception.
VideoExplorer solves this by intertwining planning, temporal grounding, and scalable perception into a coherent, iterative loop:
- Formulates a sub-question.
- Locates the relevant moments.
- Performs task-oriented, fine-grained perception.
- Repeats until the final answer is reached.
๐ก Key Features
- Iterative Reasoning: Dynamically explores video content instead of relying on a static context.
- Task-Oriented Perception: Focuses computational resources on relevant moments, enabling scalable analysis.
- Interpretable Trajectories: Each step of the reasoning process is transparent and traceable.
๐๏ธ Framework & Training
To overcome the lack of LVU training data, we constructed a high-quality dataset using difficulty-adaptive sampling. Our training pipeline consists of:
- Supervised Trajectory Initialization
- Trajectory-level Preference Optimization
This two-stage approach encourages adaptive temporal grounding and iterative information integration guided by downstream rewards.
๐ Results
Extensive evaluations on popular long-video benchmarks show that VideoExplorer achieves significant performance advantages over existing baselines, demonstrating its robustness, adaptability, and efficiency.
๐ Quick Start
1. Clone & Install
# Clone repository
git clone https://github.com/yhy-2000/VideoDeepResearch.git
cd VideoDeepResearch
# Install dependencies
pip install -r requirements.txt
Project Layout:
VideoDeepResearch/
โโโ requirements.txt # Python dependencies
โโโ eval/ # Code for evaluating benchmarks
โโโ train/ # Code for supervised finetuning (SFT) and trajectory-based direct preference optimization (TDPO)
โโโ asset/ # Assets used in the demo
โโโ data/
โ โโโ videos/ # Raw video files
โ โโโ clips/ # Generated video clips
โ โโโ dense_frames/ # Extracted key frames
โโโ README.md # This documentation
Launch Demo
base eval/demo.sh
Evaluation on Benchmarks
base eval/eval.sh
Training
Our training dataset is available at https://huggingface.co/datasets/avery00/VideoExplorer-Dataset/tree/main. To set up:
-
Place dpo_marathon.json in train/LLaMA-Factory-dpo/data.
-
Place the remaining two files in train/LLaMA-Factory-sft/data.
Environment Setting
mv train/LLaMA-Factory-sft train/LLaMA-Factory-main
cd train/LLaMA-Factory-main
pip install -e ".[torch,metrics]" --no-build-isolation
mv train/LLaMA-Factory-main train/LLaMA-Factory-sft
Supervised Finetuning
cd train
# load the right code
mv train/LLaMA-Factory-sft train/LLaMA-Factory-main
# finetuning planner
bash sft_planner.sh
# finetuning temporal grounder
bash sft_temporal_grounding_agent.sh
mv train/LLaMA-Factory-main train/LLaMA-Factory-sft
Trajectory-based Direct Preference Optimization
# load the right code
mv train/LLaMA-Factory-dpo train/LLaMA-Factory-main
# Trajectory-based DPO
bash train/dpo_planner.sh
mv train/LLaMA-Factory-main train/LLaMA-Factory-dpo
๐ฌ Contact
Encounter issues or have questions? Reach out to:
H.Y. Yuan Email: hyyuan@ruc.edu.cn
๐ Citation
If you find this work helpful, please cite our paper:
@misc{yuan2025thinkvideosagenticlongvideo,
title={Think With Videos For Agentic Long-Video Understanding},
author={Huaying Yuan and Zheng Liu and Junjie Zhou and Hongjin Qian and Yan Shu and Nicu Sebe and Ji-Rong Wen and Zhicheng Dou},
year={2025},
eprint={2506.10821},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.10821},
}