Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search
July 28, 2025 ยท View on GitHub
๐ Overview
AutoCaption is a novel framework that employs Monte Carlo Tree Search (MCTS) to generate rich, diverse, and detailed video captions. The framework iteratively constructs high-quality video descriptions that thoroughly cover objects, actions, environments, and temporal dynamics.
MCTS-VCB is a fine-grained video captioning benchmark automatically constructed using AutoCaption, enabling comprehensive evaluation of Multimodal Large Language Models (MLLMs) on video understanding tasks.
๐ Highlights
- ๐ง AutoCaption Framework: Iteratively constructs high-quality video descriptions using MCTS, covering objects, actions, environments, and more
- ๐ MCTS-VCB Benchmark: Contains diverse, multi-faceted video captions for robust MLLM evaluation
- ๐ Comprehensive Evaluation: Benchmarked over 20 MLLMs with Gemini-1.5-Pro achieving the top F1 score of 71.2%
- ๐ Fine-tuning Results: InternVL2.5-8B fine-tuned on AutoCaption data achieved:
- +25.0% improvement on MCTS-VCB
- +16.3% improvement on DREAM-1K
๐ ๏ธ Installation
Prerequisites
- Python 3.8+
- CUDA-compatible GPU (recommended)
- 16GB+ GPU memory for Qwen2-VL-7B
Quick Install
git clone https://github.com/your-username/autocaption.git
cd autocaption
pip install -r requirements.txt
Development Install
git clone https://github.com/your-username/autocaption.git
cd autocaption
pip install -e .
Optional Dependencies
# For distributed processing
pip install mpi4py
# For experiment tracking
pip install wandb
๐ Quick Start
1. Prepare Data
Create your input file in JSONL format:
{"video_name": "video1.mp4", "video_path": "/path/to/video1.mp4", "index": 0}
{"video_name": "video2.mp4", "video_path": "/path/to/video2.mp4", "index": 1}
2. Configure Settings
# Copy and modify configuration
cp config/config.yaml config/my_config.yaml
# Edit config/my_config.yaml as needed
3. Run AutoCaption
# Multi-GPU processing
python main.py \
--input_path data/videos.jsonl \
--output_path results/captions.jsonl \
--process_num 4 \
--gpu_nums_one_process 2 \
--max_rollout_times 25 \
--log_level INFO
๐ Repository Structure
autocaption/
โโโ ๐ generator/ # Model generators
โ โโโ qwen2vl_7b.py # Qwen2-VL-7B wrapper
โโโ ๐ scripts/ # Utility scripts
โ โโโ run_autocaption.sh # Main execution script
โโโ ๐ main.py # Main entry point
โโโ ๐ mcts.py # MCTS algorithm implementation
โโโ ๐ util.py # Utility functions
โโโ ๐ requirements.txt # Python dependencies
โโโ โ๏ธ setup.py # Package setup
โโโ ๐ README.md # This file
๐ฏ MCTS Action Types
AutoCaption uses 6 different action types for comprehensive video analysis:
- ACTION1: Overall video description
- ACTION2: Detail-focused observation (weighted selection)
- ACTION3: Temporal perspective analysis
- ACTION4: Spatial perspective analysis
- ACTION5: Background description
- ACTION6: Camera movement analysis
๐ Citation
If you use AutoCaption or MCTS-VCB in your research, please cite our paper:
@misc{yu2025evaluatingmultimodallargelanguage,
title={Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search},
author={Linhao Yu and Xinguang Ji and Yahui Liu and Fanheng Kong and Chenxi Sun and Jingyuan Zhang and Hongzhi Zhang and V. W. and Fuzheng Zhang and Deyi Xiong},
year={2025},
eprint={2506.11155},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.11155},
}