Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search

July 28, 2025 ยท View on GitHub

arXiv Hugging Face

๐Ÿ“Œ Overview

AutoCaption is a novel framework that employs Monte Carlo Tree Search (MCTS) to generate rich, diverse, and detailed video captions. The framework iteratively constructs high-quality video descriptions that thoroughly cover objects, actions, environments, and temporal dynamics.

MCTS-VCB is a fine-grained video captioning benchmark automatically constructed using AutoCaption, enabling comprehensive evaluation of Multimodal Large Language Models (MLLMs) on video understanding tasks.

๐Ÿš€ Highlights

  • ๐Ÿง  AutoCaption Framework: Iteratively constructs high-quality video descriptions using MCTS, covering objects, actions, environments, and more
  • ๐Ÿ“Š MCTS-VCB Benchmark: Contains diverse, multi-faceted video captions for robust MLLM evaluation
  • ๐Ÿ” Comprehensive Evaluation: Benchmarked over 20 MLLMs with Gemini-1.5-Pro achieving the top F1 score of 71.2%
  • ๐Ÿ“ˆ Fine-tuning Results: InternVL2.5-8B fine-tuned on AutoCaption data achieved:
    • +25.0% improvement on MCTS-VCB
    • +16.3% improvement on DREAM-1K

๐Ÿ› ๏ธ Installation

Prerequisites

  • Python 3.8+
  • CUDA-compatible GPU (recommended)
  • 16GB+ GPU memory for Qwen2-VL-7B

Quick Install

git clone https://github.com/your-username/autocaption.git
cd autocaption
pip install -r requirements.txt

Development Install

git clone https://github.com/your-username/autocaption.git
cd autocaption
pip install -e .

Optional Dependencies

# For distributed processing
pip install mpi4py

# For experiment tracking
pip install wandb

๐Ÿš€ Quick Start

1. Prepare Data

Create your input file in JSONL format:

{"video_name": "video1.mp4", "video_path": "/path/to/video1.mp4", "index": 0}
{"video_name": "video2.mp4", "video_path": "/path/to/video2.mp4", "index": 1}

2. Configure Settings

# Copy and modify configuration
cp config/config.yaml config/my_config.yaml
# Edit config/my_config.yaml as needed

3. Run AutoCaption

# Multi-GPU processing
python main.py \
    --input_path data/videos.jsonl \
    --output_path results/captions.jsonl \
    --process_num 4 \
    --gpu_nums_one_process 2 \
    --max_rollout_times 25 \
    --log_level INFO

๐Ÿ“‚ Repository Structure

autocaption/
โ”œโ”€โ”€ ๐Ÿ“ generator/              # Model generators
โ”‚   โ””โ”€โ”€ qwen2vl_7b.py         # Qwen2-VL-7B wrapper
โ”œโ”€โ”€ ๐Ÿ“ scripts/                # Utility scripts
โ”‚   โ””โ”€โ”€ run_autocaption.sh    # Main execution script
โ”œโ”€โ”€ ๐Ÿ main.py                 # Main entry point
โ”œโ”€โ”€ ๐Ÿ mcts.py                 # MCTS algorithm implementation
โ”œโ”€โ”€ ๐Ÿ util.py                 # Utility functions
โ”œโ”€โ”€ ๐Ÿ“‹ requirements.txt        # Python dependencies
โ”œโ”€โ”€ โš™๏ธ setup.py                # Package setup
โ””โ”€โ”€ ๐Ÿ“„ README.md               # This file

๐ŸŽฏ MCTS Action Types

AutoCaption uses 6 different action types for comprehensive video analysis:

  1. ACTION1: Overall video description
  2. ACTION2: Detail-focused observation (weighted selection)
  3. ACTION3: Temporal perspective analysis
  4. ACTION4: Spatial perspective analysis
  5. ACTION5: Background description
  6. ACTION6: Camera movement analysis

๐Ÿ“Œ Citation

If you use AutoCaption or MCTS-VCB in your research, please cite our paper:

@misc{yu2025evaluatingmultimodallargelanguage,
    title={Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search}, 
    author={Linhao Yu and Xinguang Ji and Yahui Liu and Fanheng Kong and Chenxi Sun and Jingyuan Zhang and Hongzhi Zhang and V. W. and Fuzheng Zhang and Deyi Xiong},
    year={2025},
    eprint={2506.11155},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2506.11155},
}