ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models

July 15, 2025 ยท View on GitHub


๐Ÿ“ท This is the code repository for the paper: ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models. ACM MM 2025.

Overview: (a) Vanilla Text CoT Reasoning; (b) Video-Text Interleaved CoT Reasoning; (c) Video-Text Interleaved Data Construction; (d) Performance Comparison: Vanilla Reasoning Paradigm (Vanilla CoT, Vanilla Desp-CoT, and Vanilla Plan-and-Solve) vs. Video-Text Interleaved Reasoning Paradigm (ViT CoT, ViT Desp-CoT and ViT Plan-and-Solve) on Qwen2.5-VL-7B.

Preparation steps: environment installation

(1) Environment installation command:

pip install -r requirements.txt

(2) Please fill in the API information in the file: src/ViTCoT_stage1 and src/ViTCoT_stage2.

API_KEYS = []

(3) Download datasets ๐Ÿค— all_video.zip and ๐Ÿค— key_video.zip and unzip them into the src folder.

๐Ÿ’ป To get the performance results for Gemini-2.0-Flash, run the following command:

cd src
bash run.sh

๐Ÿ’ฏ Model Performance

๐Ÿ’ฌ Contact

Please create Github issues here or email Yongheng Zhang or Libo Qin if you have any questions or suggestions.