ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models
July 15, 2025 ยท View on GitHub
๐ท This is the code repository for the paper: ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models. ACM MM 2025.
Overview: (a) Vanilla Text CoT Reasoning; (b) Video-Text Interleaved CoT Reasoning; (c) Video-Text Interleaved Data Construction; (d) Performance Comparison: Vanilla Reasoning Paradigm (Vanilla CoT, Vanilla Desp-CoT, and Vanilla Plan-and-Solve) vs. Video-Text Interleaved Reasoning Paradigm (ViT CoT, ViT Desp-CoT and ViT Plan-and-Solve) on Qwen2.5-VL-7B.
Preparation steps: environment installation
(1) Environment installation command:
pip install -r requirements.txt
(2) Please fill in the API information in the file: src/ViTCoT_stage1 and src/ViTCoT_stage2.
API_KEYS = []
(3) Download datasets ๐ค all_video.zip and ๐ค key_video.zip and unzip them into the src folder.
๐ป To get the performance results for Gemini-2.0-Flash, run the following command:
cd src
bash run.sh
๐ฏ Model Performance
๐ฌ Contact
Please create Github issues here or email Yongheng Zhang or Libo Qin if you have any questions or suggestions.