[ICCV2025] DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs
October 14, 2025 ยท View on GitHub
We propose DisCo, a visual encapsulation method designed to yield semantically distinct and temporally coherent visual representations for Video Multi-modal Large Language Models (MLLMs). [Paper]
|
|
|
|
Environment Setup
We use Python 3.10 and PyTorch 2.0.1 for our code.
conda create -n disco python=3.10
cd disco && pip install -r requirements.txt
Data Preparation
Please follow the steps below to prepare the data for training and evaluation.
Training Data
[Stage 1] vision-text alignment: We use 900K video dense captions from ShareGPTVideo and 23K image dense captions from LLaVA. We also extracted instances for VCD module for each caption.
- Captions and extracted instances could be downloaded from here. [under caption_data]
- Images/videos could be fetched from the data pages of ShareGPTVideo and LLaVA.
After downloading the data, set each of the following paths in available_corpus.py:
anno_path -> # the path of the caption for this dataset
data_root -> # the directory of the videos for this dataset
[Stage 2] instruction tuning: We follow the data recipe used in InternVideo2. Please refer to this page for details of how to download the datasets. Specifically, you only need to get these datasets:
- LLaVA reasoning & caption
- VideoChat caption
- ShareGPTVideo
- CLEVRER
- PerceptionTest
- STAR
- EgoQA
- NextQA
After downloading the data, set the anno_path and data_root for each dataset included in available_corpus["videochat2_stage2_sh"] in it.py.
[Stage 3] post-training (for InternVideo2-HD only): We adhere to the post-training stage of InterVideo2-HD. Please refer to this page for details. Specifically, you only need to get these datasets:
- LLaVA reasoning
- MiniGPT-4 caption
- ShareGPTVideo
- CLEVRER
- PerceptionTest
- STAR
- EgoQA
- NextQA
After downloading the data, set the anno_path and data_root for each dataset included in available_corpus["videochat2_instruction_2024_0723_f16_post"] in it.py.
Evaluation Data
We evaluate on these benchmarks:
You could download the corresponding video and QA data from their data repositories.
Model Preparation
If you want to carry out the training of DisCo, you need to prepare these pretrained models:
- Download Mistral-7B-Instruct-v0.3 and set
LLM_PATHin the training scripts underscripts/ptto your own Mistral path. - Download pretrained models for InternVideo2 or InternVideo2-HD, and set
PRETRAINED_MODELin the trainng scripts underscripts/ptto your own model path.
If you want to evaluate the DisCo model directly, you could download the trained checkpoints from here. You would also need Mistral-7B-Instruct-v0.3 as well.
Training
We provide the training scripts for InternVideo2 and InternVideo2HD.
-
Stage 1: vision-text alignment
- First, go to config file for stage 1. Set
llm.pretrained_llm_pathto the path of your Mistral model. - Then, run the following script:
# InternVideo2 bash scripts/pt/internvideo2_stage1.sh # InternVideo2-HD bash scripts/pt/internvideo2_hd_stage1.sh - First, go to config file for stage 1. Set
-
Stage 2: instruction tuning
- First, go to config file for stage 2. Set
llm.pretrained_llm_pathto the path of your Mistral model. - Then, run the following script. Set
PRETRAINED_MODELas the checkpoint path of stage 1.
# InternVideo2 bash scripts/pt/internvideo2_stage2.sh # InternVideo2-HD bash scripts/pt/internvideo2_hd_stage2_sft.sh - First, go to config file for stage 2. Set
-
Stage3: HD tuning (for InternVideo2-HD only) Run the following script. Set
PRETRAINED_MODELas the checkpoint path of stage 2.# InternVideo2-HD bash scripts/pt/internvideo2_hd_stage2_post.sh
Evaluation
We provide evaluation scripts for the following benchmarks:
- MVBench, VideoMME, STAR, PerceptionTest, EgoSchema, MLVU
Here is an example of evaluation on MVBench:
- Go to MVBench evaluation script, set
CKPT_PATHto the path of DisCo checkpoint,LLM_PATHto the path of your Mistral model. - Go to evaluation/eval_mvbench.py, set the
data_listanddata_dirto the path of your own MVBench data (downloaded following Evaluation Data). - Run
bash scripts/eval/eval_mvbench.sh.
You can evaluate on other benchmarks in this way as well.
Citation
@misc{zhao2025discodistinctcoherentvisual,
title={DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs},
author={Jiahe Zhao and Rongkun Zheng and Yi Wang and Helin Wang and Hengshuang Zhao},
year={2025},
eprint={2507.10302},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.10302},
}
Acknowledgement
The code is adapted from the video MLLM codebase developed by Chenting Wang. Thanks for his contributions!