[ICCV2025] DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs

October 14, 2025 · View on GitHub

We propose DisCo, a visual encapsulation method designed to yield semantically distinct and temporally coherent visual representations for Video Multi-modal Large Language Models (MLLMs). [Paper]

Environment Setup

We use Python 3.10 and PyTorch 2.0.1 for our code.

conda create -n disco python=3.10
cd disco && pip install -r requirements.txt

Data Preparation

Please follow the steps below to prepare the data for training and evaluation.

Training Data

[Stage 1] vision-text alignment: We use 900K video dense captions from ShareGPTVideo and 23K image dense captions from LLaVA. We also extracted instances for VCD module for each caption.

Captions and extracted instances could be downloaded from here. [under caption_data]
Images/videos could be fetched from the data pages of ShareGPTVideo and LLaVA.

After downloading the data, set each of the following paths in available_corpus.py:

anno_path -> # the path of the caption for this dataset
data_root -> # the directory of the videos for this dataset

[Stage 2] instruction tuning: We follow the data recipe used in InternVideo2. Please refer to this page for details of how to download the datasets. Specifically, you only need to get these datasets:

- LLaVA reasoning & caption
- VideoChat caption
- ShareGPTVideo
- CLEVRER
- PerceptionTest
- STAR
- EgoQA
- NextQA

After downloading the data, set the anno_path and data_root for each dataset included in available_corpus["videochat2_stage2_sh"] in it.py.

[Stage 3] post-training (for InternVideo2-HD only): We adhere to the post-training stage of InterVideo2-HD. Please refer to this page for details. Specifically, you only need to get these datasets:

- LLaVA reasoning
- MiniGPT-4 caption
- ShareGPTVideo
- CLEVRER
- PerceptionTest
- STAR
- EgoQA
- NextQA

After downloading the data, set the anno_path and data_root for each dataset included in available_corpus["videochat2_instruction_2024_0723_f16_post"] in it.py.

Evaluation Data

We evaluate on these benchmarks:

You could download the corresponding video and QA data from their data repositories.

Model Preparation

If you want to carry out the training of DisCo, you need to prepare these pretrained models:

Download Mistral-7B-Instruct-v0.3 and set LLM_PATH in the training scripts under scripts/pt to your own Mistral path.
Download pretrained models for InternVideo2 or InternVideo2-HD, and set PRETRAINED_MODEL in the trainng scripts under scripts/pt to your own model path.

If you want to evaluate the DisCo model directly, you could download the trained checkpoints from here. You would also need Mistral-7B-Instruct-v0.3 as well.

Training

We provide the training scripts for InternVideo2 and InternVideo2HD.

Stage 1: vision-text alignment
- First, go to config file for stage 1. Set llm.pretrained_llm_path to the path of your Mistral model.
- Then, run the following script:
```
# InternVideo2
bash scripts/pt/internvideo2_stage1.sh
# InternVideo2-HD
bash scripts/pt/internvideo2_hd_stage1.sh
```
Stage 2: instruction tuning
- First, go to config file for stage 2. Set llm.pretrained_llm_path to the path of your Mistral model.
- Then, run the following script. Set PRETRAINED_MODEL as the checkpoint path of stage 1.
```
# InternVideo2
bash scripts/pt/internvideo2_stage2.sh
# InternVideo2-HD
bash scripts/pt/internvideo2_hd_stage2_sft.sh
```
Stage3: HD tuning (for InternVideo2-HD only) Run the following script. Set PRETRAINED_MODEL as the checkpoint path of stage 2.
```
# InternVideo2-HD
bash scripts/pt/internvideo2_hd_stage2_post.sh
```

Evaluation

We provide evaluation scripts for the following benchmarks:

MVBench, VideoMME, STAR, PerceptionTest, EgoSchema, MLVU

Here is an example of evaluation on MVBench:

Go to MVBench evaluation script, set CKPT_PATH to the path of DisCo checkpoint, LLM_PATH to the path of your Mistral model.
Go to evaluation/eval_mvbench.py, set the data_list and data_dir to the path of your own MVBench data (downloaded following Evaluation Data).
Run bash scripts/eval/eval_mvbench.sh.

You can evaluate on other benchmarks in this way as well.

Citation

@misc{zhao2025discodistinctcoherentvisual,
      title={DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs}, 
      author={Jiahe Zhao and Rongkun Zheng and Yi Wang and Helin Wang and Hengshuang Zhao},
      year={2025},
      eprint={2507.10302},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.10302}, 
}

Acknowledgement

The code is adapted from the video MLLM codebase developed by Chenting Wang. Thanks for his contributions!