README.md
July 22, 2025 ยท View on GitHub
Multimodal Long Video Modeling Based on Temporal Dynamic Context
Paper | Project Page | Model
News
- 2025.7.16 Release model weights and inference code. Have a try!
- 2025.6.10 Release training and evaluation code.
๐ Contents
Framework of Temporal Dynamic Context Compression
![]() |
|---|
| Architecture of Our Multimodal Video Encoder. We first extract features for each second of the video, including both visual and corresponding audio tokens. The first frame is selected as the static frame, and a Q-Former is used to perform Temporal Dynamic Context compression based on its relationship with subsequent frames, resulting in K compressed tokens per frame. The final video representation consists of all static frame tokens and multimodal video context. |
Install
- Clone the repo into a local folder.
git clone https://github.com/Hoar012/TDC-Video.git
cd TDC-Video
- Install packages.
conda create -n tdc python=3.10 -y
conda activate tdc
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
Models
Pretrained model weights are available on Hugging Face.
TDC-Qwen2-7B: TDC-Qwen2-7B; TDC-Llama3_2-3B: TDC-Llama3_2-3B
Demo
python main.py
Training
- Prepare training data
- Stage 1: Image-Text Alignment: LLaVA-OneVision-Single
- Stage 2: Video Instruction Tuning: Stage2 data
- Stage 3: Audio-Video Instruction Tuning: Stage3 data
We also provide the processed videos and audios for stage 3 training: Processed data.
- Start training
Modify the PATH_TO_JSON and PATH_TO_FOLDER arguments in the training scripts to your save folder.
PATH_TO_JSON=""
PATH_TO_FOLDER=""
Train your own model
- Stage 1: Image-Text Alignment
sh scripts/stage1/train_image_qwen.sh
Modify PREV_STAGE_CHECKPOINT in the training scripts to your first stage model path
Change image_token_len and query_num_list in config.json to 144
- Stage 2: Video Instruction Tuning
sh scripts/stage2/train_video_qwen.sh
- Stage 3: Audio-Video Instruction Tuning
# Lora training
sh scripts/stage3/train_video_audio_qwen_lora.sh
Evaluation
Evaluation on General Video Understanding
torchrun --nproc_per_node=8 ./eval/eval_mlvu.py --model_path Hoar012/TDC-Qwen2-7B --model_name cambrian_qwen --version qwen --data_path eval/MLVU
Evaluation on Audio-Visual Comprehension
torchrun --nproc_per_node=8 ./eval/eval_musicQA.py --model_path Hoar012/TDC-Qwen2-7B --model_name cambrian_qwen --version qwen --data_path eval/Music-AVQA --test_file eval/Music-AVQA/avqa-test.json
For more detailed instructions on evaluation, please refer to the evaluation guide.
BibTeX
@misc{hao2025multimodallongvideomodeling,
title={Multimodal Long Video Modeling Based on Temporal Dynamic Context},
author={Haoran Hao and Jiaming Han and Yiyuan Zhang and Xiangyu Yue},
year={2025},
eprint={2504.10443},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.10443},
}
Acknowledgement
This repository is built upon: LLaVA, LongVU and StoryTeller.
