README.md

July 22, 2025 · View on GitHub

Multimodal Long Video Modeling Based on Temporal Dynamic Context

News

2025.7.16 Release model weights and inference code. Have a try!
2025.6.10 Release training and evaluation code.

Framework of Temporal Dynamic Context Compression


Architecture of Our Multimodal Video Encoder. We first extract features for each second of the video, including both visual and corresponding audio tokens. The first frame is selected as the static frame, and a Q-Former is used to perform Temporal Dynamic Context compression based on its relationship with subsequent frames, resulting in K compressed tokens per frame. The final video representation consists of all static frame tokens and multimodal video context.

Architecture of Our Multimodal Video Encoder. We first extract features for each second of the video, including both visual and corresponding audio tokens. The first frame is selected as the static frame, and a Q-Former is used to perform Temporal Dynamic Context compression based on its relationship with subsequent frames, resulting in K compressed tokens per frame. The final video representation consists of all static frame tokens and multimodal video context.

Install

Clone the repo into a local folder.

git clone https://github.com/Hoar012/TDC-Video.git
cd TDC-Video

Install packages.

conda create -n tdc python=3.10 -y
conda activate tdc
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Models

Pretrained model weights are available on Hugging Face.

TDC-Qwen2-7B: TDC-Qwen2-7B; TDC-Llama3_2-3B: TDC-Llama3_2-3B

Demo

python main.py

Training

Prepare training data

Stage 1: Image-Text Alignment: LLaVA-OneVision-Single
Stage 2: Video Instruction Tuning: Stage2 data
Stage 3: Audio-Video Instruction Tuning: Stage3 data

We also provide the processed videos and audios for stage 3 training: Processed data.

Start training

Modify the PATH_TO_JSON and PATH_TO_FOLDER arguments in the training scripts to your save folder.

PATH_TO_JSON=""
PATH_TO_FOLDER=""

Train your own model

Stage 1: Image-Text Alignment

sh scripts/stage1/train_image_qwen.sh

Modify PREV_STAGE_CHECKPOINT in the training scripts to your first stage model path

Change image_token_len and query_num_list in config.json to 144

Stage 2: Video Instruction Tuning

sh scripts/stage2/train_video_qwen.sh

Stage 3: Audio-Video Instruction Tuning

# Lora training
sh scripts/stage3/train_video_audio_qwen_lora.sh

Evaluation

Evaluation on General Video Understanding

torchrun --nproc_per_node=8 ./eval/eval_mlvu.py --model_path Hoar012/TDC-Qwen2-7B --model_name cambrian_qwen --version qwen --data_path eval/MLVU

Evaluation on Audio-Visual Comprehension

torchrun --nproc_per_node=8 ./eval/eval_musicQA.py --model_path Hoar012/TDC-Qwen2-7B --model_name cambrian_qwen --version qwen --data_path eval/Music-AVQA --test_file eval/Music-AVQA/avqa-test.json

For more detailed instructions on evaluation, please refer to the evaluation guide.

BibTeX

@misc{hao2025multimodallongvideomodeling,
        title={Multimodal Long Video Modeling Based on Temporal Dynamic Context}, 
        author={Haoran Hao and Jiaming Han and Yiyuan Zhang and Xiangyu Yue},
        year={2025},
        eprint={2504.10443},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2504.10443}, 
  }

Acknowledgement

This repository is built upon: LLaVA, LongVU and StoryTeller.