Temporal Recipe
July 9, 2025 · View on GitHub
Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding
Thong Nguyen, Zhiyuan Hu, Xu Lin, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan
Training and Inference
Step 1: Temporal-oriented post-training
Usage:
bash run_scripts/POST_TRAINING_DATASET/train.sh
POST_TRAINING_DATASET: dataset we use for incorporating temporal knowledge into large vision-language model. It can be eitherinternvid,vidal, andinternvid_vidal.
Step 2: Task-specific fine-tuning
Usage:
bash run_scripts/FINETUNING_DATASET/train_TASK.sh
-
FINETUNING_DATASET: dataset to evaluate the model’s video understanding ability. Choices includemsrvttandmsvd. -
TASK: video understanding task, can be eitherqaorcap.
Step 3: Evaluate
Usage:
bash run_scripts/FINETUNING_DATASET/test_TASK.sh
Choices for FINETUNING_DATASET and TASK are similar to those in step 2.