Temporal Recipe

July 9, 2025 · View on GitHub

Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding

Training and Inference

Usage:

bash run_scripts/POST_TRAINING_DATASET/train.sh

POST_TRAINING_DATASET: dataset we use for incorporating temporal knowledge into large vision-language model. It can be either internvid, vidal, and internvid_vidal.

Usage:

bash run_scripts/FINETUNING_DATASET/train_TASK.sh

FINETUNING_DATASET: dataset to evaluate the model’s video understanding ability. Choices include msrvtt and msvd.
TASK: video understanding task, can be either qa or cap.

Usage:

bash run_scripts/FINETUNING_DATASET/test_TASK.sh

Choices for FINETUNING_DATASET and TASK are similar to those in step 2.