README.md
January 1, 2026 · View on GitHub
Factorized Learning for Temporally Grounded Video-Language Models
Wenzheng Zeng1, Difei Gao1, Mike Zheng Shou1†, Hwee Tou Ng1†,
1National University of Singapore
†Corresponding authors
ICCV 2025
📄 Paper with supp. | 🤗 Daily paper | 🤗 Model | 🤗 Dataset | 🖼️ Poster | ▶️ Video | X (Twitter) | 知乎
🔆 Highlights
- Model: We propose a new framework , where we decompose the generation objective into a "grounding then answering with evidence referencing" paradigm and introduce evidence tokens to emphasize explicit event-level visual semantic capture.
- Training Algorithm: We introduce Factorized Preference Optimization (FPO) that explicitly addresses both temporal grounding and textual response. A factorized data synthesis approach is also designed to support FPO.
- Performance: Our method consistently outperforms SOTA methods across various tasks.
- Open Source: We release the source code and model weights to the community.
🔥News
- [2025-10] Code and model weights are released!
- [2025-06] Our work is accepted to ICCV 2025!
🛠️ Installation
Please refer to the following environmental settings that we use. You may install these packages by yourself if you meet any problem during automatic installation.
- CUDA 11.8
- Python 3.12.2
- PyTorch 2.4.0
- Transformers 4.44.2
- DeepSpeed 0.14.5
- NNCore 0.4.5
Install from source
- Clone the repository from GitHub.
git clone https://github.com/nusnlp/d2vlm.git
cd d2vlm
- Initialize conda environment.
conda create -n d2vlm python=3.12 -y
conda activate d2vlm
- Install dependencies.
pip install -r requirements.txt
📦 Dataset
Please refer to the Dataset page.
🤖 Inference and Evaluation
- You can download the pre-trained model at here.
- Run the following commands. Remember to change the absolute path within each .sh file.
E.T. Bench
bash scripts/inference.sh
# or refer to the inference and Evaluation part of scripts/train_inference_eval.sh
Charades-STA
bash all_benchmark_eval/charades/inference.sh
all_benchmark_eval/charades/inference.sh
Youcook2
bash all_benchmark_eval/youcook2/inference.sh
💪 Training
- Download the pretrained model from here (stage-2 model of E.T. Chat).
- Check and run the following command (modify relevant path).
bash scripts/train_inference_eval.sh
🎓 Citation
If you find our work useful in your research, please consider to cite our paper:
@inproceedings{d2vlm,
title={Factorized Learning for Temporally Grounded Video-Language Models},
author={Zeng, Wenzheng and Gao, Difei and Shou, Mike Zheng and Ng, Hwee Tou},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year={2025},
pages={20683-20693}
}
🙏 Acknowledgments
This project was built upon E.T. Bench, TimeChat, and AMP. We thank their solid contribution to the community!