[ICCV2025] DisTime: Distribution-based Time Representation for Video Large Language Models
July 10, 2025 · View on GitHub
This is the implementation of Paper: DisTime: Distribution-based Time Representation for Video Large Language Models (ICCV 2025).
Installation
Please refer to INSTALLATION
Models and Data
Models
InternVid-TG
In this paper, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. With these methods, we construct the InternVid-TG dataset. The dataset is released at https://huggingface.co/datasets/yingsen/internvid-tg.
Data construction
{"video": "xxx.mp4", "tgt": [11.39, 31.65], "conversations": [{"from": "human", "value": "<video>\nGive you a textual query: 'They subsequently apply wax to a ski in the kitchen, all the while remaining active and on the move.'. When does the described content occur in the video? Please return the timestamp."}, {"from": "gpt", "value": "The event is depicted at <TIME_STAMP>."}]}
{"video": "xxx.mp4", "tgt": [0, 44, 45, 57, ...], "conversations": [{"from": "human", "value": "<video>\nIdentify and localize a series of steps or actions occurring in the video, providing start and end timestamps and related descriptions."}, {"from": "gpt", "value": "<TIME_STAMP>, clean the bananas. <TIME_STAMP>, take the skin off. <TIME_STAMP>,..."}]}
Training
# InternVL2.5-1B
sh internvl_chat/shell/distime/internvl2_5_1b_dynamic_res_merged_stage_finetune_lora.sh
# InternVL2.5-8B
sh internvl_chat/shell/distime/internvl2_5_8b_dynamic_res_merged_stage_finetune_lora.sh
Evaluation
Moment Retrieval
Charades-STA
# evaluate
GPUS=8 sh internvl_chat/evaluate.sh checkpoint/DisTime/DisTime-InternVL2_5-1B charades
# metric
python internvl_chat/eval/charades-sta/charades_sta_eval_utils.py
ANet
# evaluate
GPUS=8 sh internvl_chat/evaluate.sh checkpoint/DisTime/DisTime-InternVL2_5-1B anet
# metric
python internvl_chat/eval/anet/anet_eval_utils.py
QVHighlight
# evaluate
GPUS=8 sh internvl_chat/evaluate.sh checkpoint/DisTime/DisTime-InternVL2_5-1B qvh
# metric
python internvl_chat/eval/qvhighlight/qvhighlight_eval_utils.py
Dense Video Captioning
YouCook2_dvc
# evaluate
GPUS=8 sh internvl_chat/evaluate.sh checkpoint/DisTime/DisTime-InternVL2_5-1B youcook2_dvc
# metric
python internvl_chat/eval/youcook2_dvc/dvc/eval_dvc.py --pred_file internvl_chat/results/YouCook2-DVC/1B/results.json --gt_file internvl_chat/data_example/youcook2_dvc/val.caption_coco_format.json
ANet_dvc
# evaluate
GPUS=8 sh internvl_chat/evaluate.sh checkpoint/DisTime/DisTime-InternVL2_5-1B anet_dvc
# metric
python internvl_chat/eval/anet_dvc/metric/anet_dvc_eval_utils.py --data_path internvl_chat/data_example/anet/val_2.json --log_path internvl_chat/results/ANet-Caption-DVC/1B/results.txt --task captioning
Grounded Video Question Answering
NExT-GQA
# evaluate stage1
GPUS=8 sh internvl_chat/evaluate.sh checkpoint/DisTime/DisTime-InternVL2_5-1B nextgqa 1
# evaluate stage2
GPUS=8 sh internvl_chat/evaluate.sh checkpoint/DisTime/DisTime-InternVL2_5-1B nextgqa 2
# process result
python internvl_chat/eval/nextgqa/process_and_split.py
# evaluate GQA metric
python internvl_chat/eval/nextgqa/nextgqa_eval_utils.py
# evaluate QA metric
python internvl_chat/eval/nextgqa/evaluate_nextgqa.py
General Video Understanding
MVBench
# evaluate and metric
GPUS=8 sh internvl_chat/evaluate.sh checkpoint/DisTime/DisTime-InternVL2_5-1B mvbench
LongVideoBench
# evaluate and metric
GPUS=8 sh internvl_chat/evaluate.sh checkpoint/DisTime/DisTime-InternVL2_5-1B longvideobench
Citation
@article{zeng2025distime,
title={DisTime: Distribution-based Time Representation for Video Large Language Models},
author={Zeng, Yingsen and Huang, Zepeng and Zhong, Yujie and Feng, Chengjian and Hu, Jie and Ma, Lin and Liu, Yang},
journal={arXiv preprint arXiv:2505.24329},
year={2025}
}
Acknowledgement
DisTime is developed with the codebases of the following projects: InternVL and LLaVA-NeXT. We would like to express our sincere gratitude to these open-source contributions, which have greatly facilitated our research and exploration of time representation for video large language models.