Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

June 20, 2025 · View on GitHub

Note: TimeZero is the original version

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
Ye Wang*, Ziheng Wang*, Boshen Xu*‡, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, Xiangnan Fang, Zewen He, Zhenbo Luo, Wenxuan Wang, Junqi Lin, Jian Luan, Qin Jin†

TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM
Ye Wang*, Boshen Xu*, Zihao Yue, Zihan Xiao, Ziheng Wang, Liang Zhang, Dingyi Yang, Wenxuan Wang, Qin Jin†

Updates

🚀2025-06-20: The code for Time-R1 is now open-sourced at https://github.com/xiaomi-research/time-r1! It supports training and testing on Charades, ActivityNet, and TimeRFT, along with vLLM-accelerated inference for Charades, ActivityNet, TVGBench, VideoMME, MVBench, TempCompass, and EgoSchema.
2025-05-27: Expanded Time-R1 ArXiv paper released! Read on ArXiv.
2025-03-17: TimeZero initial release! Code and evaluation scripts are now available.
2025-03-17: TimeZero achieves SOTA performance on Charades-STA!

TimeZero is a reasoning-guided Large Vision-Language Model (LVLM) for Temporal Video Grounding (TVG). It excels at identifying temporal segments within videos that correspond to a given natural language query. TimeZero achieves this entirely through a reinforcement learning approach that allows the model to reason about video-language relationships during inference.

Key Features:

Reinforcement Learning Training: TimeZero is trained entirely using reinforcement learning, enhancing its ability to generate accurate temporal boundaries.
Test-Time Reasoning: The model exhibits emergent reasoning capabilities during inference, generating a chain of thought to justify its segment predictions.
SOTA Performance: TimeZero sets a new SOTA on the Charades-STA benchmark.

This README provides an overview of TimeZero, including setup instructions, the training process, and evaluation guidelines.

Example:

Training Visualization:

0a466a4bca3bb8d9b2a2af0f15890b4

Setup

conda create -n timezero python=3.11
conda env create -f environment.yml
conda activate timezero

Training

TimeZero training involves the following steps:

Data Preprocessing:

Download the dataset Charades-STA, Charades-v1, ActivityNet

Before training, you need to preprocess the video data.
```
bash preprocess_video.sh
```
Specify the path to the Charades-STA dataset (video files, annotations, etc.).

GRPO Training:

cd scripts
bash run_grpo_video.sh

run_grpo_video.sh

#!/bin/bash

export DEBUG_MODE="false"  # Set to "true" for verbose logging during training.
export LOG_PATH="./debug_log.txt"

torchrun --nproc_per_node="4" \
--nnodes="1" \
--node_rank="0" \
--master_addr="127.0.0.1" \
--master_port="12361" \
src/open_r1/grpo_video.py \
--deepspeed scripts/zero3_offload.json \
--output_dir $OUTDIR \
--model_name_or_path mllm/Qwen2.5-VL-7B-Instruct \
--preprocessed_data_path ./Charades_preprocessed_data_maxpix_3584 \
--train_data_path ./Charades/charades_annotation/train.json \
--eval_data_path ./Charades/charades_annotation/val.json \
--video_folder ./Charades/Charades_v1 \
--dataset_name xxx \
--max_prompt_length 8192 \
--max_completion_length 1024 \
--num_generations 8 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 2 \
--logging_steps 1 \
--bf16 \
--torch_dtype bfloat16 \
--data_seed 42 \
--gradient_checkpointing true \
--attn_implementation flash_attention_2 \
--num_train_epochs 2 \
--run_name $WANDB_NAME \
--report_to wandb \
--save_steps 50 \
--save_only_model true

Evaluation

After training, evaluate your model's performance:

bash scripts/evaluate.sh # Use evaluate.sh for evaluation.

evaluate.sh

python evaluate.py --model_base <path_to_your_trained_model> --dataset <charades or activitynet>

The evaluation script (evaluate.py) needs to be implemented to load your model, process the test data, and calculate the relevant metrics (R1@0.3, R1@0.5, R1@0.7, etc.).

Results

Charades-STA (Finetuned)

TimeZero outperforms previous state-of-the-art methods by a large margin.

Method	Type	R1@0.3	R1@0.5	R1@0.7
EaTR (VLP sota)	VLP	-	68.4	44.9
TimeSuite (LVLM sota)	SFT	79.4	67.1	43.0
TimeZero (ours)	RL	83.3	72.5	47.9

ActivityNet (Finetuned)

TimeZero surpasses previous state-of-the-art LVLMs.

Method	Type	R1@0.3	R1@0.5	R1@0.7
EaTR (VLP sota)	VLP	-	58.18	37.64
TRACE (LVLM sota)	SFT	54.0	37.7	24.0
TimeZero (ours)	RL	68.6	47.3	26.9

Acknowledgements

We thank the authors of the following projects for their contributions:

Citation

If you find our work useful, please consider cite our paper :).

@article{wang2025timer1,
      title={Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding}, 
      author={Wang, Ye and Wang, Ziheng and Xu, Boshen and Du, Yang and Lin, Kejun and Xiao, Zihan and Yue, Zihao and Ju, Jianzhong and Zhang, Liang and Yang, Dingyi and Fang, Xiangnan and He, Zewen and Luo, Zhenbo and Wang, Wenxuan and Lin, Junqi and Luan, Jian and Jin, Qin},
      journal={arXiv preprint arXiv:2503.13377},
      year={2025},
}

@article{wang2025timezero,
      title={TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM},
      author={Wang, Ye and Xu, Boshen and Yue, Zihao and Xiao, Zihan and Wang, Ziheng and Zhang, Liang and Yang, Dingyi and Wang, Wenxuan and Jin, Qin},
      journal={arXiv preprint arXiv:2503.13377},
      year={2025}
}