Vid-Group: Temporal Video Grounding Pretraining from Unlabeled Videos in the Wild

March 21, 2025 · View on GitHub

Contribution 1) Dataset: A crucial challenge in temporal video grounding is its reliance on massive datasets with labor-intensive annotations. To overcome this, we introduce a large-scale dataset for Temporal Video Grounding Pretraining (Vid-Group), collected in a scalable way with minimal human involvement.

Contribution 2) Pretraining Algorithm: To address the issues of error-prone pseudo-anntations, we propose the Refinement and Correction (ReCorrect) algorithm with a self-correction mechanism for pretraining.

Dataset Download

Our full dataset can be accessed in vid_group_dataset_release.txt. It comprises 52.7K videos and a total of 200.3K annotations.

Below is an example annotation:

v_b0rFmH4XXg8 22.1 110.4 176.7##Person kayaking through water between rocky cliffs.

It represents the following information:

Video ID: v_b0rFmH4XXg8
Video Duration: 176.7 seconds
Sentence Query: "Person kayaking through water between rocky cliffs."
Start Time: 22.1 seconds
End Time: 110.4 seconds

To access the video, remove the prefix "v_" from the Video ID and then you can find the video on YouTube. For example, for the Video ID v_b0rFmH4XXg8, the corresponding YouTube link is: https://www.youtube.com/watch?v=b0rFmH4XXg8

Dataset Copyright

This dataset is released under the CC BY 4.0 license. In order to comply with legal requirements, this dataset provides YouTube links to the videos instead of distributing the video files. Users are advised to download the videos independently and to strictly adhere to YouTube's Terms of Service and all applicable copyright policies when accessing or using the video content, as well as adhering to the license terms governing the annotations.

Comparison to Existing Dataset

Dataset Comparision

Code for Zero-Shot, Unsupervised, and Fully-Supervised Setting

To run the code, use the following command, which integrates the evaluation process for 1) zero-shot, 2) unsupervised, and 3) fully-supervised setting.

python main.py --cfg ./experiment/charades/recorrect_eval_configs_on_ZeroShot+Unsup+Full.json --eval

The evaluation results highlight that:

Compared to the fully supervised approach SimBase, Our ReCorrect achieves 81.3% and 86.7% of its performance in zero-shot and unsupervised settings.
This narrow performance gap underscores the potential of our Vid-Group dataset to address the critical challenge of TVG's heavy reliance on manual annotations.

Checkpoints and Features

You do not need any extra downloading to run the code, as the repository is self-contained with necessary features and checkpoints.

CLIP features are available in the data/charades/feat directory.
Pre-trained checkpoints are located in ckpt/charades
- zero_shot.ckpt: zero-shot model.
- unsup.ckpt: unsupervised model.
- full_sup.ckpt: fully supervised model.

Expected Results

Zero-Shot Setting

Method	R@0.1	R@0.2	R@0.3	mIoU
ReCorrect	66.54	51.15	28.54	45.63
% of SimBase	85.6%	76.9%	64.8%	81.3%

Unsupervised Setting

Method	R@0.1	R@0.2	R@0.3	mIoU
ReCorrect	70.96	54.42	31.10	48.66
% of SimBase	91.2%	81.9%	70.7%	86.7%

Fully Supervised Setting

Method	R@0.1	R@0.2	R@0.3	mIoU
SimBase	77.77	66.48	44.01	56.15
ReCorrect (Ours)	78.55	68.39	45.78	57.42