Vid-Group: Temporal Video Grounding Pretraining from Unlabeled Videos in the Wild
March 21, 2025 · View on GitHub
Contribution 1) Dataset: A crucial challenge in temporal video grounding is its reliance on massive datasets with labor-intensive annotations. To overcome this, we introduce a large-scale dataset for Temporal Video Grounding Pretraining (Vid-Group), collected in a scalable way with minimal human involvement.
Contribution 2) Pretraining Algorithm: To address the issues of error-prone pseudo-anntations, we propose the Refinement and Correction (ReCorrect) algorithm with a self-correction mechanism for pretraining.
Dataset Download
Our full dataset can be accessed in vid_group_dataset_release.txt. It comprises 52.7K videos and a total of 200.3K annotations.
Below is an example annotation:
v_b0rFmH4XXg8 22.1 110.4 176.7##Person kayaking through water between rocky cliffs.
It represents the following information:
- Video ID:
v_b0rFmH4XXg8 - Video Duration: 176.7 seconds
- Sentence Query: "Person kayaking through water between rocky cliffs."
- Start Time: 22.1 seconds
- End Time: 110.4 seconds
To access the video, remove the prefix "v_" from the Video ID and then you can find the video on YouTube. For example, for the Video ID v_b0rFmH4XXg8, the corresponding YouTube link is: https://www.youtube.com/watch?v=b0rFmH4XXg8
Dataset Copyright
This dataset is released under the CC BY 4.0 license. In order to comply with legal requirements, this dataset provides YouTube links to the videos instead of distributing the video files. Users are advised to download the videos independently and to strictly adhere to YouTube's Terms of Service and all applicable copyright policies when accessing or using the video content, as well as adhering to the license terms governing the annotations.
Comparison to Existing Dataset

Code for Zero-Shot, Unsupervised, and Fully-Supervised Setting
To run the code, use the following command, which integrates the evaluation process for 1) zero-shot, 2) unsupervised, and 3) fully-supervised setting.
python main.py --cfg ./experiment/charades/recorrect_eval_configs_on_ZeroShot+Unsup+Full.json --eval
The evaluation results highlight that:
- Compared to the fully supervised approach SimBase, Our ReCorrect achieves 81.3% and 86.7% of its performance in zero-shot and unsupervised settings.
- This narrow performance gap underscores the potential of our Vid-Group dataset to address the critical challenge of TVG's heavy reliance on manual annotations.
Checkpoints and Features
You do not need any extra downloading to run the code, as the repository is self-contained with necessary features and checkpoints.
- CLIP features are available in the
data/charades/featdirectory. - Pre-trained checkpoints are located in
ckpt/charadeszero_shot.ckpt: zero-shot model.unsup.ckpt: unsupervised model.full_sup.ckpt: fully supervised model.
Expected Results
Zero-Shot Setting
| Method | R@0.1 | R@0.2 | R@0.3 | mIoU |
|---|---|---|---|---|
| ReCorrect | 66.54 | 51.15 | 28.54 | 45.63 |
| % of SimBase | 85.6% | 76.9% | 64.8% | 81.3% |
Unsupervised Setting
| Method | R@0.1 | R@0.2 | R@0.3 | mIoU |
|---|---|---|---|---|
| ReCorrect | 70.96 | 54.42 | 31.10 | 48.66 |
| % of SimBase | 91.2% | 81.9% | 70.7% | 86.7% |
Fully Supervised Setting
| Method | R@0.1 | R@0.2 | R@0.3 | mIoU |
|---|---|---|---|---|
| SimBase | 77.77 | 66.48 | 44.01 | 56.15 |
| ReCorrect (Ours) | 78.55 | 68.39 | 45.78 | 57.42 |