Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild

December 6, 2024 · View on GitHub

Paper Link: https://arxiv.org/pdf/2412.00811

In this paper, we propose a new dataset and algorithm for video moment retrieval, which effectively relieves the high cost of human annotations. Our experiments highlight that:

  • Compared to the fully supervised approach SimBase, Our ReCorrect model achieves 81.3% and 86.7% of its performance in zero-shot and unsupervised settings.
  • This narrow performance gap underscores the potential of our Vid-Morp dataset to address the critical challenge of VMR's heavy reliance on manual annotations.

Quick Start

To run the code, use the following command, which integrates the evaluation process for 1) zero-shot, 2) unsupervised, and 3) fully-supervised setting.

python main.py --cfg ./experiment/charades/recorrect_eval_configs_on_ZeroShot+Unsup+Full.json --eval

You do not need any extra downloading to run the code, as the repository is self-contained with necessary features and checkpoints.

  1. CLIP features are available in the data/charades/feat directory.
  2. Pre-trained checkpoints are located in ckpt/charades
    • zero_shot.ckpt: zero-shot model.
    • unsup.ckpt: unsupervised model.
    • full_sup.ckpt: fully supervised model.

Fully Supervised Setting

MethodR@0.1R@0.2R@0.3mIoU
SimBase77.7766.4844.0156.15
ReCorrect (Ours)78.5568.3945.7857.42

 

Zero-Shot Setting

MethodR@0.1R@0.2R@0.3mIoU
ReCorrect66.5451.1528.5445.63
% of SimBase85.6%76.9%64.8%81.3%

 

Unsupervised Setting

MethodR@0.1R@0.2R@0.3mIoU
ReCorrect70.9654.4231.1048.66
% of SimBase91.2%81.9%70.7%86.7%

 

Motivation

Motivation

A crucial challenge in video moment retrieval is its heavy reliance on extensive manual annoations for training. To overcome this, we introduce a large scale dataset for Video Moment Retrieval Pretraining (Vid-Morp), collected with minimal human involvement. Vid-Morp comprises over 50K in-the-wild videos and 200K pseudo training samples. Models pretrained on Vid-Morp significantly relieve the annotation costs and demonstrate strong generalizability across diverse downstream settings.

Dataset

Dataset Download

To access the dataset download link, please send an email to peijun001@e.ntu.edu.sg. Note the dataset is only for academic usage.

Comparison to Existing Dataset

Dataset Comparision

Citation

If you use our code or dataset in your research, please cite with:

@article{bao2024vid,
  title={Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild},
  author={Bao, Peijun and Kong, Chenqi and Shao, Zihao and Ng, Boon Poh and Er, Meng Hwa and Kot, Alex C},
  journal={arXiv preprint arXiv:2412.00811},
  year={2024}
}