CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

November 20, 2023 · View on GitHub

The implementation of paper CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval.

CLIP4Clip is a video-text retrieval model based on CLIP (ViT-B). We investigate three similarity calculation approaches: parameter-free type, sequential type, and tight type, in this work. The model achieve SOTA results on MSR-VTT, MSVD, LSMDC, ActivityNet, and DiDeMo.

CLIP4Clip

Requirement

  1. Run pip install -r requirement.text to install the exactly same dependencies.

  2. Or use conda-pack command to install the environment downloaded from here with [0dhw]:

    pip install conda-pack
    mkdir -p [path_to_conda_env]    # (e.g., ~/anaconda/envs/ENV_NAME)
    tar -zxvf [ENV_NAME].tar.gz -C [path_to_conda_env]
    

Data Preparing

1. For MSRVTT

The official data and video links can be found in link.

For the convenience, you can also download the splits and captions by,

wget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msrvtt_data.zip

Besides, the raw videos can be found in sharing from Frozen️ in Time, i.e.,

wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip

2. For MSVD

Raw videos can be download from link.

The splits and raw_captions can be found in the wonderful job collaborative-experts. For the convenience, you can also download them by,

wget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msvd_data.zip

Compress Video (optional)

Our UniPT adopts this operation for Speed-up.

python preprocess/compress_video.py --input_root [raw_video_path] --output_root [compressed_video_path]

This script will compress the video to 3fps with width 224 (or height 224). Modify the variables for your customization.

Training and Testing

  1. Download CLIP (ViT-B/32) weight into CLIP4Clip/modules/ViT-B-32.pt.
wget -P ./modules https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
  1. Then run ./train_xxxx_tuning.sh to obtain the corresponding model in ckpts/.

  2. One can download our best checkpoints of MSR-VTT and MSVD with [0dhw].

Acknowledgments

Our code is based on CLIP and UniVL.