ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
March 10, 2023 ยท View on GitHub
This is the PyTorch implementation of our paper:
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
Yan-Bo Lin, Jie Lei, Mohit Bansal, and Gedas Bertasius
In European Conference on Computer Vision, 2022.
๐ Preparation
pip3 install requirements.txt- Dataset: ActivityNet, QVHighlights, YouCook2, DiDeMo and Charades.
- extract video frames in 3 fps.
- extract audio features.
- To load pretrained CLIP weight
The download links are from official CLIP4Clip Download CLIP (ViT-B/32) weight,
wget -P ./modules https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
or, download CLIP (ViT-B/16) weight,
wget -P ./modules https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt
๐ฟ Extract images and audio features.
ActivityNet/
โโโ raw_frames/
โ โโโ VIDEO_NAME/
โ โโโ 0001.jpg
โ โโโ ...
โ โโโ 00...jpg
โ
โโโ VGGSound_Audio_features_10s_aligned/
โโโ VIDEO_NAME/
โโโ 0000.pt
โโโ ...
โโโ 00...pt
๐ฟ Extracted audio features.
VGGSound features on ActivityNet Captions: Google Drive
๐ Train and evaluate
ActivityNet Captions: bash run_act.sh
DiDemo: bash run_didemo.sh
Charades: bash run_cha.sh
QVHighlight:bash run_qvh.sh
YouCook2: bash run_yc2.sh
๐ Cite
If you use this code in your research, please cite:
@InProceedings{ECLIPSE_ECCV22,
author = {Yan-Bo Lin and Jie Lei and Mohit Bansal and Gedas Bertasius},
title = {ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
month = {October},
year = {2022}
}
๐ Acknowledgments
Our code is based on CLIP4Clip and VGGSound
โ Future works
- Preprocessed video frames and audio features
License
This project is licensed under MIT License, as found in the LICENSE file.
