MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval (ECCV 2022)
July 5, 2022 ยท View on GitHub
Main Results on Downstream Tasks
Text-to-video Retrieval on MSR-VTT

Text-to-video Retrieval on MSVD, LSMDC and DiDeMo

Visualization
Local Visual Semantics Capture
We visualize the self- attention map from the video encoder through computing the self-attention of the [CLS] token in the last block. Our pre-trained model pays high attention to those significant local regions in the video.

Fine-grained Video-text Alignment
We visualize the cross-modality alignment between text and video tokens by calculating the similarity map between features embedded from the text encoder and video encoder. Our pre-trained model aligns words with corresponding visual regions accurately.

Pre-trained Model
Our pre-trained model can be downloaded in Pre-trained Model, which contains the weights of Video Encoder and Text Encoder.
Video Encoder
Our video encoder is exactly the same as Frozen, which consists of a stack of divided space-time self-attention blocks. Compared to the video encoder of MCQ, the video encoder of MILES adds temporal attention to enable reasoning among the visible regions along the temporal dimensions for masked video modeling.
Downstream Retrieval (Zero-shot on MSR-VTT)
-
Download our pre-trained model in Pre-trained Model.
-
Load the pre-trained model in "configs/zero_msrvtt_4f_i21k_MILES.json".
bash sctripts/test_retrieval_MILES.sh
Acknowledgement
Our code is based on the implementation of "Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval" https://github.com/m-bain/frozen-in-time.git.
Citation
If our code is helpful to your work, please cite:
@article{ge2022miles,
title={MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval},
author={Ge, Yuying and Ge, Yixiao and Liu, Xihui and Wang, Alex Jinpeng and Wu, Jianping and Shan, Ying and Qie, Xiaohu and Luo, Ping},
journal={arXiv preprint arXiv:2204.12408},
year={2022}
}
