Preserving Modality Structure Improves Multi-Modal Learning
February 9, 2024 ยท View on GitHub
Swetha, Sirnam, Rizve, Mamshad Nayeem, Shvetsova, Nina, Kuehne, Hilde and Shah, Mubarak
Accepted at ICCV 2023!
This repo is official implementation of Preserving Modality Structure Improves Multi-Modal Learning

Repository contains:
- Training Code
- Model Weights [Todo]
- Fine-tuning and evaluation datasets: MSR-VTT and YouCook2
Get started
- Create an environment:
conda create python=3.8 -y -n multisk conda activate multisk conda install -y pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=10.2 -c pytorch pip install gensim==3.8.0 sacred==0.8.2 humanize==3.14.0 transformers==4.10.2 librosa==0.8.1 timm==0.4.12 - If needed, download
data.tarwith features and spectrograms to fine-tune and evaluate on YouCook2 and MSR-VTT here. Extract a tar:tar -xvf data.tar
Pretraining
-
Downloading HowTo100M and feature extraction. Please note that HowTo100M videos require a huge storage, and features alone take up terabytes of space. Features extraction (ResNet-152,ResNeXt-101) and audio spectrogram extraction were carefully described in https://github.com/roudimit/AVLnet/blob/main/training.md.
-
Review
configs/pretraining/resnet_tva.yamland make surecsv,features_path,features_path_audio, andcaption_pathpoint on the correct paths. CSV file should contain one column named 'path' with a list of videos. An example of the CSV file that we used in the training can be found here (HowTo100M_1166_videopaths.txt). -
Train
python train.py --config configs/pretraining/resnet_tva.yaml
Using the model on your own data
If you want to use the model on your own data, please follow steps described in https://github.com/roudimit/AVLnet for features extraction and audio spectrogram extraction.
Cite
If you use this code in your research, please cite:
@InProceedings{Swetha_2023_ICCV,
author = {Swetha, Sirnam and Rizve, Mamshad Nayeem and Shvetsova, Nina and Kuehne, Hilde and Shah, Mubarak},
title = {Preserving Modality Structure Improves Multi-Modal Learning},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {21993-22003}
}
Contact
If you have any problems with the code or have a question, please open an issue or email swetha(dot)sirnam at ucf.edu. I'll try to answer as soon as possible.
Acknowledgments and Licenses
The main structure of the code is based on everything-at-once which is built upon frozen-in-time.
The code in davenet.py, layers.py, avlnet.py is partly derived from https://github.com/dharwath/DAVEnet-pytorch/, https://github.com/wnhsu/ResDAVEnet-VQ, https://github.com/antoine77340/howto100m, and https://github.com/roudimit/AVLnet, and is licensed under BSD-3 (David Harwath, Wei-Ning Hsu, Andrew Rouditchenko) and Apache License 2.0 (Antoine Miech).