Preserving Modality Structure Improves Multi-Modal Learning

February 9, 2024 · View on GitHub

Swetha, Sirnam, Rizve, Mamshad Nayeem, Shvetsova, Nina, Kuehne, Hilde and Shah, Mubarak

Accepted at ICCV 2023!

This repo is official implementation of Preserving Modality Structure Improves Multi-Modal Learning

Project Page

Hugging Face

alt text

Repository contains:

Training Code
Model Weights [Todo]
Fine-tuning and evaluation datasets: MSR-VTT and YouCook2

Get started

Create an environment:

conda create python=3.8 -y -n multisk
conda activate multisk 
conda install -y pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=10.2 -c pytorch
pip install gensim==3.8.0 sacred==0.8.2 humanize==3.14.0 transformers==4.10.2 librosa==0.8.1 timm==0.4.12

If needed, download data.tar with features and spectrograms to fine-tune and evaluate on YouCook2 and MSR-VTT here. Extract a tar: tar -xvf data.tar

Pretraining

Downloading HowTo100M and feature extraction. Please note that HowTo100M videos require a huge storage, and features alone take up terabytes of space. Features extraction (ResNet-152,ResNeXt-101) and audio spectrogram extraction were carefully described in https://github.com/roudimit/AVLnet/blob/main/training.md.
Review configs/pretraining/resnet_tva.yaml and make sure csv, features_path, features_path_audio, and caption_path point on the correct paths. CSV file should contain one column named 'path' with a list of videos. An example of the CSV file that we used in the training can be found here (HowTo100M_1166_videopaths.txt).
Train python train.py --config configs/pretraining/resnet_tva.yaml

Using the model on your own data

If you want to use the model on your own data, please follow steps described in https://github.com/roudimit/AVLnet for features extraction and audio spectrogram extraction.

Cite

If you use this code in your research, please cite:

@InProceedings{Swetha_2023_ICCV,
    author    = {Swetha, Sirnam and Rizve, Mamshad Nayeem and Shvetsova, Nina and Kuehne, Hilde and Shah, Mubarak},
    title     = {Preserving Modality Structure Improves Multi-Modal Learning},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {21993-22003}
      }

Contact

If you have any problems with the code or have a question, please open an issue or email swetha(dot)sirnam at ucf.edu. I'll try to answer as soon as possible.

Acknowledgments and Licenses

The main structure of the code is based on everything-at-once which is built upon frozen-in-time.

The code in davenet.py, layers.py, avlnet.py is partly derived from https://github.com/dharwath/DAVEnet-pytorch/, https://github.com/wnhsu/ResDAVEnet-VQ, https://github.com/antoine77340/howto100m, and https://github.com/roudimit/AVLnet, and is licensed under BSD-3 (David Harwath, Wei-Ning Hsu, Andrew Rouditchenko) and Apache License 2.0 (Antoine Miech).