[ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

January 22, 2025 · View on GitHub

Authors: Shoubin Yu, Jaehong Yoon, Mohit Bansal

University of North Carolina at Chapel Hill

🔥 News

Jan 23, 2025. CREMA has been accepted to ICLR 2025!
Jun 14, 2024. Check our new arXiv-version2 for exciting additions to CREMA:
- New modality-sequential modular training & modality-adaptive early exit strategy to handle learning with many modalities.
- More unique/rare multimodal reasoning tasks (video-touch and video-thermal QA) to further demonstrate the generalizability of CREMA

Code structure


# CREMA code
./lavis/

# running scripts for CREMA training/inference
./run_scripts

Setup

Install Dependencies

(Optional) Creating conda environment

conda create -n crema python=3.8
conda activate crema

build from source

pip install -e .

Download Models

Pre-trained Models

Visual Encoder: we adopt pre-trained ViT-G (1B), the codebase downloads the model automatically.

Audio Encoder: we use pre-trained BEATs (iter3+), please download the model here, and update the path in the code

3D Encoder: we conduct off-line feature extraction following 3D-LLM, please refer to this page for per-extracted features. Please change the storage in dataset config.

Multimodal Qformer: We initialize query tokens and FC layer for each MMQA in Multimodal Q-Former form pre-trained BLIP-2 model checkpoints. We hold Multimodal Q-Fromer with pre-trained MMQA-audio and MMQA-3D via HuggingFace, and Multimodal Q-Fromer initilized from BLIP-2 can be found here.

Fine-tuned Models

Dataset	Modalities
SQA3D	Video+3D+Depth+Norm
MUSIC-AVQA	Video+Audio+Flow+Norm+Depth
NExT-QA	Video+Flow+Depth+Normal

Dataset Preparation & Feature Extraction

We test our model on:

SQA3D: we follow 3D-LLM data format.
MUSIC-AVQA: we follow the orginal MUSIC-AVQA data format.
NExT-QA: we follow SeViLA data format.
Touch-QA (reformulated from Touch&Go): we follow SeViLA data format, and released our data here.
Thermal-QA (reformulated from Thermal-IM): we follow SeViLA data format, and released our data here.

To get trimmed Touch-QA and Thermal-QA video frames, you can first download raw videos from each original data project, and preprocess with our scripts after setting the custom data path, by running.

python trim_video.py

python decode_frames.py

We extract various extra modalities from raw video with pre-train models, please refer to each model repo and paper appendix for more details.

We will share extracted features.

Dataset	Multimodal Features
SQA3D	Video Frames, Depth Map, Surface Normals
MUSIC-AVQA	Video Frames, Optical Flow , Depth Map, Surface Normals
NExT-QA	Video Frames, Depth Map, Optical Flow, Surface Normals
Touch-QA	Video Frames, Surface Normals
Thermal-QA	Video Frames, Depth Map

We pre-train MMQA in our CRMEA framework with public modality-specific datasets:

AudioCaps for MMQA-Audio
3D-LLM for MMQA-3D

Training and Inference

We provide CREMA training and inference script examples as follows.

1) Training

sh run_scripts/crema/finetune/sqa3d.sh

2) Inference

sh run_scripts/crema/inference/sqa3d.sh

Acknowledgments

We thank the developers of LAVIS, BLIP-2, CLIP, X-InstructBLIP, for their public code release.

Reference

Please cite our paper if you use our models in your works:

@article{yu2024crema,
  title={CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion},
  author={Yu, Shoubin and Yoon, Jaehong and Bansal, Mohit},
  journal={ICLR},
  year={2025}
}

Authors: Shoubin Yu*, Jaehong Yoon*, Mohit Bansal

Authors: Shoubin Yu, Jaehong Yoon, Mohit Bansal