MultiMAE: Multi-modal Multi-task Masked Autoencoders

October 23, 2022 · View on GitHub

Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir

Website | arXiv | BibTeX

Official PyTorch implementation and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders.

We introduce Multi-modal Multi-task Masked Autoencoders (MultiMAE), an efficient and effective pre-training strategy for Vision Transformers. Given a small random sample of visible patches from multiple modalities, the MultiMAE pre-training objective is to reconstruct the masked-out regions. Once pre-trained, a single MultiMAE encoder can then be used for both single-modal and multi-modal downstream transfer, yielding competitive to or significantly better results than the baselines.

Catalog

Pre-trained models
MultiMAE pre-training code
ImageNet-1K classification fine-tuning code
Semantic segmentation fine-tuning code (single-modal & multi-modal)
Depth estimation fine-tuning code
Taskonomy fine-tuning code
Colab & Hugging Face demos
Download links for ImageNet-1K depth and semantic segmentation pseudo labels

Pre-trained models

We provide the weights of our pre-trained MultiMAE ViT-B model, in MultiViT (multi-modal) format and timm (RGB-only) format.

For comparison, we also provide the weights of a MAE ViT-B model that we pre-trained using the official MAE codebase following the recommended settings.

Method	Arch.	Pre-training modalities	Pre-training epochs	Weights (MultiViT)	Weights (timm)	Config
MAE	ViT-B	RGB	1600	download	download	See MAE
MultiMAE	ViT-B	RGB+D+S	1600	download	download	link

These pre-trained models can then be fine-tuned using this codebase to reach the following performance:

Method	Classif. (@1)	Semantic Segmentation (mIoU)							Depth (δ1)
	ImageNet-1K (RGB)	ADE20K (RGB)	Hypersim (RGB / D / RGB + D)			NYUv2 (RGB / D / RGB + D)			NYUv2 (RGB)
Sup. (DeiT)	81.8	45.8	33.9	-	-	50.1	-	-	80.7
MAE	83.3	46.2	36.5	-	-	50.8	-	-	85.1
MultiMAE	83.3	46.2	37.0	38.5	47.6	52.0	41.4	56.0	86.4

Model formats

We provide pre-trained weights in two different formats: the single-modal ViT / timm format, which is compatible with other popular ViT repositories (e.g., timm, DINO, MAE), and the multi-modal MultiMAE / MultiViT format, which is used throughout this codebase for multi-modal pre-training and fine-tuning. See multimae/multimae.py for the documentation and implementation of MultiMAE / MultiViT.

You can convert between these formats using the provided vit2multimae_converter.py and multimae2vit_converter.py scripts.

@article{bachmann2022multimae,
  author    = {Roman Bachmann and David Mizrahi and Andrei Atanov and Amir Zamir},
  title     = {{MultiMAE}: Multi-modal Multi-task Masked Autoencoders},
  booktitle = {European Conference on Computer Vision},
  year      = {2022},
}

MultiMAE: Multi-modal Multi-task Masked Autoencoders

Catalog

Pre-trained models

Model formats

Usage

Set-up

Pre-training

Fine-tuning

Demo & visualizations

Acknowledgement

License

Citation