MultiMAE: Multi-modal Multi-task Masked Autoencoders
October 23, 2022 · View on GitHub
Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir
Official PyTorch implementation and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders.
We introduce Multi-modal Multi-task Masked Autoencoders (MultiMAE), an efficient and effective pre-training strategy for Vision Transformers. Given a small random sample of visible patches from multiple modalities, the MultiMAE pre-training objective is to reconstruct the masked-out regions. Once pre-trained, a single MultiMAE encoder can then be used for both single-modal and multi-modal downstream transfer, yielding competitive to or significantly better results than the baselines.
Catalog
- Pre-trained models
- MultiMAE pre-training code
- ImageNet-1K classification fine-tuning code
- Semantic segmentation fine-tuning code (single-modal & multi-modal)
- Depth estimation fine-tuning code
- Taskonomy fine-tuning code
- Colab & Hugging Face demos
- Download links for ImageNet-1K depth and semantic segmentation pseudo labels
Pre-trained models
We provide the weights of our pre-trained MultiMAE ViT-B model, in MultiViT (multi-modal) format and timm (RGB-only) format.
For comparison, we also provide the weights of a MAE ViT-B model that we pre-trained using the official MAE codebase following the recommended settings.
| Method | Arch. | Pre-training modalities | Pre-training epochs | Weights (MultiViT) | Weights (timm) | Config |
|---|---|---|---|---|---|---|
| MAE | ViT-B | RGB | 1600 | download | download | See MAE |
| MultiMAE | ViT-B | RGB+D+S | 1600 | download | download | link |
These pre-trained models can then be fine-tuned using this codebase to reach the following performance:
| Method | Classif. (@1) | Semantic Segmentation (mIoU) | Depth (δ1) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| ImageNet-1K (RGB) |
ADE20K (RGB) |
Hypersim (RGB / D / RGB + D) |
NYUv2 (RGB / D / RGB + D) |
NYUv2 (RGB) |
|||||
| Sup. (DeiT) | 81.8 | 45.8 | 33.9 | - | - | 50.1 | - | - | 80.7 |
| MAE | 83.3 | 46.2 | 36.5 | - | - |
50.8 | - | - | 85.1 |
| MultiMAE | 83.3 | 46.2 | 37.0 | 38.5 | 47.6 | 52.0 | 41.4 | 56.0 | 86.4 |
Model formats
We provide pre-trained weights in two different formats: the single-modal ViT / timm format, which is compatible with other popular ViT repositories (e.g., timm, DINO, MAE), and the multi-modal MultiMAE / MultiViT format, which is used throughout this codebase for multi-modal pre-training and fine-tuning. See multimae/multimae.py for the documentation and implementation of MultiMAE / MultiViT.
You can convert between these formats using the provided vit2multimae_converter.py and multimae2vit_converter.py scripts.
Usage
Set-up
See SETUP.md for set-up instructions.
Pre-training
See PRETRAINING.md for pre-training instructions.
Fine-tuning
See FINETUNING.md for fine-tuning instructions.
Demo & visualizations
For interactive demos, please see our website. Open our Colab notebook to play around with the visualization code, or simply upload an image to our Hugging Face Spaces demo.
Acknowledgement
This repository is built using the timm, DeiT, DINO, MoCo v3, BEiT, MAE-priv, and MAE repositories.
License
This project is under the CC-BY-NC 4.0 license. See LICENSE for details.
Citation
If you find this repository helpful, please consider citing our work:
@article{bachmann2022multimae,
author = {Roman Bachmann and David Mizrahi and Andrei Atanov and Amir Zamir},
title = {{MultiMAE}: Multi-modal Multi-task Masked Autoencoders},
booktitle = {European Conference on Computer Vision},
year = {2022},
}