MAE-Lite (IJCV 2025)

March 1, 2025 · View on GitHub

An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-training
Jin Gao, Shubo Lin, Shaoru Wang*, Yutong Kou, Zeming Li, Liang Li, Congxuan Zhang, Xiaoqin Zhang, Yizheng Wang, Weiming Hu
IJCV 2025

A Closer Look at Self-Supervised Lightweight Vision Transformers
Shaoru Wang, Jin Gao*, Zeming Li, Xiaoqin Zhang, Weiming Hu
ICML 2023

🎉 News

2024.12: Our extended version is accepted by IJCV 2025!
2023.5: Code & models are released!
2023.4: Our paper is accepted by ICML 2023!
2022.5: Our initial version of the paper was published on Arxiv.

✨ Introduction

MAE-Lite focuses on exploring the pre-training of lightweight Vision Transformers (ViTs). This repo provide the code and models for the study in the paper.

We provide advanced pre-training (based on MAE) and fine-tuning recipes for lightweight ViTs and demonstrate that even vanilla lightweight ViT (e.g., ViT-Tiny) beats most previous SOTA ConvNets and ViT derivatives with delicate network architecture design. We achieve 79.0% top-1 accuracy on ImageNet with vanilla ViT-Tiny (5.7M).
We provide code for the transfer evaluation of pre-trained models on several classification tasks (e.g., Oxford 102 Flower, Oxford-IIIT Pet, FGVC Aircraft, CIFAR, etc.) and COCO detection tasks (based on ViTDet). We find that the self-supervised pre-trained ViTs work worse than the supervised pre-trained ones on data-insufficient downstream tasks.
We provide code for the analysis tools used in the paper to examine the layer representations and attention distance & entropy for the ViTs.
We provide code and models for our proposed knowledge distillation method for the pre-trained lightweight ViTs based on MAE, which shows superiority on the trasfer evaluation of data-insufficient classification tasks and dense prediction tasks.

update (2025.02.28)

We provide benchmark for more masked image modeling (MIM) pre-training methods (BEiT, BootMAE, MaskFeat) on lightweight ViTs and evaluate their transferability to downstream tasks.
We provide code and models for our decoupled distillation method during pre-training and transfer to more dense prediction tasks including detection, tracking and semantic segmentation, which enables SOTA performance on the ADE20K segmentation task (42.8% mIoU) and LaSOT tracking task (66.1% AUC) in the lightweight regime. The latter even surpasses all the current SOTA lightweight CPU-realtime trackers.
We extend our distillation method to hierarchical ViTs (Swin and Hiera), which validate the generalizability and effectiveness of the distillation following our observation-analysis-solution flow.

📋 Getting Started

Installation

Setup conda environment:

# Create environment
conda create -n mae-lite python=3.7 -y
conda activate mae-lite

# Instaill requirements
conda install pytorch==1.9.0 torchvision==0.10.0 -c pytorch -y

# Clone MAE-Lite
git clone https://github.com/wangsr126/mae-lite.git
cd mae-lite

# Install other requirements
pip3 install -r requirements.txt
python3 setup.py build develop --user

Data Preparation

Prepare the ImageNet data in <BASE_FOLDER>/data/imagenet/imagenet_train, <BASE_FOLDER>/data/imagenet/imagenet_val.

Pre-Training

To pre-train ViT-Tiny with our recommended MAE recipe:

# 4096 batch-sizes on 8 GPUs:
cd projects/mae_lite
ssl_train -b 4096 -d 0-7 -e 400 -f mae_lite_exp.py --amp \
--exp-options exp_name=mae_lite/mae_tiny_400e

Fine-Tuning on ImageNet

Please download the pre-trained models, e.g.,

download MAE-Tiny to <BASE_FOLDER>/checkpoints/mae_tiny_400e.pth.tar

To fine-tune with the improved recipe:

# 1024 batch-sizes on 8 GPUs:
cd projects/eval_tools
ssl_train -b 1024 -d 0-7 -e 300 -f finetuning_exp.py --amp \
[--ckpt <checkpoint-path>] --exp-options pretrain_exp_name=mae_lite/mae_tiny_400e

<checkpoint-path>: if set to <BASE_FOLDER>/checkpoints/mae_tiny_400e.pth.tar, it will be loaded as initialization; If not set, the checkpoint at <BASE_FOLDER>/outputs/mae_lite/mae_tiny_400e/last_epoch_ckpt.pth.tar will be loaded automatically.

Evaluation of fine-tuned models

download MAE-Tiny-FT to <BASE_FOLDER>/checkpoints/mae_tiny_400e_ft_300e.pth.tar

# 1024 batch-sizes on 1 GPUs:
python mae_lite/tools/eval.py -b 1024 -d 0 -f projects/eval_tools/finetuning_exp.py \
--ckpt <BASE_FOLDER>/checkpoints/mae_tiny_400e_ft_300e.pth.tar \
--exp-options pretrain_exp_name=mae_lite/mae_tiny_400e/ft_eval

And you will get "Top1: 77.978" if all right.

download MAE-Tiny-FT-RPE to <BASE_FOLDER>/checkpoints/mae_tiny_400e_ft_rpe_1000e.pth.tar

# 1024 batch-sizes on 1 GPUs:
python mae_lite/tools/eval.py -b 1024 -d 0 -f projects/eval_tools/finetuning_rpe_exp.py \
--ckpt <BASE_FOLDER>/checkpoints/mae_tiny_400e_ft_rpe_1000e.pth.tar \
--exp-options pretrain_exp_name=mae_lite/mae_tiny_400e/ft_rpe_eval

And you will get "Top1: 79.002" if all right.

download MAE-Tiny-Distill-D²-FT-RPE to <BASE_FOLDER>/checkpoints/mae_tiny_distill_d2_400e_ft_rpe_1000e.pth.tar

# 1024 batch-sizes on 1 GPUs:
python mae_lite/tools/eval.py -b 1024 -d 0 -f projects/eval_tools/finetuning_rpe_exp.py \
--ckpt <BASE_FOLDER>/checkpoints/mae_tiny_distill_d2_400e_ft_rpe_1000e.pth.tar \
--exp-options pretrain_exp_name=mae_lite/mae_tiny_400e/ft_rpe_eval qv_bias=False

And you will get "Top1: 79.444" if all right.

Pre-Training with Distillation

Please refer to DISTILL.md.

Transfer to Other Datasets

Please refer to TRANSFER.md.

Transfer to Detection Tasks

Please refer to DETECTION.md.

Transfer to Tracking Tasks

Please refer to TRACKING.md.

Transfer to Semantic Segmentation Tasks

Please refer to SEGMENTATION.md.

Experiments of MoCo-v3

Please refer to MOCOV3.md.

Models Analysis Tools

Please refer to VISUAL.md.

📄 Main Results

pre-train code	pre-train epochs	fine-tune recipe	fine-tune epoch	accuracy	ckpt
-	-	impr.	300	75.8	link
mae_lite	400	-	-	-	link
		impr.	300	78.0	link
		impr.+RPE	1000	79.0	link
mae_lite_distill	400	-	-	-	link
		impr.	300	78.4	link
mae_lite_d2_distill	400	-	-	-	link
		impr.	300	78.7	link
		impr.+RPE	1000	79.4	link

🏷️ Citation

Please cite the following paper if this repo helps your research:

@misc{wang2023closer,
      title={A Closer Look at Self-Supervised Lightweight Vision Transformers}, 
      author={Shaoru Wang and Jin Gao and Zeming Li and Xiaoqin Zhang and Weiming Hu},
      journal={arXiv preprint arXiv:2205.14443},
      year={2023},
}

@article{gao2025experimental,
      title={An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-training},
      author={Jin Gao, Shubo Lin, Shaoru Wang, Yutong Kou, Zeming Li, Liang Li, Congxuan Zhang, Xiaoqin Zhang, Yizheng Wang, Weiming Hu},
      journal={International Journal of Computer Vision},
      year={2025},
      doi={10.1007/s11263-024-02327-w},
      publisher={Springer}
}

🤝 Acknowledge

We thank for the code implementation from timm, MAE, MoCo-v3.

License

This repo is released under the Apache 2.0 license. Please see the LICENSE file for more information.