MAE-Lite (IJCV 2025)
March 1, 2025 ยท View on GitHub
News | Introduction | Getting Started | Main Results | Citation | Acknowledge
An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-training
Jin Gao, Shubo Lin, Shaoru Wang*, Yutong Kou, Zeming Li, Liang Li, Congxuan Zhang, Xiaoqin Zhang, Yizheng Wang, Weiming Hu
IJCV 2025
A Closer Look at Self-Supervised Lightweight Vision Transformers
Shaoru Wang, Jin Gao*, Zeming Li, Xiaoqin Zhang, Weiming Hu
ICML 2023
๐ News
2024.12: Our extended version is accepted by IJCV 2025!2023.5: Code & models are released!2023.4: Our paper is accepted by ICML 2023!2022.5: Our initial version of the paper was published on Arxiv.
โจ Introduction
MAE-Lite focuses on exploring the pre-training of lightweight Vision Transformers (ViTs). This repo provide the code and models for the study in the paper.
- We provide advanced pre-training (based on MAE) and fine-tuning recipes for lightweight ViTs and demonstrate that even vanilla lightweight ViT (e.g., ViT-Tiny) beats most previous SOTA ConvNets and ViT derivatives with delicate network architecture design. We achieve 79.0% top-1 accuracy on ImageNet with vanilla ViT-Tiny (5.7M).
- We provide code for the transfer evaluation of pre-trained models on several classification tasks (e.g., Oxford 102 Flower, Oxford-IIIT Pet, FGVC Aircraft, CIFAR, etc.) and COCO detection tasks (based on ViTDet). We find that the self-supervised pre-trained ViTs work worse than the supervised pre-trained ones on data-insufficient downstream tasks.
- We provide code for the analysis tools used in the paper to examine the layer representations and attention distance & entropy for the ViTs.
- We provide code and models for our proposed knowledge distillation method for the pre-trained lightweight ViTs based on MAE, which shows superiority on the trasfer evaluation of data-insufficient classification tasks and dense prediction tasks.
update (2025.02.28)
- We provide benchmark for more masked image modeling (MIM) pre-training methods (BEiT, BootMAE, MaskFeat) on lightweight ViTs and evaluate their transferability to downstream tasks.
- We provide code and models for our decoupled distillation method during pre-training and transfer to more dense prediction tasks including detection, tracking and semantic segmentation, which enables SOTA performance on the ADE20K segmentation task (42.8% mIoU) and LaSOT tracking task (66.1% AUC) in the lightweight regime. The latter even surpasses all the current SOTA lightweight CPU-realtime trackers.
- We extend our distillation method to hierarchical ViTs (Swin and Hiera), which validate the generalizability and effectiveness of the distillation following our observation-analysis-solution flow.
๐ Getting Started
Installation
Setup conda environment:
# Create environment
conda create -n mae-lite python=3.7 -y
conda activate mae-lite
# Instaill requirements
conda install pytorch==1.9.0 torchvision==0.10.0 -c pytorch -y
# Clone MAE-Lite
git clone https://github.com/wangsr126/mae-lite.git
cd mae-lite
# Install other requirements
pip3 install -r requirements.txt
python3 setup.py build develop --user
Data Preparation
Prepare the ImageNet data in <BASE_FOLDER>/data/imagenet/imagenet_train, <BASE_FOLDER>/data/imagenet/imagenet_val.
Pre-Training
To pre-train ViT-Tiny with our recommended MAE recipe:
# 4096 batch-sizes on 8 GPUs:
cd projects/mae_lite
ssl_train -b 4096 -d 0-7 -e 400 -f mae_lite_exp.py --amp \
--exp-options exp_name=mae_lite/mae_tiny_400e
Fine-Tuning on ImageNet
Please download the pre-trained models, e.g.,
download MAE-Tiny to <BASE_FOLDER>/checkpoints/mae_tiny_400e.pth.tar
To fine-tune with the improved recipe:
# 1024 batch-sizes on 8 GPUs:
cd projects/eval_tools
ssl_train -b 1024 -d 0-7 -e 300 -f finetuning_exp.py --amp \
[--ckpt <checkpoint-path>] --exp-options pretrain_exp_name=mae_lite/mae_tiny_400e
<checkpoint-path>: if set to<BASE_FOLDER>/checkpoints/mae_tiny_400e.pth.tar, it will be loaded as initialization; If not set, the checkpoint at<BASE_FOLDER>/outputs/mae_lite/mae_tiny_400e/last_epoch_ckpt.pth.tarwill be loaded automatically.
Evaluation of fine-tuned models
download MAE-Tiny-FT to <BASE_FOLDER>/checkpoints/mae_tiny_400e_ft_300e.pth.tar
# 1024 batch-sizes on 1 GPUs:
python mae_lite/tools/eval.py -b 1024 -d 0 -f projects/eval_tools/finetuning_exp.py \
--ckpt <BASE_FOLDER>/checkpoints/mae_tiny_400e_ft_300e.pth.tar \
--exp-options pretrain_exp_name=mae_lite/mae_tiny_400e/ft_eval
And you will get "Top1: 77.978" if all right.
download MAE-Tiny-FT-RPE to <BASE_FOLDER>/checkpoints/mae_tiny_400e_ft_rpe_1000e.pth.tar
# 1024 batch-sizes on 1 GPUs:
python mae_lite/tools/eval.py -b 1024 -d 0 -f projects/eval_tools/finetuning_rpe_exp.py \
--ckpt <BASE_FOLDER>/checkpoints/mae_tiny_400e_ft_rpe_1000e.pth.tar \
--exp-options pretrain_exp_name=mae_lite/mae_tiny_400e/ft_rpe_eval
And you will get "Top1: 79.002" if all right.
download MAE-Tiny-Distill-Dยฒ-FT-RPE to <BASE_FOLDER>/checkpoints/mae_tiny_distill_d2_400e_ft_rpe_1000e.pth.tar
# 1024 batch-sizes on 1 GPUs:
python mae_lite/tools/eval.py -b 1024 -d 0 -f projects/eval_tools/finetuning_rpe_exp.py \
--ckpt <BASE_FOLDER>/checkpoints/mae_tiny_distill_d2_400e_ft_rpe_1000e.pth.tar \
--exp-options pretrain_exp_name=mae_lite/mae_tiny_400e/ft_rpe_eval qv_bias=False
And you will get "Top1: 79.444" if all right.
Pre-Training with Distillation
Please refer to DISTILL.md.
Transfer to Other Datasets
Please refer to TRANSFER.md.
Transfer to Detection Tasks
Please refer to DETECTION.md.
Transfer to Tracking Tasks
Please refer to TRACKING.md.
Transfer to Semantic Segmentation Tasks
Please refer to SEGMENTATION.md.
Experiments of MoCo-v3
Please refer to MOCOV3.md.
Models Analysis Tools
Please refer to VISUAL.md.
๐ Main Results
| pre-train code | pre-train epochs | fine-tune recipe | fine-tune epoch | accuracy | ckpt |
|---|---|---|---|---|---|
| - | - | impr. | 300 | 75.8 | link |
| mae_lite | 400 | - | - | - | link |
| impr. | 300 | 78.0 | link | ||
| impr.+RPE | 1000 | 79.0 | link | ||
| mae_lite_distill | 400 | - | - | - | link |
| impr. | 300 | 78.4 | link | ||
| mae_lite_d2_distill | 400 | - | - | - | link |
| impr. | 300 | 78.7 | link | ||
| impr.+RPE | 1000 | 79.4 | link |
๐ท๏ธ Citation
Please cite the following paper if this repo helps your research:
@misc{wang2023closer,
title={A Closer Look at Self-Supervised Lightweight Vision Transformers},
author={Shaoru Wang and Jin Gao and Zeming Li and Xiaoqin Zhang and Weiming Hu},
journal={arXiv preprint arXiv:2205.14443},
year={2023},
}
@article{gao2025experimental,
title={An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-training},
author={Jin Gao, Shubo Lin, Shaoru Wang, Yutong Kou, Zeming Li, Liang Li, Congxuan Zhang, Xiaoqin Zhang, Yizheng Wang, Weiming Hu},
journal={International Journal of Computer Vision},
year={2025},
doi={10.1007/s11263-024-02327-w},
publisher={Springer}
}
๐ค Acknowledge
We thank for the code implementation from timm, MAE, MoCo-v3.
License
This repo is released under the Apache 2.0 license. Please see the LICENSE file for more information.