Autoregressive Pre-training of Large Vision Encoders
April 23, 2025 Β· View on GitHub
This repository is the entry point for all things AIM, a family of autoregressive models that push the boundaries of visual and multimodal learning:
- AIMv2:
Multimodal Autoregressive Pre-training of Large Vision Encoders[BibTeX] [CVPR 2025 (Highlight)]
Enrico Fini*, Mustafa Shukor*, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis BΓ©thune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, and Alaaeldin El-Nouby* - AIMv1:
Scalable Pre-training of Large Autoregressive Image Models[BibTeX][ICML 2024]
Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, Armand Joulin.
*: Equal technical contribution
If you're looking for the original AIM model (AIMv1), please refer to the README here.
Overview of AIMv2
We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and to scale effectively. Some AIMv2 highlights include:
- Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks.
- Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension.
- Exhibits strong recognition performance with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk.

AIMv2 Model Gallery
We share with the community AIMv2 pre-trained checkpoints of varying capacities, pre-training resolutions:
- [
AIMv2 with 224px] - [
AIMv2 with 336px] - [
AIMv2 with 448px] - [
AIMv2 with Native Resolution] - [
AIMv2 distilled ViT-Large] (recommended for multimodal applications) - [
Zero-shot Adapted AIMv2]
Installation
Please install PyTorch using the official installation instructions. Afterward, install the package as:
pip install 'git+https://github.com/apple/ml-aim.git#subdirectory=aim-v1'
pip install 'git+https://github.com/apple/ml-aim.git#subdirectory=aim-v2'
We also offer MLX backend support for research and experimentation on Apple silicon. To enable MLX support, simply run:
pip install mlx
Examples
Using PyTorch
from PIL import Image
from aim.v2.utils import load_pretrained
from aim.v1.torch.data import val_transforms
img = Image.open(...)
model = load_pretrained("aimv2-large-patch14-336", backend="torch")
transform = val_transforms(img_size=336)
inp = transform(img).unsqueeze(0)
features = model(inp)
Using MLX
from PIL import Image
import mlx.core as mx
from aim.v2.utils import load_pretrained
from aim.v1.torch.data import val_transforms
img = Image.open(...)
model = load_pretrained("aimv2-large-patch14-336", backend="mlx")
transform = val_transforms(img_size=336)
inp = transform(img).unsqueeze(0)
inp = mx.array(inp.numpy())
features = model(inp)
Using JAX
from PIL import Image
import jax.numpy as jnp
from aim.v2.utils import load_pretrained
from aim.v1.torch.data import val_transforms
img = Image.open(...)
model, params = load_pretrained("aimv2-large-patch14-336", backend="jax")
transform = val_transforms(img_size=336)
inp = transform(img).unsqueeze(0)
inp = jnp.array(inp)
features = model.apply({"params": params}, inp)
Pre-trained Checkpoints
The pre-trained models can be accessed via HuggingFace Hub as:
from PIL import Image
from transformers import AutoImageProcessor, AutoModel
image = Image.open(...)
processor = AutoImageProcessor.from_pretrained("apple/aimv2-large-patch14-336")
model = AutoModel.from_pretrained("apple/aimv2-large-patch14-336", trust_remote_code=True)
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
AIMv2 with 224px
| model_id | #params | IN-1k | HF Link | Backbone |
|---|---|---|---|---|
| aimv2-large-patch14-224 | 0.3B | 86.6 | π€link | link |
| aimv2-huge-patch14-224 | 0.6B | 87.5 | π€link | link |
| aimv2-1B-patch14-224 | 1.2B | 88.1 | π€link | link |
| aimv2-3B-patch14-224 | 2.7B | 88.5 | π€link | link |
AIMv2 with 336px
| model_id | #params | IN-1k | HF Link | Backbone |
|---|---|---|---|---|
| aimv2-large-patch14-336 | 0.3B | 87.6 | π€link | link |
| aimv2-huge-patch14-336 | 0.6B | 88.2 | π€link | link |
| aimv2-1B-patch14-336 | 1.2B | 88.7 | π€link | link |
| aimv2-3B-patch14-336 | 2.7B | 89.2 | π€link | link |
AIMv2 with 448px
| model_id | #params | IN-1k | HF Link | Backbone |
|---|---|---|---|---|
| aimv2-large-patch14-448 | 0.3B | 87.9 | π€link | link |
| aimv2-huge-patch14-448 | 0.6B | 88.6 | π€link | link |
| aimv2-1B-patch14-448 | 1.2B | 89.0 | π€link | link |
| aimv2-3B-patch14-448 | 2.7B | 89.5 | π€link | link |
AIMv2 with Native Resolution
We additionally provide an AIMv2-L checkpoint that is finetuned to process a wide range of image resolutions and aspect ratios. Regardless of the aspect ratio, the image is patchified (patch_size=14) and a 2D sinusoidal positional embedding is added to the linearly projected input patches. This checkpoint supports number of patches in the range of [112, 4096].
| model_id | #params | IN-1k | HF Link | Backbone |
|---|---|---|---|---|
| aimv2-large-patch14-native | 0.3B | 87.3 | π€link | link |
AIMv2 distilled ViT-Large
We provide an AIMv2-L checkpoint distilled from AIMv2-3B that provides a remarkable performance for multimodal understanding benchmarks.
| Model | VQAv2 | GQA | OKVQA | TextVQA | DocVQA | InfoVQA | ChartQA | SciQA | MMEp |
|---|---|---|---|---|---|---|---|---|---|
| AIMv2-L | 80.2 | 72.6 | 60.9 | 53.9 | 26.8 | 22.4 | 20.3 | 74.5 | 1457 |
| AIMv2-L-distilled | 81.1 | 73.0 | 61.4 | 53.5 | 29.2 | 23.3 | 24.0 | 76.3 | 1627 |
| model_id | #params | Res. | HF Link | Backbone |
|---|---|---|---|---|
| aimv2-large-patch14-224-distilled | 0.3B | 224px | π€link | link |
| aimv2-large-patch14-336-distilled | 0.3B | 336px | π€link | link |
Zero-shot Adapted AIMv2
We provide the AIMv2-L vision and text encoders after LiT tuning to enable zero-shot recognition.
| model | #params | zero-shot IN1-k | Backbone |
|---|---|---|---|
| AIMv2-L | 0.3B | 77.0 | link |
Citation
If you find our work useful, please consider citing us as:
AIMv2 bibtex
@misc{fini2024multimodal,
title={Multimodal Autoregressive Pre-training of Large Vision Encoders},
author={Enrico Fini and Mustafa Shukor and Xiujun Li and Philipp Dufter and Michal Klein and David Haldimann and Sai Aitharaju and Victor Guilherme Turrisi da Costa and Louis BΓ©thune and Zhe Gan and Alexander T Toshev and Marcin Eichner and Moin Nabi and Yinfei Yang and Joshua M. Susskind and Alaaeldin El-Nouby},
year={2024},
eprint={2411.14402},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
AIMv1 bibtex
@InProceedings{pmlr-v235-el-nouby24a,
title = {Scalable Pre-training of Large Autoregressive Image Models},
author = {El-Nouby, Alaaeldin and Klein, Michal and Zhai, Shuangfei and Bautista, Miguel \'{A}ngel and Shankar, Vaishaal and Toshev, Alexander T and Susskind, Joshua M. and Joulin, Armand},
booktitle = {Proceedings of the 41st International Conference on Machine Learning},
pages = {12371--12384},
year = {2024},
}
License
Please check out the repository LICENSE before using the provided code and models.