ViTamin: Designing Scalable Vision Models in the Vision-language Era

June 9, 2024 ยท View on GitHub

๐Ÿ”ฅ Officially supported by timm and OpenCLIP. Thanks @rwightman!

One line of code to call ViTamin:

model = timm.create_model('vitamin_xlarge_384')

ViTamin-XL, with only 436M parameters and trained on the public DataComp-1B dataset, achieves an impressive 82.9% zero-shot ImageNet accuracy.

ViTamin-L sets a new SOTA across seven benchmarks for open-vocabulary segmentation, and also push forward the capabilities of large multi-modal models (e.g., LLaVA) significantly.

๐Ÿค— The HuggingFace collection of ViTamin model cards has been released! Check out the model cards!

teaser

Get Started

It currently includes code and models for the following tasks:

ViTamin Pre-training: See ./ViTamin/README.md for a quick start, which includes CLIP pre-training / fine-tuning pipelines and zero-shot evaluation pipelines.

Open-vocabulary Detection and Segmentation: See ViTamin for Open-vocab Detection and ViTamin for Open-vocab Segmentation.

Large Multi-Modal Models: See ViTamin for Large Multi-Modal Models.

We also support ViTamin with Hugging Face model jienengchen/ViTamin-XL-384px.

import torch
import open_clip
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(
    'jienengchen/ViTamin-XL-384px',
    trust_remote_code=True).to(device).eval()

image = Image.open('./image.png').convert('RGB')
image_processor = CLIPImageProcessor.from_pretrained('jienengchen/ViTamin-XL-384px')

pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

tokenizer = open_clip.get_tokenizer('hf-hub:laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K')
text = tokenizer(["a photo of vitamin", "a dog", "a cat"]).to(device)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features, text_features, logit_scale = model(pixel_values, text)
    text_probs = (100.0 * image_features @ text_features.to(torch.float).T).softmax(dim=-1)

print("Label probs:", text_probs) 

Main Results with CLIP Pre-training on DataComp-1B

We will provide 61 trained VLMs (48 benchmarked + 13 best performing) in Hugging Face for community use. Stay tuned!

image encoder๐Ÿค— HuggingFaceimage sizenum patchestext encoder depth/widthseen samples (B)trainable params Image+Text (M)MACs Image+Text (G)ImageNet Acc.avg. 38 datasetsImageNet dist. shift.VTABretrieval
ViTamin-LLink22419612/76812.8333.3+123.772.6+6.680.866.769.865.360.3
ViTamin-LLink25625612/76812.8+0.2333.4+123.794.8+6.681.267.071.165.361.2
ViTamin-LLink33644112/76812.8+0.2333.6+123.7163.4+6.681.667.072.164.461.6
ViTamin-LLink38457612/76812.8+0.2333.7+123.7213.4+6.681.867.272.464.761.8
ViTamin-L2Link22419624/102412.8333.6+354.072.6+23.380.966.470.663.461.5
ViTamin-L2Link25625624/102412.8+0.5333.6+354.094.8+23.381.567.471.964.163.1
ViTamin-L2Link33644124/102412.8+0.5333.8+354.0163.4+23.381.867.873.064.563.6
ViTamin-L2Link38457624/102412.8+0.5334.0+354.0213.4+23.382.168.173.464.863.7
ViTamin-XLLink25625627/115212.8+0.5436.1+488.7125.3+33.182.167.672.365.462.7
ViTamin-XLLink38457627/115212.8+0.5436.1+488.7281.9+33.182.668.173.665.663.8
ViTamin-XLLink25625627/115240436.1+488.7125.3+33.182.367.572.864.062.1
ViTamin-XLLink33644127/115240+1436.1+488.7215.9+33.182.768.073.964.162.6
ViTamin-XLLink38457627/115240+1436.1+488.7281.9+33.182.968.174.164.062.5

Main Results on Downstream tasks

Open-Vocab Detection

image encoderdetectorOV-COCO (AP50novel)OV-LVIS (APr)
ViT-L/14Sliding F-ViT36.132.5
ViTamin-LSliding F-ViT37.535.6

Open-Vocab Segmentation

image encodersegmentorADECityscapesMVA-150A-847PC-459PC-59PAS-21
ViT-L/14Sliding FC-CLIP24.640.716.531.814.318.355.181.5
ViTamin-LSliding FC-CLIP27.344.018.235.616.120.458.483.4

Note: Panoptic dataset (ADE, CityScapes, MV) are with the metric of PQ. Semantic dataset (A-150, A-847, PC-459, PC-59, PAS-21) are with the metric of mIoU.

Large Multi-modal Models

image encoderimage sizeVQAv2GQAVizWizSQAT-VQAPOPEMMEMM-BenchMM-B-CNSEEDLLaVA-WildMM-Vet
ViTamin-L33678.461.651.166.958.784.6142165.458.457.764.533.6
ViTamin-L38478.961.655.467.659.885.5144764.558.357.966.133.6

Citing ViTamin

@inproceedings{chen2024vitamin,
  title={ViTamin: Designing Scalable Vision Models in the Vision-language Era},
  author={Chen, Jieneng and Yu, Qihang and Shen, Xiaohui and Yuille, Alan and Chen, Liang-Chieh},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}