CLIP-KD

April 10, 2024 ยท View on GitHub

This repository contains the source code of CLIP-KD [CLIP-KD: An Empirical Study of CLIP Model Distillation].

Install

pip install -r requirements-training.txt
pip install -r requirements-test.txt

Dataset preparation

Conceptual Captions 3M

OpenCLIP reads a CSV file with two columns: a path to an image, and a text caption. The names of the columns are passed as an argument to main.py.

The script src/data/gather_cc.py will collect the Conceptual Captions 3M images. First, download the Conceptual Captions 3M URLs and then run the script from our repository: For easy notation, we rename Train_GCC-training as cc3m_train, and Validation_GCC-1.1.0-Validation as cc3m_val.

python src/data/gather_cc.py [path/to/cc3m/images/] [path/to/cc3m_train.tsv] [path/to/cc3m_val.tsv]

Our downloaded CC3M training set contains 2.89M images, and our CC3M validation set contains 13K images.

The generated cc3m_train.csv is:

title   filepath
XXXXXX  train/X/X.jpg
...     ...

The generated cc3m_val.csv is:

title   filepath
XXXXXX  val/X/X.jpg
...     ...

Conceptual 12M

The script src/data/gather_cc12m.py will collect the Conceptual 12M images. First, download the Conceptual 12M URLs and then run the script from our repository:

python src/data/gather_cc12m.py [path/to/cc12m/images/] [path/to/cc12m.tsv]

The generated cc12m.csv is:

title   filepath
XXXXXX  train/X/X.jpg
...     ...

Our downloaded CC12M training set contains 9.97M images.

Distill CLIP models

Distillation with different strategies

The teacher is pretrained on CC3M+12M. Students are distilled on CC3M+12M.

RoleNetworkMethodImageNet AccTrain script
TeacherViT-B/16-36.99sh
StudentViT-T/16Baseline30.55sh
StudentViT-T/16+CRD31.94sh
StudentViT-T/16+FD34.23sh
StudentViT-T/16+MFD34.09sh
StudentViT-T/16+GD31.54sh
StudentViT-T/16+ICL33.11sh
StudentViT-T/16+AFD31.42sh

Supervised by ViT-B/16 as the teacher

The teacher is pretrained on CC3M+12M. Students are distilled on CC3M+12M.

RoleNetworkMethodImageNet AccTrain scriptDownload
TeacherViT-B/16-36.99shmodel | log
StudentViT-T/16Baseline30.55shmodel | log
StudentViT-T/16CLIP-KD34.90shmodel | log
StudentMobileViT-SBaseline32.60shmodel | log
StudentMobileViT-SCLIP-KD35.96shmodel | log
StudentSwin-TBaseline36.38shmodel | log
StudentSwin-TCLIP-KD40.18shmodel | log
StudentMobileNetV3Baseline25.11shmodel | log
StudentMobileNetV3CLIP-KD26.95shmodel | log
StudentEfficientNet-B0Baseline32.55shmodel | log
StudentEfficientNet-B0CLIP-KD35.44shmodel | log
StudentResNet-18Baseline28.55shmodel | log
StudentResNet-18CLIP-KD31.36shmodel | log

Supervised by ResNet-101 as the teacher

The teacher is pretrained on CC3M+12M. Students are distilled on CC3M+12M.

RoleNetworkMethodImageNet AccTrain scriptDownload
TeacherResNet-101-36.76shmodel | log
StudentMobileViT-SBaseline32.60shmodel | log
StudentMobileViT-SCLIP-KD34.97shmodel | log
StudentSwin-TBaseline36.38shmodel | log
StudentSwin-TCLIP-KD39.51shmodel | log
StudentMobileNetV3Baseline25.11shmodel | log
StudentMobileNetV3CLIP-KD26.15shmodel | log
StudentEfficientNet-B0Baseline32.55shmodel | log
StudentEfficientNet-B0CLIP-KD34.64shmodel | log
StudentResNet-18Baseline28.55shmodel | log
StudentResNet-18CLIP-KD30.88shmodel | log

Transferred from Laion-400M

The teacher is pretrained on Laion-400M. Students are distilled on CC3M+12M.

RoleNetworkMethodImageNetTrain scriptDownload
TeacherViT-L/14-72.8-model
StudentViT-B/16Baseline37.0shmodel | log
StudentViT-B/16CLIP-KD57.5shmodel | log
StudentViT-T/16Baseline30.6shmodel | log
StudentViT-T/16CLIP-KD40.9shmodel | log
RoleNetworkMethodImageNetTrain scriptDownload
TeacherViT-B/16-67.1-model
StudentViT-T/16Baseline30.6shmodel | log
StudentViT-T/16CLIP-KD42.6shmodel | log
StudentResNet-50Baseline35.3shmodel | log
StudentResNet-50CLIP-KD55.4shmodel | log

Evaluate pretrained models on more downstream tasks

Evaluation a pretrained model on MSCOCO and Flickr cross-retrieval and ImageNet variants (ImageNet-V2, ImageNet-Rendition and ImageNet-Sketch) classification. Please refer to eval_coco.sh and eval_flickr.sh.

Acknowledgement

Our codebase is bulit over open_clip, an open-source codebase to run CLIP models.

We would appreciate it if our paper and repo are helpful to you!

@inproceedings{yang2024clip,
  title={CLIP-KD: An Empirical Study of CLIP Model Distillation},
  author={Yang, Chuanguang and An, Zhulin and Huang, Libo and Bi, Junyu and Yu, Xinqiang and Yang, Han and Diao, Boyu and Xu, Yongjun},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}