CLIP-KD

April 10, 2024 · View on GitHub

This repository contains the source code of CLIP-KD [CLIP-KD: An Empirical Study of CLIP Model Distillation].

Install

pip install -r requirements-training.txt
pip install -r requirements-test.txt

Dataset preparation

Conceptual Captions 3M

OpenCLIP reads a CSV file with two columns: a path to an image, and a text caption. The names of the columns are passed as an argument to main.py.

The script src/data/gather_cc.py will collect the Conceptual Captions 3M images. First, download the Conceptual Captions 3M URLs and then run the script from our repository: For easy notation, we rename Train_GCC-training as cc3m_train, and Validation_GCC-1.1.0-Validation as cc3m_val.

python src/data/gather_cc.py [path/to/cc3m/images/] [path/to/cc3m_train.tsv] [path/to/cc3m_val.tsv]

Our downloaded CC3M training set contains 2.89M images, and our CC3M validation set contains 13K images.

The generated cc3m_train.csv is:

title   filepath
XXXXXX  train/X/X.jpg
...     ...

The generated cc3m_val.csv is:

title   filepath
XXXXXX  val/X/X.jpg
...     ...

Conceptual 12M

The script src/data/gather_cc12m.py will collect the Conceptual 12M images. First, download the Conceptual 12M URLs and then run the script from our repository:

python src/data/gather_cc12m.py [path/to/cc12m/images/] [path/to/cc12m.tsv]

The generated cc12m.csv is:

title   filepath
XXXXXX  train/X/X.jpg
...     ...

Our downloaded CC12M training set contains 9.97M images.

Distill CLIP models

Distillation with different strategies

The teacher is pretrained on CC3M+12M. Students are distilled on CC3M+12M.

Role	Network	Method	ImageNet Acc	Train script
Teacher	ViT-B/16	-	36.99	sh
Student	ViT-T/16	Baseline	30.55	sh
Student	ViT-T/16	+CRD	31.94	sh
Student	ViT-T/16	+FD	34.23	sh
Student	ViT-T/16	+MFD	34.09	sh
Student	ViT-T/16	+GD	31.54	sh
Student	ViT-T/16	+ICL	33.11	sh
Student	ViT-T/16	+AFD	31.42	sh

Supervised by ViT-B/16 as the teacher

The teacher is pretrained on CC3M+12M. Students are distilled on CC3M+12M.

Role	Network	Method	ImageNet Acc	Train script	Download
Teacher	ViT-B/16	-	36.99	sh	model \| log
Student	ViT-T/16	Baseline	30.55	sh	model \| log
Student	ViT-T/16	CLIP-KD	34.90	sh	model \| log
Student	MobileViT-S	Baseline	32.60	sh	model \| log
Student	MobileViT-S	CLIP-KD	35.96	sh	model \| log
Student	Swin-T	Baseline	36.38	sh	model \| log
Student	Swin-T	CLIP-KD	40.18	sh	model \| log
Student	MobileNetV3	Baseline	25.11	sh	model \| log
Student	MobileNetV3	CLIP-KD	26.95	sh	model \| log
Student	EfficientNet-B0	Baseline	32.55	sh	model \| log
Student	EfficientNet-B0	CLIP-KD	35.44	sh	model \| log
Student	ResNet-18	Baseline	28.55	sh	model \| log
Student	ResNet-18	CLIP-KD	31.36	sh	model \| log

Supervised by ResNet-101 as the teacher

The teacher is pretrained on CC3M+12M. Students are distilled on CC3M+12M.

Role	Network	Method	ImageNet Acc	Train script	Download
Teacher	ResNet-101	-	36.76	sh	model \| log
Student	MobileViT-S	Baseline	32.60	sh	model \| log
Student	MobileViT-S	CLIP-KD	34.97	sh	model \| log
Student	Swin-T	Baseline	36.38	sh	model \| log
Student	Swin-T	CLIP-KD	39.51	sh	model \| log
Student	MobileNetV3	Baseline	25.11	sh	model \| log
Student	MobileNetV3	CLIP-KD	26.15	sh	model \| log
Student	EfficientNet-B0	Baseline	32.55	sh	model \| log
Student	EfficientNet-B0	CLIP-KD	34.64	sh	model \| log
Student	ResNet-18	Baseline	28.55	sh	model \| log
Student	ResNet-18	CLIP-KD	30.88	sh	model \| log

Transferred from Laion-400M

The teacher is pretrained on Laion-400M. Students are distilled on CC3M+12M.

Role	Network	Method	ImageNet	Train script	Download
Teacher	ViT-L/14	-	72.8	-	model
Student	ViT-B/16	Baseline	37.0	sh	model \| log
Student	ViT-B/16	CLIP-KD	57.5	sh	model \| log
Student	ViT-T/16	Baseline	30.6	sh	model \| log
Student	ViT-T/16	CLIP-KD	40.9	sh	model \| log

Role	Network	Method	ImageNet	Train script	Download
Teacher	ViT-B/16	-	67.1	-	model
Student	ViT-T/16	Baseline	30.6	sh	model \| log
Student	ViT-T/16	CLIP-KD	42.6	sh	model \| log
Student	ResNet-50	Baseline	35.3	sh	model \| log
Student	ResNet-50	CLIP-KD	55.4	sh	model \| log

Evaluate pretrained models on more downstream tasks

Evaluation a pretrained model on MSCOCO and Flickr cross-retrieval and ImageNet variants (ImageNet-V2, ImageNet-Rendition and ImageNet-Sketch) classification. Please refer to eval_coco.sh and eval_flickr.sh.

Acknowledgement

Our codebase is bulit over open_clip, an open-source codebase to run CLIP models.

We would appreciate it if our paper and repo are helpful to you!

@inproceedings{yang2024clip,
  title={CLIP-KD: An Empirical Study of CLIP Model Distillation},
  author={Yang, Chuanguang and An, Zhulin and Huang, Libo and Bi, Junyu and Yu, Xinqiang and Yang, Han and Diao, Boyu and Xu, Yongjun},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}