Transductive zero-shot and few-shot CLIP

March 6, 2025 · View on GitHub

This GitHub repository features code from our CVPR 2024 paper where we tackle zero-shot and few-shot classification using the vision-language model CLIP. Our approach handles groups of unlabeled images together, enhancing accuracy over traditional methods that consider each image separately. We build a new classification framework based on classification of probability features and an optimization technique that mimics the Expectation-Maximization algorithm. On zero-shot tasks with test batches of 75 samples, our approaches EM-Dirichlet and Hard EM-Dirichlet yield near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance.

1. Getting started

1.1. Requirements

torch 1.13.1 (or later)
torchvision
tqdm
numpy
pillow
pyyaml
scipy
clip

1.2 Download datasets and splits

For downloading the datasets and splits, we follow the instructions given in the Github repository of TIP-Adapter. We use train/val/test splits from CoOp's Github for all datasets except ImageNet where the validation set is used as test set. The splits for imagenet can be downloaded at https://drive.google.com/drive/fol .

The downloaded datasets should be placed in the folder data/ the following way:

.
├── ...
├── data           
│   ├── food101       
│   ├── eurosat       
│   ├── dtd       
│   ├── oxfordpets       
│   ├── flowers101     
│   ├── caltech101      
│   ├── ucf101       
│   ├── fgvcaircraft                
│   ├── stanfordcars      
│   ├── sun397        
│   └── imagenet               
└── ...

1.3 Extracting and saving the features

For a fixed temperature ( $T=30$ recommended), we extract and save the softmax features defined as

z_n = \text{softmax}(T \cos(f_{\text{im}}(x_n), f_{\text{text}}(t_k) )).

To do so, run bash scripts/extract_softmax_features.sh For instance, for the dataset eurosat, the temperature T=30 and the backbone RN50, the features will be saved under

eurosat
├── saved_features                    
│   ├── test_softmax_RN50_T30.plk
│   ├── val_softmax_RN50_T30.plk           
│   ├── train_softmax_RN50_T30.plk           
└── ...

Alternatively, to reproduce the comparisons in the paper, you can also compute directly the visual embeddings running bash scripts/extract_visual_features.sh.

The process of extracting features might be time-consuming, but once completed, the methods operates quite efficiently.

2. Reproducing the zero-shot results

You can reproduce the results displayed in Table 1 in the paper by using the config/main_config.yaml file. Small variations in the results may be observed due to the randomization of the tasks.

The zero-shot methods are EM-Dirichlet (em_dirichlet), Hard EM-Dirichlet (hard_em_dirichlet), Hard K-means (hard_kmeans), Soft K-means (soft_kmeans), EM-Gaussian (Id cov) (em_gaussian), EM-Gaussian (diagonal con) (em_dirichlet_cov), KL K-means (kl_kmeans).

The methods can be tested on the softmax features by setting use_softmax_features=True or the visual features use_softmax_features=False.

For example, to run the method EM-Dirichlet on Caltech101 on 1000 realistic tranductive zero-shot tasks:

python main.py --opts shots 0 dataset caltech101 batch_size 100 number_tasks 1000 use_softmax_feature True

3. Reproducing the few-shot results

You can reproduce the results displayed in Table 2 in the paper by using the config/main_config.yaml file. Small variations in the results may be observed due to the randomization of the tasks.

The zero-shot methods are EM-Dirichlet (em_dirichlet), Hard EM-Dirichlet (hard_em_dirichlet), $\alpha$ -TIM (alpha_tim), PADDLE (paddle), Laplacian Shot (laplacian_shot), BDSCPN (bdcpsn).

Methods ( $\alpha$ -TIM, PADDLE, Laplacian Shot, BDCPSN) having a hyper-parameter have be previously tuned on the validation set (results stored in results_few_shot/val/).

For example, to run the method EM-Dirichlet on Caltech101 on 1000 realistic tranductive 4-shot tasks:

python main.py --opts shots 4 dataset caltech101 batch_size 100 number_tasks 1000 use_softmax_feature True

Aknowlegments

This repository was inspired by the publicly available code from the paper Realistic evaluation of transductive few-shot learning and TIP-Adapter.