README.md

July 3, 2025 · View on GitHub

$\texttt{BATCLIP}$ : Bimodal Online Test-Time Adaptation for CLIP

Sarthak Kumar Maharana¹, Baoming Zhang¹, Leonid Karlinsky², Rogerio Feris², and Yunhui Guo¹
¹The University of Texas at Dallas ² MIT-IBM Watson AI Lab
ICCV 2025

✍🏻 Paper 🔗 Project

Abstract

Although open-vocabulary classification models like Contrastive Language Image Pretraining (CLIP) have demonstrated strong zero-shot learning capabilities, their robustness to common image corruptions remains poorly understood. Through extensive experiments, we show that zero-shot CLIP lacks robustness to common image corruptions during test-time, necessitating the adaptation of CLIP to unlabeled corrupted images using test-time adaptation (TTA). However, we found that existing TTA methods have severe limitations in adapting CLIP due to their unimodal nature. To address these limitations, we propose $\texttt{BATCLIP}$ , a bimodal $\textbf{online}$ TTA method designed to improve CLIP's robustness to common image corruptions. The key insight of our approach is not only to adapt the visual encoders for improving image features but also to strengthen the alignment between image and text features by promoting a stronger association between the image class prototype, computed using pseudo-labels, and the corresponding text feature. We evaluate our approach on benchmark image corruption datasets and achieve state-of-the-art results in online TTA for CLIP. Furthermore, we evaluate our proposed TTA approach on various domain generalization datasets to demonstrate its generalization capabilities.

Prerequisites

To use the repository, we provide a conda environment.

conda update conda
conda env create -f environment.yml
conda activate tta

Usage

$\texttt{BATCLIP}$ is heavily built upon this. Thanks, Mario Doebler!

Features

Datasets
- cifar10_c CIFAR10-C
- cifar100_c CIFAR100-C
- imagenet_c ImageNet-C
Models
- It is also possible to use the models provided by OpenCLIP.
Settings
- reset_each_shift Reset the model state after the adaptation to a domain. We follow this setting.
Mixed Precision Training
- Almost all of the aforementioned methods (except SAR and GTTA) can be trained with mixed precision. This greatly speeds up your experiments and requires less memory. However, all benchmark results are generated with fp32.
Modular Design
- Adding new methods should be rather simple, thanks to the modular design.

Get Started

Once you’ve obtained any missing datasets, update the root data directory in conf.py by setting _C.DATA_DIR = "./data". If your individual dataset folders use names other than those defined in the complete_data_dir_path mapping (also in conf.py), simply edit that dictionary to match your directory names.

Run Experiments

Example run,

python test_time.py --cfg cfgs/imagenet_c/ours.yaml MODEL.ARCH VIT-B-16 MODEL.WEIGHTS openai MODEL.USE_CLIP True SETTING reset_each_shift

You can head over to the config files to change the parameters.

TODO

Key results and viz.
Framework pending

Citation

@inproceedings{maharana2025batclip,
  title={BATCLIP: Bimodal Online Test-Time Adaptation for CLIP},
  author={Maharana, Sarthak Kumar and Zhang, Baoming and Karlinsky, Leonid and Feris, Rogerio and Guo, Yunhui},
  journal={International Conference on Computer Vision (ICCV)},
  year={2025}
}