README.md
July 3, 2025 · View on GitHub
: Bimodal Online Test-Time Adaptation for CLIP
Sarthak Kumar Maharana1, Baoming Zhang1, Leonid Karlinsky2, Rogerio Feris2, and Yunhui Guo1
1The University of Texas at Dallas 2 MIT-IBM Watson AI Lab
ICCV 2025
Abstract
Although open-vocabulary classification models like Contrastive Language Image Pretraining (CLIP) have demonstrated strong zero-shot learning capabilities, their robustness to common image corruptions remains poorly understood. Through extensive experiments, we show that zero-shot CLIP lacks robustness to common image corruptions during test-time, necessitating the adaptation of CLIP to unlabeled corrupted images using test-time adaptation (TTA). However, we found that existing TTA methods have severe limitations in adapting CLIP due to their unimodal nature. To address these limitations, we propose , a bimodal TTA method designed to improve CLIP's robustness to common image corruptions. The key insight of our approach is not only to adapt the visual encoders for improving image features but also to strengthen the alignment between image and text features by promoting a stronger association between the image class prototype, computed using pseudo-labels, and the corresponding text feature. We evaluate our approach on benchmark image corruption datasets and achieve state-of-the-art results in online TTA for CLIP. Furthermore, we evaluate our proposed TTA approach on various domain generalization datasets to demonstrate its generalization capabilities.
Prerequisites
To use the repository, we provide a conda environment.
conda update conda
conda env create -f environment.yml
conda activate tta
Usage
is heavily built upon this. Thanks, Mario Doebler!
Features
-
Datasets
cifar10_cCIFAR10-Ccifar100_cCIFAR100-Cimagenet_cImageNet-C
-
Models
- It is also possible to use the models provided by OpenCLIP.
-
Settings
reset_each_shiftReset the model state after the adaptation to a domain. We follow this setting.
-
Mixed Precision Training
- Almost all of the aforementioned methods (except SAR and GTTA) can be trained with mixed precision. This greatly speeds up your experiments and requires less memory. However, all benchmark results are generated with fp32.
-
Modular Design
- Adding new methods should be rather simple, thanks to the modular design.
Get Started
Once you’ve obtained any missing datasets, update the root data directory in conf.py by setting _C.DATA_DIR = "./data". If your individual dataset folders use names other than those defined in the complete_data_dir_path mapping (also in conf.py), simply edit that dictionary to match your directory names.
Run Experiments
Example run,
python test_time.py --cfg cfgs/imagenet_c/ours.yaml MODEL.ARCH VIT-B-16 MODEL.WEIGHTS openai MODEL.USE_CLIP True SETTING reset_each_shift
You can head over to the config files to change the parameters.
TODO
- Key results and viz.
- Framework pending
Citation
@inproceedings{maharana2025batclip,
title={BATCLIP: Bimodal Online Test-Time Adaptation for CLIP},
author={Maharana, Sarthak Kumar and Zhang, Baoming and Karlinsky, Leonid and Feris, Rogerio and Guo, Yunhui},
journal={International Conference on Computer Vision (ICCV)},
year={2025}
}