FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation (CVPR 2026)
March 6, 2026 · View on GitHub
This repository contains the official implementation of the CVPR 2026 paper "FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation".
中文版 | English
🚀 Introduction
FOZO proposes a novel backpropagation-free paradigm for Test-Time Adaptation (TTA).
Traditional TTA methods typically rely on backpropagation to update model parameters, which is challenging to deploy on edge devices or quantized models. FOZO optimizes a small number of visual prompts inserted into the model through zeroth-order optimization. To address instability in TTA data streams, we introduce a dynamic decay perturbation mechanism, combined with an unsupervised loss function that integrates deep and shallow feature statistics alignment and prediction entropy minimization.
Key Highlights:
- Pure Forward-Only Inference: Completely eliminates the need for gradient computation or storing intermediate activations, resulting in extremely low memory overhead.
- Dynamic Perturbation Strategy: Automatically adjusts the zeroth-order gradient perturbation scale and learning rate based on loss fluctuations.
- Strong Robustness: Achieves SOTA performance on ImageNet-C (5K), ImageNet-R, and ImageNet-Sketch.
- Quantization-Friendly: Natively supports INT8 quantized models (e.g., PTQ4ViT), addressing the challenge of updating weights in quantized models.
- Efficient and Practical: Completes adaptation with only 2 forward passes, making it suitable for edge device deployment.
Application Scenarios
FOZO is particularly suitable for the following scenarios:
- Edge Device Deployment: Test-time adaptation on devices with limited computational resources
- Quantized Models: Adaptation for low-precision models (INT8/INT4)
- Real-time Applications: Online learning scenarios requiring fast response
- Cross-Domain Generalization: Rapid adaptation of models to new data domains
- Privacy Protection: No need to store intermediate activations, reducing privacy leakage risks
Core Algorithm
The core idea of FOZO is to estimate gradients through zeroth-order optimization (Simultaneous Perturbation Stochastic Approximation, SPSA), thereby updating learnable visual prompt parameters. The algorithm flow is as follows:
- Initialization: Insert a small number of learnable prompts into the input layer of Vision Transformer
- Zeroth-Order Gradient Estimation: Estimate gradients through two forward passes (positive perturbation and negative perturbation)
- Dynamic Adjustment: Dynamically adjust perturbation scale and learning rate based on loss changes
- Parameter Update: Update prompt parameters using the estimated gradient
- Feature Alignment: Optimize the objective function through deep and shallow feature statistics alignment and entropy minimization
🛠️ Environment Setup
We recommend using Python 3.9+ and PyTorch 2.0+ environment.
# Create and activate conda environment
conda env create -f environment.yml
conda activate fozo
📊 Data Preparation
Prepare datasets according to the following structure and specify paths through parameters (e.g., --data_corruption) in main.py:
ImageNet (Original Validation Set)
Used for source domain statistics calculation and baseline testing:
# Download ImageNet validation set (50,000 images)
# Get from https://www.image-net.org/download.php
# Extract to the following directory structure:
ILSVRC2012_img_val/
└── val/
├── n01440764/
├── n01443537/
└── ...
ImageNet-C
Contains 15 types of image corruptions (noise, blur, weather, etc.), each with 5 severity levels:
- Step 1: Download from ImageNet-C: zenodo link
- Step 2: Extract and organize as follows:
imagenet-c/
├── gaussian_noise/
│ ├── 1/
│ ├── 2/
│ ├── 3/
│ ├── 4/
│ └── 5/
├── shot_noise/
├── impulse_noise/
├── defocus_blur/
├── glass_blur/
├── motion_blur/
├── zoom_blur/
├── snow/
├── frost/
├── fog/
├── brightness/
├── contrast/
├── elastic_transform/
├── pixelate/
└── jpeg_compression/
ImageNet-V2
Used to test model generalization on resampled ImageNet data:
- Step 1: Download from ImageNet-V2: HuggingFace link
- Step 2: Extract
imagenetv2-matched-frequency.tar.gzand organize:
imagenet-v2/
└── imagenetv2-matched-frequency-format-val/
├── 1/
├── 2/
├── 3/
├── 4/
├── 5/
└── ...
ImageNet-R
Contains 30,000 images across 200 categories including art, cartoons, sketches, etc.:
- Step 1: Download from ImageNet-R: download link
- Step 2: Extract the tar file
ImageNet-Sketch
Contains 50,000 hand-drawn sketches:
- Step 1: Download from ImageNet-Sketch: Google Drive link
- Step 2: Extract the zip file
Dataset Path Configuration
Before running experiments, ensure that dataset paths are correctly set in main.py or command line arguments:
--data /path/to/imagenet/val # ImageNet original validation set
--data_corruption /path/to/imagenet-c # ImageNet-C
--data_rendition /path/to/imagenet-r # ImageNet-R
--data_sketch /path/to/imagenet-sketch # ImageNet-Sketch
--data_v2 /path/to/imagenet-v2 # ImageNet-V2
🏃 Quick Start
Basic Usage
1. Run FOZO for continual adaptation (full-precision model)
Run FOZO on ImageNet-C (5K) with default parameters:
python main.py \
--algorithm fozo \
--data /path/to/imagenet/val \
--data_corruption /path/to/imagenet-c \
--num_prompts 3 \
--fitness_lambda 0.4 \
--lr 0.08 \
--zo_eps 0.5 \
--batch_size 64 \
--continual
2. Run no-adaptation baseline
python main.py \
--algorithm no_adapt \
--data /path/to/imagenet/val \
--data_corruption /path/to/imagenet-c
3. Run TTA on quantized model (INT8)
To test performance on quantized models, add the --quant flag:
python main.py \
--algorithm fozo \
--quant \
--data /path/to/imagenet/val \
--data_corruption /path/to/imagenet-c \
--tag _quant_experiment
4. Run using provided script
We provide an example script run.sh that can be run directly:
bash run.sh
📈 Experimental Results
ImageNet-C (5K, Level 5) Performance Comparison
Results on ImageNet-C (5K subset, severity level 5) based on ViT-Base model:
| Method | Top-1 Acc (%) | Memory (MiB) | FP Count | Runtime |
|---|---|---|---|---|
| NoAdapt | 55.57 | 819 | 1 | 94 |
| FOA | 58.13 | 831 | 2 | 224 |
| ZOA | 58.56 | 859 | 2 | 198 |
| FOZO (Ours) | 59.52 | 831 | 2 | 179 |
Note: FP represents forward pass count. FOZO achieves faster convergence while maintaining low memory.
Convergence Curves for Forward-Only TTA Algorithms

Faster convergence: On ImageNet-C, only 66% of the test time required by previous methods (FOA/ZOA) is needed to achieve the same 65% accuracy.
📝 Citation
If you use this code or reference the paper in your research, please cite:
@inproceedings{fozo2026,
title={FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation},
author={Anonymous},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}
🤝 Acknowledgments
This project's code partially references the following excellent works:
- FOA - Forward-Only Adaptation method
- RobustBench - Standardized robustness evaluation benchmark
- PTQ4ViT - Vision Transformer quantization tool
- VPT - Visual Prompt Tuning method