๐Ÿš€ Towards Calibrating Prompt Tuning of Vision-Language Models (CVPR 2026)

March 14, 2026 ยท View on GitHub

Code and resources for our CVPR 2026 paper on calibrating prompt tuning in vision-language models.

Paper arXiv ๐Ÿ”— Project Page

โœ๏ธ Authors


๐Ÿ“Œ Overview

Prompt tuning is a powerful adaptation strategy for vision-language models, but it often leads to suboptimal confidence calibration, especially when transferring from base classes to novel classes. In this work, we study the calibration behavior of prompt-tuned models and propose a principled approach that improves Expected Calibration Error (ECE) while preserving strong classification performance.

Our contributions are summarized as follows:

  • We analyze the calibration behavior of prompt-tuned vision-language models and identify key factors behind miscalibration on both base and novel classes.
  • We propose a new calibration framework for prompt tuning that explicitly regularizes the learned prompt/text space for improved confidence reliability.
  • We show that our method consistently improves calibration performance across multiple prompt-tuning baselines and 11 fine-grained classification benchmarks.
  • We provide extensive empirical analysis demonstrating improved ECE with competitive or better accuracy across both base and novel splits.

Main comparison figure

Overall calibration comparison across prompt-tuning baselines and datasets.

Motivation figure

Motivation and analysis of calibration behavior in prompt-tuned vision-language models.


๐Ÿ“ฅ Installation

We follow the official MaPLe repository for environment setup and dataset preparation:

Please use the same environment configuration and dataset preparation pipeline as MaPLe.


๐Ÿ“‚ Datasets

We evaluate on 11 fine-grained classification benchmarks commonly used in prompt-tuning literature:

  1. ImageNet
  2. Caltech101
  3. OxfordPets
  4. StanfordCars
  5. Flowers102
  6. Food101
  7. FGVCAircraft
  8. SUN397
  9. DTD
  10. EuroSAT
  11. UCF101

๐Ÿ”ง Modify Dassl.pytorch

Move to Dassl.pytorch>dassl>evaluation>evaluator.py. Replace evaluator.py with the below code:

import numpy as np
import os.path as osp
from collections import OrderedDict, defaultdict

import torch
from sklearn.metrics import f1_score, confusion_matrix
from sklearn.preprocessing import KBinsDiscretizer
import pandas as pd
import matplotlib
matplotlib.use("Agg")           # headless save
import matplotlib.pyplot as plt

from .build import EVALUATOR_REGISTRY


def ECE_Loss(num_bins, predictions, confidences, correct):
    bin_boundaries = torch.linspace(0, 1, num_bins + 1)
    bin_lowers, bin_uppers = bin_boundaries[:-1], bin_boundaries[1:]
    bin_accuracy   = [0.0] * num_bins
    bin_confidence = [0.0] * num_bins
    bin_count      = [0]   * num_bins

    # assign each sample to its confidence bin
    for i, conf in enumerate(confidences):
        for j, (low, up) in enumerate(zip(bin_lowers, bin_uppers)):
            if low.item() < conf <= up.item():
                bin_count[j]      += 1
                bin_accuracy[j]   += correct[i]
                bin_confidence[j] += conf
                break

    # average out per-bin accuracy and confidence
    for j in range(num_bins):
        if bin_count[j] > 0:
            bin_accuracy[j]   /= bin_count[j]
            bin_confidence[j] /= bin_count[j]

    # weighted absolute differences
    total = len(predictions)
    ece = 0.0
    for j in range(num_bins):
        ece += abs(bin_accuracy[j] - bin_confidence[j]) * (bin_count[j] / total)

    return ece


def MCE(conf, pred, gt, conf_bin_num=10):
    """
    Maximal Calibration Error
    """
    df = pd.DataFrame({'true': gt, 'pred': pred, 'conf': conf})
    df['correct'] = (df.pred == df.true).astype(int)

    # digitize into bins
    bin_bounds = np.linspace(0, 1, conf_bin_num + 1)[1:-1]
    df['conf_bin'] = df['conf'].apply(lambda x: np.digitize(x, bin_bounds))

    # compute per-bin accuracy, confidence, counts
    group_acc   = df.groupby('conf_bin')['correct'].mean()
    group_conf  = df.groupby('conf_bin')['conf'].mean()
    counts      = df.groupby('conf_bin')['conf'].count()

    # maximal weighted deviation
    mce = (abs(group_acc - group_conf) * (counts / len(df))).max()
    return mce


def AdaptiveECE(conf, pred, gt, conf_bin_num=10):
    """
    Adaptive (quantile) Expected Calibration Error
    """
    df = pd.DataFrame({'true': gt, 'pred': pred, 'conf': conf})
    df['correct'] = (df.pred == df.true).astype(int)

    # quantile-based binning
    df['conf_bin'] = KBinsDiscretizer(
        n_bins=conf_bin_num,
        encode='ordinal',
        strategy='quantile'
    ).fit_transform(conf[:, None]).astype(int)

    group_acc  = df.groupby('conf_bin')['correct'].mean()
    group_conf = df.groupby('conf_bin')['conf'].mean()
    counts     = df.groupby('conf_bin')['conf'].count()

    ace = (abs(group_acc - group_conf) * (counts / len(df))).sum()
    return ace


def PIECE(conf, knndist, pred, gt,
          dist_bin_num=10, conf_bin_num=10,
          knn_strategy='quantile'):
    """
    Proximity-Informed Expected Calibration Error
    """
    df = pd.DataFrame({
        'true':    gt,
        'pred':    pred,
        'conf':    conf,
        'knndist': knndist
    })
    df['correct'] = (df.pred == df.true).astype(int)

    # bin by knn distance
    df['knn_bin'] = KBinsDiscretizer(
        n_bins=dist_bin_num,
        encode='ordinal',
        strategy=knn_strategy
    ).fit_transform(df[['knndist']]).astype(int)

    # uniform bins for confidence
    bin_bounds = np.linspace(0, 1, conf_bin_num + 1)[1:-1]
    df['conf_bin'] = df['conf'].apply(lambda x: np.digitize(x, bin_bounds))

    # compute per-(knn,conf) stats
    grp_acc   = df.groupby(['knn_bin', 'conf_bin'])['correct'].mean()
    grp_conf  = df.groupby(['knn_bin', 'conf_bin'])['conf'].mean()
    counts    = df.groupby(['knn_bin', 'conf_bin'])['conf'].count()

    piece = (abs(grp_acc - grp_conf) * (counts / len(df))).sum()
    return piece
class EvaluatorBase:
    def __init__(self, cfg):
        self.cfg = cfg

    def reset(self):
        raise NotImplementedError

    def process(self, *args, **kwargs):
        raise NotImplementedError

    def evaluate(self):
        raise NotImplementedError
@EVALUATOR_REGISTRY.register()
class Classification(EvaluatorBase):
    def __init__(self, cfg, lab2cname=None, **kwargs):
        super().__init__(cfg)
        self._lab2cname     = lab2cname
        self._correct       = 0
        self._total         = 0
        self._y_true        = []
        self._y_pred        = []
        self._confidences   = []
        self._knn_dists     = []  # for PIECE

        if cfg.TEST.PER_CLASS_RESULT:
            assert lab2cname is not None, "lab2cname is required for per-class results"
            self._per_class_res = defaultdict(list)
        else:
            self._per_class_res = None

    def reset(self):
        self._correct       = 0
        self._total         = 0
        self._y_true        = []
        self._y_pred        = []
        self._confidences   = []
        self._knn_dists     = []
        if self._per_class_res is not None:
            self._per_class_res = defaultdict(list)

    def process(self, model_output, ground_truth, knn_dist=None):
        # predictions and confidences
        preds      = model_output.argmax(dim=1)
        confs      = model_output.softmax(dim=1).max(dim=1)[0]
        matches    = preds.eq(ground_truth).float()

        # update overall counters
        self._correct += int(matches.sum().item())
        self._total   += ground_truth.size(0)

        # store for final metrics
        self._y_true      .extend(ground_truth.cpu().tolist())
        self._y_pred      .extend(preds.cpu().tolist())
        self._confidences .extend(confs.cpu().tolist())
        if knn_dist is not None:
            # assume knn_dist is a numpy array aligned with batch
            self._knn_dists.extend(knn_dist.tolist())

        if self._per_class_res is not None:
            for i, label in enumerate(ground_truth):
                self._per_class_res[label.item()].append(int(matches[i].item()))

    def evaluate(self):
        results = OrderedDict()

        # convert to numpy arrays
        y_true = np.array(self._y_true)
        y_pred = np.array(self._y_pred)
        confs  = np.array(self._confidences)

        # overall accuracy & error
        acc = 100.0 * self._correct / self._total
        err = 100.0 - acc

        # macro-F1
        macro_f1 = 100.0 * f1_score(
            y_true, y_pred,
            average="macro",
            labels=np.unique(y_true)
        )

        # calibration metrics
        ece_value       = ECE_Loss(
            num_bins=10,
            predictions=y_pred,
            confidences=confs,
            correct=(y_pred == y_true).astype(int)
        ) * 100.0

        mce_value       = MCE(confs, y_pred, y_true) * 100.0
        adaptive_ece    = AdaptiveECE(confs, y_pred, y_true) * 100.0

        # PIECE only if we have knn distances
        if len(self._knn_dists) == len(confs):
            knn_arr    = np.array(self._knn_dists)
            piece_value = PIECE(confs, knn_arr, y_pred, y_true) * 100.0
        else:
            piece_value = None

        # build result dict
        results["accuracy"]       = acc
        results["error_rate"]     = err
        results["macro_f1"]       = macro_f1
        results["ece"]            = ece_value
        results["mce"]            = mce_value
        results["adaptive_ece"]   = adaptive_ece
        if piece_value is not None:
            results["piece"]      = piece_value

        # print summary
        print(f"=> Total samples: {self._total:,}")
        print(f"=> Accuracy: {acc:.2f}%  Error rate: {err:.2f}%")
        print(f"=> Macro-F1: {macro_f1:.2f}%")
        print(f"=> ECE: {ece_value:.2f}%  MCE: {mce_value:.2f}%  Adaptive ECE: {adaptive_ece:.2f}%")
        if piece_value is not None:
            print(f"=> PIECE: {piece_value:.2f}%")

        # per-class results
        if self._per_class_res is not None:
            accs = []
            print("=> Per-class accuracies:")
            for lbl in sorted(self._per_class_res.keys()):
                corrects = self._per_class_res[lbl]
                cls_acc  = 100.0 * sum(corrects) / len(corrects)
                cname    = self._lab2cname[lbl]
                accs.append(cls_acc)
                print(f"* Class {lbl} ({cname}): {cls_acc:.2f}% [{len(corrects)} samples]")
            mean_pc = float(np.mean(accs))
            results["perclass_accuracy"] = mean_pc
            print(f"=> Average per-class accuracy: {mean_pc:.2f}%")

        # optionally save confusion matrix
        if self.cfg.TEST.COMPUTE_CMAT:
            cmat = confusion_matrix(y_true, y_pred, normalize="true")
            save_path = osp.join(self.cfg.OUTPUT_DIR, "cmat.pt")
            torch.save(cmat, save_path)
            print(f"Confusion matrix saved to {save_path}")
        return results


๐Ÿ”ง Run Experiments

๐Ÿ”ฅ TCPT Experiment Move to the respective method's scripts folder and run the command below (Example is done for MaPLe):

#Fine-grained classification
bash base2new_train_maple_datasets.sh && bash base2new_test_maple.sh && bash parse_all_results.sh


๐Ÿ“Š Main Results

We report Top-1 Accuracy (Acc.) and Expected Calibration Error (ECE) on both base and novel classes. Higher accuracy is better, while lower ECE indicates better calibration.

Dataset abbreviations:
INet = ImageNet, Cal = Caltech101, Pets = OxfordPets, Cars = StanfordCars, Flow = Flowers102, Food = Food101, Air = FGVCAircraft, SUN = SUN397, DTD = DTD, Euro = EuroSAT, UCF = UCF101.


Table 1. Base-Class Results

Zero-Shot Reference

MethodMetricINetCalPetsCarsFlowFoodAirSUNDTDEuroUCFAvg
Zero ShotAcc.72.4097.2091.3063.6071.8090.1027.7069.4053.0057.0071.0069.50
Zero ShotECE1.516.492.253.743.111.573.031.594.538.353.243.58

CoOp-based Methods

MethodMetricINetCalPetsCarsFlowFoodAirSUNDTDEuroUCFAvg
CoOpAcc.75.6097.9894.7776.2290.0090.2035.2381.1476.2790.2483.3281.00
CoOpECE1.650.661.003.734.933.6625.708.1112.171.756.446.35
MBLSAcc.75.1297.8991.1176.2189.3489.7834.3281.3276.3490.1282.7880.39
MBLSECE2.989.607.7012.205.6912.3410.4816.804.258.029.399.04
Temp. Scal.Acc.75.6098.1994.1578.6597.7290.1042.0081.3280.6790.7084.5683.06
Temp. Scal.ECE1.501.202.546.654.600.503.432.013.864.761.572.96
DACAcc.------------
DACECE------------
ZS-NormAcc.76.1097.8594.3877.7895.7689.5239.7481.3781.0290.4584.0182.54
ZS-NormECE3.154.357.7511.3011.293.1413.054.2249.5337.043.4713.48
PenaltyAcc.76.4497.7295.1177.0596.3087.9238.0781.0477.3247.0980.4777.68
PenaltyECE2.434.796.4710.019.385.988.594.5921.8420.477.429.27
OursAcc.76.5398.0694.9577.3297.2190.3838.6281.6880.4488.5684.6882.58
OursECE2.471.011.947.104.800.304.961.222.424.901.112.93

MaPLe-based Methods

MethodMetricINetCalPetsCarsFlowFoodAirSUNDTDEuroUCFAvg
MaPLeAcc.76.7197.9795.5372.9395.0090.8036.3380.5579.6391.1383.2082.41
MaPLeECE2.271.542.687.254.280.783.861.274.183.422.683.19
MBLSAcc.75.5998.2395.2372.7795.9390.8036.2080.7380.0390.9384.1382.50
MBLSECE29.065.036.6419.0612.746.555.6011.014.793.738.468.36
Temp. Scal.Acc.76.6697.9794.9372.7095.9390.6336.3780.7378.6093.6084.0082.55
Temp. Scal.ECE2.371.262.284.963.440.713.042.845.981.313.072.89
DACAcc.------------
DACECE------------
ZS-NormAcc.76.6397.5795.7073.0795.6390.5736.0080.9780.4391.3083.8782.51
ZS-NormECE1.6423.305.918.6611.491.137.872.337.0219.383.869.10
PenaltyAcc.76.7298.0795.3072.4395.7790.7334.3380.9364.6036.7783.0375.20
PenaltyECE3.875.416.3713.5312.673.878.427.2819.9713.438.509.95
OursAcc.76.7297.9794.9372.8096.2090.4336.8081.1080.7392.0084.5082.75
OursECE2.391.191.547.923.450.654.501.553.561.332.122.78

KGCoOp-based Methods

MethodMetricINetCalPetsCarsFlowFoodAirSUNDTDEuroUCFAvg
KGCoOpAcc.75.7597.7094.6872.7095.1690.5736.7780.5979.4086.1483.5181.18
KGCoOpECE2.522.923.2710.1612.121.683.274.928.3911.905.036.02
MBLSAcc.76.2397.8195.0075.3496.2490.4938.2880.8679.9487.9683.4581.96
MBLSECE6.194.305.2613.4312.484.088.018.169.0311.975.868.07
Temp. Scal.Acc.75.7797.6694.6770.0894.6590.5035.8180.5178.7486.4483.3280.74
Temp. Scal.ECE6.474.165.1311.7015.353.647.418.5011.1215.797.398.79
DACAcc.------------
DACECE------------
ZS-NormAcc.75.7894.1497.6574.5573.9091.7130.7976.5051.4965.3976.4473.49
ZS-NormECE2.701.653.513.854.722.208.423.236.376.163.834.24
PenaltyAcc.75.6597.7094.6872.4593.8690.5937.7680.6378.4083.0982.9780.71
PenaltyECE2.733.273.2210.5813.011.739.596.5120.406.516.077.57
OursAcc.75.8497.6894.8471.6595.2290.5236.0380.7078.4785.1083.1680.34
OursECE2.141.882.968.1011.211.124.814.127.0112.644.145.47

Table 2. Novel-Class Results

Zero-Shot Reference

MethodMetricINetCalPetsCarsFlowFoodAirSUNDTDEuroUCFAvg
Zero ShotAcc.72.4094.1097.1075.0077.5091.1035.9075.5060.6063.8078.6074.30
Zero ShotECE2.091.553.423.314.911.836.553.486.869.125.524.43

CoOp-based Methods

MethodMetricINetCalPetsCarsFlowFoodAirSUNDTDEuroUCFAvg
CoOpAcc.59.0794.1896.4965.2969.9090.5724.7970.7752.9864.6862.8368.32
CoOpECE10.692.161.6711.7312.133.0330.4413.7020.8211.8818.7412.45
MBLSAcc.59.1195.1096.2365.2869.8990.2324.8070.1253.1264.6562.9768.31
MBLSECE4.092.213.459.7018.9013.8010.209.708.9012.1013.219.66
Temp. ScalingAcc.59.0793.4596.0366.7065.8696.6027.3770.6748.1954.7057.5166.92
Temp. ScalingECE7.333.173.655.018.061.0618.806.9320.2115.1314.559.45
DACAcc.------------
DACECE5.673.171.825.1610.191.7817.384.0510.488.628.677.00
ZS-NormAcc.66.2693.3093.9866.6267.2188.9125.7670.5144.0850.4262.5266.32
ZS-NormECE2.462.897.942.874.413.3210.182.4721.8015.934.287.14
PenaltyAcc.66.7192.8796.1468.1168.6578.3429.2971.6540.7841.4467.5365.59
PenaltyECE2.362.527.422.734.934.707.812.794.2013.114.665.20
OursAcc.67.0393.5697.3669.4971.6390.8430.8370.0348.0756.7066.4969.28
OursECE2.022.213.032.103.510.8710.643.089.3111.154.754.79

MaPLe-based Methods

MethodMetricINetCalPetsCarsFlowFoodAirSUNDTDEuroUCFAvg
MaPLeAcc.70.5095.1097.8573.5772.8092.1034.5378.2058.4775.9077.8575.17
MaPLeECE1.931.622.633.0911.671.1911.242.2112.1611.683.985.76
MBLSAcc.68.4794.1796.9771.9368.9391.4633.7778.1054.7075.9778.2373.88
MBLSECE22.824.067.4111.414.847.066.0610.4110.3111.256.639.30
Temp. ScalingAcc.70.4694.8397.3073.4772.7791.7734.0778.1357.9773.7775.3374.53
Temp. ScalingECE1.952.562.134.0812.760.7219.115.0916.478.057.137.28
DACAcc.------------
DACECE2.111.262.512.7511.281.509.061.228.168.552.304.61
ZS-NormAcc.70.6390.3097.2373.3070.0391.8334.0778.4760.7068.1377.8073.86
ZS-NormECE3.6723.025.003.266.051.627.822.655.2314.533.336.93
PenaltyAcc.70.6693.6097.3373.9070.8791.9034.7078.6745.4736.7776.8370.06
PenaltyECE1.493.256.235.945.764.274.925.708.4713.356.075.95
OursAcc.70.2894.8797.5775.2773.8791.7736.0378.1361.2767.6079.8775.14
OursECE1.741.422.292.6010.070.868.331.117.377.453.324.23

KGCoOp-based Methods

MethodMetricINetCalPetsCarsFlowFoodAirSUNDTDEuroUCFAvg
KGCoOpAcc.69.7094.4397.6774.2575.1091.6536.7776.3354.2364.6875.5973.67
KGCoOpECE1.841.713.423.365.032.046.061.664.388.672.653.71
MBLSAcc.69.1494.3294.2473.0173.9090.4928.8775.7556.2864.2773.8472.19
MBLSECE4.601.623.163.954.004.0011.395.563.235.304.104.63
Temp. ScalingAcc.69.7994.5497.5674.9475.3791.6632.3576.7953.8362.1776.9173.27
Temp. ScalingECE5.811.894.916.354.634.025.406.183.837.606.435.18
DACAcc.------------
DACECE4.321.843.113.125.901.9411.781.677.096.592.694.47
ZS-NormAcc.69.6894.1497.6574.5573.9091.7130.7976.5051.4965.3976.4472.19
ZS-NormECE1.801.653.513.854.722.208.423.236.376.163.834.16
PenaltyAcc.69.5894.3496.3574.7573.2191.3130.5876.6951.1965.4376.5272.99
PenaltyECE1.821.714.213.055.122.998.124.135.876.563.934.75
OursAcc.69.5094.2197.7274.3973.8091.6431.6376.3055.9265.7676.6273.41
OursECE1.841.223.503.604.781.617.671.913.374.153.013.33

๐Ÿ” Summary of Results

  • On base classes, our method achieves strong calibration improvements across CoOp, MaPLe, and KGCoOp backbones, with especially competitive average ECE values.
  • On novel classes, our method consistently reduces calibration error while maintaining competitive classification accuracy.
  • In several settings, our method improves average ECE substantially over prior calibration baselines and remains competitive with zero-shot CLIP calibration.

๐Ÿ™ Acknowledgement

We thank the authors of the following repositories for making their code publicly available:


๐Ÿ“– Citation

If you find our work useful for your research, please consider citing:

@misc{sharifdeen2026calibratingprompttuningvisionlanguage,
      title={Towards Calibrating Prompt Tuning of Vision-Language Models}, 
      author={Ashshak Sharifdeen and Fahad Shamshad and Muhammad Akhtar Munir and Abhishek Basu and Mohamed Insaf Ismithdeen and Jeyapriyan Jeyamohan and Chathurika Sewwandi Silva and Karthik Nandakumar and Muhammad Haris Khan},
      year={2026},
      eprint={2602.19024},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.19024}, 
}

๐Ÿ“ง Contact

If you need any further clarification, please feel free to contact me at ashshaks@gmail.com.