CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment

March 12, 2026 · View on GitHub

This repository contains the implementation of the CVPR2026 paper: CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment[Paper].

Abstract

Vision-language models like CLIP have achieved remarkable progress in cross-modal representation learning, yet suffer from systematic misalignment among visually and semantically similar categories. We observe that such confusion patterns are not random but persistently occur between specific category pairs, revealing the model’s intrinsic bias and limited fine-grained discriminative ability.
To address this, we propose CAPT, a Confusion-Aware Prompt Tuning framework that enables models to learn from their own misalignment.
Specifically, we construct a Confusion Bank to model stable confusion relationships across categories and their misaligned samples explicitly.On this basis, we introduce a Semantic Confusion Miner (SEM) to capture global inter-class confusion through semantic difference and commonality prompts, and a Sample Confusion Miner (SAM) to retrieve representative misclassified instances from the bank and capture sample-level cues through a Diff-Manner Adapter that integrates global and local contexts.To further unify confusion information across different granularities, a Multi-Granularity Discrepancy Expert (MGDE) module is designed to jointly leverage semantic and sample level experts for more robust confusion-aware reasoning.

Extensive experiments on 11 benchmark datasets demonstrate that our method significantly reduces confusion-induced errors while enhancing the discriminability and generalization of both base and novel classes, successfully resolving 50.72% of confusable sample pairs.

Framework

Overview of CAPT. By matching feature representations, we first employ a Semantic Confusion Miner (SEM) that, together with statistics from the Confusion Bank, identifies Semantic Confusion Pairs and generates both commonality and difference prompts. Subsequently, the Sample Confusion Miner (SAM) locates the most representative confusing samples based on these pairs and extracts their Sample Confusion Feature via the Diff-Manner Adapter. Finally, the Multi-Granularity Discrepancy Expert (MGDE) module integrates semantic and sample level confusion information for unified representation refinement.

Experimental Results

Base-to-New

Results reported below show accuracy for base and new classes on 11 recognition-based datasets.

fail

How to Run

Prepare

(Acknowledgement: This part is modified from PromptKD's official repository.)

  1. Create the environment and install Dassl.pytorch library. Please follow the instructions detailed in INSTALL.md.
  2. Download publicly released pre-trained teacher ViT-L/14 CLIP models of PromptKD.
    Files are publicly available at [Baidu Yun] [TeraBox] [Google Drive]
    (Note that due to cloud space limitations, we only provide a limited number of models in Google Cloud. Sorry.)
    After obtaining the teacher model, unzip these files and place the model in the ./teacher_model folder.
  3. Download the original ViT-B/16 and ViT-L/14 CLIP model weights from the official OpenAI website. Then place these models in the ./clip folder.
    [ViT-B/16 CLIP] [ViT-L/14 CLIP]
  4. Download the zip file of DPC-specific annotation files: [Google Drive] [Baidu Yun]
    Then unzip and place these SPLE_XXX.json files in the ./DATA/SPLE_Database folder.
  5. Prepare the dataset. Please follow the instructions detailed in DATASETS.md.

CAPT TRAINING

CAPT keeps the original backbone frozen, only need to train the Diff-Manner Adapter in SAM and the MGDE Module.

We should first construct the Confusion Bank.

python train.py  --root DATA/caltech-101 --seed 1 --trainer PromptKD --dataset-config-file configs/datasets/caltech101.yaml --config-file configs/trainers/PromptKD/vit_b16_c2_ep20_batch32_4+4ctx.yaml --output-dir output/PromptKD/base2new/train_base/caltech101/1_PromptKD_baseline/vit_b16_c2_ep20_batch32_4+4ctx/seed1  DATASET.NUM_SHOTS 0  TRAINER.MODAL base2novel TRAINER.PROMPTKD.TEMPERATURE 1.0 TRAINER.PROMPTKD.KD_WEIGHT 1000.0 TEST.SPLIT val

We then train Adapter and MGDE:

python train.py  --root DATA/caltech-101 --seed 1 --trainer StackSPLE_PromptKD --dataset-config-file configs/datasets/caltech101.yaml --config-file configs/trainers/SPLE/PromptKD/vit_b16_c2_ep20_batch4_4+4ctx.yaml --output-dir output/PromptKD/base2new/train_base/caltech101/3_SPLE_converse/vit_b16_c2_ep20_batch4_4+4ctx_con20/seed1 DATASET.NUM_SHOTS 16 SPLE.BACK_CKPT_PATH output/PromptKD/base2new/train_base/caltech101/1_PromptKD_baseline/vit_b16_c2_ep20_batch32_4+4ctx/seed1 SPLE.BACK_CKPT_EPOCH 20 SPLE.PIC_LIB DATA/SPLE_database/SPLE_Caltech101.json SPLE.STACK.MODE converse SPLE.STACK.WEIGHT 0.2 DATASET.SUBSAMPLE_CLASSES base SPLE.STACK.WEIGHT_FOR_NEW 0.0 TRAINER.MODAL base2novel TRAINER.PROMPTKD.TEMPERATURE 1.0 TRAINER.PROMPTKD.KD_WEIGHT 1000.0 TEST.SPLIT val

Contact

If you have any questions about our CAPT model, you can submit an issue on GitHub or contact me by email (23012112@muc.edu.cn).

In addition, I am currently looking for a graduate supervisor. If you are interested in me, thank you for contacting me. I can send you my resume and participate in internships at any time.

Acknowledge

Our code and readme are based on CoOp, PromptKD,DPC repository. We thank the authors for releasing their code. If you use our model and code, please consider citing these works as well.