LightRA: Lightweight Residual Attention for Adaptation of Vision-Language Models
March 25, 2026 · View on GitHub
Official implementation of the paper : "LightRA: Lightweight Residual Attention for Adaptation of Vision-Language Models".
Highlights

Abstract: Vision-language models (VLMs) have demonstrated strong performance across a wide range of downstream tasks. Among them, CLIP is particularly effective in zero-shot and few-shot learning, offering a promising solution for scenarios with limited labeled data. Yet, fine-tuning CLIP remains a significant challenge for downstream few-shot generalization tasks, as excessive learnable parameters often lead to overfitting on seen classes, thereby limiting generalization to unseen classes. To tackle this issue, various lightweight few-shot tuning methods have been introduced to adapt CLIP to downstream tasks. However, these methods still face several limitations. First, the initial outputs of newly introduced modules may interfere with the pre-trained representation space, ultimately affecting the model’s final fine-tuning performance. Second, many existing approaches primarily rely on task-specific supervision, with limited mechanisms to explicitly leverage the guidance from the pre-trained model itself. In this work, we propose a \textbf{Light}weight \textbf{R}esidual \textbf{A}ttention (LightRA) framework for few-shot adaptation. LightRA incorporates a lightweight residual module into the multi-head attention of the Transformer-based CLIP model. The residual module is initialized in a non-intrusive manner and progressively optimized to ensure minimal deviation from the original representations. By employing a self-distillation strategy, the pre-trained model itself serves as the teacher to guide the learnable parameters in LightRA, enabling them to acquire both generalizable and task-relevant knowledge, thereby improving adaptability to downstream tasks while effectively mitigating overfitting. We conduct extensive experiments on three widely used and challenging few-shot generalization tasks, and the results demonstrate that LightRA consistently outperforms existing state-of-the-art methods.
Installation
# Clone LightRA code
git clone https://github.com/longinhit/LightRA.git
cd LightRA
# Create a conda environment from the YAML file
conda env create -f LightRA_env.yml
Datasets
Follow DATASETS.md to install the datasets.
Training and Evaluation
# Base-to-Novel Generalization
bash run_base2novel.sh
# Few-Shot Evaluation
bash run_few-shot.sh
# Cross-Dataset Evaluation and Domain Generalization
bash run_xd.sh
Experimental Results
Results reported below show accuracy for base and novel classes for across 11 recognition datasets averaged over 3 seeds.
| Name | Base Acc. | Novel Acc. | HM |
|---|---|---|---|
| CLIP | 69.34 | 74.22 | 71.70 |
| CoOp | 82.69 | 63.22 | 71.66 |
| CoCoOp | 80.47 | 71.69 | 75.83 |
| MaPLe | 82.28 | 75.14 | 78.55 |
| PromptSRC | 84.26 | 76.10 | 79.97 |
| CoPrompt | 84.00 | 77.23 | 80.48 |
| HPT | 84.32 | 76.86 | 80.42 |
| ProVP-Ref | 85.20 | 73.22 | 78.76 |
| LightRA | 84.94 | 78.51 | 81.60 |
We provide all trained model checkpoints and logs to reproduce our results at this Google Drive link
Citation
If you find our work or this repo helpful for your research, please kindly cite the following paper:
@ARTICLE{LightRA,
title={LightRA: Lightweight Residual Attention for Adaptation of Vision-Language Models},
journal={IEEE Transactions on Multimedia},
author={Jiao, Jiulong and Zhou, Yizhi and Wu, Li and Song, Zhipeng and Kong, Xiangyu and Cao, Yuan and Qi, Heng},
year={2026},
pages={1-12},
doi={10.1109/TMM.2026.3676654}}
Acknowledgements
Our code is based on CoOp, PromptSRC and CoPrompt repositories. We thank the authors for releasing their codes.