MAO: Efficient Model-Agnostic Optimization of Prompt Tuning for Vision-Language Models

October 31, 2025 · View on GitHub

[ICME 2025] MAO: Efficient Model-Agnostic Optimization of Prompt Tuning for Vision-Language Models
Haoyang Li, Siyu Zhou, Liang Wang and Guodong Long.
Shanghai University, University of Technology Sydney

📦 Supplementary Material: https://github.com/JREion/M.A.O/releases/tag/docs

📰 Full Text: https://arxiv.org/abs/2503.18160

🔥 News

NOTE: We are preparing our code repository (mainly rewriting comments to improve readability). We hope to release code in April.
(24 Jun. 2025) We upload the poster of M.A.O.
(15 Apr. 2025) The code of PromptSRC+M.A.O is released.
(25 Mar. 2025) Full text and supplementary material are available at Arxiv.
(21 Mar. 2025) Our paper is accepted by ICME 2025!

Abstract

Though CLIP-based prompt tuning significantly enhances pre-trained Vision-Language Models, existing research focuses on reconstructing the model architecture, e.g., additional loss calculation and meta-networks. These approaches generally lead to increased complexity and extended training cost. To maintain the efficiency of the tuning process, we propose plug-and-play Model-Agnostic Optimization (M.A.O) for prompt tuning.
Without altering any components of the prompt tuning backbone, we introduce a Data-Driven Enhancement framework to optimize the distribution of the initial data, and incorporate an Alterable Regularization module to boost the task-specific feature processing pipeline, thereby improving overall performance while maintaining low computational cost. Extensive experiments on MAO demonstrate its outstanding performance and efficiency.

Figure 1. Framework of proposed MAO. MAO builds a two-step fine-tuning structure without altering components of prompt tuning backbones. In (a) base tasks, MAO introduces a hard negative sampler as Data-Driven Enhancement (DDE), and an Alterable Regularization (reg-B) that guides the model in learning the feature distribution of hard negatives and keeps generalization. Then in (b) new tasks, rapid pseudo-labeling is performed on unlabeled images as DDE using shared-parameter CLIP, followed by reg-N to constrain the fine-tuning on new classes. The inference process follows the settings of the original backbones.

Main Contributions

(1) MAO efficiently optimizes prompt tuning backbones at data and feature level in a plug-and-play manner, consuming almost no additional computational resources.

(2) We introduce task-related Data-Driven Enhancement to MAO, improving the data distribution of base and new classes through hard negative sampling and rapid pseudo-label allocation, respectively.

(3) We incorporate Alterable Regularization into the procedure of feature processing, constraining the model to dynamically focus more on the features of updated data to enhance performance and generalization.

@INPROCEEDINGS{li2025mao,
  author={Li, Haoyang and Zhou, Siyu and Wang, Liang and Long, Guodong},
  booktitle={2025 IEEE International Conference on Multimedia and Expo (ICME)}, 
  title={MAO: Efficient Model-Agnostic Optimization of Prompt Tuning for Vision-Language Models}, 
  year={2025},
  pages={1-6},
  doi={10.1109/ICME59968.2025.11209968}}

MAO: Efficient Model-Agnostic Optimization of Prompt Tuning for Vision-Language Models

🔥 News

Abstract

Framework

Main Contributions

Experimental Results

Base-to-New

Cross-Dataset

Poster

Citation

Acknowledgements

🧰 Repositories

💖 Special Thanks