HiCroPL: Hierarchical Cross-modal Prompt Learning for Vision-Language Models [ICCV 2025]

April 12, 2026 ยท View on GitHub

This is the official implementation of the paper " Hierarchical Cross-modal Prompt Learning for Vision-Language Models".

Authors: Hao Zheng, Shunzhi Yang, Zhuoxin He, Jinfeng Yang, Zhenhua Huang


Overview

motivation

HiCroPL is a hierarchical cross-modal prompt learning framework for adapting frozen vision-language models such as CLIP.

Unlike uni-modal prompting methods or one-way coupling designs, HiCroPL builds bidirectional knowledge flow between the textual and visual branches. The core idea is simple:

  • In early layers, textual prompts transfer relatively clear semantic priors to the visual branch.
  • In later layers, visually grounded prompts refine the textual branch and improve cross-modal alignment.
  • A hierarchical knowledge mapper and lightweight layer-specific proxy tokens enable prompt interaction across layers while preserving transferable shallow semantics.

Highlights

  • Bidirectional cross-modal prompting. HiCroPL enables prompt interaction in both text-to-vision and vision-to-text directions instead of relying on isolated or one-way adaptation.
  • Hierarchical knowledge flow across layers. Prompt interaction is distributed through the encoder, allowing shallow transferable semantics and deeper task-relevant cues to cooperate.
  • Layer-specific proxy tokens. Lightweight proxy tokens make cross-modal interaction efficient without introducing heavy additional modules.
  • Strong downstream performance. HiCroPL delivers competitive generalization and is especially strong in low-shot adaptation settings.

Method

HiCroPL is built on three key ingredients:

ComponentRole
Cross-modal prompt learnerMaintains textual and visual prompt tokens across layers.
Layer-specific knowledge proxySummarizes prompt information at each layer for efficient interaction.
Hierarchical knowledge mapperTransfers prompt information across modalities in a bidirectional and layer-aware way.

A high-level forward pipeline is:

  1. Initialize textual and visual prompts across multiple layers.
  2. Refine visual prompts in early layers with text-to-vision knowledge flow.
  3. Refine textual prompts in later layers with vision-to-text knowledge flow.
  4. Inject the resulting prompts into the CLIP encoders for prediction.

Running HiCroPL

Environment

conda create -n hicropl python=3.10 -y
conda activate hicropl
pip install -r requirements.txt

Recommended PyTorch versions: 1.13.0 or 2.2.0.

Data

Prepare datasets following the standard CoOp dataset setup setup and update the dataset root in the shell scripts under scripts/hicropl/:

DATA="/path/to/dataset/folder"

Quick Start

Base-to-Novel training:

sh scripts/hicropl/base2new_train_hicropl.sh imagenet 1

Base-to-Novel evaluation:

sh scripts/hicropl/base2new_test_hicropl.sh imagenet 1

Few-shot training:

sh scripts/hicropl/few_shot.sh oxford_pets 16

Cross-dataset training:

sh scripts/hicropl/xd_train.sh caltech101 1

Cross-dataset evaluation:

sh scripts/hicropl/xd_test.sh caltech101 1

Results

Average base-to-novel results across 11 datasets:

MethodBaseNovelHM
CLIP69.3474.2271.70
CoOp82.6963.2271.66
CoCoOp80.4771.6975.83
KgCoOp80.7373.6077.00
MaPLe82.2875.1478.55
PromptSRC84.2676.1079.97
TCP84.1375.3679.50
MMA83.2076.8079.87
CoPrompt84.0077.2380.47
HiCroPL85.8977.9981.75

HiCroPL achieves the best harmonic mean among these compared methods, showing a strong balance between adapting to base classes and preserving transferability to novel classes.

Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@inproceedings{zheng2025hierarchical,
  title={Hierarchical cross-modal prompt learning for vision-language models},
  author={Zheng, Hao and Yang, Shunzhi and He, Zhuoxin and Yang, Jinfeng and Huang, Zhenhua},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={1891--1901},
  year={2025}
}

Acknowledgements

Our code is based on Co-CoOp, CoOp and MaPLe. We thank the authors for releasing their code. If you use our code, please consider citing these works as well.