HiCroPL: Hierarchical Cross-modal Prompt Learning for Vision-Language Models [ICCV 2025]

April 12, 2026 · View on GitHub

This is the official implementation of the paper " Hierarchical Cross-modal Prompt Learning for Vision-Language Models".

Authors: Hao Zheng, Shunzhi Yang, Zhuoxin He, Jinfeng Yang, Zhenhua Huang

Overview

motivation

HiCroPL is a hierarchical cross-modal prompt learning framework for adapting frozen vision-language models such as CLIP.

Unlike uni-modal prompting methods or one-way coupling designs, HiCroPL builds bidirectional knowledge flow between the textual and visual branches. The core idea is simple:

In early layers, textual prompts transfer relatively clear semantic priors to the visual branch.
In later layers, visually grounded prompts refine the textual branch and improve cross-modal alignment.
A hierarchical knowledge mapper and lightweight layer-specific proxy tokens enable prompt interaction across layers while preserving transferable shallow semantics.

Highlights

Bidirectional cross-modal prompting. HiCroPL enables prompt interaction in both text-to-vision and vision-to-text directions instead of relying on isolated or one-way adaptation.
Hierarchical knowledge flow across layers. Prompt interaction is distributed through the encoder, allowing shallow transferable semantics and deeper task-relevant cues to cooperate.
Layer-specific proxy tokens. Lightweight proxy tokens make cross-modal interaction efficient without introducing heavy additional modules.
Strong downstream performance. HiCroPL delivers competitive generalization and is especially strong in low-shot adaptation settings.

Method

HiCroPL is built on three key ingredients:

Component	Role
Cross-modal prompt learner	Maintains textual and visual prompt tokens across layers.
Layer-specific knowledge proxy	Summarizes prompt information at each layer for efficient interaction.
Hierarchical knowledge mapper	Transfers prompt information across modalities in a bidirectional and layer-aware way.

A high-level forward pipeline is:

Initialize textual and visual prompts across multiple layers.
Refine visual prompts in early layers with text-to-vision knowledge flow.
Refine textual prompts in later layers with vision-to-text knowledge flow.
Inject the resulting prompts into the CLIP encoders for prediction.

Running HiCroPL

Environment

conda create -n hicropl python=3.10 -y
conda activate hicropl
pip install -r requirements.txt

Recommended PyTorch versions: 1.13.0 or 2.2.0.

Data

Prepare datasets following the standard CoOp dataset setup setup and update the dataset root in the shell scripts under scripts/hicropl/:

DATA="/path/to/dataset/folder"

Quick Start

Base-to-Novel training:

sh scripts/hicropl/base2new_train_hicropl.sh imagenet 1

Base-to-Novel evaluation:

sh scripts/hicropl/base2new_test_hicropl.sh imagenet 1

Few-shot training:

sh scripts/hicropl/few_shot.sh oxford_pets 16

Cross-dataset training:

sh scripts/hicropl/xd_train.sh caltech101 1

Cross-dataset evaluation:

sh scripts/hicropl/xd_test.sh caltech101 1

Results

Average base-to-novel results across 11 datasets:

Method	Base	Novel	HM
CLIP	69.34	74.22	71.70
CoOp	82.69	63.22	71.66
CoCoOp	80.47	71.69	75.83
KgCoOp	80.73	73.60	77.00
MaPLe	82.28	75.14	78.55
PromptSRC	84.26	76.10	79.97
TCP	84.13	75.36	79.50
MMA	83.20	76.80	79.87
CoPrompt	84.00	77.23	80.47
HiCroPL	85.89	77.99	81.75

HiCroPL achieves the best harmonic mean among these compared methods, showing a strong balance between adapting to base classes and preserving transferability to novel classes.

Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@inproceedings{zheng2025hierarchical,
  title={Hierarchical cross-modal prompt learning for vision-language models},
  author={Zheng, Hao and Yang, Shunzhi and He, Zhuoxin and Yang, Jinfeng and Huang, Zhenhua},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={1891--1901},
  year={2025}
}

Acknowledgements

Our code is based on Co-CoOp, CoOp and MaPLe. We thank the authors for releasing their code. If you use our code, please consider citing these works as well.