Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation (ICCV 2025)
February 18, 2026 ยท View on GitHub
Jiahua Dong*, Hui Yin*, Wenqi Liang, Hanbin Zhao, Henghui Ding, Nicu Sebe, Salman Khan, Fahad Shahbaz Khan
* Equal contribution
Introduction
Video instance segmentation (VIS) has gained significant attention for its capability in segmenting and tracking object instances across video frames. However, most of the existing VIS methods unrealistically assume that the categories of object instances remain fixed over time. Moreover, they experience catastrophic forgetting of old classes when required to continuously learn object instances belonging to new classes. To address the above challenges, we develop a novel Hierarchical Visual Prompt Learning (HVPL) model, which alleviates catastrophic forgetting of old classes from both frame-level and video-level perspectives. Specifically, to mitigate forgetting at the frame level, we devise a task-specific frame prompt and an orthogonal gradient correction (OGC) module. The OGC module helps the frame prompt encode task-specific global instance information for new classes in each individual frame by projecting its gradients onto the orthogonal feature space of old classes. Furthermore, to address forgetting at the video level, we design a task-specific video prompt and a video context decoder. This decoder first embeds structural inter-class relationships across frames into the frame prompt features, and then propagates task-specific global video contexts from the frame prompt features to the video prompt. The effectiveness of our HVPL model is demonstrated through extensive experiments, in which it outperforms baseline methods.
Updates
Aug 21, 2025: ๐๐๐ Code and pretrained weights are now available! Thanks for your patience :)Jun 26, 2025: ๐๐๐ HVPL is accepted to ICCV 2025!
Installation
See installation instructions.
Getting Started
We provide a script train_net_hvpl.py, that is made to train all the configs provided in HVPL.
To train a model with "train_net_hvpl.py" on VIS, first setup the corresponding datasets following Preparing Datasets for HVPL.
Train
All training scripts are located in the scripts/ directory.
| Scenario | Script |
|---|---|
| YouTubeVIS 2019 20-2 | bash youtube_2019_20_2.sh |
| YouTubeVIS 2019 20-5 | bash youtube_2019_20_5.sh |
| YouTubeVIS 2021 20-4 | bash yvis_2021_20_4.sh |
| YouTubeVIS 2021 30-10 | bash yvis_2021_30_10.sh |
| OVIS 15-5 | bash OVIS_15_5.sh |
| OVIS 15-10 | bash OVIS_15_10.sh |
Eval
Note that during training, testing will be performed, generating a corresponding result.json file as well as a .txt file for saving the evaluation metrics (AP, AP50, AP75, AR1).
To compute the forgetting (F) metric (FAP, FAR1):
- Organize results: Rename each
result.jsontoresult_X.json(X = step index, 0 to N) - Place files: Move renamed files to the directory specified in
dataset_eval.py(resFile_dir) - Update ground truth: Set
annFileindataset_eval.pyto your test split path - Run evaluation:
python dataset_eval.py
Citing HVPL
If you use HVPL in your research or wish to refer to the baseline results published in the Model Zoo, please use the following BibTeX entry.
@InProceedings{Dong2025HVPL,
author = {Dong, Jiahua and Yin, Hui and Liang, Wenqi and Zhao, Hanbin and Ding, Henghui and Sebe, Nicu and Khan, Salman and Khan, Fahad Shahbaz},
title = {Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025}
}
Acknowledgement
Our code is largely based on VITA, ECLIPSE, Detectron2, Mask2Former, and Deformable DETR. We are truly grateful for their excellent work.