β¨ [CVPR26-highlight] Open-Vocabulary Domain Generalization in Urban-Scene Segmentation
April 28, 2026 Β· View on GitHub
πππ (2026-04-09) Our work has been selected as Highlight.
πππ (2026-02-21) Our work has been accepted to CVPR 2026, Final reviewer scores: 6 (Acc) / 5 (Weak Acc) / 5 (Weak Acc).
SΒ²-Corr introduces a state-space powered correlation refinement module that stabilizes textβimage alignment under domain shift, achieving SOTA performance on both Real-to-Real and Synthetic-to-Real OVDG-SS settings.
π Features
-
π§© State-Space Correlation Aggregation Robust long-range correlation modeling via scan-based state passing.
-
π Open-Vocabulary Semantic Segmentation Compatible with EVA-CLIP text/image encoders.
-
π Domain Generalization Train on CS-7 / GTA-7 β test on ACDC / BDD / Mapillary / ROADWork.
-
π― Supports Multiple Category Spaces (7 / 19 / 30 / 41 / 58 classes)
π¦ Installation
1. Clone the Repository
git clone https://github.com/DZhaoXd/s2_corr.git
cd s2_corr
2. Create Conda Environment
conda create -n S2_Corr python=3.10
conda activate S2_Corr
pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1
pip install -r requirements.txt
pip install -e .
π Dataset Preparation
π₯ Dataset Download
Cityscapes
Please download leftImg8bit_trainvaltest.zip and gt_trainvaltest.zip from here and extract them to data/cityscape.
GTA5
Please download all GTA5 image and label packages from here and extract them to data/GTA5/GTAV.
ACDC
Please download rgb_anon_trainvaltest.zip and gt_trainval.zip from here and extract them to data/ACDC.
Then restructure the folders from the original condition/split/sequence/ layout into a flat split/ layout (e.g., rgb_anon/train/, gt/train/).
Please download the ACDC Inpaint data (1000 inpainted image pairs) using the Stable Diffusion model from here and extract them to data/ACDC/ACDC_inpaint41.
BDD100K
Please download 10K Images and Segmentation from here and extract them to data/BDD/bdd100k.
Please download the BDD100K Inpaint data (1000 inpainted image pairs) using the Stable Diffusion model from here and extract them to data/BDD/bdd_inpaint41.
Mapillary
Please download mapillary-vistas-dataset_public_v1.2.zip from here and extract it to data/mapillary.
ROADWork_Data
Please download the two files images.zip (β 9.87 GB) and sem_seg_labels.zip (β 187 MB) from here
and extract them to data/ROADWork_Data.
π§ Conversion Scripts
7- & 19-Class Format
python tools/convert_datasets_to19/gta.py data/GTA5/GTAV
python tools/convert_datasets_to19/cityscapes.py data/cityscape --splits train
python tools/convert_datasets_to19/cityscapes19_val.py data/cityscape
python tools/convert_datasets_to19/mapillary.py data/mapillary
python tools/convert_datasets_ovss/prepare_cityscapes_seen_7.py data/cityscape
python tools/convert_datasets_ovss/process_GTA_19_to_7.py data/GTA5/GTAV
cityscapes19_val evaluation uses leftImg8bit/val and gtFine_19/val, so make sure the validation split is converted before running CS-7 / CS-19 evaluation.
large-Vocabulary (30 / 41 / 58 / 10)
python tools/process_Mapi_65.py data/mapillary
python tools/cp_Mapi_training.py data/mapillary
python tools/process_RW_10.py
Folder structure under data/ should look like:
data/
βββ GTA5/
β βββ GTAV/
β βββ images/ # 24966
β βββ labels_7/ # c-7
β βββ labels_19/ # c-19
β
βββ cityscape/
β βββ leftImg8bit/
β β βββ train/ # 2,975
β β βββ val/ # 500
β βββ gtFine_7/
β β βββ train/ # c-7
β βββ gtFine_19/
β βββ train/ # c-19
β βββ val/ # c-19, used by cityscapes19_val
β
βββ BDD/
β βββ bdd100k/
β β βββ images/10k/val/ # 1,000
β β βββ labels/sem_seg/masks/val/ # c-19
β β
β βββ bdd_inpaint41/
β βββ images/ # 1,000
β βββ labels/ # c-41
β
β
βββ ACDC/
β βββ rgb_anon/train/ # 1,600 (c-19)
β βββ gt/train/ # c-19
β β
β βββ ACDC_inpaint41/
β βββ images/ # 1,000
β βββ labels/ # c-41
β
β
βββ mapillary/
β βββ training/
β β βββ images/
β β βββ labels/
β β βββ labels_TrainID30/ # c-30
β β
β βββ val/
β β βββ images/ # 2,000
β β βββ labels_TrainIds/ # c-19
β β βββ labels_TrainID30/ # c-30
β β
β βββ OV_30/
β βββ images/ # 3,943
β βββ labels/ # c-30
β
βββ ROADWork_Data/
βββ images/ # 2,098
βββ gtFine_10/ # c-10
π§ Pretrained EVA-CLIP Models
Download EVA-CLIP weights from:
π https://github.com/baaivision/EVA/tree/master/EVA-CLIP
Place under:
Pretrain/
EVA02_CLIP_B_psz16_s8B.pt
EVA02_CLIP_L_336_psz14_s6B.pt
Training
Training script format:
bash run.sh <CONFIG_YAML> <NUM_GPUS> <OUTPUT_DIR>
πΉ CS-7 (Real-to-Real)
ViT-B/16
CUDA_VISIBLE_DEVICES=0 nohup bash run.sh configs/cs7_catseg.yaml 1 outputs/cs7_eva_b16_r512 \
> logs/cs7_eva_b16_r512.log 2>&1 &
ViT-L/14
CUDA_VISIBLE_DEVICES=0 nohup bash run.sh configs/cs7_catseg_vitl.yaml 1 outputs/cs7_eva_L14_r448 \
> logs/cs7_eva_L14_r448.log 2>&1 &
πΉ GTA-7 (Synthetic-to-Real)
ViT-B/16
CUDA_VISIBLE_DEVICES=0 nohup bash run.sh configs/gta5_seen7_catseg.yaml 1 outputs/gta_seen7_eva_b16_r512 \
> logs/gta_seen7_eva_b16_r512.log 2>&1 &
ViT-L/14
CUDA_VISIBLE_DEVICES=0 nohup bash run.sh configs/gta5_seen7_catseg_vitl.yaml 1 outputs/gta_seen7_eva_L14_r448 \
> logs/gta_seen7_eva_L14_r448.log 2>&1 &
πΉ CS-19 (19 Classes)
CUDA_VISIBLE_DEVICES=0 nohup bash run.sh configs/cs19_catseg_vitl.yaml 1 outputs/cs19_eva_L14_r448 \
> logs/cs19_eva_L14_r448.log 2>&1 &
π Visualization
π¨ Segmentation Masks
CUDA_VISIBLE_DEVICES=0 nohup sh demo/vis.sh configs/cs7_catseg.yaml 1 outputs/cs7_eva_b16_r512 \
> logs/viz_cs7_eva_b16_r512.log 2>&1 &
Correlation / Attention Maps
CUDA_VISIBLE_DEVICES=0 nohup sh demo/vis_atten.sh configs/cs7_catseg.yaml 1 outputs/cs7_eva_b16_r512 \
> logs/viz_attention_cs7_eva_b16_r512.log 2>&1 &
π Citation
@misc{zhao2026OVDG,
title={Open-Vocabulary Domain Generalization in Urban-Scene Segmentation},
author={Dong Zhao and Qi Zang and Nan Pu and Wenjing Li and Nicu Sebe and Zhun Zhong},
year={2026},
eprint={2602.18853},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.18853},
}
β€οΈ Acknowledgements
This project builds upon CAT-Seg