✨ [CVPR26-highlight] Open-Vocabulary Domain Generalization in Urban-Scene Segmentation

June 13, 2026 Β· View on GitHub

πŸŽ‰πŸŽ‰πŸŽ‰ (2026-04-09) Our work has been selected as CVPR 2026 Highlight.

πŸŽ‰πŸŽ‰πŸŽ‰ (2026-02-21) Our work has been accepted to CVPR 2026, Final reviewer scores: 6 (Acc) / 5 (Weak Acc) / 5 (Weak Acc).

SΒ²-Corr introduces a state-space powered correlation refinement module that stabilizes text–image alignment under domain shift, achieving SOTA performance on both Real-to-Real and Synthetic-to-Real OVDG-SS settings.

πŸš€ Features

  • 🧩 State-Space Correlation Aggregation Robust long-range correlation modeling via scan-based state passing.

  • πŸ” Open-Vocabulary Semantic Segmentation Compatible with EVA-CLIP text/image encoders.

  • 🌍 Domain Generalization Train on CS-7 / GTA-7 β†’ test on ACDC / BDD / Mapillary / ROADWork.

  • 🎯 Supports Multiple Category Spaces (7 / 19 / 30 / 41 / 58 classes)


⭐ If you find this repository helpful, please consider giving it a star. Thanks!

πŸ“¦ Installation

1. Clone the Repository

git clone https://github.com/DZhaoXd/s2_corr.git
cd s2_corr

2. Create Conda Environment

conda create -n S2_Corr python=3.10
conda activate S2_Corr
pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1
pip install -r requirements.txt
pip install -e .

πŸ“ Dataset Preparation

πŸ“₯ Dataset Download

Cityscapes
Please download leftImg8bit_trainvaltest.zip and gt_trainvaltest.zip from here and extract them to data/cityscape.

GTA5
Please download all GTA5 image and label packages from here and extract them to data/GTA5/GTAV.

ACDC
Please download rgb_anon_trainvaltest.zip and gt_trainval.zip from here and extract them to data/ACDC.
Then restructure the folders from the original condition/split/sequence/ layout into a flat split/ layout (e.g., rgb_anon/train/, gt/train/).

Please download the ACDC Inpaint data (1000 inpainted image pairs) using the Stable Diffusion model from here and extract them to data/ACDC/ACDC_inpaint41.

BDD100K
Please download 10K Images and Segmentation from here and extract them to data/BDD/bdd100k.

Please download the BDD100K Inpaint data (1000 inpainted image pairs) using the Stable Diffusion model from here and extract them to data/BDD/bdd_inpaint41.

Mapillary
Please download mapillary-vistas-dataset_public_v1.2.zip from here and extract it to data/mapillary.

ROADWork_Data
Please download the two files images.zip (β‰ˆ 9.87 GB) and sem_seg_labels.zip (β‰ˆ 187 MB) from here
and extract them to data/ROADWork_Data.


πŸ”§ Conversion Scripts

7- & 19-Class Format

python tools/convert_datasets_to19/gta.py data/GTA5/GTAV
python tools/convert_datasets_to19/cityscapes.py data/cityscape --splits train
python tools/convert_datasets_to19/cityscapes19_val.py data/cityscape
python tools/convert_datasets_to19/mapillary.py data/mapillary
python tools/convert_datasets_ovss/prepare_cityscapes_seen_7.py data/cityscape
python tools/convert_datasets_ovss/process_GTA_19_to_7.py data/GTA5/GTAV

cityscapes19_val evaluation uses leftImg8bit/val and gtFine_19/val, so make sure the validation split is converted before running CS-7 / CS-19 evaluation.

large-Vocabulary (30 / 41 / 58 / 10)

python tools/process_Mapi_65.py data/mapillary
python tools/cp_Mapi_training.py data/mapillary
python tools/process_RW_10.py

Folder structure under data/ should look like:

data/
β”œβ”€β”€ GTA5/
β”‚   └── GTAV/
β”‚       β”œβ”€β”€ images/                 # 24966
β”‚       β”œβ”€β”€ labels_7/               # c-7
β”‚       └── labels_19/              # c-19
β”‚
β”œβ”€β”€ cityscape/
β”‚   β”œβ”€β”€ leftImg8bit/
β”‚   β”‚   β”œβ”€β”€ train/                  # 2,975
β”‚   β”‚   └── val/                    # 500
β”‚   β”œβ”€β”€ gtFine_7/
β”‚   β”‚   └── train/                  # c-7
β”‚   └── gtFine_19/
β”‚       β”œβ”€β”€ train/                  # c-19
β”‚       └── val/                    # c-19, used by cityscapes19_val
β”‚
β”œβ”€β”€ BDD/
β”‚   β”œβ”€β”€ bdd100k/
β”‚   β”‚   β”œβ”€β”€ images/10k/val/         # 1,000
β”‚   β”‚   └── labels/sem_seg/masks/val/   # c-19
β”‚   β”‚
β”‚   β”œβ”€β”€ bdd_inpaint41/
β”‚       β”œβ”€β”€ images/                 # 1,000
β”‚       └── labels/                 # c-41
β”‚   
β”‚
β”œβ”€β”€ ACDC/
β”‚   β”œβ”€β”€ rgb_anon/train/             # 1,600 (c-19)
β”‚   └── gt/train/                   # c-19
β”‚   β”‚
β”‚   └── ACDC_inpaint41/
β”‚       β”œβ”€β”€ images/                 # 1,000
β”‚       └── labels/                 # c-41
β”‚
β”‚
β”œβ”€β”€ mapillary/
β”‚   β”œβ”€β”€ training/
β”‚   β”‚   β”œβ”€β”€ images/
β”‚   β”‚   β”œβ”€β”€ labels/
β”‚   β”‚   └── labels_TrainID30/       # c-30
β”‚   β”‚
β”‚   β”œβ”€β”€ val/
β”‚   β”‚   β”œβ”€β”€ images/                 # 2,000
β”‚   β”‚   β”œβ”€β”€ labels_TrainIds/        # c-19
β”‚   β”‚   └── labels_TrainID30/       # c-30
β”‚   β”‚
β”‚   └── OV_30/
β”‚       β”œβ”€β”€ images/                 # 3,943
β”‚       └── labels/                 # c-30
β”‚
└── ROADWork_Data/
    β”œβ”€β”€ images/                     # 2,098
    └── gtFine_10/                  # c-10


🧠 Pretrained EVA-CLIP Models

Download EVA-CLIP weights from:

πŸ‘‰ https://github.com/baaivision/EVA/tree/master/EVA-CLIP

Place under:

Pretrain/
  EVA02_CLIP_B_psz16_s8B.pt
  EVA02_CLIP_L_336_psz14_s6B.pt

Training

Training script format:

bash run.sh <CONFIG_YAML> <NUM_GPUS> <OUTPUT_DIR>

πŸ”Ή CS-7 (Real-to-Real)

ViT-B/16

CUDA_VISIBLE_DEVICES=0 nohup bash run.sh configs/cs7_catseg.yaml 1 outputs/cs7_eva_b16_r512 \
  > logs/cs7_eva_b16_r512.log 2>&1 &

ViT-L/14

CUDA_VISIBLE_DEVICES=0 nohup bash run.sh configs/cs7_catseg_vitl.yaml 1 outputs/cs7_eva_L14_r448 \
  > logs/cs7_eva_L14_r448.log 2>&1 &

πŸ”Ή GTA-7 (Synthetic-to-Real)

ViT-B/16

CUDA_VISIBLE_DEVICES=0 nohup bash run.sh configs/gta5_seen7_catseg.yaml 1 outputs/gta_seen7_eva_b16_r512 \
  > logs/gta_seen7_eva_b16_r512.log 2>&1 &

ViT-L/14

CUDA_VISIBLE_DEVICES=0 nohup bash run.sh configs/gta5_seen7_catseg_vitl.yaml 1 outputs/gta_seen7_eva_L14_r448 \
  > logs/gta_seen7_eva_L14_r448.log 2>&1 &

πŸ”Ή CS-19 (19 Classes)

CUDA_VISIBLE_DEVICES=0 nohup bash run.sh configs/cs19_catseg_vitl.yaml 1 outputs/cs19_eva_L14_r448 \
  > logs/cs19_eva_L14_r448.log 2>&1 &

πŸ”Ή RW-10 (evaluation)

For RW10 multi-vocabulary evaluation, replace datasets/roadwork_10.json with datasets/roadwork_test.json in MODEL.SEM_SEG_HEAD.TEST_CLASS_JSON.

πŸ”Ή GTA-19 (19 Classes)

As discussed in the CLIPSelf : Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction [ICLR'24] paper, the spatial representations learned by VLMs are generally less suitable for dense prediction than those learned by models such as DINOv2. Therefore, to achieve better performance on the 19 seen categories on Table 9, it is helpful to first warm up the VLM representations and then further fine-tune the model using our training scripts.

I have provided a warm-up checkpoint for ViT-L, trained using the official CLIPSelf code on GTA dataset: https://drive.google.com/file/d/1PQzV4d_OuLPt46z96bU-kYQAK68X8vEY/view?usp=drive_link

You can use this checkpoint, warmup_GTA_eva_l14_336_40460594_epoch_10.pt, to replace the original pretrained checkpoint specified by:

CACHE_DIR: "./pretrain/warmup_GTA_eva_l14_336_40460594_epoch_10.pt"


Trained Checkpoints

We provide the trained checkpoints for the EVA02-L/14 backbone:

πŸ‘€ Visualization

🎨 Segmentation Masks

CUDA_VISIBLE_DEVICES=0 nohup sh demo/vis.sh configs/cs7_catseg.yaml 1 outputs/cs7_eva_b16_r512 \
  > logs/viz_cs7_eva_b16_r512.log 2>&1 &

Correlation / Attention Maps

CUDA_VISIBLE_DEVICES=0 nohup sh demo/vis_atten.sh configs/cs7_catseg.yaml 1 outputs/cs7_eva_b16_r512 \
  > logs/viz_attention_cs7_eva_b16_r512.log 2>&1 &

πŸ“š Citation

@InProceedings{Zhao_2026_CVPR,
    author    = {Zhao, Dong and Zang, Qi and Pu, Nan and Li, Wenjing and Sebe, Nicu and Zhong, Zhun},
    title     = {Open-Vocabulary Domain Generalization in Urban-Scene Segmentation},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2026},
    pages     = {20388-20398}
}

❀️ Acknowledgements

This project builds upon CAT-Seg