✨ [CVPR26-highlight] Open-Vocabulary Domain Generalization in Urban-Scene Segmentation

April 28, 2026 Β· View on GitHub

πŸŽ‰πŸŽ‰πŸŽ‰ (2026-04-09) Our work has been selected as Highlight.

πŸŽ‰πŸŽ‰πŸŽ‰ (2026-02-21) Our work has been accepted to CVPR 2026, Final reviewer scores: 6 (Acc) / 5 (Weak Acc) / 5 (Weak Acc).

SΒ²-Corr introduces a state-space powered correlation refinement module that stabilizes text–image alignment under domain shift, achieving SOTA performance on both Real-to-Real and Synthetic-to-Real OVDG-SS settings.

πŸš€ Features

  • 🧩 State-Space Correlation Aggregation Robust long-range correlation modeling via scan-based state passing.

  • πŸ” Open-Vocabulary Semantic Segmentation Compatible with EVA-CLIP text/image encoders.

  • 🌍 Domain Generalization Train on CS-7 / GTA-7 β†’ test on ACDC / BDD / Mapillary / ROADWork.

  • 🎯 Supports Multiple Category Spaces (7 / 19 / 30 / 41 / 58 classes)


πŸ“¦ Installation

1. Clone the Repository

git clone https://github.com/DZhaoXd/s2_corr.git
cd s2_corr

2. Create Conda Environment

conda create -n S2_Corr python=3.10
conda activate S2_Corr
pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1
pip install -r requirements.txt
pip install -e .

πŸ“ Dataset Preparation

πŸ“₯ Dataset Download

Cityscapes
Please download leftImg8bit_trainvaltest.zip and gt_trainvaltest.zip from here and extract them to data/cityscape.

GTA5
Please download all GTA5 image and label packages from here and extract them to data/GTA5/GTAV.

ACDC
Please download rgb_anon_trainvaltest.zip and gt_trainval.zip from here and extract them to data/ACDC.
Then restructure the folders from the original condition/split/sequence/ layout into a flat split/ layout (e.g., rgb_anon/train/, gt/train/).

Please download the ACDC Inpaint data (1000 inpainted image pairs) using the Stable Diffusion model from here and extract them to data/ACDC/ACDC_inpaint41.

BDD100K
Please download 10K Images and Segmentation from here and extract them to data/BDD/bdd100k.

Please download the BDD100K Inpaint data (1000 inpainted image pairs) using the Stable Diffusion model from here and extract them to data/BDD/bdd_inpaint41.

Mapillary
Please download mapillary-vistas-dataset_public_v1.2.zip from here and extract it to data/mapillary.

ROADWork_Data
Please download the two files images.zip (β‰ˆ 9.87 GB) and sem_seg_labels.zip (β‰ˆ 187 MB) from here
and extract them to data/ROADWork_Data.


πŸ”§ Conversion Scripts

7- & 19-Class Format

python tools/convert_datasets_to19/gta.py data/GTA5/GTAV
python tools/convert_datasets_to19/cityscapes.py data/cityscape --splits train
python tools/convert_datasets_to19/cityscapes19_val.py data/cityscape
python tools/convert_datasets_to19/mapillary.py data/mapillary
python tools/convert_datasets_ovss/prepare_cityscapes_seen_7.py data/cityscape
python tools/convert_datasets_ovss/process_GTA_19_to_7.py data/GTA5/GTAV

cityscapes19_val evaluation uses leftImg8bit/val and gtFine_19/val, so make sure the validation split is converted before running CS-7 / CS-19 evaluation.

large-Vocabulary (30 / 41 / 58 / 10)

python tools/process_Mapi_65.py data/mapillary
python tools/cp_Mapi_training.py data/mapillary
python tools/process_RW_10.py

Folder structure under data/ should look like:

data/
β”œβ”€β”€ GTA5/
β”‚   └── GTAV/
β”‚       β”œβ”€β”€ images/                 # 24966
β”‚       β”œβ”€β”€ labels_7/               # c-7
β”‚       └── labels_19/              # c-19
β”‚
β”œβ”€β”€ cityscape/
β”‚   β”œβ”€β”€ leftImg8bit/
β”‚   β”‚   β”œβ”€β”€ train/                  # 2,975
β”‚   β”‚   └── val/                    # 500
β”‚   β”œβ”€β”€ gtFine_7/
β”‚   β”‚   └── train/                  # c-7
β”‚   └── gtFine_19/
β”‚       β”œβ”€β”€ train/                  # c-19
β”‚       └── val/                    # c-19, used by cityscapes19_val
β”‚
β”œβ”€β”€ BDD/
β”‚   β”œβ”€β”€ bdd100k/
β”‚   β”‚   β”œβ”€β”€ images/10k/val/         # 1,000
β”‚   β”‚   └── labels/sem_seg/masks/val/   # c-19
β”‚   β”‚
β”‚   β”œβ”€β”€ bdd_inpaint41/
β”‚       β”œβ”€β”€ images/                 # 1,000
β”‚       └── labels/                 # c-41
β”‚   
β”‚
β”œβ”€β”€ ACDC/
β”‚   β”œβ”€β”€ rgb_anon/train/             # 1,600 (c-19)
β”‚   └── gt/train/                   # c-19
β”‚   β”‚
β”‚   └── ACDC_inpaint41/
β”‚       β”œβ”€β”€ images/                 # 1,000
β”‚       └── labels/                 # c-41
β”‚
β”‚
β”œβ”€β”€ mapillary/
β”‚   β”œβ”€β”€ training/
β”‚   β”‚   β”œβ”€β”€ images/
β”‚   β”‚   β”œβ”€β”€ labels/
β”‚   β”‚   └── labels_TrainID30/       # c-30
β”‚   β”‚
β”‚   β”œβ”€β”€ val/
β”‚   β”‚   β”œβ”€β”€ images/                 # 2,000
β”‚   β”‚   β”œβ”€β”€ labels_TrainIds/        # c-19
β”‚   β”‚   └── labels_TrainID30/       # c-30
β”‚   β”‚
β”‚   └── OV_30/
β”‚       β”œβ”€β”€ images/                 # 3,943
β”‚       └── labels/                 # c-30
β”‚
└── ROADWork_Data/
    β”œβ”€β”€ images/                     # 2,098
    └── gtFine_10/                  # c-10


🧠 Pretrained EVA-CLIP Models

Download EVA-CLIP weights from:

πŸ‘‰ https://github.com/baaivision/EVA/tree/master/EVA-CLIP

Place under:

Pretrain/
  EVA02_CLIP_B_psz16_s8B.pt
  EVA02_CLIP_L_336_psz14_s6B.pt

Training

Training script format:

bash run.sh <CONFIG_YAML> <NUM_GPUS> <OUTPUT_DIR>

πŸ”Ή CS-7 (Real-to-Real)

ViT-B/16

CUDA_VISIBLE_DEVICES=0 nohup bash run.sh configs/cs7_catseg.yaml 1 outputs/cs7_eva_b16_r512 \
  > logs/cs7_eva_b16_r512.log 2>&1 &

ViT-L/14

CUDA_VISIBLE_DEVICES=0 nohup bash run.sh configs/cs7_catseg_vitl.yaml 1 outputs/cs7_eva_L14_r448 \
  > logs/cs7_eva_L14_r448.log 2>&1 &

πŸ”Ή GTA-7 (Synthetic-to-Real)

ViT-B/16

CUDA_VISIBLE_DEVICES=0 nohup bash run.sh configs/gta5_seen7_catseg.yaml 1 outputs/gta_seen7_eva_b16_r512 \
  > logs/gta_seen7_eva_b16_r512.log 2>&1 &

ViT-L/14

CUDA_VISIBLE_DEVICES=0 nohup bash run.sh configs/gta5_seen7_catseg_vitl.yaml 1 outputs/gta_seen7_eva_L14_r448 \
  > logs/gta_seen7_eva_L14_r448.log 2>&1 &

πŸ”Ή CS-19 (19 Classes)

CUDA_VISIBLE_DEVICES=0 nohup bash run.sh configs/cs19_catseg_vitl.yaml 1 outputs/cs19_eva_L14_r448 \
  > logs/cs19_eva_L14_r448.log 2>&1 &

πŸ‘€ Visualization

🎨 Segmentation Masks

CUDA_VISIBLE_DEVICES=0 nohup sh demo/vis.sh configs/cs7_catseg.yaml 1 outputs/cs7_eva_b16_r512 \
  > logs/viz_cs7_eva_b16_r512.log 2>&1 &

Correlation / Attention Maps

CUDA_VISIBLE_DEVICES=0 nohup sh demo/vis_atten.sh configs/cs7_catseg.yaml 1 outputs/cs7_eva_b16_r512 \
  > logs/viz_attention_cs7_eva_b16_r512.log 2>&1 &

πŸ“š Citation

@misc{zhao2026OVDG,
      title={Open-Vocabulary Domain Generalization in Urban-Scene Segmentation}, 
      author={Dong Zhao and Qi Zang and Nan Pu and Wenjing Li and Nicu Sebe and Zhun Zhong},
      year={2026},
      eprint={2602.18853},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.18853}, 
}

❀️ Acknowledgements

This project builds upon CAT-Seg