README.md

April 11, 2025 · View on GitHub

Studying Image Diffusion Features for Zero-Shot Video Object Segmentation

CVPRW 2025 – IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops

Thanos Delatolas · Vicky Kalogeiton · Dim Papadopoulos

Webpage · Paper

teaser

We leverage pre-trained diffusion models for Zero-Shot Video Object Segmentation by addressing key challenges:
  • selecting the appropriate diffusion model
  • determining the optimal time step
  • identifying the best feature extraction layer
  • designing an effective affinity matrix calculation strategy to match the features

Installation

conda create -n diff-zvos python=3.10.8
conda activate diff-zvos
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
sh scripts/install_adm.sh

Datasets

To download the datasets, run:

python scripts/download_datasets.py

To run inference please follow EVALUATION.md.

Citation

@article{delatolas2025studying,
  title={Studying Image Diffusion Features for Zero-Shot Video Object Segmentation},
  author={Delatolas, Thanos and Kalogeiton, Vicky and Papadopoulos, Dim P},
  journal={arXiv preprint arXiv:2504.05468},
  year={2025}
}

State-of-the-art Comparison in Zero-Shot Video Segmentation

Model#Images#Segmentations (Image)#Frames#Segmentations (Video)DatasetsDAVIS-17 val
Image + Video-level Data
XMem1.02M27K150K210KI+S+D+Y86.2
Cutie1.02M27K150K210KI+S+D+Y88.8
SAM211M1.1B4.2M35.5MSA+SAV90.7
Image-Level masks
SegIC1.3M1.8MI+C+A+L73.7
SegGPT147K1.62MC+A+V75.6
PerSAM-F11M1.1BSA76.1
Matcher11M1.1BSA79.5
No masks
FGVG1M116KI+Y+FT72.4
STT1M95KI+Y74.1
STC20MK67.6
INO20MK72.5
Mask-VOS95KY75.6
MoCo1MI65.4
SHLS10KM68.5
DIFT-SD5BLN70.0
DINO1MI71.4
DIFT-ADM1MI75.7
Training-Free-VOS1MI76.3
Ours
SD-2.1 + Prompt Learning5BLN70.5
ADM + MAGFilter1MI76.8

Acknowledgements

We would like to thank the authors of DIFT, DINO and Cutie for making their code publicly available.