README.md

April 11, 2025 · View on GitHub

Studying Image Diffusion Features for Zero-Shot Video Object Segmentation

_{CVPRW 2025 – IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops}

Thanos Delatolas · Vicky Kalogeiton · Dim Papadopoulos

Webpage · Paper

teaser

We leverage pre-trained diffusion models for Zero-Shot Video Object Segmentation by addressing key challenges:

selecting the appropriate diffusion model
determining the optimal time step
identifying the best feature extraction layer
designing an effective affinity matrix calculation strategy to match the features

Installation

conda create -n diff-zvos python=3.10.8
conda activate diff-zvos
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
sh scripts/install_adm.sh

Datasets

To download the datasets, run:

python scripts/download_datasets.py

To run inference please follow EVALUATION.md.

Citation

@article{delatolas2025studying,
  title={Studying Image Diffusion Features for Zero-Shot Video Object Segmentation},
  author={Delatolas, Thanos and Kalogeiton, Vicky and Papadopoulos, Dim P},
  journal={arXiv preprint arXiv:2504.05468},
  year={2025}
}

State-of-the-art Comparison in Zero-Shot Video Segmentation

Model	#Images	#Segmentations (Image)	#Frames	#Segmentations (Video)	Datasets	DAVIS-17 val
Image + Video-level Data
XMem	1.02M	27K	150K	210K	I+S+D+Y	86.2
Cutie	1.02M	27K	150K	210K	I+S+D+Y	88.8
SAM2	11M	1.1B	4.2M	35.5M	SA+SAV	90.7
Image-Level masks
SegIC	1.3M	1.8M	❌	❌	I+C+A+L	73.7
SegGPT	147K	1.62M	❌	❌	C+A+V	75.6
PerSAM-F	11M	1.1B	❌	❌	SA	76.1
Matcher	11M	1.1B	❌	❌	SA	79.5
No masks
FGVG	1M	❌	116K	❌	I+Y+FT	72.4
STT	1M	❌	95K	❌	I+Y	74.1
STC	✗	❌	20M	❌	K	67.6
INO	✗	❌	20M	❌	K	72.5
Mask-VOS	✗	❌	95K	❌	Y	75.6
MoCo	1M	❌	❌	❌	I	65.4
SHLS	10K	❌	❌	❌	M	68.5
DIFT-SD	5B	❌	❌	❌	LN	70.0
DINO	1M	❌	❌	❌	I	71.4
DIFT-ADM	1M	❌	❌	❌	I	75.7
Training-Free-VOS	1M	❌	❌	❌	I	76.3
Ours
SD-2.1 + Prompt Learning	5B	❌	❌	❌	LN	70.5
ADM + MAGFilter	1M	❌	❌	❌	I	76.8

Acknowledgements

We would like to thank the authors of DIFT, DINO and Cutie for making their code publicly available.