Using Diffusion Priors for Video Amodal Segmentation

November 13, 2025 · View on GitHub

CVPR 2025

Official implementation of Using Diffusion Priors for Video Amodal Segmentation

Kaihua Chen, Deva Ramanan, Tarasha Khurana

diffusion-vas

Paper | Project Page

TODO 🤓

Release the checkpoint and inference code
Release evaluation code for SAIL-VOS and TAO-Amodal
Release fine-tuning code for Diffusion-VAS

Getting Started

Installation

1. Clone the repository

git clone https://github.com/Kaihua-Chen/diffusion-vas
cd diffusion-vas

2. Create and activate a virtual environment

conda create --name diffusion_vas python=3.10
conda activate diffusion_vas
pip install -r requirements.txt

Download Checkpoints

We provide our Diffusion-VAS checkpoints finetuned on SAIL-VOS on Hugging Face. To download them, run:

mkdir checkpoints
cd checkpoints
git lfs install
git clone https://huggingface.co/kaihuac/diffusion-vas-amodal-segmentation
git clone https://huggingface.co/kaihuac/diffusion-vas-content-completion
cd ..

Note: Ignore any Windows-related warnings when downloading.

For Depth Anything V2's checkpoints, download the Pre-trained Models (e.g., Depth-Anything-V2-Large) from this link and place them inside the checkpoints/ folder.

Inference

To run inference, simply execute:

python demo.py

This will infer the birdcage example from demo_data/.

To try different examples, modify the seq_name argument:

python demo.py --seq_name <your_sequence_name>

You can also change the checkpoint path, data output paths, and other parameters as needed.

Start with a video, use the SAM2's web demo or its codebase to segment the target object, and extract frames preferably at 8 FPS. Ensure that the output follows the same directory structure as examples from demo_data/ before running inference.

Evaluation

We currently support evaluation on SAIL-VOS-2D and TAO-Amodal.

1. Download Datasets

Download SAIL-VOS-2D and TAO-Amodal by following their official instructions.

Additionally, download our curated annotations and precomputed evaluation results:

git clone https://huggingface.co/datasets/kaihuac/diffusion_vas_datasets

This includes:

diffusion_vas_sailvos_train.json
diffusion_vas_sailvos_val.json
diffusion_vas_tao_amodal_val.json
tao_amodal_track_ids_abs2rel_val.json
sailvos_complete_objs_as_occluders.json
Precomputed eval_outputs/ folder

2. Generate Evaluation Results

To evaluate the model, first generate result files using the scripts below. Alternatively, you can skip this step and directly use our precomputed results in eval_outputs/.

Note: Please replace the paths in the commands with your own dataset and json annotation paths.

SAIL-VOS-2D

cd eval
python eval_diffusion_vas_sailvos.py \
    --eval_data_path /path/to/SAILVOS_2D/ \
    --eval_annot_path /path/to/diffusion_vas_sailvos_val.json \
    --eval_output_path /path/to/eval_outputs/

TAO-Amodal

python eval_diffusion_vas_tao_amodal.py \
    --eval_data_path /path/to/TAO/frames/ \
    --eval_annot_path /path/to/diffusion_vas_tao_amodal_val.json \
    --track_ids_path /path/to/tao_amodal_track_ids_abs2rel_val.json \
    --eval_output_path /path/to/eval_outputs/

3. Compute Metrics

Once the result files are ready, run the metric scripts:

SAIL-VOS-2D

python metric_diffusion_vas_sailvos.py \
    --eval_data_path /path/to/SAILVOS_2D/ \
    --eval_annot_path /path/to/diffusion_vas_sailvos_val.json \
    --pred_annot_path /path/to/eval_outputs/diffusion_vas_sailvos_eval_results.json

TAO-Amodal

python metric_diffusion_vas_tao_amodal.py \
    --eval_data_path /path/to/TAO/frames/ \
    --eval_annot_path /path/to/diffusion_vas_tao_amodal_val.json \
    --track_ids_path /path/to/tao_amodal_track_ids_abs2rel_val.json \
    --pred_annot_path /path/to/eval_outputs/diffusion_vas_tao_amodal_eval_results.json

Finetuning on SAIL-VOS

We currently support fine-tuning for both the amodal segmentation and content completion stages on SAIL-VOS, based on Stable Video Diffusion and adapted from SVD Xtend.

Note: Please replace the paths in the commands with your own dataset and annotation paths. The json annotations can be downloaded as shown in the Evaluation section.

Amodal segmentation fine-tuning

We provide end-to-end fine-tuning conditioned on modal masks and depth maps. The training script is:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch train/train_diffusion_vas_amodal_segm.py \
    --data_path /path/to/SAILVOS_2D/ \
    --train_annot_path /path/to/diffusion_vas_sailvos_train.json \
    --eval_annot_path /path/to/diffusion_vas_sailvos_val.json \
    --output_dir /path/to/train_diffusion_vas_amodal_seg_outputs

Note:

Our default implementation runs the depth estimator during each training step, which requires more than 24GB memory per GPU and significantly increases training time (~120 hours on 8× A6000s).
To reduce memory usage and training time, we highly recommend precomputing and saving pseudo-depth maps before training. This allows training on RTX 3090s and reduces training time (~30 hours) considerably.

Content completion fine-tuning

We provide end-to-end fine-tuning conditioned on modal RGB images and predicted amodal masks:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch train/train_diffusion_vas_content_comp.py \
    --data_path /path/to/SAILVOS_2D/ \
    --train_annot_path /path/to/sailvos_complete_objs_as_occluders.json \
    --eval_annot_path /path/to/sailvos_complete_objs_as_occluders.json \
    --occluder_data_path /path/to/sailvos_complete_objs_as_occluders.json \
    --output_dir /path/to/train_diffusion_vas_content_comp_outputs

This stage does not require depth estimation, and training typically completes in ~30 hours on 8× RTX 3090s.

Acknowledgement

This work builds on top of several excellent projects, including Depth Anything V2, SAM2, Stable Video Diffusion, and SVD Xtend. Our training and evaluation are based on SAIL-VOS and TAO-Amodal. We sincerely thank the authors for their contributions.

Citation

If you find this work helpful, please consider citing our papers:

@InProceedings{chen2025diffvas,
    author    = {Chen, Kaihua and Ramanan, Deva and Khurana, Tarasha},
    title     = {Using Diffusion Priors for Video Amodal Segmentation},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {22890-22900}
}

@article{hsieh2023taoamodal,
  title={TAO-Amodal: A Benchmark for Tracking Any Object Amodally},
  author={Cheng-Yen Hsieh and Kaihua Chen and Achal Dave and Tarasha Khurana and Deva Ramanan},
  journal={arXiv preprint arXiv:2312.12433},
  year={2023}
}