VCT_AVS
August 18, 2025 ยท View on GitHub
This repo is the official PyTorch implementation for:
Revisiting Audio-Visual Segmentation with Vision-Centric Transformer. Accepted by CVPR 2025.
In this paper, we propose a new Vision-Centric Transformer (VCT) framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information, enabling queries to better distinguish between different sounding objects from mixed audio and accurately delineate their contours. Additionally, we also introduce a Prototype Prompted Query Generation (PPQG) module within our VCT framework to generate vision-derived queries that are both semantically aware and visually rich through audio prototype prompting and
pixel context grouping, facilitating audio-visual information aggregation.
Installation
After setting up the environment, clone this repo:
conda create -n vct_avs python==3.8 -y
conda activate vct_avs
git clone https://github.com/spyflying/VCT_AVS.git
cd VCT_AVS
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 cudatoolkit=11.7 -c pytorch -c nvidia
git clone https://github.com/facebookresearch/detectron2
cd detectron2
pip install -e .
cd ..
pip install -r requirements.txt
cd models/modeling/pixel_decoder/ops
bash make.sh
This repository is built upon COMBO-AVS codebase. Please refer to the original combo-avs repository for more detailed information of installation.
Data Preparation
Download AVSBench dataset and organize the data folders as follos:
|--AVS_dataset
|--AVSBench_semantic/
|--AVSBench_object/Multi-sources/
|--AVSBench_object/Single-source/
Process the dataset for 384x384 resolution by running:
python avs_tools/preprocess_avss_audio.py
python avs_tools/generate_data_384/ms3_process.py
python avs_tools/generate_data_384/s4_process.py
python avs_tools/generate_data_384/ss_process.py
Download Pretrained Models
Download Swin-Base-384 pretrained on ImageNet-22K from download. Convert the original model with:
cd avs_tools
python swin_base_patch4_window12_384_22k.pth swin_base_patch4_window12_384_22k.pkl
Please refer to COMBO-AVS to download other pretrained models.
Model Zoo
We provide the SOTA checkpoints for Swin-B-384 setting:
| Subset | M_J | M_F | HuggingFace Link |
|---|---|---|---|
| Single-Source | 86.2 | 93.4 | download |
| Multi-Source | 67.6 | 81.4 | download |
| Semantic | 52.5 | 56.9 | download |
Testing
Run the following commands to evaluate the given checkpoint:
sh scripts/$subset$_swinb_384_test.sh
Training
Run the following commands for training:
sh scripts/$subset$_swinb_384_train.sh
Citation
If you find this repo useful for your research, please cite
@inproceedings{huang2025revisiting,
title={Revisiting Audio-Visual Segmentation with Vision-Centric Transformer},
author={Huang, Shaofei and Ling, Rui and Hui, Tianrui and Li, Hongyu and Zhou, Xu and Zhang, Shifeng and Liu, Si and Hong, Richang and Wang, Meng},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}
Contact
For questions about our paper or code, please contact Shaofei Huang(nowherespyfly@gmail.com).
Acknowledgement
This repo is mostly derived from COMBO-AVS codebase. Thanks for their efforts.