VCT_AVS

August 18, 2025 · View on GitHub

This repo is the official PyTorch implementation for:

Revisiting Audio-Visual Segmentation with Vision-Centric Transformer. Accepted by CVPR 2025.

pipeline In this paper, we propose a new Vision-Centric Transformer (VCT) framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information, enabling queries to better distinguish between different sounding objects from mixed audio and accurately delineate their contours. Additionally, we also introduce a Prototype Prompted Query Generation (PPQG) module within our VCT framework to generate vision-derived queries that are both semantically aware and visually rich through audio prototype prompting and pixel context grouping, facilitating audio-visual information aggregation.

Installation

After setting up the environment, clone this repo:

conda create -n vct_avs python==3.8 -y
conda activate vct_avs

git clone https://github.com/spyflying/VCT_AVS.git
cd VCT_AVS

conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 cudatoolkit=11.7 -c pytorch -c nvidia
git clone https://github.com/facebookresearch/detectron2
cd detectron2
pip install -e .

cd ..
pip install -r requirements.txt

cd models/modeling/pixel_decoder/ops
bash make.sh

This repository is built upon COMBO-AVS codebase. Please refer to the original combo-avs repository for more detailed information of installation.

Data Preparation

Download AVSBench dataset and organize the data folders as follos:

|--AVS_dataset
   |--AVSBench_semantic/
   |--AVSBench_object/Multi-sources/
   |--AVSBench_object/Single-source/

Process the dataset for 384x384 resolution by running:

python avs_tools/preprocess_avss_audio.py
python avs_tools/generate_data_384/ms3_process.py
python avs_tools/generate_data_384/s4_process.py
python avs_tools/generate_data_384/ss_process.py

Download Pretrained Models

Download Swin-Base-384 pretrained on ImageNet-22K from download. Convert the original model with:

cd avs_tools
python swin_base_patch4_window12_384_22k.pth swin_base_patch4_window12_384_22k.pkl

Please refer to COMBO-AVS to download other pretrained models.

Model Zoo

We provide the SOTA checkpoints for Swin-B-384 setting:

Subset	M_J	M_F	HuggingFace Link
Single-Source	86.2	93.4	download
Multi-Source	67.6	81.4	download
Semantic	52.5	56.9	download

Testing

Run the following commands to evaluate the given checkpoint:

sh scripts/$subset$_swinb_384_test.sh

Training

Run the following commands for training:

sh scripts/$subset$_swinb_384_train.sh

Citation

If you find this repo useful for your research, please cite

@inproceedings{huang2025revisiting,
  title={Revisiting Audio-Visual Segmentation with Vision-Centric Transformer},
  author={Huang, Shaofei and Ling, Rui and Hui, Tianrui and Li, Hongyu and Zhou, Xu and Zhang, Shifeng and Liu, Si and Hong, Richang and Wang, Meng},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}

Contact

For questions about our paper or code, please contact Shaofei Huang(nowherespyfly@gmail.com).

Acknowledgement

This repo is mostly derived from COMBO-AVS codebase. Thanks for their efforts.