[ICASSP 2026] Learning What To Hear: Boosting Sound-Source Association For Robust Audiovisual Instance Segmentation

January 19, 2026 · View on GitHub

Jinbae Seo, Hyeongjun Kwon, Kwonyoung Kim, Jiyoung Lee and Kwanghoon Sohn

Demo

https://github.com/user-attachments/assets/e1e72e81-0945-442a-b6d8-c552c6beae62

https://github.com/user-attachments/assets/1e872233-6934-41dc-8553-56405b54ae56

https://github.com/user-attachments/assets/d35820fb-6173-461d-ab75-9aa0f8f693c8

https://github.com/user-attachments/assets/f6530e52-5871-45f6-a538-8292865c152a

Installation

conda create --name acvis python=3.8 -y
conda activate acvis

conda install pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=12.1 -c pytorch -c nvidia -y
pip install -U opencv-python
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
pip install -r requirements.txt
pip install timm

cd mask2former/modeling/pixel_decoder/ops
sh make.sh

Setup

Datasets

Download and unzip AVISeg datasets and put them in ./datasets.

Pretrained Backbones

Download and unzip pre-trained backbones OneDrive and put them in ./pre_models.

Checkpoints

Download the following checkpoints and put them in ./checkpoints.

Backbone	Pre-trained Datasets	mAP	HOTA	FSLA	Model Weight
ResNet-50	ImageNet	42.01	62.04	42.43	ACVIS_R50_IN.pth
ResNet-50	ImageNet & COCO	46.64	65.02	46.72	ACVIS_R50_COCO.pth

Getting Started

Train

python train_net.py --num-gpus 2 --config-file configs/acvis/acvis_saoc.yaml

Evaluation

python train_net.py --config-file configs/acvis/acvis_saoc.yaml --eval-only MODEL.WEIGHTS checkpoints/ACVIS_R50_COCO.pth

Demo

python demo_video/demo.py --config-file configs/acvis/acvis_saoc.yaml --opts MODEL.WEIGHTS checkpoints/ACVIS_R50_COCO.pth

Citation

@misc{seo2025acvis,
      title={Learning What To Hear: Boosting Sound-Source Association For Robust Audiovisual Instance Segmentation}, 
      author={Jinbae Seo and Hyeongjun Kwon and Kwonyoung Kim and Jiyoung Lee and Kwanghoon Sohn},
      year={2025},
      eprint={2509.22740},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2509.22740}, 
}

Acknowledgement

Our implementation is based on Detectron2, Mask2Former, VITA and AVIS. Thanks for their great works.