Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation
October 31, 2025 · View on GitHub
Official code for "Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation"
Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation,
Yuanhong Chen*, Yuyuan Liu*, Hu Wang, Fengbei Liu, Chong Wang, Helen Frazer, Gustavo Carneiro.
CVPR 2024 (arXiv 2304.02970)
Dataset
VPO datasets are available here
VGGSound audio files are available here
Visual comparison between datasets. We show four audio-visual classes, including “female”, “cat”, “dog”, and “car”. The AVSBench (SS) (1st frame) provides pixel-level multi-class annotations to the images containing a single sounding object. The proposed VPO benchmarks (2nd frame to 4th frame) pair a subset of the segmented objects in an image with relevant audio files to produce pixel-level multi-class annotations.
Results
Please note that all the tables in the original paper use conventional semantic segmentaiton (per-dataset) mIoU and F-score metrics. We update the following table based on per-image mIoU and per-video F-score based on TPAVI. Please note that the current repository version uses AVSBench-Semantics to facilitate training and evaluation on the AVSBench-Objects dataset. However, the label noise in AVSBench-Semantics may affect the final results on the AVSBench-Objects dataset. Therefore, it is recommended to use the original AVSBench-Objects dataset instead.
Instance-level Evaluation (AVSBench Metrics)
| RESNET-50 (IMGNET PRETRAIN) | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| AVSBench-Object (SS) | AVSBench-Object (MS) | AVSBench-Semantics | |||||||
| Model | J&F Mean | J Mean | F Mean | J&F Mean | J Mean | F Mean | J&F Mean | J Mean | F Mean |
| CATR | 80.70 | 74.80 | 86.60 | 59.05 | 52.80 | 65.30 | - | - | - |
| AuTR | 80.10 | 75.00 | 85.20 | 55.30 | 49.40 | 61.20 | - | - | - |
| AVSegFormer | 80.67 | 76.54 | 84.80 | 56.17 | 49.53 | 62.80 | 27.12 | 24.93 | 29.30 |
| AVSC | 81.13 | 77.02 | 85.24 | 55.55 | 49.58 | 61.51 | - | - | - |
| BAVS | 81.63 | 77.96 | 85.29 | 56.30 | 50.23 | 62.37 | 27.16 | 24.68 | 29.63 |
| TPAVI | 78.80 | 72.79 | 84.80 | 52.84 | 47.88 | 57.80 | 22.69 | 20.18 | 25.20 |
| AVSBG | 79.77 | 74.13 | 85.40 | 50.88 | 44.95 | 56.80 | - | - | - |
| ECMVAE | 81.42 | 76.33 | 86.50 | 54.70 | 48.69 | 60.70 | - | - | - |
| DiffusionAVS | 81.35 | 75.80 | 86.90 | 55.94 | 49.77 | 62.10 | - | - | - |
| CAVP | 83.84 | 78.78 | 88.89 | 61.48 | 55.82 | 67.14 | 32.83 | 30.37 | 35.29 |
| RESNET-50 (COCO PRETRAIN) | ||||||
|---|---|---|---|---|---|---|
| AVSBench-Object (SS) | AVSBench-Object (MS) | |||||
| Model | J&F Mean | J Mean | F Mean | J&F Mean | J Mean | F Mean |
| AQFormer | 81.70 | 77.00 | 86.40 | 61.30 | 55.70 | 66.90 |
| CAVP | 83.75 | 78.72 | 88.77 | 62.34 | 56.42 | 68.25 |
Dataset-level Evaluation (Convention Semantic Segmentation Metrics)
| RESNET-50 (IMAGENET PRETRAIN) | ||||||
|---|---|---|---|---|---|---|
| AVSBench-Object (SS) | AVSBench-Object (MS) | AVSBench-Semantics | ||||
| Model | mIoU | F-Score | mIoU | F-Score | mIoU | F-Score |
| CAVP | 89.43 | 94.50 | 72.79 | 83.05 | 44.70 | 57.76 |
Demon
https://github.com/user-attachments/assets/e113d3a7-cbb4-4696-941b-4e5966870bee
https://github.com/user-attachments/assets/821e3c55-7daf-4445-a0df-a869cba37d59
https://github.com/user-attachments/assets/d80d8a75-c038-4169-b40d-261a40767c31
Checkpoints
Checkpoints are available here: avsbench-object-ss-224, avsbench-object-ms-224, avss-224.
Usage
Requirements
git clone git@github.com:cyh-0/CAVP.git
cd CAVP
pip install -r requirements.txt
Path
ln -s /path/to/datasets ../audio_visual
ln -s /path/to/ckpts ./ckpts
Training
Before training, you need to update your own WANDB_KEY in the config file.
Training scripts for AVSBench-Semantic.
python main_avss.py --experiment_name "CAVP" --setup avss --gpus 1 --batch_size 16 --lr 1e-3 --weight_decay 1e-4 --epochs 80 --wandb_mode disabled --num_workers 16
Training scripts for VPO-MONO.
python main_vpo_mono.py --experiment_name "CAVP" --setup "vpo_ss" --gpus 2 --batch_size 8 --lr 5e-4 --weight_decay 5e-4 --epochs 80 --num_workers 16 --wandb_mode online
python main_vpo_mono.py --experiment_name "CAVP" --setup "vpo_ms" --gpus 2 --batch_size 8 --lr 5e-4 --weight_decay 5e-4 --epochs 80 --num_workers 16 --wandb_mode online
python main_vpo_mono.py --experiment_name "CAVP" --setup "vpo_msmi" --gpus 2 --batch_size 8 --lr 5e-4 --weight_decay 5e-4 --epochs 80 --num_workers 16 --wandb_mode online
Training scripts for VPO-STEREO.
python main_vpo_stereo.py --experiment_name "CAVP" --setup "vpo_ss" --gpus 2 --batch_size 8 --lr 5e-4 --weight_decay 5e-4 --epochs 80 --num_workers 16 --wandb_mode online
python main_vpo_stereo.py --experiment_name "CAVP" --setup "vpo_ms" --gpus 2 --batch_size 8 --lr 5e-4 --weight_decay 5e-4 --epochs 80 --num_workers 16 --wandb_mode online
python main_vpo_stereo.py --experiment_name "CAVP" --setup "vpo_msmi" --gpus 2 --batch_size 8 --lr 5e-4 --weight_decay 5e-4 --epochs 80 --num_workers 16 --wandb_mode online
Citation
@misc{chen2024unraveling,
title={Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation},
author={Yuanhong Chen and Yuyuan Liu and Hu Wang and Fengbei Liu and Chong Wang and Helen Frazer and Gustavo Carneiro},
year={2024},
eprint={2304.02970},
archivePrefix={arXiv},
primaryClass={cs.CV}
}