README.md

June 6, 2025 ยท View on GitHub

SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes

image


1Tsinghua University 2ZJU

๐Ÿ“– Overview

  1. We propose a novel framework, SAM2-LOVE that firstly leverages SAM2 to achieve pixel-wise understanding in the LAVS by designing a multimodal fusion module.

  2. We develop creative token propagation and accumulation strategies to improve spatio-temporal comprehension of the promtable token.

  3. Extensive experiments on Ref-AVS dataset demonstrate the superiority of our method, with ablation studies highlighting the simplicity and effectiveness of its modules.

image

๐ŸŒน Acknowledgement

Our work is primarily based on EVF-SAM, SAM2, Ref-AVS. We are sincerely grateful for their excellent works.

๐Ÿ“š Citation

If you find our paper and code helpful for your research, please consider starring our repository โญ and citing our work โœ๏ธ.

@inproceedings{wang2025sam2,
  title={SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes},
  author={Wang, Yuji and Xu, Haoran and Liu, Yong and Li, Jiaze and Tang, Yansong},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={28932--28941},
  year={2025}
}