README.md

June 6, 2025 · View on GitHub

SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes

Yuji Wang¹, Haoran Xu², Yong Liu¹, Jiaze Li², Yansong Tang¹

¹Tsinghua University ²ZJU

📖 Overview

We propose a novel framework, SAM2-LOVE that firstly leverages SAM2 to achieve pixel-wise understanding in the LAVS by designing a multimodal fusion module.
We develop creative token propagation and accumulation strategies to improve spatio-temporal comprehension of the promtable token.
Extensive experiments on Ref-AVS dataset demonstrate the superiority of our method, with ablation studies highlighting the simplicity and effectiveness of its modules.

🌹 Acknowledgement

Our work is primarily based on EVF-SAM, SAM2, Ref-AVS. We are sincerely grateful for their excellent works.

📚 Citation

If you find our paper and code helpful for your research, please consider starring our repository ⭐ and citing our work ✏️.

@inproceedings{wang2025sam2,
  title={SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes},
  author={Wang, Yuji and Xu, Haoran and Liu, Yong and Li, Jiaze and Tang, Yansong},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={28932--28941},
  year={2025}
}