SoundCLIP: Can Sound Replace Vision in LLaVA With Token Substitution?
October 6, 2025 ยท View on GitHub
[Code is coming soon.. stay tuned]
๐ Live Demo: https://ali-vosoughi.github.io/SoundCLIP/
๐ Paper: Can Sound Replace Vision in LLaVA With Token Substitution? (ArXiv)
๐ Dataset: AVE-2 on HuggingFace
Project Overview
This is the official project webpage for "Can Sound Replace Vision in LLaVA With Token Substitution?" featuring an interactive demonstration of our SoundCLIP framework and the fundamental trade-off between cross-modal retrieval and text generation.
Authors
- Ali Vosoughi - University of Rochester (Website)
- Jing Bi - University of Rochester (Website)
- Pinxin Liu - University of Rochester (Website)
- Yunlong Tang - University of Rochester (Website)
- Chenliang Xu - University of Rochester (Website)
๐ Quick Links
- ๐ Interactive Demo: https://ali-vosoughi.github.io/SoundCLIP/
- ๐ Paper: ArXiv:2506.10416
- ๐ป Code: GitHub Repository
- ๐ Dataset: AVE-2 on HuggingFace
Key Contributions
1. AVE-2 Dataset
- 570,138 audio-visual clips with revolutionary 5-dimensional alignment annotations
- Now available on HuggingFace with comprehensive documentation and usage examples
- Systematic scoring across: Temporal Alignment, Spatial Coherence, Contextual Relevance, Physical Causality, Sound Source Visibility
2. SoundCLIP Framework
- Token substitution approach: Replace CLIP's [CLS] token with audio tokens in LLaVA
- Two alignment strategies:
- Projected: MLP projection to CLIP space (maximizes I(A;V), better retrieval)
- Raw: Padded audio features (preserves H(A|V), better generation)
- Lightweight integration: Only 1.9M parameters for projection layer
3. Fundamental Trade-off Discovery
- Retrieval vs Generation: y = 0.163x + 11.867 relationship discovered
- Each percentage-point gain in retrieval incurs ~0.163% loss in generation quality
Citation
If you use SoundCLIP or the AVE-2 dataset in your research, please cite our paper:
@article{vosoughi2025soundclip,
title={Can Sound Replace Vision in LLaVA With Token Substitution?},
author={Vosoughi, Ali and Bi, Jing and Liu, Pinxin and Tang, Yunlong and Xu, Chenliang},
journal={ArXiv},
year={2025}
}