SoundCLIP: Can Sound Replace Vision in LLaVA With Token Substitution?

October 6, 2025 ยท View on GitHub

[Code is coming soon.. stay tuned]

๐ŸŒ Live Demo: https://ali-vosoughi.github.io/SoundCLIP/

๐Ÿ“„ Paper: Can Sound Replace Vision in LLaVA With Token Substitution? (ArXiv)

๐Ÿ“Š Dataset: AVE-2 on HuggingFace

Project Overview

This is the official project webpage for "Can Sound Replace Vision in LLaVA With Token Substitution?" featuring an interactive demonstration of our SoundCLIP framework and the fundamental trade-off between cross-modal retrieval and text generation.

Authors

  • Ali Vosoughi - University of Rochester (Website)
  • Jing Bi - University of Rochester (Website)
  • Pinxin Liu - University of Rochester (Website)
  • Yunlong Tang - University of Rochester (Website)
  • Chenliang Xu - University of Rochester (Website)

Key Contributions

1. AVE-2 Dataset

  • 570,138 audio-visual clips with revolutionary 5-dimensional alignment annotations
  • Now available on HuggingFace with comprehensive documentation and usage examples
  • Systematic scoring across: Temporal Alignment, Spatial Coherence, Contextual Relevance, Physical Causality, Sound Source Visibility

2. SoundCLIP Framework

  • Token substitution approach: Replace CLIP's [CLS] token with audio tokens in LLaVA
  • Two alignment strategies:
    • Projected: MLP projection to CLIP space (maximizes I(A;V), better retrieval)
    • Raw: Padded audio features (preserves H(A|V), better generation)
  • Lightweight integration: Only 1.9M parameters for projection layer

3. Fundamental Trade-off Discovery

  • Retrieval vs Generation: y = 0.163x + 11.867 relationship discovered
  • Each percentage-point gain in retrieval incurs ~0.163% loss in generation quality

Citation

If you use SoundCLIP or the AVE-2 dataset in your research, please cite our paper:

@article{vosoughi2025soundclip,
  title={Can Sound Replace Vision in LLaVA With Token Substitution?},
  author={Vosoughi, Ali and Bi, Jing and Liu, Pinxin and Tang, Yunlong and Xu, Chenliang},
  journal={ArXiv},
  year={2025}
}