🎧 audio-ai-hub

June 15, 2026 · View on GitHub

The hub for audio AI research. Curated papers, open models, benchmarks and datasets across audio LLMs · speech recognition · speech synthesis · music & audio generation.

129 entries · 11 categories · latest: 2026-06

👉 Browse the interactive hub → · Contribute · Suggest a paper

The page below is a quick snapshot. For search, filtering by category and sorting by stars or date, the live site is much faster than scrolling this README.


Top 8 by GitHub stars — refreshed weekly by .github/workflows/refresh-stars.yml.

#ProjectStarsWhat it does
1Whisper⭐ 102k+Whisper is OpenAI's open-source speech recognition model trained on 680K hours of multilingual and multitask supervised data from the web.
2MMS⭐ 32k+MMS (Massively Multilingual Speech) extends speech foundation models (wav2vec 2.0) to 1,107 languages for ASR and adds TTS and language i…
3VoxCPM2⭐ 29k+VoxCPM2 is a fully open-source 2B-parameter multilingual, controllable speech generation foundation model extending VoxCPM's hierarchical…
4MiniCPM-o⭐ 25k+MiniCPM-o 4.5 is OpenBMB's compact (8B-class) full-duplex omni-modal LLM supporting real-time vision, speech, and text interaction with l…
5MusicGen⭐ 23k+MusicGen is Meta's single-stage autoregressive transformer for controllable text-conditioned music generation, operating over discrete En…
6AudioGen⭐ 23k+AudioGen is a transformer-based autoregressive model for text-to-environmental-sound generation, trained on discrete audio tokens.
7VALL-E⭐ 22k+VALL-E reframes text-to-speech as a conditional language modeling task over discrete audio codec tokens (EnCodec), enabling zero-shot voi…
8CosyVoice 3⭐ 21k+CosyVoice 3 scales the CosyVoice TTS stack with significantly larger pre-training data and a dedicated post-training stage, targeting in-…

🆕 Recently added

The 10 most recent entries by date. See the interactive site for everything else.

  • 2026-06 · ACA-SER — A probing study testing whether instruction-following audio language models use explicit acoustic concept tokens (six interpretable cues derived from the eGeMAPS feature set: en…
  • 2026-06 · AVSR-Gen — Introduces MV2LRS3, a controlled unseen test set subsampled from MultiVSR to strictly match the acoustic, visual, and demographic distribution of LRS3, and shows that five state…
  • 2026-06 · Audio-Oscar — Audio-Oscar is a multi-agent framework that coordinates specialist agents (character and voice design, speech generation, fine-grained timeline planning, model selection, non-sp…
  • 2026-06 · CogAudio-LLM — CogAudio-LLM is a cognitive affective reasoning framework for audio language models that counters textual semantic dominance over acoustic nuance.
  • 2026-06 · DSFA — Proposes Domain-Shift Feature Augmentation (DSFA), which turns deterministic feature statistics into stochastic distributions during fine-tuning to simulate in-the-wild variatio…
  • 2026-06 · KIT-IWSLT2026 — KIT's cross-lingual voice cloning system for the IWSLT 2026 track, built on the multilingual TTS model FishAudio-S2-Pro.
  • 2026-06 · VoxCPM2 — VoxCPM2 is a fully open-source 2B-parameter multilingual, controllable speech generation foundation model extending VoxCPM's hierarchical diffusion-autoregressive paradigm.
  • 2026-06 · dots.tts — dots.tts is a 2B-parameter continuous autoregressive TTS foundation model that models speech in a continuous latent space, combining an AudioVAE trained with multiple objectives…
  • 2026-05 · BEA-Dialogue+ — BEA-Dialogue+ is an expanded conversational Hungarian ASR corpus that relaxes the strictly speaker-disjoint split of BEA-Dialogue while preserving separation of the primary spea…
  • 2026-05 · Chatterbox-Flash — Chatterbox-Flash is a zero-shot TTS model created by fine-tuning a pretrained autoregressive TTS decoder into a block-diffusion decoder, enabling parallel token generation withi…

📚 What's inside

CategoryEntries
Model and Methods60
Speech Recognition6
Speech Synthesis13
Audio Generation9
Benchmark19
Dataset Resource3
Multimodal6
Survey4
Study3
Safety3
Chatbot3
Total129

Each row links into the live site with the corresponding category filter pre-applied.

🤝 Contributing

Add an items/<Abbreviation>.json (template in schema.json), run python3 format_input.py to regenerate the README and site data, and open a PR. CI validates JSON, checks README sync, and the site rebuilds automatically on merge. Full guide: CONTRIBUTING.md.

Don't want to write a PR yourself? Suggest a paper via the issue form and a maintainer will add it.

📑 Citation

If this hub is useful in your work, please cite it — metadata is in CITATION.cff (GitHub's "Cite this repository" button on the sidebar uses it).

🙏 Contributors

Thanks to zwenyu, Yuan-ManX, chaoweihuang, Liu-Tianchi, Sakshi113, hbwu-ntu, potsawee, czwxian, marianasignal, and many others who suggested entries or opened PRs.

Star History Chart