🎧 audio-ai-hub

June 15, 2026 · View on GitHub

The hub for audio AI research. Curated papers, open models, benchmarks and datasets across audio LLMs · speech recognition · speech synthesis · music & audio generation.

129 entries · 11 categories · latest: 2026-06

👉 Browse the interactive hub → · Contribute · Suggest a paper

The page below is a quick snapshot. For search, filtering by category and sorting by stars or date, the live site is much faster than scrolling this README.

⭐ Featured

Top 8 by GitHub stars — refreshed weekly by .github/workflows/refresh-stars.yml.

#	Project	Stars	What it does
1	Whisper	⭐ 102k+	Whisper is OpenAI's open-source speech recognition model trained on 680K hours of multilingual and multitask supervised data from the web.
2	MMS	⭐ 32k+	MMS (Massively Multilingual Speech) extends speech foundation models (wav2vec 2.0) to 1,107 languages for ASR and adds TTS and language i…
3	VoxCPM2	⭐ 29k+	VoxCPM2 is a fully open-source 2B-parameter multilingual, controllable speech generation foundation model extending VoxCPM's hierarchical…
4	MiniCPM-o	⭐ 25k+	MiniCPM-o 4.5 is OpenBMB's compact (8B-class) full-duplex omni-modal LLM supporting real-time vision, speech, and text interaction with l…
5	MusicGen	⭐ 23k+	MusicGen is Meta's single-stage autoregressive transformer for controllable text-conditioned music generation, operating over discrete En…
6	AudioGen	⭐ 23k+	AudioGen is a transformer-based autoregressive model for text-to-environmental-sound generation, trained on discrete audio tokens.
7	VALL-E	⭐ 22k+	VALL-E reframes text-to-speech as a conditional language modeling task over discrete audio codec tokens (EnCodec), enabling zero-shot voi…
8	CosyVoice 3	⭐ 21k+	CosyVoice 3 scales the CosyVoice TTS stack with significantly larger pre-training data and a dedicated post-training stage, targeting in-…

🆕 Recently added

The 10 most recent entries by date. See the interactive site for everything else.

2026-06 · ACA-SER — A probing study testing whether instruction-following audio language models use explicit acoustic concept tokens (six interpretable cues derived from the eGeMAPS feature set: en…
2026-06 · AVSR-Gen — Introduces MV2LRS3, a controlled unseen test set subsampled from MultiVSR to strictly match the acoustic, visual, and demographic distribution of LRS3, and shows that five state…
2026-06 · Audio-Oscar — Audio-Oscar is a multi-agent framework that coordinates specialist agents (character and voice design, speech generation, fine-grained timeline planning, model selection, non-sp…
2026-06 · CogAudio-LLM — CogAudio-LLM is a cognitive affective reasoning framework for audio language models that counters textual semantic dominance over acoustic nuance.
2026-06 · DSFA — Proposes Domain-Shift Feature Augmentation (DSFA), which turns deterministic feature statistics into stochastic distributions during fine-tuning to simulate in-the-wild variatio…
2026-06 · KIT-IWSLT2026 — KIT's cross-lingual voice cloning system for the IWSLT 2026 track, built on the multilingual TTS model FishAudio-S2-Pro.
2026-06 · VoxCPM2 — VoxCPM2 is a fully open-source 2B-parameter multilingual, controllable speech generation foundation model extending VoxCPM's hierarchical diffusion-autoregressive paradigm.
2026-06 · dots.tts — dots.tts is a 2B-parameter continuous autoregressive TTS foundation model that models speech in a continuous latent space, combining an AudioVAE trained with multiple objectives…
2026-05 · BEA-Dialogue+ — BEA-Dialogue+ is an expanded conversational Hungarian ASR corpus that relaxes the strictly speaker-disjoint split of BEA-Dialogue while preserving separation of the primary spea…
2026-05 · Chatterbox-Flash — Chatterbox-Flash is a zero-shot TTS model created by fine-tuning a pretrained autoregressive TTS decoder into a block-diffusion decoder, enabling parallel token generation withi…

📚 What's inside

Category	Entries
Model and Methods	60
Speech Recognition	6
Speech Synthesis	13
Audio Generation	9
Benchmark	19
Dataset Resource	3
Multimodal	6
Survey	4
Study	3
Safety	3
Chatbot	3
Total	129

Each row links into the live site with the corresponding category filter pre-applied.

Add an items/<Abbreviation>.json (template in schema.json), run python3 format_input.py to regenerate the README and site data, and open a PR. CI validates JSON, checks README sync, and the site rebuilds automatically on merge. Full guide: CONTRIBUTING.md.

Don't want to write a PR yourself? Suggest a paper via the issue form and a maintainer will add it.

📑 Citation

If this hub is useful in your work, please cite it — metadata is in CITATION.cff (GitHub's "Cite this repository" button on the sidebar uses it).

🙏 Contributors

Thanks to zwenyu, Yuan-ManX, chaoweihuang, Liu-Tianchi, Sakshi113, hbwu-ntu, potsawee, czwxian, marianasignal, and many others who suggested entries or opened PRs.

⭐ Featured

🆕 Recently added

📚 What's inside

🤝 Contributing

📑 Citation

🙏 Contributors