🎧 audio-ai-hub
June 15, 2026 · View on GitHub
The hub for audio AI research. Curated papers, open models, benchmarks and datasets across audio LLMs · speech recognition · speech synthesis · music & audio generation.
129 entries · 11 categories · latest: 2026-06
👉 Browse the interactive hub → · Contribute · Suggest a paper
The page below is a quick snapshot. For search, filtering by category and sorting by stars or date, the live site is much faster than scrolling this README.
⭐ Featured
Top 8 by GitHub stars — refreshed weekly by .github/workflows/refresh-stars.yml.
| # | Project | Stars | What it does |
|---|---|---|---|
| 1 | Whisper | ⭐ 102k+ | Whisper is OpenAI's open-source speech recognition model trained on 680K hours of multilingual and multitask supervised data from the web. |
| 2 | MMS | ⭐ 32k+ | MMS (Massively Multilingual Speech) extends speech foundation models (wav2vec 2.0) to 1,107 languages for ASR and adds TTS and language i… |
| 3 | VoxCPM2 | ⭐ 29k+ | VoxCPM2 is a fully open-source 2B-parameter multilingual, controllable speech generation foundation model extending VoxCPM's hierarchical… |
| 4 | MiniCPM-o | ⭐ 25k+ | MiniCPM-o 4.5 is OpenBMB's compact (8B-class) full-duplex omni-modal LLM supporting real-time vision, speech, and text interaction with l… |
| 5 | MusicGen | ⭐ 23k+ | MusicGen is Meta's single-stage autoregressive transformer for controllable text-conditioned music generation, operating over discrete En… |
| 6 | AudioGen | ⭐ 23k+ | AudioGen is a transformer-based autoregressive model for text-to-environmental-sound generation, trained on discrete audio tokens. |
| 7 | VALL-E | ⭐ 22k+ | VALL-E reframes text-to-speech as a conditional language modeling task over discrete audio codec tokens (EnCodec), enabling zero-shot voi… |
| 8 | CosyVoice 3 | ⭐ 21k+ | CosyVoice 3 scales the CosyVoice TTS stack with significantly larger pre-training data and a dedicated post-training stage, targeting in-… |
🆕 Recently added
The 10 most recent entries by date. See the interactive site for everything else.
2026-06· ACA-SER — A probing study testing whether instruction-following audio language models use explicit acoustic concept tokens (six interpretable cues derived from the eGeMAPS feature set: en…2026-06· AVSR-Gen — Introduces MV2LRS3, a controlled unseen test set subsampled from MultiVSR to strictly match the acoustic, visual, and demographic distribution of LRS3, and shows that five state…2026-06· Audio-Oscar — Audio-Oscar is a multi-agent framework that coordinates specialist agents (character and voice design, speech generation, fine-grained timeline planning, model selection, non-sp…2026-06· CogAudio-LLM — CogAudio-LLM is a cognitive affective reasoning framework for audio language models that counters textual semantic dominance over acoustic nuance.2026-06· DSFA — Proposes Domain-Shift Feature Augmentation (DSFA), which turns deterministic feature statistics into stochastic distributions during fine-tuning to simulate in-the-wild variatio…2026-06· KIT-IWSLT2026 — KIT's cross-lingual voice cloning system for the IWSLT 2026 track, built on the multilingual TTS model FishAudio-S2-Pro.2026-06· VoxCPM2 — VoxCPM2 is a fully open-source 2B-parameter multilingual, controllable speech generation foundation model extending VoxCPM's hierarchical diffusion-autoregressive paradigm.2026-06· dots.tts — dots.tts is a 2B-parameter continuous autoregressive TTS foundation model that models speech in a continuous latent space, combining an AudioVAE trained with multiple objectives…2026-05· BEA-Dialogue+ — BEA-Dialogue+ is an expanded conversational Hungarian ASR corpus that relaxes the strictly speaker-disjoint split of BEA-Dialogue while preserving separation of the primary spea…2026-05· Chatterbox-Flash — Chatterbox-Flash is a zero-shot TTS model created by fine-tuning a pretrained autoregressive TTS decoder into a block-diffusion decoder, enabling parallel token generation withi…
📚 What's inside
| Category | Entries |
|---|---|
| Model and Methods | 60 |
| Speech Recognition | 6 |
| Speech Synthesis | 13 |
| Audio Generation | 9 |
| Benchmark | 19 |
| Dataset Resource | 3 |
| Multimodal | 6 |
| Survey | 4 |
| Study | 3 |
| Safety | 3 |
| Chatbot | 3 |
| Total | 129 |
Each row links into the live site with the corresponding category filter pre-applied.
🤝 Contributing
Add an items/<Abbreviation>.json (template in schema.json), run python3 format_input.py to regenerate the README and site data, and open a PR. CI validates JSON, checks README sync, and the site rebuilds automatically on merge. Full guide: CONTRIBUTING.md.
Don't want to write a PR yourself? Suggest a paper via the issue form and a maintainer will add it.
📑 Citation
If this hub is useful in your work, please cite it — metadata is in CITATION.cff (GitHub's "Cite this repository" button on the sidebar uses it).
🙏 Contributors
Thanks to zwenyu, Yuan-ManX, chaoweihuang, Liu-Tianchi, Sakshi113, hbwu-ntu, potsawee, czwxian, marianasignal, and many others who suggested entries or opened PRs.