| 14.11 | Mustango: Toward Controllable Text-to-Music Generation | arXiv | GitHub | Hugging Face |
| 13.11 | Music ControlNet: Multiple Time-varying Controls for Music Generation | arXiv | - | - |
| 02.11 | E3 TTS: Easy End-to-End Diffusion-based Text to Speech | arXiv | - | - |
| 01.10 | UniAudio: An Audio Foundation Model Toward Universal Audio Generation | arXiv | GitHub | - |
| 24.09 | VoiceLDM: Text-to-Speech with Environmental Context | arXiv | GitHub | - |
| 05.09 | PromptTTS 2: Describing and Generating Voices with Text Prompt | arXiv | - | - |
| 14.08 | SpeechX: Neural Codec Language Model as a Versatile Speech Transformer | arXiv | - | - |
| 10.08 | AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining | arXiv | GitHub | Hugging Face |
| 09.08 | JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models | arXiv | - | - |
| 03.08 | MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies | arXiv | GitHub | - |
| 14.07 | Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts | arXiv | - | - |
| 10.07 | VampNet: Music Generation via Masked Acoustic Token Modeling | arXiv | GitHub | - |
| 22.06 | AudioPaLM: A Large Language Model That Can Speak and Listen | arXiv | - | - |
| 19.06 | Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale | PDF | GitHub | - |
| 08.06 | MusicGen: Simple and Controllable Music Generation | arXiv | GitHub | Hugging Face Colab |
| 06.06 | Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias | arXiv | - | - |
| 01.06 | Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis | arXiv | GitHub | - |
| 29.05 | Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation | arXiv | - | - |
| 25.05 | MeLoDy: Efficient Neural Music Generation | arXiv | - | - |
| 18.05 | CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training | arXiv | - | - |
| 18.05 | SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities | arXiv | GitHub | - |
| 16.05 | SoundStorm: Efficient Parallel Audio Generation | arXiv | GitHub (unofficial) | - |
| 03.05 | Diverse and Vivid Sound Generation from Text Descriptions | arXiv | - | - |
| 02.05 | Long-Term Rhythmic Video Soundtracker | arXiv | GitHub | - |
| 24.04 | TANGO: Text-to-Audio generation using instruction tuned LLM and Latent Diffusion Model | PDF | GitHub | Hugging Face |
| 18.04 | NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers | arXiv | GitHub (unofficial) | - |
| 10.04 | Bark: Text-Prompted Generative Audio Model | - | GitHub | Hugging Face Colab |
| 03.04 | AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models | arXiv | - | - |
| 08.03 | VALL-E X: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling | arXiv | - | - |
| 27.02 | I Hear Your True Colors: Image Guided Audio Generation | arXiv | GitHub | - |
| 08.02 | Noise2Music: Text-conditioned Music Generation with Diffusion Models | arXiv | - | - |
| 04.02 | Multi-Source Diffusion Models for Simultaneous Music Generation and Separation | arXiv | GitHub | - |
| 30.01 | SingSong: Generating musical accompaniments from singing | arXiv | - | - |
| 30.01 | AudioLDM: Text-to-Audio Generation with Latent Diffusion Models | arXiv | GitHub | Hugging Face |
| 30.01 | Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion | arXiv | GitHub | - |
| 29.01 | Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models | PDF | - | - |
| 28.01 | Noise2Music | - | - | - |
| 27.01 | RAVE2 [Samples RAVE1] | arXiv | GitHub | - |
| 26.01 | MusicLM: Generating Music From Text | arXiv | GitHub (unofficial) | - |
| 18.01 | Msanii: High Fidelity Music Synthesis on a Shoestring Budget | arXiv | GitHub | Hugging Face Colab |
| 16.01 | ArchiSound: Audio Generation with Diffusion | arXiv | GitHub | - |
| 05.01 | VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers | arXiv | GitHub (unofficial) (demo) | - |