Multimodal AI - Audio-Text-To-Text Modality (Resources, Notes)

December 8, 2025 · View on GitHub

Collection of open-source multimodal models with audio support, focusing on models that can process audio tokens and understand them in conjunction with text prompts.

Overview

This repository catalogs and analyzes a relatively small but significant subclassification of multimodal models: those with native audio support. Audio in addition to audio and text to text. another category of models which are potentially in scope include any-to-any: these are models which, as the name suggests, are built to handle any input and output pairing.

As of December 7, 2025, these models are classified on Hugging Face under the "multimodal" category rather than the "audio" category—an interesting distinction that reflects their fundamentally different architecture from traditional ASR models.

While the primary focus is open-source models, closed-source providers are included for completeness given the relatively small size of this emerging field.

Hugging Face Task Classification Mapping

The focus of this resource list maps to these two tasks in Hugging Face's (current) classification system for AI tasks:

TaskDescriptionLinks
audio-text-to-textModels that accept audio + text input and produce text outputTask Models
any-to-anyOmni-modal models handling any input/output pairing (subsumes audio)Task Models

Hugging Face Resources

Audio Text To Text

ResourceLink
Task Overviewaudio-text-to-text
Models (Trending)Browse models
DatasetsBrowse datasets

Omni / All-Modality Multimodal

ResourceLink
Task Overviewany-to-any
Models (Trending)Browse models
DatasetsBrowse datasets

Repository Index

Core Documentation

DocumentDescription
models/index.mdComplete index of all audio multimodal models
models.mdFeatured open-source audio multimodal models with detailed profiles
companies.mdCompanies developing audio multimodal models (open source focus)
providers.mdOrganizations developing audio multimodal (open & closed source)
benchmarks.mdEvaluation frameworks and leaderboards
scope.mdDefinition of what "audio multimodal" means in this context

Notes & Research

LocationDescription
notes/Personal notes on nomenclature, parameters, and reference links
notes/nomenclature.mdTerminology and naming conventions
notes/parameters.mdModel parameter sizes for deployment planning
notes/ref.mdQuick reference links (HuggingFace task pages)

AI-Generated Analysis

The ask-ai/ directory contains AI-assisted research outputs:

DocumentDescription
ask-ai/prompt.mdThe prompt used to generate the analysis
ask-ai/outputs/models.mdComprehensive model list beyond featured models
ask-ai/outputs/nomenclature.mdTerminology analysis across vendors and research
ask-ai/outputs/benchmarks.mdExtended benchmark coverage by workflow type
ask-ai/outputs/pros-cons.mdComparison of STT vs pipeline vs multimodal approaches
ask-ai/outputs/redundancy-analysis.mdWill multimodal ASR make traditional STT redundant?
ask-ai/outputs/ecosystem.mdEcosystem overview and emerging trends

Data

LocationDescription
data/Raw exports from Hugging Face API (CSV/JSON)
DocumentDescription
resource-lists.mdCurated awesome-lists for multimodal AI
models-hf.mdGitHub repositories for audio multimodal models
papers.mdResearch papers and academic resources
tooling.mdData pipeline and processing tools
eval-tools.mdEvaluation frameworks and test prompts
inference-tools.mdTools for running inference at scale
demos-and-starters.mdExample implementations and starter projects
github-tags.mdGitHub topic pages for discovery

Evaluations & Benchmarking

A custom evaluation framework for testing true audio understanding capabilities—what separates audio multimodal models from traditional STT.

LocationDescription
evaluations/README.mdEvaluation framework overview and methodology
evaluations/test-prompts/Complete test prompt library

Test Prompt Categories

Human-Authored Prompts (by-daniel/):

PromptTests
accent-identification.mdRegional accent detection with grounded examples
guess-my-mood.mdEmotional analysis, fatigue detection, word-tone dissonance
non-verbal-context.mdMulti-speaker interpersonal dynamics, pauses as communication
parameters.mdVocal frequency analysis for audio engineering (EQ recommendations)
who-is-this.mdSpeaker identification/recognition

AI-Generated Prompts (ai-generated/): Extended benchmark covering additional audio understanding dimensions.


Why Audio Multimodal Matters

Classic STT vs. Audio Multimodal

The audio category on Hugging Face includes ASR (Automatic Speech Recognition) models like Whisper, Parakeet, and Wav2Vec, along with supporting components (diarization, VAD, punctuation restoration). These are powerful but follow a traditional pipeline approach.

Audio multimodal models are fundamentally different:

  • Native audio understanding: Process audio tokens directly alongside text prompts
  • Unified inference: Single API call handles transcription, formatting, and summarization
  • Prompt-guided processing: Can be instructed to analyze accents, describe voices, or format output

Practical Advantages

Instead of chaining: Whisper → GPT-4 → Formatting

Audio multimodal enables: Single API call with system prompt → Formatted output

Use cases:

  • Voice journals with structured formatting
  • Conference call summarization
  • Accent/voice analysis
  • Long-form audio processing (tested with 1-hour recordings)

See models/ for detailed profiles:

Any-to-Any (Omni-Modal)

ModelDeveloperParametersLicense
Qwen OmniAlibaba7B-35BApache 2.0
Gemma 3nGoogle2B-4B effectiveGemma
Macaw-LLMChenyang Lyu et al.7B-13BApache 2.0

Audio-Text-to-Text

ModelDeveloperParametersLicense
Audio Flamingo 3NVIDIA8BNon-commercial
BuboGPTByteDance7B-13BBSD 3-Clause
Kimi-AudioMoonshot AI10BMIT/Apache 2.0
OmniAudioNexaAI2.6BApache 2.0
Phi-4-MultimodalMicrosoft5.6BMIT
Qwen2-AudioAlibaba8BApache 2.0
SALMONNByteDance/Tsinghua7B-13BApache 2.0
SoundwaveFreedomIntelligence9BApache 2.0
Step-Audio-ChatStepFun130BApache 2.0
Step-Audio-R1StepFun33BApache 2.0
UltravoxFixie.ai8B-70BMIT
VoxtralMistral AI5B-24BApache 2.0

Providers

See providers.md for the full list, or companies.md for a company-to-models mapping:

  • Open Source: Alibaba, ByteDance, Fixie.ai, FreedomIntelligence, Google DeepMind, Microsoft, Mistral AI, Moonshot AI, NexaAI, NVIDIA, StepFun
  • Closed Source: Google (Gemini), OpenAI (GPT-4o), Anthropic (Claude), Reka AI

Benchmarks

See benchmarks.md for full coverage of evaluation frameworks and leaderboards.

BenchmarkDeveloperFocusLinks
MSEBGoogle ResearchSound embedding evaluationGitHub · Blog
UltraEval-AudioOpenBMBSpeech understanding & generationGitHub
lmms-evalEvolvingLMMs Lab100+ multimodal tasksGitHub
VERSAWavLab Speech90+ speech/audio metricsGitHub
AudioBenchAudioLLMsComprehensive audio LLMGitHub · Leaderboard

Leaderboards: AudioBench · Open ASR


External Resources


Future of Voice AI

Audio multimodal represents what may be the successor to first-wave STT models. The ability to handle transcription, cleanup, and formatting in a single unified inference process—without the complexity of VAD, punctuation restoration, and post-processing chains—makes this an elegant and powerful approach to voice AI.

Updates

This repository will be periodically updated as the field evolves. Given the rapid pace of AI development, timestamps are included throughout.


Created: December 7, 2025 | Updated: December 8, 2025