Multimodal AI - Audio-Text-To-Text Modality (Resources, Notes)

December 8, 2025 · View on GitHub

Collection of open-source multimodal models with audio support, focusing on models that can process audio tokens and understand them in conjunction with text prompts.

Overview

This repository catalogs and analyzes a relatively small but significant subclassification of multimodal models: those with native audio support. Audio in addition to audio and text to text. another category of models which are potentially in scope include any-to-any: these are models which, as the name suggests, are built to handle any input and output pairing.

As of December 7, 2025, these models are classified on Hugging Face under the "multimodal" category rather than the "audio" category—an interesting distinction that reflects their fundamentally different architecture from traditional ASR models.

While the primary focus is open-source models, closed-source providers are included for completeness given the relatively small size of this emerging field.

Hugging Face Task Classification Mapping

The focus of this resource list maps to these two tasks in Hugging Face's (current) classification system for AI tasks:

Task	Description	Links
audio-text-to-text	Models that accept audio + text input and produce text output
any-to-any	Omni-modal models handling any input/output pairing (subsumes audio)

Hugging Face Resources

Audio Text To Text

Resource	Link
Task Overview	audio-text-to-text
Models (Trending)	Browse models
Datasets	Browse datasets

Omni / All-Modality Multimodal

Resource	Link
Task Overview	any-to-any
Models (Trending)	Browse models
Datasets	Browse datasets

Repository Index

Core Documentation

Document	Description
models/index.md	Complete index of all audio multimodal models
models.md	Featured open-source audio multimodal models with detailed profiles
companies.md	Companies developing audio multimodal models (open source focus)
providers.md	Organizations developing audio multimodal (open & closed source)
benchmarks.md	Evaluation frameworks and leaderboards
scope.md	Definition of what "audio multimodal" means in this context

Notes & Research

Location	Description
notes/	Personal notes on nomenclature, parameters, and reference links
notes/nomenclature.md	Terminology and naming conventions
notes/parameters.md	Model parameter sizes for deployment planning
notes/ref.md	Quick reference links (HuggingFace task pages)

AI-Generated Analysis

The ask-ai/ directory contains AI-assisted research outputs:

Document	Description
ask-ai/prompt.md	The prompt used to generate the analysis
ask-ai/outputs/models.md	Comprehensive model list beyond featured models
ask-ai/outputs/nomenclature.md	Terminology analysis across vendors and research
ask-ai/outputs/benchmarks.md	Extended benchmark coverage by workflow type
ask-ai/outputs/pros-cons.md	Comparison of STT vs pipeline vs multimodal approaches
ask-ai/outputs/redundancy-analysis.md	Will multimodal ASR make traditional STT redundant?
ask-ai/outputs/ecosystem.md	Ecosystem overview and emerging trends

Data

Location	Description
data/	Raw exports from Hugging Face API (CSV/JSON)

Resources & Links

Document	Description
resource-lists.md	Curated awesome-lists for multimodal AI
models-hf.md	GitHub repositories for audio multimodal models
papers.md	Research papers and academic resources
tooling.md	Data pipeline and processing tools
eval-tools.md	Evaluation frameworks and test prompts
inference-tools.md	Tools for running inference at scale
demos-and-starters.md	Example implementations and starter projects
github-tags.md	GitHub topic pages for discovery

Evaluations & Benchmarking

A custom evaluation framework for testing true audio understanding capabilities—what separates audio multimodal models from traditional STT.

Location	Description
evaluations/README.md	Evaluation framework overview and methodology
evaluations/test-prompts/	Complete test prompt library

Test Prompt Categories

Human-Authored Prompts (by-daniel/):

Prompt	Tests
accent-identification.md	Regional accent detection with grounded examples
guess-my-mood.md	Emotional analysis, fatigue detection, word-tone dissonance
non-verbal-context.md	Multi-speaker interpersonal dynamics, pauses as communication
parameters.md	Vocal frequency analysis for audio engineering (EQ recommendations)
who-is-this.md	Speaker identification/recognition

AI-Generated Prompts (ai-generated/): Extended benchmark covering additional audio understanding dimensions.

Why Audio Multimodal Matters

Classic STT vs. Audio Multimodal

The audio category on Hugging Face includes ASR (Automatic Speech Recognition) models like Whisper, Parakeet, and Wav2Vec, along with supporting components (diarization, VAD, punctuation restoration). These are powerful but follow a traditional pipeline approach.

Audio multimodal models are fundamentally different:

Native audio understanding: Process audio tokens directly alongside text prompts
Unified inference: Single API call handles transcription, formatting, and summarization
Prompt-guided processing: Can be instructed to analyze accents, describe voices, or format output

Practical Advantages

Instead of chaining: Whisper → GPT-4 → Formatting

Audio multimodal enables: Single API call with system prompt → Formatted output

Use cases:

Voice journals with structured formatting
Conference call summarization
Accent/voice analysis
Long-form audio processing (tested with 1-hour recordings)

Featured Models

See models/ for detailed profiles:

Model	Developer	Parameters	License
Qwen Omni	Alibaba	7B-35B	Apache 2.0
Gemma 3n	Google	2B-4B effective	Gemma
Macaw-LLM	Chenyang Lyu et al.	7B-13B	Apache 2.0

Audio-Text-to-Text

Model	Developer	Parameters	License
Audio Flamingo 3	NVIDIA	8B	Non-commercial
BuboGPT	ByteDance	7B-13B	BSD 3-Clause
Kimi-Audio	Moonshot AI	10B	MIT/Apache 2.0
OmniAudio	NexaAI	2.6B	Apache 2.0
Phi-4-Multimodal	Microsoft	5.6B	MIT
Qwen2-Audio	Alibaba	8B	Apache 2.0
SALMONN	ByteDance/Tsinghua	7B-13B	Apache 2.0
Soundwave	FreedomIntelligence	9B	Apache 2.0
Step-Audio-Chat	StepFun	130B	Apache 2.0
Step-Audio-R1	StepFun	33B	Apache 2.0
Ultravox	Fixie.ai	8B-70B	MIT
Voxtral	Mistral AI	5B-24B	Apache 2.0

Providers

See providers.md for the full list, or companies.md for a company-to-models mapping:

Open Source: Alibaba, ByteDance, Fixie.ai, FreedomIntelligence, Google DeepMind, Microsoft, Mistral AI, Moonshot AI, NexaAI, NVIDIA, StepFun
Closed Source: Google (Gemini), OpenAI (GPT-4o), Anthropic (Claude), Reka AI

Benchmarks

See benchmarks.md for full coverage of evaluation frameworks and leaderboards.

Benchmark	Developer	Focus	Links
MSEB	Google Research	Sound embedding evaluation	GitHub · Blog
UltraEval-Audio	OpenBMB	Speech understanding & generation	GitHub
lmms-eval	EvolvingLMMs Lab	100+ multimodal tasks	GitHub
VERSA	WavLab Speech	90+ speech/audio metrics	GitHub
AudioBench	AudioLLMs	Comprehensive audio LLM	GitHub · Leaderboard

Leaderboards: AudioBench · Open ASR

External Resources

Awesome-Audio-LLM - Curated list of audio LLM research
Hugging Face audio-text-to-text models - Browse latest models
Hugging Face ASR models - Traditional ASR for comparison

Future of Voice AI

Audio multimodal represents what may be the successor to first-wave STT models. The ability to handle transcription, cleanup, and formatting in a single unified inference process—without the complexity of VAD, punctuation restoration, and post-processing chains—makes this an elegant and powerful approach to voice AI.

Updates

This repository will be periodically updated as the field evolves. Given the rapid pace of AI development, timestamps are included throughout.

Created: December 7, 2025 | Updated: December 8, 2025