MOSS-Audio

April 20, 2026 · View on GitHub

MOSS-Audio is an open-source audio understanding model from MOSI.AI, the OpenMOSS team, and Shanghai Innovation Institute. It performs unified modeling over complex real-world audio, supporting speech understanding, environmental sound understanding, music understanding, audio captioning, time-aware QA, and complex reasoning. In this release, we provide four models: MOSS-Audio-4B-Instruct, MOSS-Audio-4B-Thinking, MOSS-Audio-8B-Instruct, and MOSS-Audio-8B-Thinking. The Instruct variants are optimized for direct instruction following, while the Thinking variants provide stronger chain-of-thought reasoning capabilities.

News

2026.4.20: We have added the MOSS-Audio fine-tuning code and documentation. See finetune/FINETUNE.md for LoRA and full-parameter training examples.
2026.4.13: 🎉🎉🎉 We have released MOSS-Audio. Blog and paper coming soon!

Introduction
Model Architecture
- DeepStack Cross-Layer Feature Injection
- Time-Aware Representation
Released Models
Evaluation
Quickstart
More Information
Citation

Introduction

Understanding audio requires more than simply transcribing words — it demands the ability to perceive acoustic cues, recognize speakers and emotions, interpret environmental sounds, reason over temporal context, and handle complex multi-step inference. MOSS-Audio is built to unify these capabilities within a single model.

Speech & Content Understanding: Accurately recognizes and transcribes spoken content from audio inputs, producing clean and well-structured text outputs. Supports both word-level and sentence-level timestamp alignment.
Speaker, Emotion & Event Analysis: Identifies speaker characteristics, analyzes emotional states based on tone, timbre, and context, and detects key acoustic events within the audio.
Scene & Sound Cue Extraction: Extracts meaningful cues from background sounds, environmental noise, music, and non-speech signals to infer scene context and atmosphere.
Music Understanding: Analyzes musical style, emotional progression, instrumentation, and salient acoustic features in music segments.
Audio Question Answering & Summarization: Answers questions and generates summaries about speech, podcasts, meetings, interviews, and environmental recordings, helping users efficiently extract key information.
Time-Aware QA: Supports time-aware questions, including word-level and sentence-level timestamp ASR.
Complex Reasoning: Performs multi-hop reasoning over audio content, powered by chain-of-thought training and reinforcement learning.

Model Architecture

MOSS-Audio follows a modular design comprising three components: an audio encoder, a modality adapter, and a large language model. Raw audio is first encoded by MOSS-Audio-Encoder into continuous temporal representations at 12.5 Hz, which are then projected into the language model's embedding space through the adapter and finally consumed by the LLM for auto-regressive text generation.

Rather than relying on off-the-shelf audio frontends, we train a dedicated encoder from scratch to obtain more robust speech representations, tighter temporal alignment, and better extensibility across acoustic domains.

DeepStack Cross-Layer Feature Injection

Using only the encoder's top-layer features tends to lose low-level prosody, transient events, and local time-frequency structure. To address this, we design a DeepStack-inspired cross-layer injection module between the encoder and the language model: in addition to the encoder's final-layer output, features from earlier and intermediate layers are selected, independently projected, and injected into the language model's early layers, preserving multi-granularity information from low-level acoustic details to high-level semantic abstractions.

This design is especially well-suited for audio understanding tasks, as it helps retain rhythm, timbre, transients, and background structure — information that a single high-level representation cannot fully capture.

Time-Aware Representation

Time is a critical dimension in audio understanding. To enhance explicit temporal awareness, we adopt a time-marker insertion strategy during pretraining: explicit time tokens are inserted between audio frame representations at fixed time intervals to indicate temporal positions. This design enables the model to learn "what happened when" within a unified text generation framework, naturally supporting timestamp ASR, event localization, time-based QA, and long-audio retrospection.

Released Models

Model	Audio Encoder	LLM Backbone	Total Size
MOSS-Audio-4B-Instruct	MOSS-Audio-Encoder	Qwen3-4B	~4.6B
MOSS-Audio-4B-Thinking	MOSS-Audio-Encoder	Qwen3-4B	~4.6B
MOSS-Audio-8B-Instruct	MOSS-Audio-Encoder	Qwen3-8B	~8.6B
MOSS-Audio-8B-Thinking	MOSS-Audio-Encoder	Qwen3-8B	~8.6B

More model families, sizes, and variants will be released in the future. Stay tuned!

Evaluation

We evaluate MOSS-Audio on a comprehensive set of audio understanding benchmarks. Key results:

General Audio Understanding: MOSS-Audio-8B-Thinking achieves an average accuracy of 71.08, with 77.33 on MMAU, 64.92 on MMAU-Pro, 66.53 on MMAR, and 75.52 on MMSU, outperforming all open-source models.
Speech Captioning: MOSS-Audio-Instruct variants lead across 11 out of 13 fine-grained speech description dimensions, with MOSS-Audio-8B-Instruct achieving the best overall average score (3.7252).
ASR: On a diverse ASR benchmark suite spanning 12 evaluation dimensions, MOSS-Audio achieves the lowest overall CER (11.30), with particular strength in health-condition, code-switching, dialect, singing, and non-speech scenarios.
Timestamp ASR: MOSS-Audio-8B-Instruct achieves 35.77 AAS on AISHELL-1 and 131.61 AAS on LibriSpeech, dramatically outperforming Qwen3-Omni (833.66) and Gemini-3.1-Pro (708.24) in timestamp asr accuracy.

General Audio Understanding (Accuracy↑)

Model	Model Size	MMAU	MMAU-Pro	MMAR	MMSU	Avg
Open Source (small)
Kimi-Audio	7B	72.41	56.58	60.82	54.74	61.14
Qwen2.5-Omni	7B	65.60	52.20	56.70	61.32	58.96
Audio Flamingo 3	7B	61.23	51.70	57.96	60.04	57.73
MiMo-Audio-7B	7B	74.90	53.35	61.70	61.94	62.97
MiniCPM-o-4.5	9B	70.97	39.65	55.75	60.96	56.83
MOSS-Audio-4B-Instruct	4B	75.79	58.16	59.68	59.68	64.04
MOSS-Audio-4B-Thinking	4B	77.64	60.75	63.91	71.20	68.37
MOSS-Audio-8B-Instruct	8B	77.03	57.48	64.42	66.36	66.32
MOSS-Audio-8B-Thinking	8B	77.33	64.92	66.53	75.52	71.08
Open Source (large)
Qwen3-Omni-30B-A3B-Instruct	30B	75.00	61.22	66.40	69.00	67.91
Step-Audio-R1.1	33B	72.18	60.80	68.75	64.18	66.48
Step-Audio-R1	33B	78.67	59.68	69.15	75.18	70.67
Closed Source
GPT4o-Audio	-	65.66	52.30	59.78	58.76	59.13
Gemini-3-Pro	-	80.15	68.28	81.73	81.28	77.86
Gemini-3.1-Pro	-	81.10	73.47	83.70	81.30	79.89

Speech Captioning (LLM-as-a-Judge Score↑)

Speech Captioning (click to expand)

Model	gender	age	accent	pitch	volume	speed	texture	clarity	fluency	emotion	tone	personality	summary	Avg
Qwen3-Omni-30B-A3B-Instruct	4.436	3.936	4.356	3.590	3.682	3.614	3.093	3.521	3.531	3.328	3.224	3.292	3.179	3.5986
Qwen3-Omni-30B-A3B-Thinking	4.419	4.026	4.327	3.610	3.577	3.610	3.179	3.403	3.526	3.232	3.154	3.197	3.107	3.5667
Gemini-3-Pro	4.191	3.835	4.181	3.392	3.254	3.320	2.998	3.347	3.524	3.055	2.997	3.023	2.775	3.3763
Gemini-3.1-Pro	4.436	3.936	4.356	3.590	3.682	3.614	3.093	3.521	3.531	3.328	3.224	3.292	3.179	3.5986
MOSS-Audio-4B-Instruct	4.697	3.980	4.497	3.628	3.722	3.564	3.407	3.841	3.744	3.311	3.282	3.305	3.259	3.7105
MOSS-Audio-8B-Instruct	4.683	3.979	4.572	3.682	3.709	3.638	3.403	3.869	3.747	3.314	3.253	3.272	3.307	3.7252

ASR

Model	Overall	Health Condition	Dialect	Singing	Non-Speech Vocalizations	Code-Switching	Acoustic Environment (Clean)	Acoustic Environment (Noisy)	Acoustic Characteristics: Whisper	Acoustic Characteristics: Far-Field / Near-Field	Multi-Speaker	Age	Semantic Content
Paraformer-Large	15.77	22.18	43.45	32.34	4.95	12.65	3.11	4.67	5.02	17.46	20.33	14.96	7.14
GLM-ASR-Nano	17.29	24.49	22.39	51.95	4.65	11.88	3.68	5.02	4.94	27.51	28.02	17.19	7.32
Fun-ASR-Nano	12.04	21.99	7.80	19.35	4.76	11.23	2.98	3.46	3.78	18.38	19.82	14.95	6.08
SenseVoice-Small	14.50	24.04	8.89	23.79	4.92	13.90	4.13	4.93	5.57	26.66	24.06	17.63	7.55
Kimi-Audio-7B-Instruct	14.12	21.11	29.34	21.76	4.68	16.38	2.20	2.15	2.66	21.02	20.61	16.74	6.12
Qwen2.5-Omni-3B	15.26	24.65	33.87	24.24	5.54	11.66	2.76	3.56	4.32	22.15	22.91	15.17	7.24
Qwen2.5-Omni-7B	15.05	23.85	31.91	22.69	4.56	12.97	2.52	3.16	3.64	25.38	21.01	16.13	6.78
Qwen3-Omni-30B-A3B-Instruct	11.39	20.73	15.63	16.01	4.73	11.30	2.23	2.47	1.90	17.08	18.15	11.46	5.74
MOSS-Audio-4B-Instruct	11.58	21.11	11.84	10.79	4.01	10.11	3.11	3.72	3.29	18.48	20.33	15.09	8.15
MOSS-Audio-8B-Instruct	11.30	19.18	8.76	9.81	4.31	10.18	2.70	3.20	2.75	24.04	24.36	15.26	7.69

Detailed ASR Results (click to expand)

Model	Acoustic Environment (Clean)			Acoustic Environment (Noisy)	Acoustic Characteristics: Whisper	Acoustic Characteristics: Far-Field / Near-Field	Multi-Speaker	Age		Health Condition		Semantic Content		Code-Switching			Dialect		Singing		Non-Speech Vocalizations
Model	AISHELL-1 test	AISHELL-2 Android \| IOS \| Mic	THCHS-30 test	MAGICDATA-READ test	AISHELL6-Whisper normal \| whisper	AliMeeting Test_Ali_far \| Test_Ali_near	AISHELL-4 test	SeniorTalk sentence	ChildMandarin test	AISHELL-6A mild \| moderate \| severe \| StutteringSpeech	AISHELL_6B LRDWWS \| Uncontrol	WenetSpeech test-meeting	Fleurs cmn_hans_cn	CS-Dialogue test	TALCS test	ASCEND test	KeSpeech test	WSYue-ASR-eval short	MIR-1K test	openc-pop test	MNV_17
Paraformer-Large	1.98	3.28 \| 3.21 \| 3.00	4.07	4.67	1.11 \| 8.92	25.64 \| 9.27	20.33	17.31	12.60	6.98 \| 9.30 \| 13.34 \| 10.74	47.59 \| 45.08	7.88	6.40	10.64	10.77	16.55	11.48	75.42	57.70	6.98	4.95
GLM-ASR-Nano	2.89	3.75 \| 3.73 \| 3.78	4.23	5.02	0.83 \| 9.06	40.27 \| 14.76	28.02	20.33	14.06	8.74 \| 12.11 \| 14.38 \| 12.29	50.34 \| 49.09	9.70	4.94	11.06	11.07	13.50	9.72	35.07	95.87	8.03	4.65
Fun-ASR-Nano	2.16	3.04 \| 2.99 \| 3.07	3.65	3.46	0.81 \| 6.76	27.21 \| 9.55	19.82	16.96	12.94	6.60 \| 8.81 \| 12.98 \| 10.30	47.42 \| 45.84	7.39	4.76	10.47	8.09	15.13	7.43	8.17	35.85	2.84	4.76
SenseVoice-Small	3.23	4.16 \| 4.02 \| 3.96	5.26	4.93	1.25 \| 9.88	37.01 \| 16.31	24.06	21.07	14.18	7.62 \| 9.85 \| 14.39 \| 11.47	52.92 \| 47.97	8.35	6.75	12.81	10.52	18.38	10.45	7.34	39.51	8.07	4.92
Kimi-Audio-7B-Instruct	0.79	2.91 \| 3.03 \| 2.88	1.39	2.15	0.69 \| 4.63	28.22 \| 13.82	20.61	19.70	13.79	7.00 \| 9.34 \| 12.56 \| 10.75	44.44 \| 42.57	7.15	5.10	14.56	12.74	21.83	5.51	53.17	38.35	5.17	4.68
Qwen2.5-Omni-3B	1.51	3.10 \| 2.94 \| 2.93	3.32	3.56	0.82 \| 7.82	32.14 \| 12.16	22.91	17.38	12.96	6.87 \| 10.55 \| 14.57 \| 11.33	54.54 \| 50.03	9.04	5.45	10.78	10.94	13.25	7.67	60.06	45.00	3.47	5.54
Qwen2.5-Omni-7B	1.16	2.88 \| 2.77 \| 2.73	3.06	3.16	0.71 \| 6.57	32.03 \| 18.73	21.01	19.96	12.29	7.27 \| 10.94 \| 12.92 \| 10.53	51.99 \| 49.45	8.43	5.13	14.02	10.46	14.42	6.40	57.43	42.62	2.75	4.56
Qwen3-Omni-30B-A3B-Instruct	0.95	2.70 \| 2.72 \| 2.57	2.21	2.47	0.59 \| 3.22	25.72 \| 8.44	18.15	14.13	8.79	6.20 \| 8.88 \| 11.59 \| 10.25	45.80 \| 41.65	6.64	4.84	12.94	8.33	12.64	5.87	25.39	30.81	1.21	4.73
MOSS-Audio-4B-Instruct	2.26	3.22 \| 3.20 \| 3.33	3.53	3.72	0.73 \| 5.86	27.27 \| 9.68	20.33	16.93	13.25	6.36 \| 9.77 \| 12.68 \| 10.28	43.35 \| 44.25	8.17	8.13	9.14	8.37	12.83	14.65	9.04	18.47	3.10	4.01
MOSS-Audio-8B-Instruct	1.82	2.97 \| 2.95 \| 2.91	2.82	3.20	0.69 \| 4.80	36.82 \| 11.25	24.36	17.42	13.10	5.84 \| 8.94 \| 11.52 \| 9.72	39.76 \| 39.27	7.86	7.52	9.07	8.22	13.26	9.18	8.33	17.24	2.39	4.31

Timestamp ASR (AAS↓)

Model	AISHELL-1(zh)	LibriSpeech(en)
Qwen3-Omni-30B-A3B-Instruct	833.66	646.95
Gemini-3.1-Pro	708.24	871.19
MOSS-Audio-4B-Instruct	76.96	358.13
MOSS-Audio-8B-Instruct	35.77	131.61

Quickstart

Environment Setup

We recommend Python 3.12 with a clean Conda environment. The commands below are enough for local inference.

Recommended setup

git clone https://github.com/OpenMOSS/MOSS-Audio.git
cd MOSS-Audio

conda create -n moss-audio python=3.12 -y
conda activate moss-audio

conda install -c conda-forge "ffmpeg=7" -y
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]"

Optional: FlashAttention 2

If your GPU supports FlashAttention 2, you can replace the last install command with:

pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,flash-attn]"

Basic Usage

Download the model first:

hf  download OpenMOSS-Team/MOSS-Audio-4B-Instruct --local-dir ./weights/MOSS-Audio-4B-Instruct 
hf  download OpenMOSS-Team/MOSS-Audio-4B-Thinking --local-dir ./weights/MOSS-Audio-4B-Thinking 
hf  download OpenMOSS-Team/MOSS-Audio-8B-Instruct --local-dir ./weights/MOSS-Audio-8B-Instruct 
hf  download OpenMOSS-Team/MOSS-Audio-8B-Thinking --local-dir ./weights/MOSS-Audio-8B-Thinking

Then edit MODEL_PATH / AUDIO_PATH in infer.py as needed, and run:

python infer.py

The default prompt in infer.py is Describe this audio. You can directly edit that line if you want to try transcription, audio QA, or speech captioning.

Fine-tuning

We now provide an official fine-tuning script in finetune/finetune.py, with full instructions in finetune/FINETUNE.md.

Install the extra dependencies needed for training:

pip install librosa peft

Minimal example for LoRA fine-tuning:

accelerate launch finetune/finetune.py \
    --model_dir ./weights/MOSS-Audio-4B-Instruct \
    --data_path train.jsonl \
    --output_dir ./output/lora \
    --use_lora \
    --bf16

The training data should be a JSONL file containing audio-text conversations. For data format, supported arguments, multi-GPU examples, and full-parameter fine-tuning, see finetune/FINETUNE.md.

Gradio App

Start the Gradio demo with:

python app.py

SGLang Serving

If you want to serve MOSS-Audio with SGLang, see the full guide in moss_audio_usage_guide.md.

The shortest setup is:

git clone -b moss-audio https://github.com/OpenMOSS/sglang.git
cd sglang
pip install -e "python[all]"
pip install nvidia-cudnn-cu12==9.16.0.29
cd ..
sglang serve --model-path ./weights/MOSS-Audio --trust-remote-code

If you use the default torch==2.9.1+cu128 runtime, installing nvidia-cudnn-cu12==9.16.0.29 is recommended before starting sglang serve.

More Information

MOSI.AI: https://mosi.cn
OpenMOSS: https://www.open-moss.com

LICENSE

Models in MOSS-Audio are licensed under the Apache License 2.0.

Citation

@misc{mossaudio2026,
      title={MOSS-Audio Technical Report},
      author={OpenMOSS Team},
      year={2026},
      howpublished={\url{https://github.com/OpenMOSS/MOSS-Audio}},
      note={GitHub repository}
}