MOSS-Audio
April 20, 2026 · View on GitHub
MOSS-Audio is an open-source audio understanding model from MOSI.AI, the OpenMOSS team, and Shanghai Innovation Institute. It performs unified modeling over complex real-world audio, supporting speech understanding, environmental sound understanding, music understanding, audio captioning, time-aware QA, and complex reasoning. In this release, we provide four models: MOSS-Audio-4B-Instruct, MOSS-Audio-4B-Thinking, MOSS-Audio-8B-Instruct, and MOSS-Audio-8B-Thinking. The Instruct variants are optimized for direct instruction following, while the Thinking variants provide stronger chain-of-thought reasoning capabilities.
News
- 2026.4.20: We have added the MOSS-Audio fine-tuning code and documentation. See
finetune/FINETUNE.mdfor LoRA and full-parameter training examples. - 2026.4.13: 🎉🎉🎉 We have released MOSS-Audio. Blog and paper coming soon!
Contents
Introduction
Understanding audio requires more than simply transcribing words — it demands the ability to perceive acoustic cues, recognize speakers and emotions, interpret environmental sounds, reason over temporal context, and handle complex multi-step inference. MOSS-Audio is built to unify these capabilities within a single model.
- Speech & Content Understanding: Accurately recognizes and transcribes spoken content from audio inputs, producing clean and well-structured text outputs. Supports both word-level and sentence-level timestamp alignment.
- Speaker, Emotion & Event Analysis: Identifies speaker characteristics, analyzes emotional states based on tone, timbre, and context, and detects key acoustic events within the audio.
- Scene & Sound Cue Extraction: Extracts meaningful cues from background sounds, environmental noise, music, and non-speech signals to infer scene context and atmosphere.
- Music Understanding: Analyzes musical style, emotional progression, instrumentation, and salient acoustic features in music segments.
- Audio Question Answering & Summarization: Answers questions and generates summaries about speech, podcasts, meetings, interviews, and environmental recordings, helping users efficiently extract key information.
- Time-Aware QA: Supports time-aware questions, including word-level and sentence-level timestamp ASR.
- Complex Reasoning: Performs multi-hop reasoning over audio content, powered by chain-of-thought training and reinforcement learning.
Model Architecture
MOSS-Audio follows a modular design comprising three components: an audio encoder, a modality adapter, and a large language model. Raw audio is first encoded by MOSS-Audio-Encoder into continuous temporal representations at 12.5 Hz, which are then projected into the language model's embedding space through the adapter and finally consumed by the LLM for auto-regressive text generation.
Rather than relying on off-the-shelf audio frontends, we train a dedicated encoder from scratch to obtain more robust speech representations, tighter temporal alignment, and better extensibility across acoustic domains.
DeepStack Cross-Layer Feature Injection
Using only the encoder's top-layer features tends to lose low-level prosody, transient events, and local time-frequency structure. To address this, we design a DeepStack-inspired cross-layer injection module between the encoder and the language model: in addition to the encoder's final-layer output, features from earlier and intermediate layers are selected, independently projected, and injected into the language model's early layers, preserving multi-granularity information from low-level acoustic details to high-level semantic abstractions.
This design is especially well-suited for audio understanding tasks, as it helps retain rhythm, timbre, transients, and background structure — information that a single high-level representation cannot fully capture.
Time-Aware Representation
Time is a critical dimension in audio understanding. To enhance explicit temporal awareness, we adopt a time-marker insertion strategy during pretraining: explicit time tokens are inserted between audio frame representations at fixed time intervals to indicate temporal positions. This design enables the model to learn "what happened when" within a unified text generation framework, naturally supporting timestamp ASR, event localization, time-based QA, and long-audio retrospection.
Released Models
More model families, sizes, and variants will be released in the future. Stay tuned!
Evaluation
We evaluate MOSS-Audio on a comprehensive set of audio understanding benchmarks. Key results:
- General Audio Understanding: MOSS-Audio-8B-Thinking achieves an average accuracy of 71.08, with 77.33 on MMAU, 64.92 on MMAU-Pro, 66.53 on MMAR, and 75.52 on MMSU, outperforming all open-source models.
- Speech Captioning: MOSS-Audio-Instruct variants lead across 11 out of 13 fine-grained speech description dimensions, with MOSS-Audio-8B-Instruct achieving the best overall average score (3.7252).
- ASR: On a diverse ASR benchmark suite spanning 12 evaluation dimensions, MOSS-Audio achieves the lowest overall CER (11.30), with particular strength in health-condition, code-switching, dialect, singing, and non-speech scenarios.
- Timestamp ASR: MOSS-Audio-8B-Instruct achieves 35.77 AAS on AISHELL-1 and 131.61 AAS on LibriSpeech, dramatically outperforming Qwen3-Omni (833.66) and Gemini-3.1-Pro (708.24) in timestamp asr accuracy.
General Audio Understanding (Accuracy↑)
| Model | Model Size | MMAU | MMAU-Pro | MMAR | MMSU | Avg |
|---|---|---|---|---|---|---|
| Open Source (small) | ||||||
| Kimi-Audio | 7B | 72.41 | 56.58 | 60.82 | 54.74 | 61.14 |
| Qwen2.5-Omni | 7B | 65.60 | 52.20 | 56.70 | 61.32 | 58.96 |
| Audio Flamingo 3 | 7B | 61.23 | 51.70 | 57.96 | 60.04 | 57.73 |
| MiMo-Audio-7B | 7B | 74.90 | 53.35 | 61.70 | 61.94 | 62.97 |
| MiniCPM-o-4.5 | 9B | 70.97 | 39.65 | 55.75 | 60.96 | 56.83 |
| MOSS-Audio-4B-Instruct | 4B | 75.79 | 58.16 | 59.68 | 59.68 | 64.04 |
| MOSS-Audio-4B-Thinking | 4B | 77.64 | 60.75 | 63.91 | 71.20 | 68.37 |
| MOSS-Audio-8B-Instruct | 8B | 77.03 | 57.48 | 64.42 | 66.36 | 66.32 |
| MOSS-Audio-8B-Thinking | 8B | 77.33 | 64.92 | 66.53 | 75.52 | 71.08 |
| Open Source (large) | ||||||
| Qwen3-Omni-30B-A3B-Instruct | 30B | 75.00 | 61.22 | 66.40 | 69.00 | 67.91 |
| Step-Audio-R1.1 | 33B | 72.18 | 60.80 | 68.75 | 64.18 | 66.48 |
| Step-Audio-R1 | 33B | 78.67 | 59.68 | 69.15 | 75.18 | 70.67 |
| Closed Source | ||||||
| GPT4o-Audio | - | 65.66 | 52.30 | 59.78 | 58.76 | 59.13 |
| Gemini-3-Pro | - | 80.15 | 68.28 | 81.73 | 81.28 | 77.86 |
| Gemini-3.1-Pro | - | 81.10 | 73.47 | 83.70 | 81.30 | 79.89 |
Speech Captioning (LLM-as-a-Judge Score↑)
Speech Captioning (click to expand)
| Model | gender | age | accent | pitch | volume | speed | texture | clarity | fluency | emotion | tone | personality | summary | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen3-Omni-30B-A3B-Instruct | 4.436 | 3.936 | 4.356 | 3.590 | 3.682 | 3.614 | 3.093 | 3.521 | 3.531 | 3.328 | 3.224 | 3.292 | 3.179 | 3.5986 |
| Qwen3-Omni-30B-A3B-Thinking | 4.419 | 4.026 | 4.327 | 3.610 | 3.577 | 3.610 | 3.179 | 3.403 | 3.526 | 3.232 | 3.154 | 3.197 | 3.107 | 3.5667 |
| Gemini-3-Pro | 4.191 | 3.835 | 4.181 | 3.392 | 3.254 | 3.320 | 2.998 | 3.347 | 3.524 | 3.055 | 2.997 | 3.023 | 2.775 | 3.3763 |
| Gemini-3.1-Pro | 4.436 | 3.936 | 4.356 | 3.590 | 3.682 | 3.614 | 3.093 | 3.521 | 3.531 | 3.328 | 3.224 | 3.292 | 3.179 | 3.5986 |
| MOSS-Audio-4B-Instruct | 4.697 | 3.980 | 4.497 | 3.628 | 3.722 | 3.564 | 3.407 | 3.841 | 3.744 | 3.311 | 3.282 | 3.305 | 3.259 | 3.7105 |
| MOSS-Audio-8B-Instruct | 4.683 | 3.979 | 4.572 | 3.682 | 3.709 | 3.638 | 3.403 | 3.869 | 3.747 | 3.314 | 3.253 | 3.272 | 3.307 | 3.7252 |
ASR
| Model | Overall | Health Condition | Dialect | Singing | Non-Speech Vocalizations | Code-Switching | Acoustic Environment (Clean) | Acoustic Environment (Noisy) | Acoustic Characteristics: Whisper | Acoustic Characteristics: Far-Field / Near-Field | Multi-Speaker | Age | Semantic Content |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Paraformer-Large | 15.77 | 22.18 | 43.45 | 32.34 | 4.95 | 12.65 | 3.11 | 4.67 | 5.02 | 17.46 | 20.33 | 14.96 | 7.14 |
| GLM-ASR-Nano | 17.29 | 24.49 | 22.39 | 51.95 | 4.65 | 11.88 | 3.68 | 5.02 | 4.94 | 27.51 | 28.02 | 17.19 | 7.32 |
| Fun-ASR-Nano | 12.04 | 21.99 | 7.80 | 19.35 | 4.76 | 11.23 | 2.98 | 3.46 | 3.78 | 18.38 | 19.82 | 14.95 | 6.08 |
| SenseVoice-Small | 14.50 | 24.04 | 8.89 | 23.79 | 4.92 | 13.90 | 4.13 | 4.93 | 5.57 | 26.66 | 24.06 | 17.63 | 7.55 |
| Kimi-Audio-7B-Instruct | 14.12 | 21.11 | 29.34 | 21.76 | 4.68 | 16.38 | 2.20 | 2.15 | 2.66 | 21.02 | 20.61 | 16.74 | 6.12 |
| Qwen2.5-Omni-3B | 15.26 | 24.65 | 33.87 | 24.24 | 5.54 | 11.66 | 2.76 | 3.56 | 4.32 | 22.15 | 22.91 | 15.17 | 7.24 |
| Qwen2.5-Omni-7B | 15.05 | 23.85 | 31.91 | 22.69 | 4.56 | 12.97 | 2.52 | 3.16 | 3.64 | 25.38 | 21.01 | 16.13 | 6.78 |
| Qwen3-Omni-30B-A3B-Instruct | 11.39 | 20.73 | 15.63 | 16.01 | 4.73 | 11.30 | 2.23 | 2.47 | 1.90 | 17.08 | 18.15 | 11.46 | 5.74 |
| MOSS-Audio-4B-Instruct | 11.58 | 21.11 | 11.84 | 10.79 | 4.01 | 10.11 | 3.11 | 3.72 | 3.29 | 18.48 | 20.33 | 15.09 | 8.15 |
| MOSS-Audio-8B-Instruct | 11.30 | 19.18 | 8.76 | 9.81 | 4.31 | 10.18 | 2.70 | 3.20 | 2.75 | 24.04 | 24.36 | 15.26 | 7.69 |
Detailed ASR Results (click to expand)
| Model | Acoustic Environment (Clean) | Acoustic Environment (Noisy) | Acoustic Characteristics: Whisper | Acoustic Characteristics: Far-Field / Near-Field | Multi-Speaker | Age | Health Condition | Semantic Content | Code-Switching | Dialect | Singing | Non-Speech Vocalizations | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AISHELL-1 test |
AISHELL-2 Android | IOS | Mic |
THCHS-30 test |
MAGICDATA-READ test |
AISHELL6-Whisper normal | whisper |
AliMeeting Test_Ali_far | Test_Ali_near |
AISHELL-4 test |
SeniorTalk sentence |
ChildMandarin test |
AISHELL-6A mild | moderate | severe | StutteringSpeech |
AISHELL_6B LRDWWS | Uncontrol |
WenetSpeech test-meeting |
Fleurs cmn_hans_cn |
CS-Dialogue test |
TALCS test |
ASCEND test |
KeSpeech test |
WSYue-ASR-eval short |
MIR-1K test |
openc-pop test |
MNV_17 | |
| Paraformer-Large | 1.98 | 3.28 | 3.21 | 3.00 | 4.07 | 4.67 | 1.11 | 8.92 | 25.64 | 9.27 | 20.33 | 17.31 | 12.60 | 6.98 | 9.30 | 13.34 | 10.74 | 47.59 | 45.08 | 7.88 | 6.40 | 10.64 | 10.77 | 16.55 | 11.48 | 75.42 | 57.70 | 6.98 | 4.95 |
| GLM-ASR-Nano | 2.89 | 3.75 | 3.73 | 3.78 | 4.23 | 5.02 | 0.83 | 9.06 | 40.27 | 14.76 | 28.02 | 20.33 | 14.06 | 8.74 | 12.11 | 14.38 | 12.29 | 50.34 | 49.09 | 9.70 | 4.94 | 11.06 | 11.07 | 13.50 | 9.72 | 35.07 | 95.87 | 8.03 | 4.65 |
| Fun-ASR-Nano | 2.16 | 3.04 | 2.99 | 3.07 | 3.65 | 3.46 | 0.81 | 6.76 | 27.21 | 9.55 | 19.82 | 16.96 | 12.94 | 6.60 | 8.81 | 12.98 | 10.30 | 47.42 | 45.84 | 7.39 | 4.76 | 10.47 | 8.09 | 15.13 | 7.43 | 8.17 | 35.85 | 2.84 | 4.76 |
| SenseVoice-Small | 3.23 | 4.16 | 4.02 | 3.96 | 5.26 | 4.93 | 1.25 | 9.88 | 37.01 | 16.31 | 24.06 | 21.07 | 14.18 | 7.62 | 9.85 | 14.39 | 11.47 | 52.92 | 47.97 | 8.35 | 6.75 | 12.81 | 10.52 | 18.38 | 10.45 | 7.34 | 39.51 | 8.07 | 4.92 |
| Kimi-Audio-7B-Instruct | 0.79 | 2.91 | 3.03 | 2.88 | 1.39 | 2.15 | 0.69 | 4.63 | 28.22 | 13.82 | 20.61 | 19.70 | 13.79 | 7.00 | 9.34 | 12.56 | 10.75 | 44.44 | 42.57 | 7.15 | 5.10 | 14.56 | 12.74 | 21.83 | 5.51 | 53.17 | 38.35 | 5.17 | 4.68 |
| Qwen2.5-Omni-3B | 1.51 | 3.10 | 2.94 | 2.93 | 3.32 | 3.56 | 0.82 | 7.82 | 32.14 | 12.16 | 22.91 | 17.38 | 12.96 | 6.87 | 10.55 | 14.57 | 11.33 | 54.54 | 50.03 | 9.04 | 5.45 | 10.78 | 10.94 | 13.25 | 7.67 | 60.06 | 45.00 | 3.47 | 5.54 |
| Qwen2.5-Omni-7B | 1.16 | 2.88 | 2.77 | 2.73 | 3.06 | 3.16 | 0.71 | 6.57 | 32.03 | 18.73 | 21.01 | 19.96 | 12.29 | 7.27 | 10.94 | 12.92 | 10.53 | 51.99 | 49.45 | 8.43 | 5.13 | 14.02 | 10.46 | 14.42 | 6.40 | 57.43 | 42.62 | 2.75 | 4.56 |
| Qwen3-Omni-30B-A3B-Instruct | 0.95 | 2.70 | 2.72 | 2.57 | 2.21 | 2.47 | 0.59 | 3.22 | 25.72 | 8.44 | 18.15 | 14.13 | 8.79 | 6.20 | 8.88 | 11.59 | 10.25 | 45.80 | 41.65 | 6.64 | 4.84 | 12.94 | 8.33 | 12.64 | 5.87 | 25.39 | 30.81 | 1.21 | 4.73 |
| MOSS-Audio-4B-Instruct | 2.26 | 3.22 | 3.20 | 3.33 | 3.53 | 3.72 | 0.73 | 5.86 | 27.27 | 9.68 | 20.33 | 16.93 | 13.25 | 6.36 | 9.77 | 12.68 | 10.28 | 43.35 | 44.25 | 8.17 | 8.13 | 9.14 | 8.37 | 12.83 | 14.65 | 9.04 | 18.47 | 3.10 | 4.01 |
| MOSS-Audio-8B-Instruct | 1.82 | 2.97 | 2.95 | 2.91 | 2.82 | 3.20 | 0.69 | 4.80 | 36.82 | 11.25 | 24.36 | 17.42 | 13.10 | 5.84 | 8.94 | 11.52 | 9.72 | 39.76 | 39.27 | 7.86 | 7.52 | 9.07 | 8.22 | 13.26 | 9.18 | 8.33 | 17.24 | 2.39 | 4.31 |
Timestamp ASR (AAS↓)
| Model | AISHELL-1(zh) | LibriSpeech(en) |
|---|---|---|
| Qwen3-Omni-30B-A3B-Instruct | 833.66 | 646.95 |
| Gemini-3.1-Pro | 708.24 | 871.19 |
| MOSS-Audio-4B-Instruct | 76.96 | 358.13 |
| MOSS-Audio-8B-Instruct | 35.77 | 131.61 |
Quickstart
Environment Setup
We recommend Python 3.12 with a clean Conda environment. The commands below are enough for local inference.
Recommended setup
git clone https://github.com/OpenMOSS/MOSS-Audio.git
cd MOSS-Audio
conda create -n moss-audio python=3.12 -y
conda activate moss-audio
conda install -c conda-forge "ffmpeg=7" -y
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]"
Optional: FlashAttention 2
If your GPU supports FlashAttention 2, you can replace the last install command with:
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,flash-attn]"
Basic Usage
Download the model first:
hf download OpenMOSS-Team/MOSS-Audio-4B-Instruct --local-dir ./weights/MOSS-Audio-4B-Instruct
hf download OpenMOSS-Team/MOSS-Audio-4B-Thinking --local-dir ./weights/MOSS-Audio-4B-Thinking
hf download OpenMOSS-Team/MOSS-Audio-8B-Instruct --local-dir ./weights/MOSS-Audio-8B-Instruct
hf download OpenMOSS-Team/MOSS-Audio-8B-Thinking --local-dir ./weights/MOSS-Audio-8B-Thinking
Then edit MODEL_PATH / AUDIO_PATH in infer.py as needed, and run:
python infer.py
The default prompt in infer.py is Describe this audio. You can directly edit that line if you want to try transcription, audio QA, or speech captioning.
Fine-tuning
We now provide an official fine-tuning script in finetune/finetune.py, with full instructions in finetune/FINETUNE.md.
Install the extra dependencies needed for training:
pip install librosa peft
Minimal example for LoRA fine-tuning:
accelerate launch finetune/finetune.py \
--model_dir ./weights/MOSS-Audio-4B-Instruct \
--data_path train.jsonl \
--output_dir ./output/lora \
--use_lora \
--bf16
The training data should be a JSONL file containing audio-text conversations. For data format, supported arguments, multi-GPU examples, and full-parameter fine-tuning, see finetune/FINETUNE.md.
Gradio App
Start the Gradio demo with:
python app.py
SGLang Serving
If you want to serve MOSS-Audio with SGLang, see the full guide in moss_audio_usage_guide.md.
The shortest setup is:
git clone -b moss-audio https://github.com/OpenMOSS/sglang.git
cd sglang
pip install -e "python[all]"
pip install nvidia-cudnn-cu12==9.16.0.29
cd ..
sglang serve --model-path ./weights/MOSS-Audio --trust-remote-code
If you use the default torch==2.9.1+cu128 runtime, installing nvidia-cudnn-cu12==9.16.0.29 is recommended before starting sglang serve.
More Information
- MOSI.AI: https://mosi.cn
- OpenMOSS: https://www.open-moss.com
LICENSE
Models in MOSS-Audio are licensed under the Apache License 2.0.
Citation
@misc{mossaudio2026,
title={MOSS-Audio Technical Report},
author={OpenMOSS Team},
year={2026},
howpublished={\url{https://github.com/OpenMOSS/MOSS-Audio}},
note={GitHub repository}
}