MOSS-Audio

April 20, 2026 · View on GitHub

WeChat

English | 简体中文

MOSS-Audio is an open-source audio understanding model from MOSI.AI, the OpenMOSS team, and Shanghai Innovation Institute. It performs unified modeling over complex real-world audio, supporting speech understanding, environmental sound understanding, music understanding, audio captioning, time-aware QA, and complex reasoning. In this release, we provide four models: MOSS-Audio-4B-Instruct, MOSS-Audio-4B-Thinking, MOSS-Audio-8B-Instruct, and MOSS-Audio-8B-Thinking. The Instruct variants are optimized for direct instruction following, while the Thinking variants provide stronger chain-of-thought reasoning capabilities.

News

  • 2026.4.20: We have added the MOSS-Audio fine-tuning code and documentation. See finetune/FINETUNE.md for LoRA and full-parameter training examples.
  • 2026.4.13: 🎉🎉🎉 We have released MOSS-Audio. Blog and paper coming soon!

Contents

Introduction

Understanding audio requires more than simply transcribing words — it demands the ability to perceive acoustic cues, recognize speakers and emotions, interpret environmental sounds, reason over temporal context, and handle complex multi-step inference. MOSS-Audio is built to unify these capabilities within a single model.

  • Speech & Content Understanding: Accurately recognizes and transcribes spoken content from audio inputs, producing clean and well-structured text outputs. Supports both word-level and sentence-level timestamp alignment.
  • Speaker, Emotion & Event Analysis: Identifies speaker characteristics, analyzes emotional states based on tone, timbre, and context, and detects key acoustic events within the audio.
  • Scene & Sound Cue Extraction: Extracts meaningful cues from background sounds, environmental noise, music, and non-speech signals to infer scene context and atmosphere.
  • Music Understanding: Analyzes musical style, emotional progression, instrumentation, and salient acoustic features in music segments.
  • Audio Question Answering & Summarization: Answers questions and generates summaries about speech, podcasts, meetings, interviews, and environmental recordings, helping users efficiently extract key information.
  • Time-Aware QA: Supports time-aware questions, including word-level and sentence-level timestamp ASR.
  • Complex Reasoning: Performs multi-hop reasoning over audio content, powered by chain-of-thought training and reinforcement learning.

Model Architecture

MOSS-Audio follows a modular design comprising three components: an audio encoder, a modality adapter, and a large language model. Raw audio is first encoded by MOSS-Audio-Encoder into continuous temporal representations at 12.5 Hz, which are then projected into the language model's embedding space through the adapter and finally consumed by the LLM for auto-regressive text generation.

Rather than relying on off-the-shelf audio frontends, we train a dedicated encoder from scratch to obtain more robust speech representations, tighter temporal alignment, and better extensibility across acoustic domains.

DeepStack Cross-Layer Feature Injection

Using only the encoder's top-layer features tends to lose low-level prosody, transient events, and local time-frequency structure. To address this, we design a DeepStack-inspired cross-layer injection module between the encoder and the language model: in addition to the encoder's final-layer output, features from earlier and intermediate layers are selected, independently projected, and injected into the language model's early layers, preserving multi-granularity information from low-level acoustic details to high-level semantic abstractions.

This design is especially well-suited for audio understanding tasks, as it helps retain rhythm, timbre, transients, and background structure — information that a single high-level representation cannot fully capture.

Time-Aware Representation

Time is a critical dimension in audio understanding. To enhance explicit temporal awareness, we adopt a time-marker insertion strategy during pretraining: explicit time tokens are inserted between audio frame representations at fixed time intervals to indicate temporal positions. This design enables the model to learn "what happened when" within a unified text generation framework, naturally supporting timestamp ASR, event localization, time-based QA, and long-audio retrospection.

Released Models

ModelAudio EncoderLLM BackboneTotal SizeHugging FaceModelScope
MOSS-Audio-4B-InstructMOSS-Audio-EncoderQwen3-4B~4.6BHugging FaceModelScope
MOSS-Audio-4B-ThinkingMOSS-Audio-EncoderQwen3-4B~4.6BHugging FaceModelScope
MOSS-Audio-8B-InstructMOSS-Audio-EncoderQwen3-8B~8.6BHugging FaceModelScope
MOSS-Audio-8B-ThinkingMOSS-Audio-EncoderQwen3-8B~8.6BHugging FaceModelScope

More model families, sizes, and variants will be released in the future. Stay tuned!

Evaluation

We evaluate MOSS-Audio on a comprehensive set of audio understanding benchmarks. Key results:

  • General Audio Understanding: MOSS-Audio-8B-Thinking achieves an average accuracy of 71.08, with 77.33 on MMAU, 64.92 on MMAU-Pro, 66.53 on MMAR, and 75.52 on MMSU, outperforming all open-source models.
  • Speech Captioning: MOSS-Audio-Instruct variants lead across 11 out of 13 fine-grained speech description dimensions, with MOSS-Audio-8B-Instruct achieving the best overall average score (3.7252).
  • ASR: On a diverse ASR benchmark suite spanning 12 evaluation dimensions, MOSS-Audio achieves the lowest overall CER (11.30), with particular strength in health-condition, code-switching, dialect, singing, and non-speech scenarios.
  • Timestamp ASR: MOSS-Audio-8B-Instruct achieves 35.77 AAS on AISHELL-1 and 131.61 AAS on LibriSpeech, dramatically outperforming Qwen3-Omni (833.66) and Gemini-3.1-Pro (708.24) in timestamp asr accuracy.

General Audio Understanding (Accuracy↑)

Model Model Size MMAU MMAU-Pro MMAR MMSU Avg
Open Source (small)
Kimi-Audio7B72.4156.5860.8254.7461.14
Qwen2.5-Omni7B65.6052.2056.7061.3258.96
Audio Flamingo 37B61.2351.7057.9660.0457.73
MiMo-Audio-7B7B74.9053.3561.7061.9462.97
MiniCPM-o-4.59B70.9739.6555.7560.9656.83
MOSS-Audio-4B-Instruct4B75.7958.1659.6859.6864.04
MOSS-Audio-4B-Thinking4B77.6460.7563.9171.2068.37
MOSS-Audio-8B-Instruct8B77.0357.4864.4266.3666.32
MOSS-Audio-8B-Thinking8B77.3364.9266.5375.5271.08
Open Source (large)
Qwen3-Omni-30B-A3B-Instruct30B75.0061.2266.4069.0067.91
Step-Audio-R1.133B72.1860.8068.7564.1866.48
Step-Audio-R133B78.6759.6869.1575.1870.67
Closed Source
GPT4o-Audio-65.6652.3059.7858.7659.13
Gemini-3-Pro-80.1568.2881.7381.2877.86
Gemini-3.1-Pro-81.1073.4783.7081.3079.89

Speech Captioning (LLM-as-a-Judge Score↑)

Speech Captioning (click to expand)
ModelgenderageaccentpitchvolumespeedtextureclarityfluencyemotiontonepersonalitysummaryAvg
Qwen3-Omni-30B-A3B-Instruct4.4363.9364.3563.5903.6823.6143.0933.5213.5313.3283.2243.2923.1793.5986
Qwen3-Omni-30B-A3B-Thinking4.4194.0264.3273.6103.5773.6103.1793.4033.5263.2323.1543.1973.1073.5667
Gemini-3-Pro4.1913.8354.1813.3923.2543.3202.9983.3473.5243.0552.9973.0232.7753.3763
Gemini-3.1-Pro4.4363.9364.3563.5903.6823.6143.0933.5213.5313.3283.2243.2923.1793.5986
MOSS-Audio-4B-Instruct4.6973.9804.4973.6283.7223.5643.4073.8413.7443.3113.2823.3053.2593.7105
MOSS-Audio-8B-Instruct4.6833.9794.5723.6823.7093.6383.4033.8693.7473.3143.2533.2723.3073.7252

ASR

ModelOverallHealth ConditionDialectSingingNon-Speech VocalizationsCode-SwitchingAcoustic Environment (Clean)Acoustic Environment (Noisy)Acoustic Characteristics: WhisperAcoustic Characteristics: Far-Field / Near-FieldMulti-SpeakerAgeSemantic Content
Paraformer-Large15.7722.1843.4532.344.9512.653.114.675.0217.4620.3314.967.14
GLM-ASR-Nano17.2924.4922.3951.954.6511.883.685.024.9427.5128.0217.197.32
Fun-ASR-Nano12.0421.997.8019.354.7611.232.983.463.7818.3819.8214.956.08
SenseVoice-Small14.5024.048.8923.794.9213.904.134.935.5726.6624.0617.637.55
Kimi-Audio-7B-Instruct14.1221.1129.3421.764.6816.382.202.152.6621.0220.6116.746.12
Qwen2.5-Omni-3B15.2624.6533.8724.245.5411.662.763.564.3222.1522.9115.177.24
Qwen2.5-Omni-7B15.0523.8531.9122.694.5612.972.523.163.6425.3821.0116.136.78
Qwen3-Omni-30B-A3B-Instruct11.3920.7315.6316.014.7311.302.232.471.9017.0818.1511.465.74
MOSS-Audio-4B-Instruct11.5821.1111.8410.794.0110.113.113.723.2918.4820.3315.098.15
MOSS-Audio-8B-Instruct11.3019.188.769.814.3110.182.703.202.7524.0424.3615.267.69
Detailed ASR Results (click to expand)
Model Acoustic Environment (Clean) Acoustic Environment (Noisy) Acoustic Characteristics: Whisper Acoustic Characteristics: Far-Field / Near-Field Multi-Speaker Age Health Condition Semantic Content Code-Switching Dialect Singing Non-Speech Vocalizations
AISHELL-1
test
AISHELL-2
Android | IOS | Mic
THCHS-30
test
MAGICDATA-READ
test
AISHELL6-Whisper
normal | whisper
AliMeeting
Test_Ali_far | Test_Ali_near
AISHELL-4
test
SeniorTalk
sentence
ChildMandarin
test
AISHELL-6A
mild | moderate | severe | StutteringSpeech
AISHELL_6B
LRDWWS | Uncontrol
WenetSpeech
test-meeting
Fleurs
cmn_hans_cn
CS-Dialogue
test
TALCS
test
ASCEND
test
KeSpeech
test
WSYue-ASR-eval
short
MIR-1K
test
openc-pop
test
MNV_17
Paraformer-Large 1.98 3.28 | 3.21 | 3.00 4.07 4.67 1.11 | 8.92 25.64 | 9.27 20.33 17.31 12.60 6.98 | 9.30 | 13.34 | 10.74 47.59 | 45.08 7.88 6.40 10.64 10.77 16.55 11.48 75.42 57.70 6.98 4.95
GLM-ASR-Nano 2.89 3.75 | 3.73 | 3.78 4.23 5.02 0.83 | 9.06 40.27 | 14.76 28.02 20.33 14.06 8.74 | 12.11 | 14.38 | 12.29 50.34 | 49.09 9.70 4.94 11.06 11.07 13.50 9.72 35.07 95.87 8.03 4.65
Fun-ASR-Nano 2.16 3.04 | 2.99 | 3.07 3.65 3.46 0.81 | 6.76 27.21 | 9.55 19.82 16.96 12.94 6.60 | 8.81 | 12.98 | 10.30 47.42 | 45.84 7.39 4.76 10.47 8.09 15.13 7.43 8.17 35.85 2.84 4.76
SenseVoice-Small 3.23 4.16 | 4.02 | 3.96 5.26 4.93 1.25 | 9.88 37.01 | 16.31 24.06 21.07 14.18 7.62 | 9.85 | 14.39 | 11.47 52.92 | 47.97 8.35 6.75 12.81 10.52 18.38 10.45 7.34 39.51 8.07 4.92
Kimi-Audio-7B-Instruct 0.79 2.91 | 3.03 | 2.88 1.39 2.15 0.69 | 4.63 28.22 | 13.82 20.61 19.70 13.79 7.00 | 9.34 | 12.56 | 10.75 44.44 | 42.57 7.15 5.10 14.56 12.74 21.83 5.51 53.17 38.35 5.17 4.68
Qwen2.5-Omni-3B 1.51 3.10 | 2.94 | 2.93 3.32 3.56 0.82 | 7.82 32.14 | 12.16 22.91 17.38 12.96 6.87 | 10.55 | 14.57 | 11.33 54.54 | 50.03 9.04 5.45 10.78 10.94 13.25 7.67 60.06 45.00 3.47 5.54
Qwen2.5-Omni-7B 1.16 2.88 | 2.77 | 2.73 3.06 3.16 0.71 | 6.57 32.03 | 18.73 21.01 19.96 12.29 7.27 | 10.94 | 12.92 | 10.53 51.99 | 49.45 8.43 5.13 14.02 10.46 14.42 6.40 57.43 42.62 2.75 4.56
Qwen3-Omni-30B-A3B-Instruct 0.95 2.70 | 2.72 | 2.57 2.21 2.47 0.59 | 3.22 25.72 | 8.44 18.15 14.13 8.79 6.20 | 8.88 | 11.59 | 10.25 45.80 | 41.65 6.64 4.84 12.94 8.33 12.64 5.87 25.39 30.81 1.21 4.73
MOSS-Audio-4B-Instruct 2.26 3.22 | 3.20 | 3.33 3.53 3.72 0.73 | 5.86 27.27 | 9.68 20.33 16.93 13.25 6.36 | 9.77 | 12.68 | 10.28 43.35 | 44.25 8.17 8.13 9.14 8.37 12.83 14.65 9.04 18.47 3.10 4.01
MOSS-Audio-8B-Instruct 1.82 2.97 | 2.95 | 2.91 2.82 3.20 0.69 | 4.80 36.82 | 11.25 24.36 17.42 13.10 5.84 | 8.94 | 11.52 | 9.72 39.76 | 39.27 7.86 7.52 9.07 8.22 13.26 9.18 8.33 17.24 2.39 4.31

Timestamp ASR (AAS↓)

ModelAISHELL-1(zh)LibriSpeech(en)
Qwen3-Omni-30B-A3B-Instruct833.66646.95
Gemini-3.1-Pro708.24871.19
MOSS-Audio-4B-Instruct76.96358.13
MOSS-Audio-8B-Instruct35.77131.61

Quickstart

Environment Setup

We recommend Python 3.12 with a clean Conda environment. The commands below are enough for local inference.

git clone https://github.com/OpenMOSS/MOSS-Audio.git
cd MOSS-Audio

conda create -n moss-audio python=3.12 -y
conda activate moss-audio

conda install -c conda-forge "ffmpeg=7" -y
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]"

Optional: FlashAttention 2

If your GPU supports FlashAttention 2, you can replace the last install command with:

pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,flash-attn]"

Basic Usage

Download the model first:

hf  download OpenMOSS-Team/MOSS-Audio-4B-Instruct --local-dir ./weights/MOSS-Audio-4B-Instruct 
hf  download OpenMOSS-Team/MOSS-Audio-4B-Thinking --local-dir ./weights/MOSS-Audio-4B-Thinking 
hf  download OpenMOSS-Team/MOSS-Audio-8B-Instruct --local-dir ./weights/MOSS-Audio-8B-Instruct 
hf  download OpenMOSS-Team/MOSS-Audio-8B-Thinking --local-dir ./weights/MOSS-Audio-8B-Thinking 

Then edit MODEL_PATH / AUDIO_PATH in infer.py as needed, and run:

python infer.py

The default prompt in infer.py is Describe this audio. You can directly edit that line if you want to try transcription, audio QA, or speech captioning.

Fine-tuning

We now provide an official fine-tuning script in finetune/finetune.py, with full instructions in finetune/FINETUNE.md.

Install the extra dependencies needed for training:

pip install librosa peft

Minimal example for LoRA fine-tuning:

accelerate launch finetune/finetune.py \
    --model_dir ./weights/MOSS-Audio-4B-Instruct \
    --data_path train.jsonl \
    --output_dir ./output/lora \
    --use_lora \
    --bf16

The training data should be a JSONL file containing audio-text conversations. For data format, supported arguments, multi-GPU examples, and full-parameter fine-tuning, see finetune/FINETUNE.md.

Gradio App

Start the Gradio demo with:

python app.py

SGLang Serving

If you want to serve MOSS-Audio with SGLang, see the full guide in moss_audio_usage_guide.md.

The shortest setup is:

git clone -b moss-audio https://github.com/OpenMOSS/sglang.git
cd sglang
pip install -e "python[all]"
pip install nvidia-cudnn-cu12==9.16.0.29
cd ..
sglang serve --model-path ./weights/MOSS-Audio --trust-remote-code

If you use the default torch==2.9.1+cu128 runtime, installing nvidia-cudnn-cu12==9.16.0.29 is recommended before starting sglang serve.

More Information

LICENSE

Models in MOSS-Audio are licensed under the Apache License 2.0.

Citation

@misc{mossaudio2026,
      title={MOSS-Audio Technical Report},
      author={OpenMOSS Team},
      year={2026},
      howpublished={\url{https://github.com/OpenMOSS/MOSS-Audio}},
      note={GitHub repository}
}

Star History

Star History Chart