MOSS-TTS First-Class End-to-End Inference Pipeline
April 6, 2026 · View on GitHub
This document describes the first-class MOSS-TTS end-to-end inference pipeline in the current llama.cpp repository.
There are currently two ways to run it:
- Recommended native path: all three models run inside
llama.cppmoss-tts-delaybackbone viallama_decode()moss-tts-audio-encoderviallama_encode()moss-tts-audio-decoderviallama_encode()
- Hybrid wrapper path: backbone in
llama.cpp, audio tokenizer in ONNX, orchestrated by Python
Unlike the older moss_tts_delay/llama_cpp backend in the MOSS-TTS repository, this path moves multi-channel inputs, the transformer backbone, multi-head outputs, and delay-pattern decoding into llama.cpp.
Prerequisites
- llama.cpp built from source with the
llama-moss-ttstarget - Python >= 3.10 if you want to use the hybrid wrapper or the converter scripts
- Python packages required by the hybrid helper scripts:
numpysoundfiletokenizersonnxruntime
Build
CPU-only build
cd /path/to/llama.cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-moss-tts -j
Binary:
build/bin/llama-moss-tts
CUDA build
cd /path/to/llama.cpp
cmake -S . -B build-cuda -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
cmake --build build-cuda --target llama-moss-tts -j
Binary:
build-cuda/bin/llama-moss-tts
If you want to build the hybrid wrapper at runtime, you can also pass --build to the e2e script.
Weight Preparation
Step 1: Prepare the backbone GGUF
You need a first-class MOSS-TTS-Delay GGUF model that already contains:
- text embedding tables
- 32 audio embedding tables
- Qwen3 backbone weights
- a text output head
- 32 audio output heads
For example:
out/moss_delay_firstclass_f16.gguf
You can generate it directly from the full Hugging Face MOSS-TTS model directory:
huggingface-cli download OpenMOSS-Team/MOSS-TTS --local-dir /path/to/MOSS-TTS-hf
python convert_hf_to_gguf.py \
/path/to/MOSS-TTS-hf \
--outfile /path/to/moss_delay_firstclass_f16.gguf \
--outtype f16
Important:
- The
--model-gguffile used by this e2e pipeline is a special first-class MOSS-TTS-Delay GGUF generated from the fullOpenMOSS-Team/MOSS-TTSHugging Face model directory with the command above. - It is not the same thing as a generic GGUF downloaded from
OpenMOSS/MOSS-TTS-GGUF. - Do not point this pipeline at a file from
OpenMOSS/MOSS-TTS-GGUFunless that file was explicitly produced as a first-class MOSS-TTS-Delay GGUF for thisllama.cppimplementation.
Step 2: Prepare the native audio encoder / decoder GGUFs
You need two additional GGUF files:
moss-tts-audio-encodermoss-tts-audio-decoder
They can be generated from the Hugging Face MOSS-Audio-Tokenizer directory with:
huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer --local-dir /path/to/MOSS-Audio-Tokenizer-hf
python convert_moss_audio_tokenizer_split_to_gguf.py \
/path/to/MOSS-Audio-Tokenizer-hf \
--outdir /path/to/out \
--outtype f16
Typical outputs:
/path/to/out/moss_tts_audio_encoder_f16.gguf/path/to/out/moss_tts_audio_decoder_f16.gguf
Step 3: Prepare the tokenizer directory for the hybrid wrapper
You need a tokenizer directory containing at least:
tokenizer.json
For example:
weights/extracted/qwen3_backbone/
Step 4: Prepare the ONNX audio tokenizer for the hybrid wrapper
You need both ONNX files:
encoder.onnxdecoder.onnx
For example:
weights/MOSS-Audio-Tokenizer-ONNX/encoder.onnxweights/MOSS-Audio-Tokenizer-ONNX/decoder.onnx
Usage
Current Native Runtime: Three GGUFs
This is the current recommended path.
CPU
# Text-only TTS on CPU
build/bin/llama-moss-tts \
-m /path/to/moss_delay_firstclass_f16.gguf \
--audio-decoder-model /path/to/moss_tts_audio_decoder_f16.gguf \
--text "Hello, world!" \
--wav-out /path/to/output.wav
# Voice cloning on CPU
build/bin/llama-moss-tts \
-m /path/to/moss_delay_firstclass_f16.gguf \
--audio-encoder-model /path/to/moss_tts_audio_encoder_f16.gguf \
--audio-decoder-model /path/to/moss_tts_audio_decoder_f16.gguf \
--text-file /path/to/text.txt \
--reference-audio /path/to/reference_24k.wav \
--wav-out /path/to/output.wav
GPU
# Text-only TTS on GPU
build-cuda/bin/llama-moss-tts \
-m /path/to/moss_delay_firstclass_f16.gguf \
--audio-decoder-model /path/to/moss_tts_audio_decoder_f16.gguf \
--text "Hello, world!" \
--wav-out /path/to/output.wav \
-ngl -1
# Voice cloning on GPU
build-cuda/bin/llama-moss-tts \
-m /path/to/moss_delay_firstclass_f16.gguf \
--audio-encoder-model /path/to/moss_tts_audio_encoder_f16.gguf \
--audio-decoder-model /path/to/moss_tts_audio_decoder_f16.gguf \
--text-file /path/to/text.txt \
--reference-audio /path/to/reference_24k.wav \
--wav-out /path/to/output.wav \
-ngl -1
Notes:
--reference-audiomust be a 24 kHz mono wav.-ngl -1means "offload all eligible layers to GPU".- If you built
build-cuda/bin/llama-moss-ttsbut want to force CPU execution, use-ngl 0.
Hybrid Wrapper: Backbone in GGUF, Audio Tokenizer in ONNX
This path remains useful for parity checks and intermediate artifact inspection.
CLI
# Voice cloning: text + reference audio -> wav
python tools/tts/moss-tts-firstclass-e2e.py \
--model-gguf /path/to/moss_delay_firstclass.gguf \
--tokenizer-dir /path/to/tokenizer_dir \
--onnx-encoder /path/to/encoder.onnx \
--onnx-decoder /path/to/decoder.onnx \
--text-file /path/to/text.txt \
--reference-audio /path/to/reference_24k.wav \
--output-wav /path/to/output.wav
# Direct generation without reference audio
python tools/tts/moss-tts-firstclass-e2e.py \
--model-gguf /path/to/moss_delay_firstclass.gguf \
--tokenizer-dir /path/to/tokenizer_dir \
--onnx-encoder /path/to/encoder.onnx \
--onnx-decoder /path/to/decoder.onnx \
--text "Hello, world!" \
--output-wav /path/to/output.wav
# Build llama-moss-tts before running
python tools/tts/moss-tts-firstclass-e2e.py \
--build \
--model-gguf /path/to/moss_delay_firstclass.gguf \
--tokenizer-dir /path/to/tokenizer_dir \
--onnx-encoder /path/to/encoder.onnx \
--onnx-decoder /path/to/decoder.onnx \
--text "Hello!" \
--output-wav /path/to/output.wav
Key Options
| Option | Values | Description |
|---|---|---|
--model-gguf | path | First-class MOSS-TTS GGUF model |
--moss-tts-dir | path | Deprecated compatibility flag; no longer required |
--tokenizer-dir | path | Directory containing tokenizer.json |
--onnx-encoder | path | Audio tokenizer encoder ONNX |
--onnx-decoder | path | Audio tokenizer decoder ONNX |
--text / --text-file | string / path | Input text, choose exactly one |
--reference-audio | path | Optional reference audio; if provided, it must be 24 kHz |
--language | zh / en / tag | Language tag passed to the prompt builder |
--max-new-tokens | int | Maximum generation steps |
--text-temperature | float | Text-channel sampling temperature, default 1.5 |
--audio-temperature | float | Audio-channel sampling temperature, default 1.7 |
--n-gpu-layers | -1 / 0 / N | GPU offload layers, default -1 |
--audio-decoder-cpu | flag | Force ONNX waveform decoding on CPU |
--cpu-audio-encode | flag | Force ONNX reference-audio encoding on CPU |
--build | flag | Build llama-moss-tts before running |
Native Runtime Options
| Option | Values | Description |
|---|---|---|
-m | path | Backbone moss-tts-delay GGUF |
--audio-encoder-model | path | Native moss-tts-audio-encoder GGUF |
--audio-decoder-model | path | Native moss-tts-audio-decoder GGUF |
--text / --text-file | string / path | Input text, choose exactly one |
--reference-audio | path | Optional 24 kHz reference wav |
--language | zh / en / tag | Language tag passed to the prompt builder |
--max-new-tokens | int | Maximum generation steps |
--gpu-layers / -ngl | -1 / 0 / N | GPU offload layers |
--wav-out | path | Output wav path |
Architecture
Native Three-GGUF Path
Input text (+ optional reference wav)
|
v
llama-moss-tts
|
|- text prompt packing
|- optional reference wav -> moss-tts-audio-encoder -> reference audio codes
|- moss-tts-delay backbone via llama_decode()
|- multi-head sampling + C++ delay-pattern decoding
|- raw audio codes -> moss-tts-audio-decoder -> waveform
v
wav
Hybrid Wrapper Path
Input text (+ optional reference wav)
|
v
moss-tts-build-generation-ref.py
|
|- tokenizes text with the Qwen3 tokenizer
|- optionally encodes the reference wav into audio codes with ONNX
|- builds the packed prompt with the local lightweight MOSS-TTS processor
v
generation.ref.bin
|
v
llama-moss-tts
|
|- loads the first-class GGUF model
|- performs multi-channel embedding lookup in-graph
|- runs the Qwen3 backbone inside llama.cpp
|- samples multi-head logits
|- performs delay-pattern decoding in C++
v
raw.codes.bin
|
v
moss-tts-audio-decode.py
|
|- decodes raw audio codes into waveform with ONNX
v
wav
Temporary Artifacts
The e2e script creates a temporary directory and removes it automatically after the run.
The following intermediate files are not kept:
generation.ref.binraw.codes.bin
The only visible artifact after the run is the output wav you requested.
Output
At the end of a successful run, the script prints:
wav— output pathwav_info— sample rate, channel count, frame count, and duration
File Structure
llama.cpp/
├── docs/
│ ├── moss-tts-firstclass-e2e.md
│ └── moss-tts-firstclass-e2e_zh.md
├── convert_moss_audio_tokenizer_split_to_gguf.py
├── tools/tts/
│ ├── moss-tts-firstclass-e2e.py # End-to-end wrapper
│ ├── moss-tts-build-generation-ref.py # Prompt / input builder
│ ├── moss-tts-audio-decode.py # ONNX audio decode helper
│ └── run-moss-tts-delay.cpp # llama-moss-tts implementation
├── build/bin/
│ └── llama-moss-tts
└── build-cuda/bin/
└── llama-moss-tts