README.md
April 7, 2026 ยท View on GitHub
Overview
VyvoTTS converts text to speech by having an LLM generate interleaved audio codec tokens, which are then decoded to audio waveforms. It supports SNAC and Mimi audio codecs.
Installation
git clone https://github.com/Vyvo-Labs/VyvoTTS.git
cd VyvoTTS
uv venv --python 3.12 && source .venv/bin/activate
# Base install (includes both SNAC and Mimi support)
uv pip install -e "."
# With inference backends
uv pip install -e ".[vllm]" # vLLM
uv pip install -e ".[sglang]" # SGLang
# With training dependencies
uv pip install -e ".[train]"
Inference
from vyvotts.inference.transformers_inference import VyvoTTSTransformersInference
engine = VyvoTTSTransformersInference(
config_path="vyvotts/configs/inference/lfm2_5.yaml",
model_name="LiquidAI/LFM2.5-350M",
tokenizer_name="LiquidAI/LFM2-350M",
codec_type="mimi", # or "snac"
)
audio, timing = engine.generate("Hello world", output_path="output.wav")
All four backends share the same interface โ swap by changing the import:
from vyvotts.inference.vllm_inference import VyvoTTSInference # vLLM (fastest TTFT)
from vyvotts.inference.sglang_inference import VyvoTTSSGLangInference # SGLang (highest tok/s)
from vyvotts.inference.transformers_inference import VyvoTTSTransformersInference # HuggingFace
from vyvotts.inference.unsloth_inference import VyvoTTSUnslothInference # 4/8-bit quantized
Benchmark (Qwen3-1.7B, H100 PCIe)
| Engine | TTFT | TTFA | Tokens/s |
|---|---|---|---|
| SGLang | 10ms | 32ms | 308 |
| vLLM | 6ms | 30ms | 292 |
| Unsloth | 25ms | 55ms | 54 |
| Transformers | 22ms | 50ms | 50 |
Dataset Preparation
Standard dataset tokenization
from vyvotts.audio_tokenizer import process_dataset
process_dataset(
original_dataset="MrDragonFox/Elise",
output_dataset="username/dataset-name",
model_type="lfm2_5", # or "qwen3", "lfm2"
codec_type="mimi", # or "snac"
num_gpus=8, # multi-GPU support
)
Large-scale Emilia dataset tokenization
python -m vyvotts.tokenize_emilia \
--dataset ylacombe/emilia-subset \
--output_dataset /scratch/output \
--model_type lfm2_5 \
--codec_type mimi \
--num_gpus 8
Supports two Emilia sources:
ylacombe/emilia-subsetโ 3.39M EN samples, parquet-basedamphion/Emilia-Datasetโ Emilia + Emilia-YODAS EN, tar-based
Training
Pre-training (multi-GPU FSDP)
python -m accelerate.commands.launch \
--config_file vyvotts/configs/train/accelerate_pretrain.yaml \
vyvotts/train/pretrain/train.py
Configure in vyvotts/configs/train/lfm2_5_pretrain.yaml:
- Model, tokenizer, codec type
- Dataset paths (local or HuggingFace)
- QA:TTS ratio scheduling (2:1 โ 1:1)
Fine-tuning
Single-speaker fine-tuning with automatic tokenization:
# Single speaker
python -m vyvotts.finetune \
--dataset Vyvo/ElevenLabs-EN \
--speaker ElevenLabs \
--output_dir output/ElevenLabs
# Multiple speakers at once
python -m vyvotts.finetune \
--dataset Vyvo/ElevenLabs-EN Vyvo/ElevenLabs-EN-Elise2-Lpq0RJl4hRqNiDLfiBMr \
--speaker ElevenLabs Elise2 \
--output_dir output/ElevenLabs output/Elise2 \
--epochs 3 --batch_size 4 --lr 2e-5
The pipeline handles everything: download โ tokenize with codec โ train โ generate test wav files.
See FINETUNE.md for the full guide.
Full training (accelerate)
# Full fine-tuning
python -m accelerate.commands.launch \
--config_file vyvotts/configs/train/accelerate_finetune.yaml \
vyvotts/train/finetune/train.py
# LoRA fine-tuning
python -m accelerate.commands.launch \
--config_file vyvotts/configs/train/accelerate_finetune.yaml \
vyvotts/train/finetune/lora.py
Supported Models
| Model | Type | Config |
|---|---|---|
| LiquidAI/LFM2.5-350M | Hybrid conv+attention | lfm2_5.yaml |
| LiquidAI/LFM2-350M | Hybrid conv+attention | lfm2.yaml |
| Qwen/Qwen3-0.6B | Transformer | qwen3.yaml |
| Llama3 | Transformer | llama3.yaml |
Acknowledgements
License
MIT