README.md

April 7, 2026 ยท View on GitHub

VyvoTTS: LLM-Based Text-to-Speech Training Framework

VyvoTTS Logo

Overview

VyvoTTS converts text to speech by having an LLM generate interleaved audio codec tokens, which are then decoded to audio waveforms. It supports SNAC and Mimi audio codecs.

Installation

git clone https://github.com/Vyvo-Labs/VyvoTTS.git
cd VyvoTTS
uv venv --python 3.12 && source .venv/bin/activate

# Base install (includes both SNAC and Mimi support)
uv pip install -e "."

# With inference backends
uv pip install -e ".[vllm]"    # vLLM
uv pip install -e ".[sglang]"  # SGLang

# With training dependencies
uv pip install -e ".[train]"

Inference

from vyvotts.inference.transformers_inference import VyvoTTSTransformersInference

engine = VyvoTTSTransformersInference(
    config_path="vyvotts/configs/inference/lfm2_5.yaml",
    model_name="LiquidAI/LFM2.5-350M",
    tokenizer_name="LiquidAI/LFM2-350M",
    codec_type="mimi",       # or "snac"
)

audio, timing = engine.generate("Hello world", output_path="output.wav")

All four backends share the same interface โ€” swap by changing the import:

from vyvotts.inference.vllm_inference import VyvoTTSInference              # vLLM (fastest TTFT)
from vyvotts.inference.sglang_inference import VyvoTTSSGLangInference      # SGLang (highest tok/s)
from vyvotts.inference.transformers_inference import VyvoTTSTransformersInference  # HuggingFace
from vyvotts.inference.unsloth_inference import VyvoTTSUnslothInference    # 4/8-bit quantized

Benchmark (Qwen3-1.7B, H100 PCIe)

EngineTTFTTTFATokens/s
SGLang10ms32ms308
vLLM6ms30ms292
Unsloth25ms55ms54
Transformers22ms50ms50

Dataset Preparation

Standard dataset tokenization

from vyvotts.audio_tokenizer import process_dataset

process_dataset(
    original_dataset="MrDragonFox/Elise",
    output_dataset="username/dataset-name",
    model_type="lfm2_5",     # or "qwen3", "lfm2"
    codec_type="mimi",       # or "snac"
    num_gpus=8,              # multi-GPU support
)

Large-scale Emilia dataset tokenization

python -m vyvotts.tokenize_emilia \
    --dataset ylacombe/emilia-subset \
    --output_dataset /scratch/output \
    --model_type lfm2_5 \
    --codec_type mimi \
    --num_gpus 8

Supports two Emilia sources:

  • ylacombe/emilia-subset โ€” 3.39M EN samples, parquet-based
  • amphion/Emilia-Dataset โ€” Emilia + Emilia-YODAS EN, tar-based

Training

Pre-training (multi-GPU FSDP)

python -m accelerate.commands.launch \
    --config_file vyvotts/configs/train/accelerate_pretrain.yaml \
    vyvotts/train/pretrain/train.py

Configure in vyvotts/configs/train/lfm2_5_pretrain.yaml:

  • Model, tokenizer, codec type
  • Dataset paths (local or HuggingFace)
  • QA:TTS ratio scheduling (2:1 โ†’ 1:1)

Fine-tuning

Single-speaker fine-tuning with automatic tokenization:

# Single speaker
python -m vyvotts.finetune \
    --dataset Vyvo/ElevenLabs-EN \
    --speaker ElevenLabs \
    --output_dir output/ElevenLabs

# Multiple speakers at once
python -m vyvotts.finetune \
    --dataset Vyvo/ElevenLabs-EN Vyvo/ElevenLabs-EN-Elise2-Lpq0RJl4hRqNiDLfiBMr \
    --speaker ElevenLabs Elise2 \
    --output_dir output/ElevenLabs output/Elise2 \
    --epochs 3 --batch_size 4 --lr 2e-5

The pipeline handles everything: download โ†’ tokenize with codec โ†’ train โ†’ generate test wav files.

See FINETUNE.md for the full guide.

Full training (accelerate)

# Full fine-tuning
python -m accelerate.commands.launch \
    --config_file vyvotts/configs/train/accelerate_finetune.yaml \
    vyvotts/train/finetune/train.py

# LoRA fine-tuning
python -m accelerate.commands.launch \
    --config_file vyvotts/configs/train/accelerate_finetune.yaml \
    vyvotts/train/finetune/lora.py

Supported Models

ModelTypeConfig
LiquidAI/LFM2.5-350MHybrid conv+attentionlfm2_5.yaml
LiquidAI/LFM2-350MHybrid conv+attentionlfm2.yaml
Qwen/Qwen3-0.6BTransformerqwen3.yaml
Llama3Transformerllama3.yaml

Acknowledgements

License

MIT