FlashTTS

May 17, 2025 · View on GitHub

FlashTTS Logo

📘 Documentation | 📚 Deepwiki

中文 | English


FlashTTS

Powered by state-of-the-art models such as SparkTTS, OrpheusTTS, MegaTTS 3, FlashTTS delivers high-quality Mandarin speech synthesis and zero-shot voice cloning. With a clean and intuitive Web interface, you can quickly generate natural, lifelike voices for dubbing, narration, accessibility, virtual characters, and more.

If you find FlashTTS helpful, please leave us a ⭐ Star!

✨ Highlights

FeatureDescription
🚀Multi-backend AccelerationSupports high-performance inference engines like vllm, sglang, llama-cpp, mlx-lm,tensorrt-llm, etc.
🎯High ConcurrencyDynamic batching and asynchronous queues to handle heavy traffic with ease
🎛️Full Parameter ControlAdjust pitch, speaking rate, temperature, emotion tags, and more
📱Lightweight DeploymentBuilt on FastAPI—start with a single command; minimal dependencies
🔊Long-form SynthesisSupports very long texts while maintaining consistent voice quality
🔄Streaming TTSGenerate and play audio in real time; reduces wait time, enhances interactivity
🎭Multi-character DialogSynthesize multiple roles within the same text—ideal for script dubbing
🎨Modern FrontendWeb-ready, responsive interface

🖼️ Frontend Demo

https://github.com/user-attachments/assets/1bd9d586-fac7-4016-b955-5a58d8fb9d7e

🔈 Voice Samples

Below are demos showcasing FlashTTS’s cloning capabilities across different models and characters.

SparkTTS Model

Donald Trump (EN)
Listen

Donald Trump (ZH)
Listen

Nezha
Listen

Li Jing
Listen

Yu Chengdong
Listen

Xu Zhisheng
Listen

MegaTTS 3 Model

Cai Xukun
Listen

Taiyi Zhenren
Listen

OrpheusTTS (ZH) Model

Changle
Listen

Baizhi
Listen

Quick Start

It is recommended to install flashtts in a Python 3.8–3.12 environment via pip:

pip install flashtts

For detailed installation steps, please refer to: installation guide

Local inference command::

flashtts infer \
  -i "hello world." \
  -o output.wav \
  -m ./models/your_model \
  -b vllm \
  [other optional parameters]

For detailed usage,please refer to: quick_start.md

Server deployment:

 flashtts serve \
 --model_path Spark-TTS-0.5B \ 
 --backend vllm \ 
 --role_dir data/roles \
 --llm_device cuda \
 --tokenizer_device cuda \
 --detokenizer_device cuda \
 --wav2vec_attn_implementation sdpa \
 --llm_attn_implementation sdpa \ 
 --torch_dtype "bfloat16" \ 
 --max_length 32768 \
 --llm_gpu_memory_utilization 0.6 \
 --fix_voice \  # Whether to fix the spark-tts timbre (female and male)
 --host 0.0.0.0 \
 --port 8000

Web address: http://localhost:8000

Interface document address: http://localhost:8000/docs

For detailed deployment,please refer to: server.md

⚡ Inference Speed

Test environment: A800 GPU · Model: Spark-TTS-0.5B · Test script: speed_test.py

ScenarioEngineDeviceAudio Length (s)Inference Time (s)RTF
Shortllama-cppCPU7.486.810.91
ShorttorchGPU7.187.681.07
ShortvllmGPU7.241.660.23
ShortsglangGPU7.581.070.14
Longllama-cppCPU121.98117.830.97
LongtorchGPU113.70107.170.94
LongvllmGPU111.827.280.07
LongsglangGPU117.024.200.04

RTF < 1 means real-time synthesis.

⚙️ Usage Tips

  1. SparkTTS weights must be bfloat16 or float32; using float16 will cause errors.
  2. If you experience long silent gaps, try increasing repetition_penalty (> 1.0).
  3. OrpheusTTS supports inserting <tag> in text to control emotion. See LANG_MAP in orpheus_engine.py.
  4. For safety reasons, MegaTTS 3 does not publish the WaveVAE encoder. Please follow the official instructions to download it: reference audio.

🤝 Acknowledgments

⚠️ Disclaimer

FlashTTS is provided for academic research, education, and lawful purposes only, such as accessibility assistance and personalized speech synthesis. Do not use it for fraud, impersonation, deepfakes, or other illegal activities. Users are responsible for any misuse.

License

This project follows the same license as Spark-TTS. See LICENSE for details.

Star History

Star History Chart