FlashTTS

May 17, 2025 · View on GitHub

FlashTTS

Powered by state-of-the-art models such as SparkTTS, OrpheusTTS, MegaTTS 3, FlashTTS delivers high-quality Mandarin speech synthesis and zero-shot voice cloning. With a clean and intuitive Web interface, you can quickly generate natural, lifelike voices for dubbing, narration, accessibility, virtual characters, and more.

If you find FlashTTS helpful, please leave us a ⭐ Star!

✨ Highlights

	Feature	Description
🚀	Multi-backend Acceleration	Supports high-performance inference engines like `vllm`, `sglang`, `llama-cpp`, `mlx-lm`,`tensorrt-llm`, etc.
🎯	High Concurrency	Dynamic batching and asynchronous queues to handle heavy traffic with ease
🎛️	Full Parameter Control	Adjust pitch, speaking rate, temperature, emotion tags, and more
📱	Lightweight Deployment	Built on FastAPI—start with a single command; minimal dependencies
🔊	Long-form Synthesis	Supports very long texts while maintaining consistent voice quality
🔄	Streaming TTS	Generate and play audio in real time; reduces wait time, enhances interactivity
🎭	Multi-character Dialog	Synthesize multiple roles within the same text—ideal for script dubbing
🎨	Modern Frontend	Web-ready, responsive interface

🖼️ Frontend Demo

https://github.com/user-attachments/assets/1bd9d586-fac7-4016-b955-5a58d8fb9d7e

🔈 Voice Samples

Below are demos showcasing FlashTTS’s cloning capabilities across different models and characters.

SparkTTS Model

Donald Trump (EN) Listen	Donald Trump (ZH) Listen
Nezha Listen	Li Jing Listen
Yu Chengdong Listen	Xu Zhisheng Listen

MegaTTS 3 Model

Cai Xukun
Listen

Taiyi Zhenren
Listen

Quick Start

It is recommended to install flashtts in a Python 3.8–3.12 environment via pip:

pip install flashtts

For detailed installation steps, please refer to: installation guide

Local inference command:：

flashtts infer \
  -i "hello world." \
  -o output.wav \
  -m ./models/your_model \
  -b vllm \
  [other optional parameters]

For detailed usage，please refer to: quick_start.md

Server deployment:

 flashtts serve \
 --model_path Spark-TTS-0.5B \ 
 --backend vllm \ 
 --role_dir data/roles \
 --llm_device cuda \
 --tokenizer_device cuda \
 --detokenizer_device cuda \
 --wav2vec_attn_implementation sdpa \
 --llm_attn_implementation sdpa \ 
 --torch_dtype "bfloat16" \ 
 --max_length 32768 \
 --llm_gpu_memory_utilization 0.6 \
 --fix_voice \  # Whether to fix the spark-tts timbre (female and male)
 --host 0.0.0.0 \
 --port 8000

Web address: http://localhost:8000

Interface document address: http://localhost:8000/docs

For detailed deployment，please refer to: server.md

Scenario	Engine	Device	Audio Length (s)	Inference Time (s)	RTF
Short	llama-cpp	CPU	7.48	6.81	0.91
Short	torch	GPU	7.18	7.68	1.07
Short	vllm	GPU	7.24	1.66	0.23
Short	sglang	GPU	7.58	1.07	0.14
Long	llama-cpp	CPU	121.98	117.83	0.97
Long	torch	GPU	113.70	107.17	0.94
Long	vllm	GPU	111.82	7.28	0.07
Long	sglang	GPU	117.02	4.20	0.04

⚙️ Usage Tips

SparkTTS weights must be bfloat16 or float32; using float16 will cause errors.
If you experience long silent gaps, try increasing repetition_penalty (> 1.0).
OrpheusTTS supports inserting <tag> in text to control emotion. See LANG_MAP in orpheus_engine.py.
For safety reasons, MegaTTS 3 does not publish the WaveVAE encoder. Please follow the official instructions to download it: reference audio.

FlashTTS is provided for academic research, education, and lawful purposes only, such as accessibility assistance and personalized speech synthesis. Do not use it for fraud, impersonation, deepfakes, or other illegal activities. Users are responsible for any misuse.

License

This project follows the same license as Spark-TTS. See LICENSE for details.

FlashTTS

FlashTTS

✨ Highlights

🖼️ Frontend Demo

🔈 Voice Samples

SparkTTS Model

MegaTTS 3 Model

OrpheusTTS (ZH) Model

Quick Start

⚡ Inference Speed

⚙️ Usage Tips

🤝 Acknowledgments

⚠️ Disclaimer

License

Star History