server.md
May 18, 2025 · View on GitHub
Flash-TTS Backend Deployment and API Usage Guide
1. Installation & Startup
-
Refer to the installation guide: installation.md
-
Start the server:
- spark tts
flashtts serve \ --model_path Spark-TTS-0.5B \ # Change to your model path if needed --backend vllm \ # Choose from: vllm, sglang, torch, llama-cpp, mlx-lm, tensorrt-llm --llm_device cuda \ --tokenizer_device cuda \ --detokenizer_device cuda \ --wav2vec_attn_implementation sdpa \ --llm_attn_implementation sdpa \ # Recommended for torch backend --torch_dtype "bfloat16" \ # Spark-TTS does not support bfloat16 on all devices; use float32 if needed --max_length 32768 \ --llm_gpu_memory_utilization 0.6 \ --fix_voice \ # Whether to fix the spark-tts timbre (female and male) --host 0.0.0.0 \ --port 8000- mega tts
flashtts serve \ --model_path MegaTTS3 \ # Change to your model path if needed --backend vllm \ # Choose from: vllm, sglang, torch, llama-cpp, mlx-lm, tensorrt-llm --llm_device cuda \ --tokenizer_device cuda \ --llm_attn_implementation sdpa \ # Recommended for torch backend --torch_dtype "float16" \ --max_length 8192 \ --llm_gpu_memory_utilization 0.6 \ --host 0.0.0.0 \ --port 8000- orphpeus tts
flashtts serve \ --model_path orpheus-3b-0.1-ft-bf16 \ # Change to your model path if needed --snac_path snac_24khz \ --lang english \ --backend vllm \ # Choose from: vllm, sglang, torch, llama-cpp, mlx-lm, tensorrt-llm --llm_device cuda \ --detokenizer_device cuda \ --llm_attn_implementation sdpa \ # Recommended for torch backend --torch_dtype "float16" \ --max_length 8192 \ --llm_gpu_memory_utilization 0.6 \ --host 0.0.0.0 \ --port 8000 -
Access the web interface:
http://localhost:8000 -
View API documentation:
http://localhost:8000/docs
2. Server Startup Arguments (server.py)
| Argument | Type | Description | Default |
|---|---|---|---|
--model_path | str | Required. Path to the TTS model directory | — |
--backend | str | Required. TTS backend engine. Options: llama-cpp, vllm, sglang, torch, mlx-lm, tensorrt-llm | — |
--snac_path | str | Path to OrpheusTTS SNAC module. Required only if model is orpheus | None |
--llm_tensorrt_path | str | Path to the TensorRT model. Only effective when the backend is set to tensorrt-llm. If not provided, defaults to {model_path}/tensorrt-engine | None |
--role_dir | str | Directory for role audio references. Default: data/roles for Spark, data/mega-roles for Mega | Spark: data/roles |
Mega: data/mega-roles | |||
--api_key | str | API key for access. All requests must include Authorization: Bearer <KEY> if enabled | None |
--llm_device | str | Device for running the LLM (e.g., cpu, cuda) | auto |
--tokenizer_device | str | Device for the audio tokenizer | auto |
--detokenizer_device | str | Device for the audio detokenizer | auto |
--wav2vec_attn_implementation | str | Attention implementation for wav2vec in Spark-TTS. Options: sdpa, flash_attention_2, eager | eager |
--llm_attn_implementation | str | Attention method for LLM (torch backend). Options: sdpa, flash_attention_2, eager | eager |
--max_length | int | Max LLM context length | 32768 |
--llm_gpu_memory_utilization | float | GPU memory usage ratio (for vllm/sglang) | 0.6 |
--torch_dtype | str | Model precision type. Options: float16, bfloat16, float32, auto | auto |
--cache_implementation | str | Cache strategy for torch backend: static, offloaded_static, sliding_window, etc. | None |
--seed | int | Random seed | 0 |
--batch_size | int | Max batch size for audio processing | 1 |
--llm_batch_size | int | Max LLM batch size | 256 |
--wait_timeout | float | Timeout (in seconds) for dynamic batching | 0.01 |
--host | str | Host address to bind | 0.0.0.0 |
--port | int | Port number to listen on | 8000 |
--fix_voice | bool | Fixes the female and male timbres in the spark-tts model, ensuring they remain unchanged. | False |
3. API Usage Workflow
Example using cURL:
curl -X POST http://localhost:8000/clone_voice \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "text=Hello, world" \
-F "reference_audio_file=@/path/to/ref.wav" \
-F "stream=false" \
-F "response_format=wav" \
--output output.wav
4. API Endpoints and Parameters
4.1 Voice Cloning: POST /clone_voice
- Content-Type:
multipart/form-data - Parameters:
| Field | Type | Required | Description |
|---|---|---|---|
text | string | Yes | Text to synthesize |
reference_audio | string | No | Reference audio (URL or base64 string). Use this or reference_audio_file |
reference_audio_file | file | No | Upload reference audio file (WAV) |
latent_file | file | No | Upload latent file (npy) for MegaTTS3. |
reference_text | string | No | Transcription of the reference audio |
pitch | enum | No | Pitch: very_low, low, moderate, high, very_high |
speed | enum | No | Speed: very_low, low, moderate, high, very_high |
temperature | float | No | Controls randomness in generation |
top_k | int | No | Top-K sampling |
top_p | float | No | Nucleus sampling threshold |
repetition_penalty | float | No | Penalty to reduce repetition |
max_tokens | int | No | Max number of tokens to generate |
length_threshold | int | No | Threshold to split long text |
window_size | int | No | Window size for chunking |
stream | boolean | No | Return streaming audio (true) or full audio (false) |
response_format | enum | No | Output audio format: mp3, opus, aac, flac, wav, pcm |
4.2 Role-based Synthesis: POST /speak
- Content-Type:
application/json - Body Example:
{
"name": "RoleName",
"text": "Text to synthesize",
"pitch": "moderate",
"speed": "moderate",
"temperature": 0.9,
"top_k": 50,
"top_p": 0.95,
"repetition_penalty": 1.0,
"max_tokens": 4096,
"length_threshold": 50,
"window_size": 50,
"stream": false,
"response_format": "mp3"
}
- Note: Same fields as CloneRequest, with an additional
namefield for the voice role.
4.3 Multi-Speaker Dialogue Synthesis: POST /multi_speak
- Content-Type:
application/json - Body Example:
{
"text": "<role:female> Hello! <role:male> I'm good, thank you!",
"temperature": 0.8,
"top_k": 50,
"top_p": 0.95,
"repetition_penalty": 1.0,
"max_tokens": 4096,
"length_threshold": 50,
"window_size": 50,
"stream": true,
"response_format": "wav"
}
- Note: The
namefield is omitted; speaker is indicated by the prefix<role:role_name>in the text.
4.4 OpenAI-Compatible Endpoint (Prefix /v1)
- Paths and functionality mirror the standard API.
- Uses
OpenAISpeechRequestformat:model: Model ID or nameinput: Text to synthesizevoice: The name of the audio character you want to use, or a URL or base64 of a reference audio.- Other parameters same as Clone/Speak
4.5 Retrieve Available Roles: GET /audio_roles or GET /v1/audio_roles
- Response Example:
{ "success": true, "roles": ["alice", "bob", "tara"] }
4.6 Add Role: POST /add_speaker
- Content-Type:
multipart/form-data - Parameter Description:
| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Name of the role to be added |
audio | string | No | URL of the reference audio sample or a base64-encoded string (alternative to audio_file) |
reference_text | string | No | Text description or transcription corresponding to the reference audio |
audio_file | file | No | Upload the reference audio file (WAV format), alternative to audio |
latent_file | file | No | Latent file used by the Mega engine (used in combination with audio/audio_file) |
- Response Example:
{ "success": true, "role": "Role Name" }
4.7 Delete Role: POST /delete_speaker
- Content-Type:
multipart/form-data - Parameter Description:
| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Name of the role to be deleted |
- Response Example:
{ "success": true, "role": "Role Name" }