server.md

May 18, 2025 · View on GitHub

Flash-TTS Backend Deployment and API Usage Guide

1. Installation & Startup

Refer to the installation guide: installation.md

Start the server:

spark tts

flashtts serve \
--model_path Spark-TTS-0.5B \ # Change to your model path if needed
--backend vllm \ # Choose from: vllm, sglang, torch, llama-cpp, mlx-lm, tensorrt-llm
--llm_device cuda \
--tokenizer_device cuda \
--detokenizer_device cuda \
--wav2vec_attn_implementation sdpa \
--llm_attn_implementation sdpa \ # Recommended for torch backend
--torch_dtype "bfloat16" \ # Spark-TTS does not support bfloat16 on all devices; use float32 if needed
--max_length 32768 \
--llm_gpu_memory_utilization 0.6 \
--fix_voice \ # Whether to fix the spark-tts timbre (female and male)
--host 0.0.0.0 \
--port 8000

mega tts

 flashtts serve \
 --model_path MegaTTS3 \ # Change to your model path if needed
 --backend vllm \ # Choose from: vllm, sglang, torch, llama-cpp, mlx-lm, tensorrt-llm
 --llm_device cuda \
 --tokenizer_device cuda \
 --llm_attn_implementation sdpa \ # Recommended for torch backend
 --torch_dtype "float16" \ 
 --max_length 8192 \
 --llm_gpu_memory_utilization 0.6 \
 --host 0.0.0.0 \
 --port 8000

orphpeus tts

 flashtts serve \
 --model_path orpheus-3b-0.1-ft-bf16 \ # Change to your model path if needed
 --snac_path snac_24khz \  
 --lang english \
 --backend vllm \ # Choose from: vllm, sglang, torch, llama-cpp, mlx-lm, tensorrt-llm
 --llm_device cuda \
 --detokenizer_device cuda \
 --llm_attn_implementation sdpa \ # Recommended for torch backend
 --torch_dtype "float16" \ 
 --max_length 8192 \
 --llm_gpu_memory_utilization 0.6 \
 --host 0.0.0.0 \
 --port 8000

Access the web interface:
```
http://localhost:8000
```
View API documentation:
```
http://localhost:8000/docs
```

2. Server Startup Arguments (`server.py`)

Argument	Type	Description	Default
`--model_path`	str	Required. Path to the TTS model directory	—
`--backend`	str	Required. TTS backend engine. Options: `llama-cpp`, `vllm`, `sglang`, `torch`, `mlx-lm`, `tensorrt-llm`	—
`--snac_path`	str	Path to OrpheusTTS SNAC module. Required only if model is `orpheus`	None
`--llm_tensorrt_path`	`str`	Path to the TensorRT model. Only effective when the backend is set to `tensorrt-llm`. If not provided, defaults to `{model_path}/tensorrt-engine`	None
`--role_dir`	str	Directory for role audio references. Default: `data/roles` for Spark, `data/mega-roles` for Mega	Spark: `data/roles`
			Mega: `data/mega-roles`
`--api_key`	str	API key for access. All requests must include `Authorization: Bearer <KEY>` if enabled	None
`--llm_device`	str	Device for running the LLM (e.g., `cpu`, `cuda`)	`auto`
`--tokenizer_device`	str	Device for the audio tokenizer	`auto`
`--detokenizer_device`	str	Device for the audio detokenizer	`auto`
`--wav2vec_attn_implementation`	str	Attention implementation for `wav2vec` in Spark-TTS. Options: `sdpa`, `flash_attention_2`, `eager`	`eager`
`--llm_attn_implementation`	str	Attention method for LLM (torch backend). Options: `sdpa`, `flash_attention_2`, `eager`	`eager`
`--max_length`	int	Max LLM context length	32768
`--llm_gpu_memory_utilization`	float	GPU memory usage ratio (for `vllm`/`sglang`)	0.6
`--torch_dtype`	str	Model precision type. Options: `float16`, `bfloat16`, `float32`, `auto`	`auto`
`--cache_implementation`	str	Cache strategy for `torch` backend: `static`, `offloaded_static`, `sliding_window`, etc.	None
`--seed`	int	Random seed	0
`--batch_size`	int	Max batch size for audio processing	1
`--llm_batch_size`	int	Max LLM batch size	256
`--wait_timeout`	float	Timeout (in seconds) for dynamic batching	0.01
`--host`	str	Host address to bind	`0.0.0.0`
`--port`	int	Port number to listen on	8000
`--fix_voice`	bool	Fixes the female and male timbres in the spark-tts model, ensuring they remain unchanged.	False

3. API Usage Workflow

Example using cURL:

curl -X POST http://localhost:8000/clone_voice \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "text=Hello, world" \
  -F "reference_audio_file=@/path/to/ref.wav" \
  -F "stream=false" \
  -F "response_format=wav" \
  --output output.wav

4. API Endpoints and Parameters

4.1 Voice Cloning: `POST /clone_voice`

Content-Type: multipart/form-data
Parameters:

Field	Type	Required	Description
`text`	string	Yes	Text to synthesize
`reference_audio`	string	No	Reference audio (URL or base64 string). Use this or `reference_audio_file`
`reference_audio_file`	file	No	Upload reference audio file (WAV)
`latent_file`	file	No	Upload latent file (npy) for MegaTTS3.
`reference_text`	string	No	Transcription of the reference audio
`pitch`	enum	No	Pitch: `very_low`, `low`, `moderate`, `high`, `very_high`
`speed`	enum	No	Speed: `very_low`, `low`, `moderate`, `high`, `very_high`
`temperature`	float	No	Controls randomness in generation
`top_k`	int	No	Top-K sampling
`top_p`	float	No	Nucleus sampling threshold
`repetition_penalty`	float	No	Penalty to reduce repetition
`max_tokens`	int	No	Max number of tokens to generate
`length_threshold`	int	No	Threshold to split long text
`window_size`	int	No	Window size for chunking
`stream`	boolean	No	Return streaming audio (`true`) or full audio (`false`)
`response_format`	enum	No	Output audio format: `mp3`, `opus`, `aac`, `flac`, `wav`, `pcm`

4.2 Role-based Synthesis: `POST /speak`

Content-Type: application/json
Body Example:

{
  "name": "RoleName",
  "text": "Text to synthesize",
  "pitch": "moderate",
  "speed": "moderate",
  "temperature": 0.9,
  "top_k": 50,
  "top_p": 0.95,
  "repetition_penalty": 1.0,
  "max_tokens": 4096,
  "length_threshold": 50,
  "window_size": 50,
  "stream": false,
  "response_format": "mp3"
}

Note: Same fields as CloneRequest, with an additional name field for the voice role.

4.3 Multi-Speaker Dialogue Synthesis: `POST /multi_speak`

Content-Type: application/json
Body Example:

{
  "text": "<role:female> Hello! <role:male> I'm good, thank you!",
  "temperature": 0.8,
  "top_k": 50,
  "top_p": 0.95,
  "repetition_penalty": 1.0,
  "max_tokens": 4096,
  "length_threshold": 50,
  "window_size": 50,
  "stream": true,
  "response_format": "wav"
}

Note: The name field is omitted; speaker is indicated by the prefix <role:role_name> in the text.

4.4 OpenAI-Compatible Endpoint (Prefix `/v1`)

Paths and functionality mirror the standard API.
Uses OpenAISpeechRequest format:
- model: Model ID or name
- input: Text to synthesize
- voice: The name of the audio character you want to use, or a URL or base64 of a reference audio.
- Other parameters same as Clone/Speak

4.5 Retrieve Available Roles: `GET /audio_roles` or `GET /v1/audio_roles`

Response Example:

{
  "success": true,
  "roles": ["alice", "bob", "tara"]
}

4.6 Add Role: `POST /add_speaker`

Content-Type: multipart/form-data
Parameter Description:

Field	Type	Required	Description
`name`	string	Yes	Name of the role to be added
`audio`	string	No	URL of the reference audio sample or a base64-encoded string (alternative to `audio_file`)
`reference_text`	string	No	Text description or transcription corresponding to the reference audio
`audio_file`	file	No	Upload the reference audio file (WAV format), alternative to `audio`
`latent_file`	file	No	Latent file used by the Mega engine (used in combination with `audio`/`audio_file`)

Response Example:

{
  "success": true,
  "role": "Role Name"
}

4.7 Delete Role: `POST /delete_speaker`

Content-Type: multipart/form-data
Parameter Description:

Field	Type	Required	Description
`name`	string	Yes	Name of the role to be deleted

Response Example:

{
  "success": true,
  "role": "Role Name"
}