server.md

May 18, 2025 · View on GitHub

Flash-TTS Backend Deployment and API Usage Guide

1. Installation & Startup

  1. Refer to the installation guide: installation.md

  2. Start the server:

    • spark tts
    flashtts serve \
    --model_path Spark-TTS-0.5B \ # Change to your model path if needed
    --backend vllm \ # Choose from: vllm, sglang, torch, llama-cpp, mlx-lm, tensorrt-llm
    --llm_device cuda \
    --tokenizer_device cuda \
    --detokenizer_device cuda \
    --wav2vec_attn_implementation sdpa \
    --llm_attn_implementation sdpa \ # Recommended for torch backend
    --torch_dtype "bfloat16" \ # Spark-TTS does not support bfloat16 on all devices; use float32 if needed
    --max_length 32768 \
    --llm_gpu_memory_utilization 0.6 \
    --fix_voice \ # Whether to fix the spark-tts timbre (female and male)
    --host 0.0.0.0 \
    --port 8000
    
    • mega tts
     flashtts serve \
     --model_path MegaTTS3 \ # Change to your model path if needed
     --backend vllm \ # Choose from: vllm, sglang, torch, llama-cpp, mlx-lm, tensorrt-llm
     --llm_device cuda \
     --tokenizer_device cuda \
     --llm_attn_implementation sdpa \ # Recommended for torch backend
     --torch_dtype "float16" \ 
     --max_length 8192 \
     --llm_gpu_memory_utilization 0.6 \
     --host 0.0.0.0 \
     --port 8000
    
    • orphpeus tts
     flashtts serve \
     --model_path orpheus-3b-0.1-ft-bf16 \ # Change to your model path if needed
     --snac_path snac_24khz \  
     --lang english \
     --backend vllm \ # Choose from: vllm, sglang, torch, llama-cpp, mlx-lm, tensorrt-llm
     --llm_device cuda \
     --detokenizer_device cuda \
     --llm_attn_implementation sdpa \ # Recommended for torch backend
     --torch_dtype "float16" \ 
     --max_length 8192 \
     --llm_gpu_memory_utilization 0.6 \
     --host 0.0.0.0 \
     --port 8000
    
  3. Access the web interface:

    http://localhost:8000
    
  4. View API documentation:

    http://localhost:8000/docs
    

2. Server Startup Arguments (server.py)

ArgumentTypeDescriptionDefault
--model_pathstrRequired. Path to the TTS model directory
--backendstrRequired. TTS backend engine. Options: llama-cpp, vllm, sglang, torch, mlx-lm, tensorrt-llm
--snac_pathstrPath to OrpheusTTS SNAC module. Required only if model is orpheusNone
--llm_tensorrt_pathstrPath to the TensorRT model. Only effective when the backend is set to tensorrt-llm. If not provided, defaults to {model_path}/tensorrt-engineNone
--role_dirstrDirectory for role audio references. Default: data/roles for Spark, data/mega-roles for MegaSpark: data/roles
Mega: data/mega-roles
--api_keystrAPI key for access. All requests must include Authorization: Bearer <KEY> if enabledNone
--llm_devicestrDevice for running the LLM (e.g., cpu, cuda)auto
--tokenizer_devicestrDevice for the audio tokenizerauto
--detokenizer_devicestrDevice for the audio detokenizerauto
--wav2vec_attn_implementationstrAttention implementation for wav2vec in Spark-TTS. Options: sdpa, flash_attention_2, eagereager
--llm_attn_implementationstrAttention method for LLM (torch backend). Options: sdpa, flash_attention_2, eagereager
--max_lengthintMax LLM context length32768
--llm_gpu_memory_utilizationfloatGPU memory usage ratio (for vllm/sglang)0.6
--torch_dtypestrModel precision type. Options: float16, bfloat16, float32, autoauto
--cache_implementationstrCache strategy for torch backend: static, offloaded_static, sliding_window, etc.None
--seedintRandom seed0
--batch_sizeintMax batch size for audio processing1
--llm_batch_sizeintMax LLM batch size256
--wait_timeoutfloatTimeout (in seconds) for dynamic batching0.01
--hoststrHost address to bind0.0.0.0
--portintPort number to listen on8000
--fix_voiceboolFixes the female and male timbres in the spark-tts model, ensuring they remain unchanged.False

3. API Usage Workflow

Example using cURL:

curl -X POST http://localhost:8000/clone_voice \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "text=Hello, world" \
  -F "reference_audio_file=@/path/to/ref.wav" \
  -F "stream=false" \
  -F "response_format=wav" \
  --output output.wav

4. API Endpoints and Parameters

4.1 Voice Cloning: POST /clone_voice

  • Content-Type: multipart/form-data
  • Parameters:
FieldTypeRequiredDescription
textstringYesText to synthesize
reference_audiostringNoReference audio (URL or base64 string). Use this or reference_audio_file
reference_audio_filefileNoUpload reference audio file (WAV)
latent_filefileNoUpload latent file (npy) for MegaTTS3.
reference_textstringNoTranscription of the reference audio
pitchenumNoPitch: very_low, low, moderate, high, very_high
speedenumNoSpeed: very_low, low, moderate, high, very_high
temperaturefloatNoControls randomness in generation
top_kintNoTop-K sampling
top_pfloatNoNucleus sampling threshold
repetition_penaltyfloatNoPenalty to reduce repetition
max_tokensintNoMax number of tokens to generate
length_thresholdintNoThreshold to split long text
window_sizeintNoWindow size for chunking
streambooleanNoReturn streaming audio (true) or full audio (false)
response_formatenumNoOutput audio format: mp3, opus, aac, flac, wav, pcm

4.2 Role-based Synthesis: POST /speak

  • Content-Type: application/json
  • Body Example:
{
  "name": "RoleName",
  "text": "Text to synthesize",
  "pitch": "moderate",
  "speed": "moderate",
  "temperature": 0.9,
  "top_k": 50,
  "top_p": 0.95,
  "repetition_penalty": 1.0,
  "max_tokens": 4096,
  "length_threshold": 50,
  "window_size": 50,
  "stream": false,
  "response_format": "mp3"
}
  • Note: Same fields as CloneRequest, with an additional name field for the voice role.

4.3 Multi-Speaker Dialogue Synthesis: POST /multi_speak

  • Content-Type: application/json
  • Body Example:
{
  "text": "<role:female> Hello! <role:male> I'm good, thank you!",
  "temperature": 0.8,
  "top_k": 50,
  "top_p": 0.95,
  "repetition_penalty": 1.0,
  "max_tokens": 4096,
  "length_threshold": 50,
  "window_size": 50,
  "stream": true,
  "response_format": "wav"
}
  • Note: The name field is omitted; speaker is indicated by the prefix <role:role_name> in the text.

4.4 OpenAI-Compatible Endpoint (Prefix /v1)

  • Paths and functionality mirror the standard API.
  • Uses OpenAISpeechRequest format:
    • model: Model ID or name
    • input: Text to synthesize
    • voice: The name of the audio character you want to use, or a URL or base64 of a reference audio.
    • Other parameters same as Clone/Speak

4.5 Retrieve Available Roles: GET /audio_roles or GET /v1/audio_roles

  • Response Example:
    {
      "success": true,
      "roles": ["alice", "bob", "tara"]
    }
    

4.6 Add Role: POST /add_speaker

  • Content-Type: multipart/form-data
  • Parameter Description:
FieldTypeRequiredDescription
namestringYesName of the role to be added
audiostringNoURL of the reference audio sample or a base64-encoded string (alternative to audio_file)
reference_textstringNoText description or transcription corresponding to the reference audio
audio_filefileNoUpload the reference audio file (WAV format), alternative to audio
latent_filefileNoLatent file used by the Mega engine (used in combination with audio/audio_file)
  • Response Example:
    {
      "success": true,
      "role": "Role Name"
    }
    

4.7 Delete Role: POST /delete_speaker

  • Content-Type: multipart/form-data
  • Parameter Description:
FieldTypeRequiredDescription
namestringYesName of the role to be deleted
  • Response Example:
    {
      "success": true,
      "role": "Role Name"
    }