API Reference

December 1, 2025 · View on GitHub

The server provides OpenAI-compatible endpoints for text-to-speech generation.

Base URL

http://localhost:8000

Request Model

All TTS endpoints accept the following request parameters:

{
  "model": "voxcpm-0.5b",           // Model identifier (fixed)
  "input": "Text to synthesize",     // Required: Text to generate speech for
  "voice": "voice_name",            // Optional: Use cached voice
  "prompt_wav_path": "/path/to/audio.wav",  // Optional: Path to prompt audio file
  "prompt_text": "Transcription of prompt audio",  // Optional: Text matching prompt audio
  "response_format": "wav",         // Optional: Audio format (wav, mp3, flac, opus, aac, pcm)
  "max_length": 2048,               // Optional: Max generated sequence length (1-2048)
  "cfg_value": 2.0,                 // Optional: Classifier-free guidance (0.0-10.0)
  "inference_timesteps": 10         // Optional: Diffusion steps (1-100)
}

Voice Selection

You have two options for voice control:

Cached Voices: Use pre-computed voice embeddings
- Set voice parameter to a cached voice name
- Available voices can be listed via /voices endpoint
- Ignores prompt_wav_path and prompt_text parameters
Custom Voice Cloning: Provide your own audio prompt
- Set prompt_wav_path to the path of local WAV file
- Set prompt_text to the exact transcription of the audio
- If prompt_wav_path is empty, generates with random voice

Parameters

max_length: Controls maximum generated audio length (each unit ≈ 0.04 seconds)
cfg_value: Classifier-free guidance strength.
inference_timesteps: Number of diffusion steps, defaults to 10.

Endpoints

1. Generate Speech (File)

POST /v1/audio/speech

Generates a complete audio file and returns it for download.

Request:

curl -X POST "http://localhost:8000/v1/audio/speech" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "voxcpm-0.5b",
    "input": "Hello, this is a test of the VoxCPM TTS system.",
    "response_format": "wav"
  }'

Response: Binary audio file with appropriate Content-Type header

Supported formats: wav, mp3, flac, opus, aac, pcm

2. Stream Speech (Real-time)

POST /v1/audio/speech/stream

Streams audio chunks in real-time for low-latency playback.

Request:

curl -X POST "http://localhost:8000/v1/audio/speech/stream" \
  -H "Content-Type: application/json" \
  -H "Accept: application/octet-stream" \
  -d '{
    "model": "voxcpm-0.5b",
    "input": "This speech will be streamed in real-time.",
    "response_format": "pcm"
  }'

Response: Streaming binary audio data (16-bit PCM, 16kHz)

Headers:

X-Sample-Rate: Sample rate of the audio (typically 16000)

3. Play on Server

POST /v1/audio/speech/playback

Generates speech and plays it directly on the server with progress indicators.

Request:

{
  "model": "voxcpm-0.5b",
  "input": "This will play on the server"
}

Response:

{
  "status": "success",
  "message": "Audio playback completed on server",
  "duration_seconds": 5.23,
}

4. Cancel Generation

POST /v1/audio/speech/cancel

Cancels the currently running audio generation.

Request:

curl -X POST "http://localhost:8000/v1/audio/speech/cancel"

Response:

{
  "status": "success",
  "message": "Cancellation signal sent to Job 123"
}

5. Create Custom Voice

POST /v1/voices

Creates a new cached voice from a prompt audio file.

Request:

{
  "voice_name": "my_custom_voice",
  "prompt_wav_path": "/path/to/prompt.wav",
  "prompt_text": "Transcription of the prompt audio", // Can be raw text or path to .txt file
  "replace": false
}

Response:

{
  "status": "success",
  "message": "Voice 'my_custom_voice' created successfully."
}

6. List Available Voices

GET /voices

Returns a list of available cached voice names.

Request:

curl -X GET "http://localhost:8000/voices"

Response:

{
  "voices": ["voice1", "voice2", "voice3"],
  "count": 3,
  "cache_directory": "assets/caches"
}

7. Health Check

GET /health

Returns server status and current processing state.

Request:

curl -X GET "http://localhost:8000/health"

Response:

{
  "status": "healthy",
  "is_processing": true,
  "current_job_id": 123,
  "queue_pending": false,
  "model": "voxcpm-0.5b"
}

8. Web Playground

GET /

Interactive web interface for testing the TTS functionality.

Access at: http://localhost:{PORT}

CLI Arguments

The server accepts the following command-line arguments:

--port: Port to run the server on (default: 8000)
--host: Host to bind the server to (default: 0.0.0.0)
--cache-dir: Directory for custom voice caches (default: ~/.cache/ane_tts)