Kokoro Text-to-Speech on Docker

June 25, 2026 · View on GitHub

Kokoro Text-to-Speech on Docker

Part of the Self-Hosted AI Stack — deploy a complete self-hosted AI stack with a single command.

Docker image to run a Kokoro text-to-speech server. Provides an OpenAI-compatible audio speech API. Based on Debian (python:3.12-slim). Designed to be simple, private, and self-hosted.

Features:

OpenAI-compatible POST /v1/audio/speech endpoint — any app using the OpenAI TTS API switches with a one-line change
54 high-quality voices across 9 languages (English, Japanese, Chinese, Spanish, French, Italian, and more)
Accepts OpenAI voice-name aliases (alloy, nova, echo, ...) that map to local Kokoro voices, plus native Kokoro voice IDs (af_heart, bm_george, ...)
Audio stays on your server — no data sent to third parties
All major output formats supported: mp3, wav, flac, opus, aac, pcm
Streaming support — set stream_format to "audio" or "sse" to receive audio as each sentence is synthesized, reducing time-to-first-audio
NVIDIA GPU (CUDA) acceleration for faster inference (:cuda image tag)
Offline/air-gapped mode — run without internet access using pre-cached model (KOKORO_LOCAL_ONLY)
Automatically built and published via GitHub Actions
Persistent model cache via a Docker volume
Multi-arch: linux/amd64, linux/arm64

Also available:

AI stack: Self-Hosted AI Stack
Try it online: Open in Colab — no Docker or installation required
Related AI services: Whisper (STT), Embeddings, LiteLLM, Ollama (LLM), Docling, MCP Gateway

Tip: Whisper, Kokoro, Embeddings, LiteLLM, Ollama, Docling, and MCP Gateway can be used together to build a complete, self-hosted AI stack on your own server.

Community

📬 Subscribe for project updates (1–2 emails/month) — get free AI and VPN deployment guides (PDF)
💬 Join the r/selfhostedstack community for discussions and showcases
⭐ Star the repository if you find it useful — it helps others discover it

Other self-hosted projects: Setup IPsec VPN, IPsec VPN on Docker, WireGuard, OpenVPN, Headscale.

Quick start

Use this command to set up a Kokoro TTS server:

docker run \
    --name kokoro \
    --restart=always \
    -v kokoro-data:/var/lib/kokoro \
    -p 8880:8880 \
    -d hwdsl2/kokoro-server

GPU quick start (NVIDIA CUDA)

If you have an NVIDIA GPU, use the :cuda image for hardware-accelerated inference:

docker run \
    --name kokoro \
    --restart=always \
    --gpus=all \
    -v kokoro-data:/var/lib/kokoro \
    -p 8880:8880 \
    -d hwdsl2/kokoro-server:cuda

Requirements: NVIDIA GPU, NVIDIA driver 575.57.08+ (Linux) or 576.57+ (Windows), and the NVIDIA Container Toolkit installed on the host. The :cuda image is linux/amd64 only.

Important: This image requires at least 1.5 GB of available RAM due to the PyTorch runtime and Kokoro model. Systems with 1 GB or less of total RAM are not supported.

Note: For internet-facing deployments, using a reverse proxy to add HTTPS is strongly recommended. In that case, also replace -p 8880:8880 with -p 127.0.0.1:8880:8880 in the docker run command above, to prevent direct access to the unencrypted port.

The Kokoro model (~320 MB) is downloaded and cached on first start. Check the logs to confirm the server is ready:

docker logs kokoro

Once you see "Kokoro text-to-speech server is ready", synthesize your first audio file:

curl http://your_server_ip:8880/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{"model":"tts-1","input":"Hello, world!","voice":"af_heart"}' \
    --output speech.mp3

Requirements

A Linux server (local or cloud) with Docker installed
Supported architectures: amd64 (x86_64), arm64 (e.g. Raspberry Pi 4/5, AWS Graviton)
Minimum RAM: ~1.5 GB free (model is ~320 MB; PyTorch runtime uses additional memory)
Internet access for the initial model download (the model is cached locally afterwards). Not required if using KOKORO_LOCAL_ONLY=true with a pre-cached model.

For GPU acceleration (:cuda image):

NVIDIA GPU with CUDA support (Compute Capability 6.0+)
NVIDIA driver 575.57.08+ (Linux) or 576.57+ (Windows) installed on the host
NVIDIA Container Toolkit installed
The :cuda image supports linux/amd64 only

For internet-facing deployments, see Using a reverse proxy to add HTTPS.

Download

Get the trusted build from the Docker Hub registry:

docker pull hwdsl2/kokoro-server

For NVIDIA GPU acceleration, pull the :cuda tag instead:

docker pull hwdsl2/kokoro-server:cuda

Alternatively, you may download from Quay.io:

docker pull quay.io/hwdsl2/kokoro-server
docker image tag quay.io/hwdsl2/kokoro-server hwdsl2/kokoro-server

Supported platforms: linux/amd64 and linux/arm64. The :cuda tag supports linux/amd64 only.

Environment variables

All variables are optional. Fresh installs with a mounted /var/lib/kokoro volume auto-generate a Bearer token. Existing installs without a key remain open for backward compatibility.

This Docker image uses the following variables, that can be declared in an env file (see example):

Variable	Description	Default
`KOKORO_VOICE`	Default voice for synthesis. See voices for all options. Accepts Kokoro voice IDs (`af_heart`) or OpenAI aliases (`alloy`, `ballad`, etc.).	`af_heart`
`KOKORO_SPEED`	Default speech speed. Range: `0.25` (slowest) to `4.0` (fastest).	`1.0`
`KOKORO_PORT`	HTTP port for the API (1–65535).	`8880`
`KOKORO_LANG_CODE`	If set, loads only that language pipeline at startup (`a`=American English, `b`=British English, `e`=Spanish, `f`=French, `h`=Hindi, `i`=Italian, `j`=Japanese, `p`=Brazilian Portuguese, `z`=Mandarin Chinese). When unset, the pipeline is auto-selected from the `KOKORO_VOICE` prefix. Additional pipelines are created on demand when a request uses a different language.	(not set)
`KOKORO_API_KEY`	Optional Bearer token. Fresh persistent installs auto-generate one. If set, all API requests must include `Authorization: Bearer <key>`. Set explicitly empty to disable authentication.	Auto-generated for fresh persistent installs
`KOKORO_LOG_LEVEL`	Log level: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`.	`INFO`
`KOKORO_LOCAL_ONLY`	When set to any non-empty value (e.g. `true`), disables all HuggingFace model downloads. For offline or air-gapped deployments with pre-cached model.	(not set)

Note: In your env file, you may enclose values in single quotes, e.g. VAR='value'. Do not add spaces around =. If you change KOKORO_PORT, update the -p flag in the docker run command accordingly.

Example using an env file:

cp kokoro.env.example kokoro.env
# Edit kokoro.env with your settings, then:
docker run \
    --name kokoro \
    --restart=always \
    -v kokoro-data:/var/lib/kokoro \
    -v ./kokoro.env:/kokoro.env:ro \
    -p 8880:8880 \
    -d hwdsl2/kokoro-server

The env file is bind-mounted into the container, so changes are picked up on every restart without recreating the container.

Alternatively, pass it with --env-file

docker run \
    --name kokoro \
    --restart=always \
    -v kokoro-data:/var/lib/kokoro \
    -p 8880:8880 \
    --env-file=kokoro.env \
    -d hwdsl2/kokoro-server

Using docker-compose

cp kokoro.env.example kokoro.env
# Edit kokoro.env as needed, then:
docker compose up -d
docker logs kokoro

Example docker-compose.yml (already included):

services:
  kokoro:
    image: hwdsl2/kokoro-server
    container_name: kokoro
    restart: always
    ports:
      - "8880:8880/tcp"  # For a host-based reverse proxy, change to "127.0.0.1:8880:8880/tcp"
    volumes:
      - kokoro-data:/var/lib/kokoro
      - ./kokoro.env:/kokoro.env:ro

volumes:
  kokoro-data:
    name: kokoro-data

Note: For internet-facing deployments, using a reverse proxy to add HTTPS is strongly recommended. In that case, also change "8880:8880/tcp" to "127.0.0.1:8880:8880/tcp" in docker-compose.yml, to prevent direct access to the unencrypted port.

Using docker-compose with GPU (NVIDIA CUDA)

A separate docker-compose.cuda.yml is provided for GPU deployments:

cp kokoro.env.example kokoro.env
# Edit kokoro.env as needed, then:
docker compose -f docker-compose.cuda.yml up -d
docker logs kokoro

Example docker-compose.cuda.yml (already included):

services:
  kokoro:
    image: hwdsl2/kokoro-server:cuda
    container_name: kokoro
    restart: always
    ports:
      - "8880:8880/tcp"  # For a host-based reverse proxy, change to "127.0.0.1:8880:8880/tcp"
    volumes:
      - kokoro-data:/var/lib/kokoro
      - ./kokoro.env:/kokoro.env:ro
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

volumes:
  kokoro-data:
    name: kokoro-data

API reference

The API is compatible with OpenAI's text-to-speech endpoint. Any application already calling https://api.openai.com/v1/audio/speech can switch to self-hosted by setting:

OpenAI voice names are accepted as local aliases for client compatibility. These aliases map to Kokoro voices and do not reproduce OpenAI's proprietary voices. The voice field may be a string or an object with an id field; unknown voices return 400.

OPENAI_BASE_URL=http://your_server_ip:8880

Synthesize speech

POST /v1/audio/speech
Content-Type: application/json

Request body:

Field	Type	Required	Description
`model`	string	✅	Pass `tts-1`, `tts-1-hd`, or `kokoro` (all use Kokoro-82M).
`input`	string	✅	The text to synthesize. Maximum 4096 characters.
`voice`	string or object	✅	Voice to use. See available voices. Accepts Kokoro IDs, OpenAI aliases that map to local Kokoro voices, or an object with an `id` field. Unknown voices return `400`.
`response_format`	string	—	Output format. Default: `mp3`. Options: `mp3`, `opus`, `aac`, `flac`, `wav`, `pcm`. `pcm` is raw signed 16-bit little-endian audio at 24 kHz mono, with no header.
`speed`	float	—	Speech speed. Default: `1.0`. Range: `0.25`–`4.0`.
`instructions`	string	—	Control the voice with additional instructions. Accepted for API compatibility but not currently supported by the Kokoro engine (ignored).
`stream_format`	string	—	The format to stream the audio in. Options: `audio`, `sse`. When set to `audio`, audio bytes are streamed via chunked transfer encoding. When set to `sse`, the response uses Server-Sent Events with `speech.audio.delta` and `speech.audio.done` events (OpenAI streaming speech protocol). For SSE WAV, the first delta is a streaming WAV header and later deltas are raw PCM_S16LE at 24 kHz mono. SSE PCM deltas are raw PCM_S16LE at 24 kHz mono with no header. If omitted, the full audio is returned as a single response.
`volume_multiplier`	float	—	Output volume multiplier. Default: `1.0`. Range: `0.1`–`2.0`. Values above `1.0` amplify, below `1.0` attenuate. Samples are clipped after scaling to prevent distortion.

Example:

curl http://your_server_ip:8880/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{"model":"tts-1","input":"The quick brown fox jumps over the lazy dog.","voice":"af_heart"}' \
    --output speech.mp3

With a different voice and format:

curl http://your_server_ip:8880/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{"model":"tts-1","input":"Hello from London.","voice":"bm_george","response_format":"wav","speed":0.9}' \
    --output speech.wav

With API key authentication:

curl http://your_server_ip:8880/v1/audio/speech \
    -H "Authorization: Bearer your_api_key" \
    -H "Content-Type: application/json" \
    -d '{"model":"tts-1","input":"Hello world","voice":"nova"}' \
    --output speech.mp3

Response: Binary audio data with the appropriate Content-Type header.

List voices

GET /v1/voices

Returns all available Kokoro voice IDs and their OpenAI alias mappings.

curl http://your_server_ip:8880/v1/voices

List models

GET /v1/models

Returns the active models in OpenAI-compatible format.

curl http://your_server_ip:8880/v1/models

Interactive API docs

An interactive Swagger UI is available at:

http://your_server_ip:8880/docs

Available voices

Use kokoro_manage --listvoices to see the full list at any time:

docker exec kokoro kokoro_manage --listvoices

American English:

Voice ID	Gender	Style
`af_heart`	Female	Warm, natural — default
`af_aoede`	Female
`af_bella`	Female	Expressive
`af_jessica`	Female	Energetic
`af_kore`	Female
`af_nicole`	Female	Friendly
`af_nova`	Female	Clear
`af_river`	Female	Calm
`af_sarah`	Female	Conversational
`af_sky`	Female	Neutral, versatile
`af_alloy`	Female	Balanced
`am_adam`	Male	Deep
`am_michael`	Male	Clear
`am_echo`	Male	Neutral
`am_eric`	Male	Authoritative
`am_fenrir`	Male	Distinctive
`am_liam`	Male	Conversational
`am_onyx`	Male	Rich
`am_puck`	Male	Expressive
`am_santa`	Male	Warm

British English:

Voice ID	Gender	Style
`bf_emma`	Female	Clear, professional
`bf_isabella`	Female	Warm
`bf_alice`	Female	Crisp
`bf_lily`	Female	Soft
`bm_george`	Male	Authoritative
`bm_lewis`	Male	Smooth
`bm_daniel`	Male	Calm
`bm_fable`	Male	Expressive

Japanese: jf_alpha, jf_gongitsune, jf_nezumi, jf_tebukuro, jm_kumo

Mandarin Chinese: zf_xiaobei, zf_xiaoni, zf_xiaoxiao, zf_xiaoyi, zm_yunjian, zm_yunxi, zm_yunxia, zm_yunyang

Spanish: ef_dora, em_alex, em_santa

French: ff_siwis

Hindi: hf_alpha, hf_beta, hm_omega, hm_psi

Italian: if_sara, im_nicola

Brazilian Portuguese: pf_dora, pm_alex, pm_santa

OpenAI voice aliases (accepted in the voice field):

OpenAI alias	Maps to
`alloy`	`af_alloy`
`echo`	`am_echo`
`fable`	`bm_fable`
`onyx`	`am_onyx`
`nova`	`af_nova`
`shimmer`	`af_bella`
`ash`	`am_michael`
`coral`	`af_heart`
`sage`	`af_sky`
`verse`	`bm_george`
`ballad`	`bm_lewis`
`marin`	`af_nicole`
`cedar`	`am_adam`

Tip: The server automatically selects the correct language pipeline from the voice ID prefix — no configuration needed. For example, jf_alpha loads the Japanese pipeline, bf_emma loads British English. Additional language pipelines are created on demand when needed.

All voices use a single shared model file (~320 MB). No re-download is needed when switching voices.

Persistent data

All server data is stored in the Docker volume (/var/lib/kokoro inside the container):

/var/lib/kokoro/
├── hub/                           # Cached Kokoro model files (downloaded from HuggingFace)
├── .port                          # Active port (used by kokoro_manage)
├── .voice                         # Active default voice (used by kokoro_manage)
└── .server_addr                   # Cached server IP (used by kokoro_manage)

Back up the Docker volume to preserve the downloaded model. The model is ~320 MB and only needs to be downloaded once.

Managing the server

Use kokoro_manage inside the running container to inspect and manage the server.

Show server info:

docker exec kokoro kokoro_manage --showinfo

List available voices:

docker exec kokoro kokoro_manage --listvoices

Changing the voice

To change the default voice, update KOKORO_VOICE in your kokoro.env file and restart the container. No model re-download is required — all voices use the same Kokoro-82M model.

# Edit kokoro.env: set KOKORO_VOICE=bm_george
docker restart kokoro

Note: Individual API requests can always specify a different voice using the voice field, regardless of the container default.

Securing your server

If your Kokoro TTS server is reachable from the public internet — even briefly — apply at minimum these protections. Kokoro is CPU/GPU-intensive, so an unauthenticated endpoint can be abused to burn your compute resources.

1. Use an API key. Fresh installs with a mounted /var/lib/kokoro volume auto-generate an API key. Display it with docker exec kokoro kokoro_manage --showkey, or use docker exec kokoro kokoro_manage --getkey in scripts. Existing installs without a key remain open for backward compatibility; set KOKORO_API_KEY in your env file to enable authentication manually. All authenticated requests must include Authorization: Bearer <key>.

# Generate a 32-byte random key
openssl rand -hex 32

2. Bind to localhost when fronted by a reverse proxy. Replace -p 8880:8880 with -p 127.0.0.1:8880:8880 (or change "8880:8880/tcp" to "127.0.0.1:8880:8880/tcp" in docker-compose.yml) so the unencrypted port is not reachable directly from outside the host.

3. Limit request body size at the proxy. TTS requests carry text input; configure your reverse proxy to reject oversized request bodies (e.g. nginx client_max_body_size 1M;).

4. Mind the log level. KOKORO_LOG_LEVEL=DEBUG may write input text to logs. Keep it at INFO or higher on shared systems.

5. Enable CORS at the proxy if calling from a browser. The server does not set Access-Control-Allow-Origin headers by default; add them at your reverse proxy if you intend to call the API directly from a web page on a different origin.

6. Consider rate limiting. Place a rate-limit (e.g. nginx limit_req_zone, Caddy rate_limit) in front of the server to cap concurrent synthesis requests per client IP.

Using a reverse proxy

For internet-facing deployments, place a reverse proxy in front of the TTS server to handle HTTPS termination. The server works without HTTPS on a local or trusted network, but HTTPS is recommended when the API endpoint is exposed to the internet.

Use one of the following addresses to reach the TTS container from your reverse proxy:

kokoro:8880 — if your reverse proxy runs as a container in the same Docker network as the TTS server (e.g. defined in the same docker-compose.yml).
127.0.0.1:8880 — if your reverse proxy runs on the host and port 8880 is published (the default docker-compose.yml publishes it).

Example with Caddy (Docker image) (automatic TLS via Let's Encrypt, reverse proxy in the same Docker network):

Caddyfile:

kokoro.example.com {
  reverse_proxy kokoro:8880
}

Example with nginx (reverse proxy on the host):

server {
    listen 443 ssl;
    server_name kokoro.example.com;

    ssl_certificate     /path/to/cert.pem;
    ssl_certificate_key /path/to/key.pem;

    location / {
        proxy_pass         http://127.0.0.1:8880;
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
        proxy_set_header   X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header   X-Forwarded-Proto $scheme;
        proxy_read_timeout 120s;
    }
}

Update Docker image

To update the Docker image and container, first download the latest version:

docker pull hwdsl2/kokoro-server

If the Docker image is already up to date, you should see:

Status: Image is up to date for hwdsl2/kokoro-server:latest

Otherwise, it will download the latest version. Remove and re-create the container:

docker rm -f kokoro
# Then re-run the docker run command from Quick start with the same volume and port.

Your downloaded model is preserved in the kokoro-data volume.

Using with other AI services

The Whisper (STT), Embeddings, LiteLLM, Kokoro (TTS), Ollama (LLM), Docling, and MCP Gateway images can be combined to build a complete, self-hosted AI stack on your own server — from voice I/O to RAG-powered question answering. Whisper, Kokoro, and Embeddings run fully locally. Ollama runs all LLM inference locally, so no data is sent to third parties. When using LiteLLM with external providers (e.g., OpenAI, Anthropic), your data will be sent to those providers.

Service	Role	Default port
Embeddings	Converts text to vectors for semantic search and RAG	`8000`
Whisper (STT)	Transcribes spoken audio to text	`9000`
LiteLLM	AI gateway — routes requests to Ollama, OpenAI, Anthropic, and 100+ providers	`4000`
Kokoro (TTS)	Converts text to natural-sounding speech	`8880`
Ollama (LLM)	Runs local LLM models (llama3, qwen, mistral, etc.)	`11434`
MCP Gateway	Exposes AI services as MCP tools for AI assistants (Claude, Cursor, etc.)	`3000`
Docling	Converts documents (PDF, DOCX, etc.) to structured text/Markdown	`5001`

See also: Self-Hosted AI Stack — deploy the full stack with a single command, with ready-made configurations and pipeline examples.

Technical details

Base image: python:3.12-slim (Debian)
Runtime: Python 3 (virtual environment at /opt/venv)
TTS engine: Kokoro (Kokoro-82M, Apache 2.0) with PyTorch (CPU and CUDA GPU)
API framework: FastAPI + Uvicorn
Audio encoding: soundfile (wav/flac), ffmpeg (mp3/aac/opus)
Data directory: /var/lib/kokoro (Docker volume)
Model storage: HuggingFace Hub format inside the volume — downloaded once, reused on restarts
Sample rate: 24 kHz (native Kokoro output)

License

Note: The software components inside the pre-built image (such as Kokoro and its dependencies) are under the respective licenses chosen by their respective copyright holders. As for any pre-built image usage, it is the image user's responsibility to ensure that any use of this image complies with any relevant licenses for all software contained within.

Kokoro TTS is Copyright (C) hexgrad, and is distributed under the Apache License 2.0.

This project is an independent Docker setup for Kokoro and is not affiliated with, endorsed by, or sponsored by hexgrad or OpenAI.