Kokoro Text-to-Speech on Docker

June 25, 2026 · View on GitHub

English | 简体中文 | 繁體中文 | Русский

Kokoro Text-to-Speech on Docker

Build Status  Docker Pulls  License: MIT  Open In Colab

Part of the Self-Hosted AI Stack — deploy a complete self-hosted AI stack with a single command.

Docker image to run a Kokoro text-to-speech server. Provides an OpenAI-compatible audio speech API. Based on Debian (python:3.12-slim). Designed to be simple, private, and self-hosted.

Features:

  • OpenAI-compatible POST /v1/audio/speech endpoint — any app using the OpenAI TTS API switches with a one-line change
  • 54 high-quality voices across 9 languages (English, Japanese, Chinese, Spanish, French, Italian, and more)
  • Accepts OpenAI voice-name aliases (alloy, nova, echo, ...) that map to local Kokoro voices, plus native Kokoro voice IDs (af_heart, bm_george, ...)
  • Audio stays on your server — no data sent to third parties
  • All major output formats supported: mp3, wav, flac, opus, aac, pcm
  • Streaming support — set stream_format to "audio" or "sse" to receive audio as each sentence is synthesized, reducing time-to-first-audio
  • NVIDIA GPU (CUDA) acceleration for faster inference (:cuda image tag)
  • Offline/air-gapped mode — run without internet access using pre-cached model (KOKORO_LOCAL_ONLY)
  • Automatically built and published via GitHub Actions
  • Persistent model cache via a Docker volume
  • Multi-arch: linux/amd64, linux/arm64

Also available:

Tip: Whisper, Kokoro, Embeddings, LiteLLM, Ollama, Docling, and MCP Gateway can be used together to build a complete, self-hosted AI stack on your own server.

Community

  • 📬 Subscribe for project updates (1–2 emails/month) — get free AI and VPN deployment guides (PDF)
  • 💬 Join the r/selfhostedstack community for discussions and showcases
  • ⭐ Star the repository if you find it useful — it helps others discover it

Other self-hosted projects: Setup IPsec VPN, IPsec VPN on Docker, WireGuard, OpenVPN, Headscale.

Quick start

Use this command to set up a Kokoro TTS server:

docker run \
    --name kokoro \
    --restart=always \
    -v kokoro-data:/var/lib/kokoro \
    -p 8880:8880 \
    -d hwdsl2/kokoro-server
GPU quick start (NVIDIA CUDA)

If you have an NVIDIA GPU, use the :cuda image for hardware-accelerated inference:

docker run \
    --name kokoro \
    --restart=always \
    --gpus=all \
    -v kokoro-data:/var/lib/kokoro \
    -p 8880:8880 \
    -d hwdsl2/kokoro-server:cuda

Requirements: NVIDIA GPU, NVIDIA driver 575.57.08+ (Linux) or 576.57+ (Windows), and the NVIDIA Container Toolkit installed on the host. The :cuda image is linux/amd64 only.

Important: This image requires at least 1.5 GB of available RAM due to the PyTorch runtime and Kokoro model. Systems with 1 GB or less of total RAM are not supported.

Note: For internet-facing deployments, using a reverse proxy to add HTTPS is strongly recommended. In that case, also replace -p 8880:8880 with -p 127.0.0.1:8880:8880 in the docker run command above, to prevent direct access to the unencrypted port.

The Kokoro model (~320 MB) is downloaded and cached on first start. Check the logs to confirm the server is ready:

docker logs kokoro

Once you see "Kokoro text-to-speech server is ready", synthesize your first audio file:

curl http://your_server_ip:8880/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{"model":"tts-1","input":"Hello, world!","voice":"af_heart"}' \
    --output speech.mp3

Requirements

  • A Linux server (local or cloud) with Docker installed
  • Supported architectures: amd64 (x86_64), arm64 (e.g. Raspberry Pi 4/5, AWS Graviton)
  • Minimum RAM: ~1.5 GB free (model is ~320 MB; PyTorch runtime uses additional memory)
  • Internet access for the initial model download (the model is cached locally afterwards). Not required if using KOKORO_LOCAL_ONLY=true with a pre-cached model.

For GPU acceleration (:cuda image):

  • NVIDIA GPU with CUDA support (Compute Capability 6.0+)
  • NVIDIA driver 575.57.08+ (Linux) or 576.57+ (Windows) installed on the host
  • NVIDIA Container Toolkit installed
  • The :cuda image supports linux/amd64 only

For internet-facing deployments, see Using a reverse proxy to add HTTPS.

Download

Get the trusted build from the Docker Hub registry:

docker pull hwdsl2/kokoro-server

For NVIDIA GPU acceleration, pull the :cuda tag instead:

docker pull hwdsl2/kokoro-server:cuda

Alternatively, you may download from Quay.io:

docker pull quay.io/hwdsl2/kokoro-server
docker image tag quay.io/hwdsl2/kokoro-server hwdsl2/kokoro-server

Supported platforms: linux/amd64 and linux/arm64. The :cuda tag supports linux/amd64 only.

Environment variables

All variables are optional. Fresh installs with a mounted /var/lib/kokoro volume auto-generate a Bearer token. Existing installs without a key remain open for backward compatibility.

This Docker image uses the following variables, that can be declared in an env file (see example):

VariableDescriptionDefault
KOKORO_VOICEDefault voice for synthesis. See voices for all options. Accepts Kokoro voice IDs (af_heart) or OpenAI aliases (alloy, ballad, etc.).af_heart
KOKORO_SPEEDDefault speech speed. Range: 0.25 (slowest) to 4.0 (fastest).1.0
KOKORO_PORTHTTP port for the API (1–65535).8880
KOKORO_LANG_CODEIf set, loads only that language pipeline at startup (a=American English, b=British English, e=Spanish, f=French, h=Hindi, i=Italian, j=Japanese, p=Brazilian Portuguese, z=Mandarin Chinese). When unset, the pipeline is auto-selected from the KOKORO_VOICE prefix. Additional pipelines are created on demand when a request uses a different language.(not set)
KOKORO_API_KEYOptional Bearer token. Fresh persistent installs auto-generate one. If set, all API requests must include Authorization: Bearer <key>. Set explicitly empty to disable authentication.Auto-generated for fresh persistent installs
KOKORO_LOG_LEVELLog level: DEBUG, INFO, WARNING, ERROR, CRITICAL.INFO
KOKORO_LOCAL_ONLYWhen set to any non-empty value (e.g. true), disables all HuggingFace model downloads. For offline or air-gapped deployments with pre-cached model.(not set)

Note: In your env file, you may enclose values in single quotes, e.g. VAR='value'. Do not add spaces around =. If you change KOKORO_PORT, update the -p flag in the docker run command accordingly.

Example using an env file:

cp kokoro.env.example kokoro.env
# Edit kokoro.env with your settings, then:
docker run \
    --name kokoro \
    --restart=always \
    -v kokoro-data:/var/lib/kokoro \
    -v ./kokoro.env:/kokoro.env:ro \
    -p 8880:8880 \
    -d hwdsl2/kokoro-server

The env file is bind-mounted into the container, so changes are picked up on every restart without recreating the container.

Alternatively, pass it with --env-file
docker run \
    --name kokoro \
    --restart=always \
    -v kokoro-data:/var/lib/kokoro \
    -p 8880:8880 \
    --env-file=kokoro.env \
    -d hwdsl2/kokoro-server

Using docker-compose

cp kokoro.env.example kokoro.env
# Edit kokoro.env as needed, then:
docker compose up -d
docker logs kokoro

Example docker-compose.yml (already included):

services:
  kokoro:
    image: hwdsl2/kokoro-server
    container_name: kokoro
    restart: always
    ports:
      - "8880:8880/tcp"  # For a host-based reverse proxy, change to "127.0.0.1:8880:8880/tcp"
    volumes:
      - kokoro-data:/var/lib/kokoro
      - ./kokoro.env:/kokoro.env:ro

volumes:
  kokoro-data:
    name: kokoro-data

Note: For internet-facing deployments, using a reverse proxy to add HTTPS is strongly recommended. In that case, also change "8880:8880/tcp" to "127.0.0.1:8880:8880/tcp" in docker-compose.yml, to prevent direct access to the unencrypted port.

Using docker-compose with GPU (NVIDIA CUDA)

A separate docker-compose.cuda.yml is provided for GPU deployments:

cp kokoro.env.example kokoro.env
# Edit kokoro.env as needed, then:
docker compose -f docker-compose.cuda.yml up -d
docker logs kokoro

Example docker-compose.cuda.yml (already included):

services:
  kokoro:
    image: hwdsl2/kokoro-server:cuda
    container_name: kokoro
    restart: always
    ports:
      - "8880:8880/tcp"  # For a host-based reverse proxy, change to "127.0.0.1:8880:8880/tcp"
    volumes:
      - kokoro-data:/var/lib/kokoro
      - ./kokoro.env:/kokoro.env:ro
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

volumes:
  kokoro-data:
    name: kokoro-data

API reference

The API is compatible with OpenAI's text-to-speech endpoint. Any application already calling https://api.openai.com/v1/audio/speech can switch to self-hosted by setting:

OpenAI voice names are accepted as local aliases for client compatibility. These aliases map to Kokoro voices and do not reproduce OpenAI's proprietary voices. The voice field may be a string or an object with an id field; unknown voices return 400.

OPENAI_BASE_URL=http://your_server_ip:8880

Synthesize speech

POST /v1/audio/speech
Content-Type: application/json

Request body:

FieldTypeRequiredDescription
modelstringPass tts-1, tts-1-hd, or kokoro (all use Kokoro-82M).
inputstringThe text to synthesize. Maximum 4096 characters.
voicestring or objectVoice to use. See available voices. Accepts Kokoro IDs, OpenAI aliases that map to local Kokoro voices, or an object with an id field. Unknown voices return 400.
response_formatstringOutput format. Default: mp3. Options: mp3, opus, aac, flac, wav, pcm. pcm is raw signed 16-bit little-endian audio at 24 kHz mono, with no header.
speedfloatSpeech speed. Default: 1.0. Range: 0.254.0.
instructionsstringControl the voice with additional instructions. Accepted for API compatibility but not currently supported by the Kokoro engine (ignored).
stream_formatstringThe format to stream the audio in. Options: audio, sse. When set to audio, audio bytes are streamed via chunked transfer encoding. When set to sse, the response uses Server-Sent Events with speech.audio.delta and speech.audio.done events (OpenAI streaming speech protocol). For SSE WAV, the first delta is a streaming WAV header and later deltas are raw PCM_S16LE at 24 kHz mono. SSE PCM deltas are raw PCM_S16LE at 24 kHz mono with no header. If omitted, the full audio is returned as a single response.
volume_multiplierfloatOutput volume multiplier. Default: 1.0. Range: 0.12.0. Values above 1.0 amplify, below 1.0 attenuate. Samples are clipped after scaling to prevent distortion.

Example:

curl http://your_server_ip:8880/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{"model":"tts-1","input":"The quick brown fox jumps over the lazy dog.","voice":"af_heart"}' \
    --output speech.mp3

With a different voice and format:

curl http://your_server_ip:8880/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{"model":"tts-1","input":"Hello from London.","voice":"bm_george","response_format":"wav","speed":0.9}' \
    --output speech.wav

With API key authentication:

curl http://your_server_ip:8880/v1/audio/speech \
    -H "Authorization: Bearer your_api_key" \
    -H "Content-Type: application/json" \
    -d '{"model":"tts-1","input":"Hello world","voice":"nova"}' \
    --output speech.mp3

Response: Binary audio data with the appropriate Content-Type header.

List voices

GET /v1/voices

Returns all available Kokoro voice IDs and their OpenAI alias mappings.

curl http://your_server_ip:8880/v1/voices

List models

GET /v1/models

Returns the active models in OpenAI-compatible format.

curl http://your_server_ip:8880/v1/models

Interactive API docs

An interactive Swagger UI is available at:

http://your_server_ip:8880/docs

Available voices

Use kokoro_manage --listvoices to see the full list at any time:

docker exec kokoro kokoro_manage --listvoices

American English:

Voice IDGenderStyle
af_heartFemaleWarm, natural — default
af_aoedeFemale
af_bellaFemaleExpressive
af_jessicaFemaleEnergetic
af_koreFemale
af_nicoleFemaleFriendly
af_novaFemaleClear
af_riverFemaleCalm
af_sarahFemaleConversational
af_skyFemaleNeutral, versatile
af_alloyFemaleBalanced
am_adamMaleDeep
am_michaelMaleClear
am_echoMaleNeutral
am_ericMaleAuthoritative
am_fenrirMaleDistinctive
am_liamMaleConversational
am_onyxMaleRich
am_puckMaleExpressive
am_santaMaleWarm

British English:

Voice IDGenderStyle
bf_emmaFemaleClear, professional
bf_isabellaFemaleWarm
bf_aliceFemaleCrisp
bf_lilyFemaleSoft
bm_georgeMaleAuthoritative
bm_lewisMaleSmooth
bm_danielMaleCalm
bm_fableMaleExpressive

Japanese: jf_alpha, jf_gongitsune, jf_nezumi, jf_tebukuro, jm_kumo

Mandarin Chinese: zf_xiaobei, zf_xiaoni, zf_xiaoxiao, zf_xiaoyi, zm_yunjian, zm_yunxi, zm_yunxia, zm_yunyang

Spanish: ef_dora, em_alex, em_santa

French: ff_siwis

Hindi: hf_alpha, hf_beta, hm_omega, hm_psi

Italian: if_sara, im_nicola

Brazilian Portuguese: pf_dora, pm_alex, pm_santa

OpenAI voice aliases (accepted in the voice field):

OpenAI aliasMaps to
alloyaf_alloy
echoam_echo
fablebm_fable
onyxam_onyx
novaaf_nova
shimmeraf_bella
asham_michael
coralaf_heart
sageaf_sky
versebm_george
balladbm_lewis
marinaf_nicole
cedaram_adam

Tip: The server automatically selects the correct language pipeline from the voice ID prefix — no configuration needed. For example, jf_alpha loads the Japanese pipeline, bf_emma loads British English. Additional language pipelines are created on demand when needed.

All voices use a single shared model file (~320 MB). No re-download is needed when switching voices.

Persistent data

All server data is stored in the Docker volume (/var/lib/kokoro inside the container):

/var/lib/kokoro/
├── hub/                           # Cached Kokoro model files (downloaded from HuggingFace)
├── .port                          # Active port (used by kokoro_manage)
├── .voice                         # Active default voice (used by kokoro_manage)
└── .server_addr                   # Cached server IP (used by kokoro_manage)

Back up the Docker volume to preserve the downloaded model. The model is ~320 MB and only needs to be downloaded once.

Managing the server

Use kokoro_manage inside the running container to inspect and manage the server.

Show server info:

docker exec kokoro kokoro_manage --showinfo

List available voices:

docker exec kokoro kokoro_manage --listvoices

Changing the voice

To change the default voice, update KOKORO_VOICE in your kokoro.env file and restart the container. No model re-download is required — all voices use the same Kokoro-82M model.

# Edit kokoro.env: set KOKORO_VOICE=bm_george
docker restart kokoro

Note: Individual API requests can always specify a different voice using the voice field, regardless of the container default.

Securing your server

If your Kokoro TTS server is reachable from the public internet — even briefly — apply at minimum these protections. Kokoro is CPU/GPU-intensive, so an unauthenticated endpoint can be abused to burn your compute resources.

1. Use an API key. Fresh installs with a mounted /var/lib/kokoro volume auto-generate an API key. Display it with docker exec kokoro kokoro_manage --showkey, or use docker exec kokoro kokoro_manage --getkey in scripts. Existing installs without a key remain open for backward compatibility; set KOKORO_API_KEY in your env file to enable authentication manually. All authenticated requests must include Authorization: Bearer <key>.

# Generate a 32-byte random key
openssl rand -hex 32

2. Bind to localhost when fronted by a reverse proxy. Replace -p 8880:8880 with -p 127.0.0.1:8880:8880 (or change "8880:8880/tcp" to "127.0.0.1:8880:8880/tcp" in docker-compose.yml) so the unencrypted port is not reachable directly from outside the host.

3. Limit request body size at the proxy. TTS requests carry text input; configure your reverse proxy to reject oversized request bodies (e.g. nginx client_max_body_size 1M;).

4. Mind the log level. KOKORO_LOG_LEVEL=DEBUG may write input text to logs. Keep it at INFO or higher on shared systems.

5. Enable CORS at the proxy if calling from a browser. The server does not set Access-Control-Allow-Origin headers by default; add them at your reverse proxy if you intend to call the API directly from a web page on a different origin.

6. Consider rate limiting. Place a rate-limit (e.g. nginx limit_req_zone, Caddy rate_limit) in front of the server to cap concurrent synthesis requests per client IP.

Using a reverse proxy

For internet-facing deployments, place a reverse proxy in front of the TTS server to handle HTTPS termination. The server works without HTTPS on a local or trusted network, but HTTPS is recommended when the API endpoint is exposed to the internet.

Use one of the following addresses to reach the TTS container from your reverse proxy:

  • kokoro:8880 — if your reverse proxy runs as a container in the same Docker network as the TTS server (e.g. defined in the same docker-compose.yml).
  • 127.0.0.1:8880 — if your reverse proxy runs on the host and port 8880 is published (the default docker-compose.yml publishes it).

Example with Caddy (Docker image) (automatic TLS via Let's Encrypt, reverse proxy in the same Docker network):

Caddyfile:

kokoro.example.com {
  reverse_proxy kokoro:8880
}

Example with nginx (reverse proxy on the host):

server {
    listen 443 ssl;
    server_name kokoro.example.com;

    ssl_certificate     /path/to/cert.pem;
    ssl_certificate_key /path/to/key.pem;

    location / {
        proxy_pass         http://127.0.0.1:8880;
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
        proxy_set_header   X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header   X-Forwarded-Proto $scheme;
        proxy_read_timeout 120s;
    }
}

Update Docker image

To update the Docker image and container, first download the latest version:

docker pull hwdsl2/kokoro-server

If the Docker image is already up to date, you should see:

Status: Image is up to date for hwdsl2/kokoro-server:latest

Otherwise, it will download the latest version. Remove and re-create the container:

docker rm -f kokoro
# Then re-run the docker run command from Quick start with the same volume and port.

Your downloaded model is preserved in the kokoro-data volume.

Using with other AI services

The Whisper (STT), Embeddings, LiteLLM, Kokoro (TTS), Ollama (LLM), Docling, and MCP Gateway images can be combined to build a complete, self-hosted AI stack on your own server — from voice I/O to RAG-powered question answering. Whisper, Kokoro, and Embeddings run fully locally. Ollama runs all LLM inference locally, so no data is sent to third parties. When using LiteLLM with external providers (e.g., OpenAI, Anthropic), your data will be sent to those providers.

ServiceRoleDefault port
EmbeddingsConverts text to vectors for semantic search and RAG8000
Whisper (STT)Transcribes spoken audio to text9000
LiteLLMAI gateway — routes requests to Ollama, OpenAI, Anthropic, and 100+ providers4000
Kokoro (TTS)Converts text to natural-sounding speech8880
Ollama (LLM)Runs local LLM models (llama3, qwen, mistral, etc.)11434
MCP GatewayExposes AI services as MCP tools for AI assistants (Claude, Cursor, etc.)3000
DoclingConverts documents (PDF, DOCX, etc.) to structured text/Markdown5001

See also: Self-Hosted AI Stack — deploy the full stack with a single command, with ready-made configurations and pipeline examples.

Technical details

  • Base image: python:3.12-slim (Debian)
  • Runtime: Python 3 (virtual environment at /opt/venv)
  • TTS engine: Kokoro (Kokoro-82M, Apache 2.0) with PyTorch (CPU and CUDA GPU)
  • API framework: FastAPI + Uvicorn
  • Audio encoding: soundfile (wav/flac), ffmpeg (mp3/aac/opus)
  • Data directory: /var/lib/kokoro (Docker volume)
  • Model storage: HuggingFace Hub format inside the volume — downloaded once, reused on restarts
  • Sample rate: 24 kHz (native Kokoro output)

License

Note: The software components inside the pre-built image (such as Kokoro and its dependencies) are under the respective licenses chosen by their respective copyright holders. As for any pre-built image usage, it is the image user's responsibility to ensure that any use of this image complies with any relevant licenses for all software contained within.

Copyright (C) 2026 Lin Song
This work is licensed under the MIT License.

Kokoro TTS is Copyright (C) hexgrad, and is distributed under the Apache License 2.0.

This project is an independent Docker setup for Kokoro and is not affiliated with, endorsed by, or sponsored by hexgrad or OpenAI.