RealtimeTTS

May 10, 2026 ยท View on GitHub

PyPI Downloads GitHub release

RealtimeTTS is a Python text-to-speech library for applications that need to turn strings, generators, and LLM token streams into audio with low latency. It can play speech locally, stream chunks to another process, write WAV files, and fall back across multiple engines.

The project supports a broad engine matrix: local system voices, cloud APIs, free service wrappers, local neural models, and voice-cloning stacks.

Install

For the fastest local smoke test, install the system engine:

pip install "realtimetts[system]"

On Linux, install PortAudio headers before installing PyAudio:

sudo apt-get update
sudo apt-get install python3-dev portaudio19-dev

On macOS:

brew install portaudio

For cloud engines, local neural engines, CUDA, mpv, and current packaging caveats, see docs/installation.md.

First Audio

from RealtimeTTS import TextToAudioStream, SystemEngine


if __name__ == "__main__":
    stream = TextToAudioStream(SystemEngine())
    stream.feed("Hello from RealtimeTTS.")
    stream.play()

Use the if __name__ == "__main__": guard in scripts, especially on Windows and when using engines that start worker processes.

Streaming Text

feed() accepts an iterator, so text can arrive while audio is already playing:

from RealtimeTTS import TextToAudioStream, SystemEngine


def text_chunks():
    yield "This starts speaking quickly. "
    yield "More text can arrive while audio is already playing."


if __name__ == "__main__":
    stream = TextToAudioStream(SystemEngine())
    stream.feed(text_chunks())
    stream.play()

Use the same pattern with an LLM client by yielding only non-empty text chunks. See docs/llm-streaming.md.

Output

Write audio to a WAV file without local speaker playback:

from RealtimeTTS import TextToAudioStream, SystemEngine


if __name__ == "__main__":
    stream = TextToAudioStream(SystemEngine())
    stream.feed("Save this speech to a file.")
    stream.play(output_wavfile="speech.wav", muted=True)

For output devices, mpv playback, muted mode, callbacks, and chunk formats, see docs/output-and-files.md.

Features

  • Low-latency playback from strings, generators, and streamed model output.
  • Multiple engines with local, cloud, free-service, and neural model options.
  • Fallback engines for more resilient synthesis.
  • Sync and async playback with pause, resume, stop, and state inspection.
  • Text, audio, sentence, character, word-timing, and audio-chunk callbacks.
  • WAV output, muted synthesis, selected output devices, and volume control.
  • Voice switching and voice-cloning workflows where supported by the engine.

Engine Overview

EngineTypeInstall/status noteBest first use
SystemEngineLocalrealtimetts[system]First local audio smoke test.
GTTSEngineFree servicerealtimetts[gtts]Simple network-backed speech.
EdgeEngineFree servicerealtimetts[edge], needs mpvFree streamed voices.
OpenAIEngineCloud APIrealtimetts[openai]OpenAI TTS voices.
AzureEngineCloud APIrealtimetts[azure]Azure voices and word timings.
ElevenlabsEngineCloud APIrealtimetts[elevenlabs], needs mpvHigh-quality API voices.
CambEngineCloud APIrealtimetts[camb]CAMB MARS API voices.
MiniMaxEngineCloud APIrealtimetts[minimax]MiniMax cloud voices.
CartesiaEngineCloud APIrealtimetts[cartesia]Cartesia API voices.
TypecastEngineCloud APIrealtimetts[typecast]Typecast API voices.
ModelsLabEngineCloud APIrealtimetts[modelslab], root export pendingModelsLab API voices.
CoquiEngineLocal neuralrealtimetts[coqui]Local XTTS voice cloning.
PiperEngineLocal executablerealtimetts[piper], external Piper setupFast local executable TTS.
StyleTTSEngineLocal neuralrealtimetts[styletts], local checkout/assetsStyleTTS experiments.
ParlerEngineLocal neuralrealtimetts[parler]GPU local model experiments.
KokoroEngineLocal neuralrealtimetts[kokoro]Local voices and timing support.
OrpheusEngineLocal/API-stylerealtimetts[orpheus]Orpheus model workflows.
FasterQwenEngineLocal neuralrealtimetts[qwen]Qwen voice cloning.
OmniVoiceEngineLocal neuralrealtimetts[omnivoice]Multilingual voice cloning.
PocketTTSEngineLocal lightweightrealtimetts[pockettts]CPU-oriented voice cloning.
NeuTTSEngineLocal neuralrealtimetts[neutts], optional neutts-ggufReference-audio voice cloning.
ZipVoiceEngineLocal neuralrealtimetts[zipvoice], external checkoutZipVoice cloning/server demos.
LuxTTSEngineLocal neuralrealtimetts[luxtts]LuxTTS voice cloning.
ChatterboxEngineLocal neuralrealtimetts[chatterbox]Chatterbox prompt-audio voices.
SoproTTSEngineLocal neuralrealtimetts[sopro]Sopro reference-audio voices.
SopranoEngineLocal neuralrealtimetts[soprano]Soprano local synthesis.
MossTTSEngineLocal neuralrealtimetts[moss], runtime assetsMOSS-TTS experiments.

See docs/engine-selection.md before choosing an engine for an application. The engine-specific docs are being split out from the old README and source audit.

Documentation

  • Quick start: shortest working examples.
  • Installation: extras, platform setup, external tools, API keys, and known packaging mismatches.
  • Engine selection: engine matrix and selection guidance.
  • Feed and playback: feed(), play(), play_async(), pause, resume, stop, text state, and inline tags.
  • LLM streaming: provider-neutral streamed text patterns and latency tuning.
  • Output and files: WAV files, audio chunks, muted mode, output devices, mpv, buffering, and volume.
  • Engine setup pages now link one focused page for each concrete engine source.
  • FAQ: legacy troubleshooting page while topic docs are being split out.

Legacy translated docs remain under docs/<locale>/ while English is refactored as the canonical source.

Server Example

The browser and WebSocket server example lives in example_fast_api/:

python -m pip install fastapi uvicorn websockets pyaudio
python example_fast_api/async_server.py

Open http://localhost:8000 or connect to ws://localhost:8000/ws.

RealtimeSTT is the speech-to-text counterpart for realtime voice input.

Contributing

Focused docs, tests, and engine fixes are easiest to review. During the docs refactor, keep English docs canonical and note mismatches between source, packaging, examples, and tests rather than hiding them.

License

RealtimeTTS source code is MIT licensed. Engine providers, model weights, voice data, datasets, generated audio, and third-party services can have separate terms. Read LICENSING_ADDENDUM.md and the relevant provider or model licenses before commercial use.

Audio samples derived from the EARS dataset by Meta are licensed under CC BY-NC 4.0. See the original dataset terms for details.

Author

Kolja Beigel