Changelog

June 4, 2026 · View on GitHub

Notable changes to this project will be documented in this file.

Per-PR attribution and contributor credits are published automatically on the corresponding GitHub release page; this file is the curated, human-readable summary.

[Unreleased]

Added

POST /dev/unload release model from VRAM without stopping container; lazy reload on next request. For freeing a shared GPU while idle. Reclaim scale with load (~0.7 GB; ~1.6 GB via long-form test on 4060Ti). (#474)

Fixed

Web UI long-playback bugfix around the 10-minute mark; in-browser audio buffer is now bounded ahead of currentTime with trailing eviction behind it, so long generations stop overflowing the SourceBuffer.
Web UI stays responsive on extended sessions; waveform animation is transition-gated and PlayerState short-circuits no-op updates, so controls don't drift into lag after 10+ minutes of playback.

Notes

Scrubbing may not be fully not supported in current state on MP3 streamed playback. WAV etc, plays back fine on completion.

[v0.4.0] - 2026-05-24

Added

GPU image variants for Blackwell / RTX 50-series (:latest-cu128, :vX.Y.Z-cu128, amd64 only) with PyTorch cu128 wheels (#443). Default :latest and new :latest-cu126 alias stay on cu126 for Maxwell/Pascal compatibility.
Integration test suite (api/tests/integration/, opt-in integration marker) and a tts-api-test-client image that round-trips speech through faster-whisper against a live server. Run via docker/docker-compose.test.yml.
Web UI footer badge showing the server version from /config.

Breaking changes

/v1/audio/voices items in the voices array changed from plain strings to {"id", "name"} objects (#462) to match OpenWebUI/similar clients, and allow metadata in the response. Clients reading entries as strings will break; pass ?legacy=true to restore the old item shape.
- Old: {"voices": ["af_heart", ...]}
- New: {"voices": [{"id": "af_heart", "name": "af_heart"}, ...]}

Changed

api_version now read from the VERSION file instead of hardcoded.
Removed the legacy docker/{cpu,gpu}/Dockerfile; the .optimized variants are the only build files now.
Docker images carry OCI metadata so GHCR pages render properly. Integration compose defaults to the published test-client image.
ROCm image defaults to MIOPEN_FIND_MODE=2 so the on-disk kernel cache is reused instead of re-searched per process, and ships an opt-in warmup script at docker/rocm/warmup_miopen.py to pre-populate it. Recipe and benchmarks from @realugbun in #454.

Fixed

WAV responses drop junk size-field trailer that decoded as a click at chunk end. (#463)
ROCm MIOpen cache set to persist across compose restarts; switched bind mounts to named volumes at the path MIOpen writes to (prior mounts targeted an inaccessible location).
cpu/gpu composes set DOWNLOAD_MODEL=true for an idempotent model fetch on startup.
VERSION shipped into images so /config reports the real server version.
Silence trimming no longer treats full-scale-negative samples as silent (int16 abs() overflow).
Fixed invalid escape sequences in the text-normalizer URL regex.
CI test job uses the CPU PyTorch build and excludes integration tests by default.

[v0.3.0] - 2026-05-15

Added

AMD GPU support via ROCm (docker/rocm/ build, rocm extra in pyproject.toml). Also explored/proposed via @asheghi in #393.
gpt-4o-mini-tts model alias for OpenAI-compatible clients.
Reverse-proxy support for the Web UI (new /config endpoint exposing UVICORN_ROOT_PATH).
Configurable logging level via the API_LOG_LEVEL environment variable.
INCLUDE_JAPANESE Docker build flag for opt-in Japanese support.
Transcription accuracy test harness under examples/assorted_checks/test_transcription/ (baselines, multilingual reports, long-form runner).
Override of docker-bake.hcl variables through GitHub Actions environment variables.

Changed

PyTorch bumped to 2.8.0 (x86_64: cu126, aarch64: cu129). x86_64 settled on cu126 to keep Maxwell/Pascal cards working, which drops native Blackwell (RTX 50-series) kernel support. Blackwell users need to override the torch index manually. See #443.
kokoro bumped to 0.9.4 and misaki to 0.9.4 (proposed by @jcheek in #371, superceded).
New optimized multi-stage Dockerfiles (docker/{cpu,gpu}/Dockerfile.optimized) become the default bake target. Reported image sizes: CPU 5.6 → 4.9 GB, GPU 14.8 → 9.9 GB.
Parallelized Docker bake targets per architecture for faster CI.
ROCBlas version pinned; ROCm docker-compose now builds locally.
CI/release workflow hardening: pinned BuildKit/runners, branch-tagged builds, manifest fixes, workflow_dispatch ref and tag-check race fixed, latest tag gated.

Fixed

OGG/Opus audio truncation where the final page was lost during write_chunk finalize.
Voice tensor loading hardened with weights_only=True (avoids unsafe pickle in torch.load).
Per-request voice-tensor memory leak resolved via caching (#453), with cache cleared on unload.
Custom phoneme handling made significantly more robust.
Firefox Web UI playback falls back gracefully when audio/mpeg MSE is unsupported; waveform rendering bugfix bundled in the same web rewrite.
CPU Docker builds: Rust now installed for appuser with proper PATH and longer uv timeouts.
cmake added to CI deps to unblock pyopenjtalk builds (proposed by @jcheek in #371; superceded).
start-gpu.sh uses #!/usr/bin/env bash for broader compatibility.
Apple Silicon: test_initial_state() no longer fails.

[v0.2.4] - 2025-06-18

Added

Apple Silicon (MPS) acceleration support for macOS users.
Voice subtraction capability for creating unique voice effects.
Windows PowerShell start scripts (start-cpu.ps1, start-gpu.ps1).
Automatic model downloading integrated into all start scripts.
Example Helm chart values for Azure AKS and Nvidia GPU Operator deployments.
Volume multiplier setting.
Chinese punctuation-based sentence splitting.
CONTRIBUTING.md guidelines for developers.

Changed

Version bump of underlying Kokoro and Misaki libraries.
Default API port reverted to 8880.
Docker containers now run as a non-root user.
Improved text normalization for numbers, currency, and time formats.
Improved MP3 encoding and audio-pause handling.
Updated and improved Helm chart configurations and documentation.
Enhanced temporary file management with better error tracking.
Web UI dependencies (Siriwave) are now served locally.
Standardized environment variable handling across shell/PowerShell scripts.
Rust installed in Dockerfile for builds requiring it.

Fixed

Download links no longer dropped when streaming=false and return_download_link=true.
Windows PowerShell start scripts fixed around virtual-environment activation order.
Potential segfaults during inference addressed.
Helm chart issues around health checks, ingress, and default values.
Audio-quality degradation from incorrect bitrate settings in some paths.
Custom phonemes provided in input text are now preserved end-to-end.
'MediaSource' error affecting playback stability in the web player.
CRLF line endings in custom_responses.py converted to LF.
Money parsing and related tests.
Additional safety checks on captioned-speech generation.
Phoneme handling fixes.

Removed

Obsolete GitHub Actions build workflow; build and publish now occurs on merge to Release branch.

[v0.2.3] - 2025-03-06

Added

Streaming word timestamps.
.gitattributes for consistent line endings.

Changed

Text normalization improvements.

Fixed

Audio-quality regression caused by lower-bitrate encoding.
Disabled uvicorn/FastAPI --reload to avoid pegging a CPU core.

[v0.2.2] - 2025-02-13

Added

Helm chart.
Settings-based override of the default lang_code.
Advanced normalization settings.

Fixed

Speech not engaging reliably on the CPU image fallback.
Audio quality bumped via adjusted compression settings.
Web UI format-selection bug.

[v0.2.1] - 2025-02-10

Added

Dummy /v1/models endpoint for OpenAI compatibility (#144).

Changed

Caption flow now streams audio with tempfile download at completion, removing duplicate captions (#139).

Fixed

Compatibility with the espeak-loader dependency on misaki (#127).
Build system and model-download issues.

[v0.2.0post1] - 2025-02-07

Fix: Building Kokoro from source with adjustments, to avoid CUDA lock
Fixed ARM64 compatibility on Spacy dep to avoid emulation slowdown
Added g++ for Japanese language support
Temporarily disabled Vietnamese language support due to ARM64 compatibility issues

[v0.2.0-pre] - 2025-02-06

Added

Complete Model Overhaul:
- Upgraded to Kokoro v1.0 model architecture
- Pre-installed multi-language support from Misaki:
  - English (en), Japanese (ja), Korean (ko),Chinese (zh), Vietnamese (vi)
- All voice packs included for supported languages, along with the original versions.
Enhanced Audio Generation Features:
- Per-word timestamped caption generation
- Phoneme-based audio generation capabilities
- Detailed phoneme generation
Web UI Improvements:
- Improved voice mixing with weighted combinations
- Text file upload support
- Enhanced formatting and user interface
- Cleaner UI (in progress)
- Integration with https://github.com/hexgrad/kokoro and https://github.com/hexgrad/misaki packages

Removed

Deprecated support for Kokoro v0.19 model

Changes

Combine Voices endpoint now returns a .pt file, with generation combinations generated on the fly otherwise

[v0.1.4] - 2025-01-30

Added

Smart Chunking System:
- New text_processor with smart_split for improved sentence boundary detection
- Dynamically adjusts chunk sizes based on sentence structure, using phoneme/token information in an intial pass
- Should avoid ever going over the 510 limit per chunk, while preserving natural cadence
Web UI Added (To Be Replacing Gradio):
- Integrated streaming with tempfile generation
- Download links available in X-Download-Path header
- Configurable cleanup triggers for temp files
Debug Endpoints:
- /debug/threads for thread information and stack traces
- /debug/storage for temp file and output directory monitoring
- /debug/system for system resource information
- /debug/session_pools for ONNX/CUDA session status
Automated Model Management:
- Auto-download from releases page
- Included download scripts for manual installation
- Pre-packaged voice models in repository

Changed

Significant architectural improvements:
- Multi-model architecture support
- Enhanced concurrency handling
- Improved streaming header management
- Better resource/session pool management

[v0.1.2] - 2025-01-23

Structural Improvements

Models can be manually download and placed in api/src/models, or use included script
TTSGPU/TPSCPU/STTSService classes replaced with a ModelManager service
- CPU/GPU of each of ONNX/PyTorch (Note: Only Pytorch GPU, and ONNX CPU/GPU have been tested)
- Should be able to improve new models as they become available, or new architectures, in a more modular way
Converted a number of internal processes to async handling to improve concurrency
Improving separation of concerns towards plug-in and modular structure, making PR's and new features easier

Web UI (test release)

An integrated simple web UI has been added on the FastAPI server directly
- This can be disabled via core/config.py or ENV variables if desired.
- Simplifies deployments, utility testing, aesthetics, etc
- Looking to deprecate/collaborate/hand off the Gradio UI

[v0.1.0] - 2025-01-13

Changed

Major Docker improvements:
- Baked model directly into Dockerfile for improved deployment reliability
- Switched to uv for dependency management
- Streamlined container builds and reduced image sizes
Dependency Management:
- Migrated from pip/poetry to uv for faster, more reliable package management
- Added uv.lock for deterministic builds
- Updated dependency resolution strategy

[v0.0.5post1] - 2025-01-11

Fixed

Docker image tagging and versioning improvements (-gpu, -cpu, -ui)
Minor vram management improvements
Gradio bugfix causing crashes and errant warnings
Updated GPU and UI container configurations

[v0.0.5] - 2025-01-10

Fixed

Stabilized issues with images tagging and structures from v0.0.4
Added automatic master to develop branch synchronization
Improved release tagging and structures
Initial CI/CD setup

2025-01-04

Added

ONNX Support:
- Added single batch ONNX support for CPU inference
- Roughly 0.4 RTF (2.4x real-time speed)

Modified

Code Refactoring:
- Work on modularizing phonemizer and tokenizer into separate services
- Incorporated these services into a dev endpoint
Testing and Benchmarking:
- Cleaned up benchmarking scripts
- Cleaned up test scripts
- Added auto-WAV validation scripts

2025-01-02

Audio Format Support:
- Added comprehensive audio format conversion support (mp3, wav, opus, flac)

2025-01-01

Added

Gradio Web Interface:
- Added simple web UI utility for audio generation from input or txt file

Modified

Configuration Changes

Updated Docker configurations:
- Changes to Dockerfile:
  - Improved layer caching by separating dependency and code layers
- Updates to docker-compose.yml and docker-compose.cpu.yml:
  - Removed commit lock from model fetching to allow automatic model updates from HF
  - Added git index lock cleanup

API Changes

Modified api/src/main.py
Updated TTS service implementation in api/src/services/tts.py:
- Added device management for better resource control:
  - Voices are now copied from model repository to api/src/voices directory for persistence
- Refactored voice pack handling:
  - Removed static voice pack dictionary
  - On-demand voice loading from disk
- Added model warm-up functionality:
  - Model now initializes with a dummy text generation
  - Uses default voice (af.pt) for warm-up
  - Model is ready for inference on first request