py-xiaozhi
May 25, 2026 · View on GitHub
English | 简体中文
About
py-xiaozhi is a lightweight, cross-platform multi-modal AI interaction framework built on Python's async architecture. It supports real-time voice streaming, vision-language tasks, and IoT device control. Deployable across Windows, macOS, Linux desktops, and ARM embedded platforms (Raspberry Pi, Horizon Robotics RDK, Jetson Nano), it bridges the gap between Large Language Models and physical hardware — out of the box.
Evolved from the xiaozhi-esp32 firmware project. Officially adopted by D-Robotics (xiaozhi-in-rdk) as an upstream dependency.
Related Projects
- xiaozhi-desktop — Electron desktop client with AEC echo cancellation, Live2D, floating window modes, and Windows / macOS installers
Demo

Key Features
- Real-time Voice AI — Opus codec with auto frame detection (RFC 6716 TOC parsing), async streaming, sub-20ms latency
- Multi-modal Vision — Camera capture + vision-language model integration for image understanding and scene perception
- MCP Tool Ecosystem — Modular JSON-RPC 2.0 tool server: music player, camera, screenshot, app management, weather, volume control
- Cross-platform Deployment — Windows 10+ / macOS 10.15+ / Linux (x86_64 & ARM), optimized for Raspberry Pi and edge boards
- Multiple UI Modes — PySide6 + QML GUI / CLI / GPIO, adapting to desktop, headless server, and embedded environments
- Offline Wake Word — Sherpa-ONNX based on-device keyword spotting with custom wake word support
- IoT & Embodied AI Ready — GPIO interface for robotics control, hardware actuation, and sensor integration
- WebSocket / MQTT — Dual protocol communication with WSS/TLS encrypted transmission and auto-reconnection
- Plugin Architecture — Event-driven async design, clean dependency injection, extensible plugin system
System Requirements
Basic Requirements
- Python Version: 3.10 - 3.12
- Operating System: Windows 10+, macOS 10.15+, Linux
- Audio Devices: Microphone and speaker devices
- Network Connection: Stable internet connection (for AI services and online features)
Recommended Configuration
- Memory: At least 4GB RAM (8GB+ recommended)
- Processor: Modern CPU with AVX instruction set support
- Storage: At least 2GB available disk space (for model files and cache)
- Audio: Audio devices supporting 16kHz sampling rate
Optional Feature Requirements
- Voice Wake-up: Requires downloading Sherpa-ONNX speech recognition models
- Camera Features: Requires camera device and OpenCV support
Read This First
- Carefully read 项目文档 for startup tutorials and file descriptions
- The main branch has the latest code; manually reinstall pip dependencies after each update to ensure you have new dependencies
Zero to Xiaozhi Client (Video Tutorial)
Technical Architecture
Core Architecture Design
- Event-Driven Architecture: Based on asyncio asynchronous event loop, supporting high-concurrency processing
- Layered Design: Clear separation of application layer, protocol layer, and UI layer
- Dependency Injection: Component lifecycle managed via bootstrap container
- Plugin System: Audio, UI, MCP tools and other components loaded via plugin system
Key Technical Components
- Audio Processing: Opus codec, real-time resampling
- Speech Recognition: Sherpa-ONNX offline models, wake word recognition
- Protocol Communication: WebSocket/MQTT dual protocol support, encrypted transmission, auto-reconnection
- Configuration System: Hierarchical configuration, dot notation access, dynamic updates
Performance Optimization
- Async First: Full system asynchronous architecture, avoiding blocking operations
- Memory Management: Smart caching, garbage collection
- Audio Optimization: 5ms low-latency processing, queue management, streaming transmission
- Concurrency Control: Task pool management, semaphore control, thread safety
Security Mechanisms
- Encrypted Communication: WSS/TLS encryption, certificate verification
- Device Authentication: Dual protocol activation, device fingerprint recognition
- Access Control: Tool permission management, API access control
- Error Isolation: Exception isolation, fault recovery, graceful degradation
Development Guide
Project Structure
py-xiaozhi/
├── main.py # Application entry point
├── src/
│ ├── activation/ # Device activation
│ ├── audio_codecs/ # Audio codecs
│ ├── audio_processing/ # Wake word detection
│ ├── bootstrap/ # Application bootstrap & dependency injection
│ ├── constants/ # Constants
│ ├── core/ # Core infrastructure (event bus, state management, task management, etc.)
│ ├── logging/ # Logging subsystem
│ ├── mcp/ # MCP tool system
│ │ ├── mcp_server.py # MCP server
│ │ └── tools/ # Tool modules (music/camera/screenshot/app/weather/volume)
│ ├── plugins/ # Plugin system (audio, UI, MCP, wake word, shortcuts)
│ ├── protocols/ # Communication protocols (WebSocket/MQTT)
│ ├── ui/ # User interface
│ │ ├── gui/ # PySide6 + QML graphical interface
│ │ ├── cli/ # Command line interface
│ │ └── gpio/ # GPIO embedded interface
│ └── utils/ # Utility functions
├── libs/ # Third-party native libraries
│ ├── libopus/ # Opus audio codec library
│ └── webrtc_apm/ # WebRTC audio processing module
├── models/ # Wake word models
├── assets/ # Static resources
├── scripts/ # Auxiliary scripts
├── documents/ # VitePress documentation site
├── pyproject.toml # Project configuration
└── build.json # Build configuration
Development Environment Setup
# Clone project
git clone https://github.com/huangjunsen0406/py-xiaozhi.git
cd py-xiaozhi
# Base install (CLI / GPIO mode)
uv sync # Recommended (uv users)
# or: pip install -e . # pip users
# GUI mode (extra: PySide6 + qasync)
uv sync --extra gui # Recommended (uv users)
# or: pip install -e '.[gui]' # pip users
# Full development environment (GUI + test / packaging tools)
uv sync --extra gui --group dev
# Code formatting
./format_code.sh
# Run program - GUI mode (default; requires gui extra)
python main.py
# Run program - CLI mode (base install is enough)
python main.py --mode cli
# Specify communication protocol
python main.py --protocol websocket # WebSocket (default)
python main.py --protocol mqtt # MQTT protocol
Core Development Patterns
- Async First: Use
async/awaitsyntax, avoid blocking operations - Error Handling: Complete exception handling and logging
- Configuration Management: Use
ConfigManagerfor unified configuration access - Test-Driven: Write unit tests to ensure code quality
Extension Development
- Add MCP Tools: Create new tool modules in
src/mcp/tools/directory - Add Protocols: Implement
Protocolabstract base class - Add Plugins: Extend the plugin system via
src/plugins/
State Transition Diagram
+----------------+
| |
v |
+------+ Wake/Button +------------+ | +------------+
| IDLE | -----------> | CONNECTING | --+-> | LISTENING |
+------+ +------------+ +------------+
^ |
| | Voice Recognition Complete
| +------------+ v
+--------- | SPEAKING | <-----------------+
Playback +------------+
Complete
Contributing
- Start with CONTRIBUTING.md for the repository workflow
- Chinese version: CONTRIBUTING_ZH.md
- Detailed docs: Contribution Guide
Maintainer Workflow
- Triage incoming work as
bug,feature,docs,refactor, ormaintenance - Prefer focused pull requests with clear validation steps and linked context
- Require docs updates when behavior, configuration, or public APIs change
- Merge after CI passes and review feedback is resolved
- Release through the normal release flow; merge does not imply immediate shipping
Community and Support
Thanks to the Following Open Source Contributors
In no particular order
Xiaoxia zhh827 SmartArduino-Li Honggang HonestQiao vonweller Sun Weigong isamu2025 Rain120 kejily Radio bilibili Jun Cyber Intelligence
Sponsorship Support
Thanks to All Sponsors ❤️
Whether it's API resources, device compatibility testing, or financial support, every contribution makes the project more complete