py-xiaozhi

May 25, 2026 · View on GitHub

About

py-xiaozhi is a lightweight, cross-platform multi-modal AI interaction framework built on Python's async architecture. It supports real-time voice streaming, vision-language tasks, and IoT device control. Deployable across Windows, macOS, Linux desktops, and ARM embedded platforms (Raspberry Pi, Horizon Robotics RDK, Jetson Nano), it bridges the gap between Large Language Models and physical hardware — out of the box.

Evolved from the xiaozhi-esp32 firmware project. Officially adopted by D-Robotics (xiaozhi-in-rdk) as an upstream dependency.

xiaozhi-desktop — Electron desktop client with AEC echo cancellation, Live2D, floating window modes, and Windows / macOS installers

Demo

Bilibili Demo Video

Key Features

Real-time Voice AI — Opus codec with auto frame detection (RFC 6716 TOC parsing), async streaming, sub-20ms latency
Multi-modal Vision — Camera capture + vision-language model integration for image understanding and scene perception
MCP Tool Ecosystem — Modular JSON-RPC 2.0 tool server: music player, camera, screenshot, app management, weather, volume control
Cross-platform Deployment — Windows 10+ / macOS 10.15+ / Linux (x86_64 & ARM), optimized for Raspberry Pi and edge boards
Multiple UI Modes — PySide6 + QML GUI / CLI / GPIO, adapting to desktop, headless server, and embedded environments
Offline Wake Word — Sherpa-ONNX based on-device keyword spotting with custom wake word support
IoT & Embodied AI Ready — GPIO interface for robotics control, hardware actuation, and sensor integration
WebSocket / MQTT — Dual protocol communication with WSS/TLS encrypted transmission and auto-reconnection
Plugin Architecture — Event-driven async design, clean dependency injection, extensible plugin system

System Requirements

Basic Requirements

Python Version: 3.10 - 3.12
Operating System: Windows 10+, macOS 10.15+, Linux
Audio Devices: Microphone and speaker devices
Network Connection: Stable internet connection (for AI services and online features)

Recommended Configuration

Memory: At least 4GB RAM (8GB+ recommended)
Processor: Modern CPU with AVX instruction set support
Storage: At least 2GB available disk space (for model files and cache)
Audio: Audio devices supporting 16kHz sampling rate

Optional Feature Requirements

Voice Wake-up: Requires downloading Sherpa-ONNX speech recognition models
Camera Features: Requires camera device and OpenCV support

Read This First

Carefully read 项目文档 for startup tutorials and file descriptions
The main branch has the latest code; manually reinstall pip dependencies after each update to ensure you have new dependencies

Zero to Xiaozhi Client (Video Tutorial)

Technical Architecture

Core Architecture Design

Event-Driven Architecture: Based on asyncio asynchronous event loop, supporting high-concurrency processing
Layered Design: Clear separation of application layer, protocol layer, and UI layer
Dependency Injection: Component lifecycle managed via bootstrap container
Plugin System: Audio, UI, MCP tools and other components loaded via plugin system

Key Technical Components

Audio Processing: Opus codec, real-time resampling
Speech Recognition: Sherpa-ONNX offline models, wake word recognition
Protocol Communication: WebSocket/MQTT dual protocol support, encrypted transmission, auto-reconnection
Configuration System: Hierarchical configuration, dot notation access, dynamic updates

Performance Optimization

Async First: Full system asynchronous architecture, avoiding blocking operations
Memory Management: Smart caching, garbage collection
Audio Optimization: 5ms low-latency processing, queue management, streaming transmission
Concurrency Control: Task pool management, semaphore control, thread safety

Security Mechanisms

Encrypted Communication: WSS/TLS encryption, certificate verification
Device Authentication: Dual protocol activation, device fingerprint recognition
Access Control: Tool permission management, API access control
Error Isolation: Exception isolation, fault recovery, graceful degradation

Development Guide

Project Structure

py-xiaozhi/
├── main.py                     # Application entry point
├── src/
│   ├── activation/             # Device activation
│   ├── audio_codecs/           # Audio codecs
│   ├── audio_processing/       # Wake word detection
│   ├── bootstrap/              # Application bootstrap & dependency injection
│   ├── constants/              # Constants
│   ├── core/                   # Core infrastructure (event bus, state management, task management, etc.)
│   ├── logging/                # Logging subsystem
│   ├── mcp/                    # MCP tool system
│   │   ├── mcp_server.py       # MCP server
│   │   └── tools/              # Tool modules (music/camera/screenshot/app/weather/volume)
│   ├── plugins/                # Plugin system (audio, UI, MCP, wake word, shortcuts)
│   ├── protocols/              # Communication protocols (WebSocket/MQTT)
│   ├── ui/                     # User interface
│   │   ├── gui/                # PySide6 + QML graphical interface
│   │   ├── cli/                # Command line interface
│   │   └── gpio/               # GPIO embedded interface
│   └── utils/                  # Utility functions
├── libs/                       # Third-party native libraries
│   ├── libopus/                # Opus audio codec library
│   └── webrtc_apm/             # WebRTC audio processing module
├── models/                     # Wake word models
├── assets/                     # Static resources
├── scripts/                    # Auxiliary scripts
├── documents/                  # VitePress documentation site
├── pyproject.toml              # Project configuration
└── build.json                  # Build configuration

Development Environment Setup

# Clone project
git clone https://github.com/huangjunsen0406/py-xiaozhi.git
cd py-xiaozhi

# Base install (CLI / GPIO mode)
uv sync                                    # Recommended (uv users)
# or: pip install -e .                    # pip users

# GUI mode (extra: PySide6 + qasync)
uv sync --extra gui                        # Recommended (uv users)
# or: pip install -e '.[gui]'             # pip users

# Full development environment (GUI + test / packaging tools)
uv sync --extra gui --group dev

# Code formatting
./format_code.sh

# Run program - GUI mode (default; requires gui extra)
python main.py

# Run program - CLI mode (base install is enough)
python main.py --mode cli

# Specify communication protocol
python main.py --protocol websocket  # WebSocket (default)
python main.py --protocol mqtt       # MQTT protocol

Core Development Patterns

Async First: Use async/await syntax, avoid blocking operations
Error Handling: Complete exception handling and logging
Configuration Management: Use ConfigManager for unified configuration access
Test-Driven: Write unit tests to ensure code quality

Extension Development

Add MCP Tools: Create new tool modules in src/mcp/tools/ directory
Add Protocols: Implement Protocol abstract base class
Add Plugins: Extend the plugin system via src/plugins/

State Transition Diagram

                        +----------------+
                        |                |
                        v                |
+------+  Wake/Button  +------------+   |   +------------+
| IDLE | -----------> | CONNECTING | --+-> | LISTENING  |
+------+              +------------+       +------------+
   ^                                            |
   |                                            | Voice Recognition Complete
   |          +------------+                    v
   +--------- |  SPEAKING  | <-----------------+
     Playback +------------+
     Complete

Contributing

Start with CONTRIBUTING.md for the repository workflow
Chinese version: CONTRIBUTING_ZH.md
Detailed docs: Contribution Guide

Maintainer Workflow

Triage incoming work as bug, feature, docs, refactor, or maintenance
Prefer focused pull requests with clear validation steps and linked context
Require docs updates when behavior, configuration, or public APIs change
Merge after CI passes and review feedback is resolved
Release through the normal release flow; merge does not imply immediate shipping

Community and Support

Thanks to the Following Open Source Contributors

In no particular order

Xiaoxia zhh827 SmartArduino-Li Honggang HonestQiao vonweller Sun Weigong isamu2025 Rain120 kejily Radio bilibili Jun Cyber Intelligence

Sponsorship Support

Thanks to All Sponsors ❤️

Whether it's API resources, device compatibility testing, or financial support, every contribution makes the project more complete

Project Statistics

License

MIT License