Fluid Server: Local AI server for your Windows apps
September 10, 2025 ยท View on GitHub
THIS PROJECT IS UNDER ACTIVE DEVELOPMENT Its not ready for production usage but serves as a good reference for hwo to run whisper on Qualcomm and Intel NPUs
A portable, packaged OpenAI-compatible server for Windows desktop applications. LLM, Transcription, embeddings, and vector DB, all out of the box.
Note that this does require you to run the .exe as a sepearte async process, like a local serving server in your application, and you will need to make requests to serve inference.
Features
Core Capabilities
- LLM Chat Completions - OpenAI-compatible API with streaming, backed by llama.cpp and OpenVINO
- Audio Transcription - Whisper models with NPU acceleration, backed by OpenVINO and Qualcomm QNN
- Text Embeddings - Vector embeddings for search and RAG
- Vector Database - LanceDB integration for multimodal storage
Hardware Acceleration
- Intel NPU via OpenVINO backend
- Qualcomm NPU via QNN (Snapdragon X Elite)
- Vulkan GPU via llama-cpp
Quick Start
1. Download or Build
Option A: Download Release
- Download
fluid-server.exefrom releases
Option B: Run from Source
# Install dependencies and run
uv sync
uv run
2. Run the Server
# Run with default settings
.\dist\fluid-server.exe
# Or with custom options
.\dist\fluid-server.exe --host 127.0.0.1 --port 8080
3. Test the API
- Health Check: http://localhost:8080/health
- API Docs: http://localhost:8080/docs
- Models: http://localhost:8080/v1/models
Usage Examples
Basic Chat Completion
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen3-8b-int8-ov", "messages": [{"role": "user", "content": "Hello!"}]}'
Python Integration
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")
# Chat with streaming
for chunk in client.chat.completions.create(
model="qwen3-8b-int8-ov",
messages=[{"role": "user", "content": "Hello!"}],
stream=True
):
print(chunk.choices[0].delta.content or "", end="")
Audio Transcription
curl -X POST http://localhost:8080/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=whisper-large-v3-turbo-qnn"
Documentation
๐ Comprehensive Guides
- NPU Support Guide - Intel & Qualcomm NPU configuration
- Integration Guide - Python, .NET, Node.js examples
- Development Guide - Setup, building, and contributing
- LanceDB Integration - Vector database and embeddings
- GGUF Model Support - Using any GGUF model
- Compilation Guide - Build system details
FAQ
Why Python? Best ML ecosystem support and PyInstaller packaging.
Why not llama.cpp? We support multiple runtimes and AI accelerators beyond GGML.
Acknowledgements
Built using ty, FastAPI, Pydantic, ONNX Runtime, OpenAI Whisper, and various other AI libraries.
Runtime Technologies:
OpenVINO- Intel NPU and GPU accelerationQualcomm QNN- Snapdragon NPU optimization with HTP backendONNX Runtime- Cross-platform AI inference