Fluid Server: Local AI server for your Windows apps

September 10, 2025 · View on GitHub

THIS PROJECT IS UNDER ACTIVE DEVELOPMENT Its not ready for production usage but serves as a good reference for hwo to run whisper on Qualcomm and Intel NPUs

A portable, packaged OpenAI-compatible server for Windows desktop applications. LLM, Transcription, embeddings, and vector DB, all out of the box.

Note that this does require you to run the .exe as a sepearte async process, like a local serving server in your application, and you will need to make requests to serve inference.

Features

Core Capabilities

LLM Chat Completions - OpenAI-compatible API with streaming, backed by llama.cpp and OpenVINO
Audio Transcription - Whisper models with NPU acceleration, backed by OpenVINO and Qualcomm QNN
Text Embeddings - Vector embeddings for search and RAG
Vector Database - LanceDB integration for multimodal storage

Hardware Acceleration

Intel NPU via OpenVINO backend
Qualcomm NPU via QNN (Snapdragon X Elite)
Vulkan GPU via llama-cpp

Quick Start

1. Download or Build

Option A: Download Release

Download fluid-server.exe from releases

Option B: Run from Source

# Install dependencies and run
uv sync
uv run

2. Run the Server

# Run with default settings
.\dist\fluid-server.exe

# Or with custom options
.\dist\fluid-server.exe --host 127.0.0.1 --port 8080

3. Test the API

Health Check: http://localhost:8080/health
API Docs: http://localhost:8080/docs
Models: http://localhost:8080/v1/models

Usage Examples

Basic Chat Completion

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3-8b-int8-ov", "messages": [{"role": "user", "content": "Hello!"}]}'

Python Integration

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

# Chat with streaming
for chunk in client.chat.completions.create(
    model="qwen3-8b-int8-ov",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
):
    print(chunk.choices[0].delta.content or "", end="")

Audio Transcription

curl -X POST http://localhost:8080/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=whisper-large-v3-turbo-qnn"

Documentation

📖 Comprehensive Guides

NPU Support Guide - Intel & Qualcomm NPU configuration
Integration Guide - Python, .NET, Node.js examples
Development Guide - Setup, building, and contributing
LanceDB Integration - Vector database and embeddings
GGUF Model Support - Using any GGUF model
Compilation Guide - Build system details

FAQ

Why Python? Best ML ecosystem support and PyInstaller packaging.

Why not llama.cpp? We support multiple runtimes and AI accelerators beyond GGML.

Acknowledgements

Built using ty, FastAPI, Pydantic, ONNX Runtime, OpenAI Whisper, and various other AI libraries.

Runtime Technologies:

OpenVINO - Intel NPU and GPU acceleration
Qualcomm QNN - Snapdragon NPU optimization with HTP backend
ONNX Runtime - Cross-platform AI inference