Stormvino
May 26, 2026 · View on GitHub
Stormvino
OpenAI-compatible LLM server for Intel Arc GPUs.
Runs local inference via OpenVINO. Speaks the OpenAI API — drop it behind any
client that accepts a base_url. No NVIDIA required.
Hardware compatibility
| GPU | VRAM | Status | Notes |
|---|---|---|---|
| Arc B60 | 24 GB | ✅ Production | EnvyStorm reference machine |
| Arc B50 | 16 GB | 🔜 Testing | TinyB — install in progress |
| Arc B65 | TBD | 🔜 Planned | Next after B50 confirmed |
| Arc B70 | TBD | 🔜 Planned | |
| Other Arc | any | ⚙️ Auto-tuned | VRAM detected at runtime |
Detecting B-series cards: Battlemage GPUs often report as
Intel(R) Graphics [0xExxx](e.g.[0xe212]) — not the word "Arc";lspciand the OpenVINO device name both omit it. Identify the discrete GPU by its OpenVINO device type (DISCRETEvsINTEGRATED), not by matching "Arc". If a detection step reports "no Arc GPU found" on a B-series card, the card is still fine — confirm withclinfoorpython -c "import openvino as ov; print(ov.Core().available_devices)"and continue.
OS: Linux Mint 22.x / Ubuntu 24.04 (Noble). Kernel: Battlemage (B-series) needs the
xedriver.linux-oem-24.04provides it — but a newer generic/mainline kernel (6.11+) that already loadsxeand creates a/dev/dri/renderD*node for the card works too. The installer checks whether the GPU is already live and upgrades the kernel only if it isn't — so a working newer kernel won't be downgraded. System RAM: 16 GB minimum (a 16 GB machine reports ~15 GiB usable). Disk: 50 GB+ for a useful model set.
Install paths — pick one
🤖 Claude Code (recommended for single machine)
Fully automated. CC asks 3 questions, then handles everything — including a kernel upgrade + reboot only if your GPU isn't already working. You watch.
Step 1 — Install Claude Code if you haven't:
npm install -g @anthropic-ai/claude-code
Prerequisite — passwordless sudo for the install. The automated path runs system commands via
sudo, and Claude Code's non-interactive shell can't answer a password prompt. Grant a temporary
drop-in and remove it when the install finishes:
echo "$USER ALL=(ALL) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/stormvino-install
sudo chmod 0440 /etc/sudoers.d/stormvino-install
# when the install is done: sudo rm /etc/sudoers.d/stormvino-install
Step 2 — Clone the repo into your home dir and start CC there. Don't clone into /opt — it's
root-owned, so the clone fails; the runbook creates and owns /opt/ov_server for you during install:
git clone https://github.com/Jermalk/stormvino.git ~/stormvino
cd ~/stormvino
claude
Step 3 — In the CC chat, type exactly:
Run the Stormvino installation runbook. @CC_INSTALL.md
The @CC_INSTALL.md mention loads the runbook directly — no file dragging needed.
CC reads it and takes over. Answer the 3 questions it asks, then watch.
→ See CC_INSTALL.md for what CC does at each phase.
⚙️ Ansible (recommended for multiple machines / repeatable deploys)
One command installs on any number of Arc machines simultaneously. Detects GPU VRAM at runtime and tunes config automatically. Fully headless — handles reboots without human intervention.
git clone https://github.com/Jermalk/stormvino.git
cd stormvino
# edit vars/main.yml (3 lines) — then:
ansible-playbook -i hosts.yml stormvino.yml
→ See ANSIBLE.md for the full plan and current implementation status.
📖 Manual (full control, learn every step)
Step-by-step guide with a verification test between every phase. Covers kernel, drivers, Python env, PostgreSQL, models, and systemd services.
git clone https://github.com/Jermalk/stormvino.git
cd stormvino
./install.sh # detects hardware, routes to the right path
→ See INSTALL.md.
What you get
| Endpoint | Description |
|---|---|
POST /v1/chat/completions | OpenAI-compatible chat, streaming supported |
POST /v1/embeddings | Sentence embeddings (multilingual-e5-large) |
GET /v1/models | List discovered models |
POST /v1/images/generations | Image generation (SDXL, optional) |
POST /v1/audio/transcriptions | Speech-to-text (Whisper, optional) |
POST /v1/audio/speech | Text-to-speech (Kokoro / Piper, optional) |
GET /health | Server health + loaded models + VRAM stats |
GET /monitor | Web dashboard — live VRAM, throughput, request log |
Default port: 11435. Accessible over LAN. Runs as an unprivileged stormvino systemd
service (not root); the embedding model is offloaded to the iGPU when present, leaving the Arc's
full VRAM for the LLM.
Tested models (B60 / 24 GB VRAM)
| Model | VRAM | Role |
|---|---|---|
qwen3-14b-int4-ov | 9.1 GB | Default — reasoning, coding, chat |
qwen3-8b-int4-ov | 4.6 GB | Agent turns, fast responses |
multilingual-e5-large-int8 | 563 MB | Embeddings + task routing |
whisper-large-v3-int8-ov | ~2 GB | Speech-to-text |
qwen2.5-vl-7b-int4-ov | ~5 GB | Vision — image understanding |
→ See MODELS.md for conversion instructions and VRAM budget tables.
Quick health check
curl -s http://localhost:11435/health | python3 -m json.tool
curl -s http://localhost:11435/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"qwen3-8b-int4-ov","messages":[{"role":"user","content":"Hello"}]}'
Libraries stack
Inference (server runtime)
| Library | Version |
|---|---|
| openvino | 2026.1.0 |
| openvino-genai | 2026.1.0.0 |
| openvino-tokenizers | 2026.1.0.0 |
| infergate | 0.2.0 |
| optimum-intel | 1.27.0 |
| optimum | 2.1.0 |
| transformers | 4.57.6 |
| tokenizers | 0.22.2 |
Model conversion (offline, via optimum-cli)
| Library | Version |
|---|---|
| nncf | 3.1.0 |
| onnx | 1.21.0 |
| onnxruntime | 1.25.0 |
| safetensors | 0.7.0 |
| huggingface_hub | 0.36.2 |
Configuration
Runtime settings live in config.json. Key settings auto-patched by the installers
based on detected GPU VRAM:
| Key | Description |
|---|---|
device | OpenVINO device — auto-detected (e.g. GPU.1) |
kv_cache_size_gb | KV cache per model — tuned to VRAM tier |
max_loaded_models | Models held in VRAM simultaneously |
default_model | Model used when client doesn't specify |
embedding_model | Embedding model directory name |
postgres_dsn | Observability database connection string |
Full reference: INSTALL.md § Phase 7.
Architecture
| Layer | Component |
|---|---|
| HTTP | FastAPI + Uvicorn, single worker |
| LLM inference | openvino_genai.LLMPipeline, executor-offloaded |
| VLM inference | openvino_genai.VLMPipeline |
| Embeddings | OVModelForFeatureExtraction (optimum-intel) |
| Task routing | Embedding similarity + signal detection |
| STT | openvino_genai.WhisperPipeline |
| TTS | Kokoro-ONNX (EN) + Piper (PL) |
| Observability | PostgreSQL 16 + pgvector |
| Monitor UI | Svelte + uPlot |
Hardware reports welcome
Tested Stormvino on a GPU not in the compatibility table? Open a hardware report issue — GPU model, VRAM, kernel version, tokens/sec. Builds the matrix for everyone.
Origin
Stormvino grew out of Shangri-Lab — a personal lab built by an IT architect from Silesia who had no Python background, a pair of Intel Arc GPUs, and a firm belief that local inference shouldn't require Nvidia hardware or magic frameworks.
The philosophy is unchanged: build the simplest thing that gives full visibility first, tune quality only after you can observe it.
Built with Claude Code.