Stormvino

May 26, 2026 · View on GitHub

Mentioned in Awesome OpenVINO

Stormvino

OpenAI-compatible LLM server for Intel Arc GPUs. Runs local inference via OpenVINO. Speaks the OpenAI API — drop it behind any client that accepts a base_url. No NVIDIA required.


Hardware compatibility

GPUVRAMStatusNotes
Arc B6024 GB✅ ProductionEnvyStorm reference machine
Arc B5016 GB🔜 TestingTinyB — install in progress
Arc B65TBD🔜 PlannedNext after B50 confirmed
Arc B70TBD🔜 Planned
Other Arcany⚙️ Auto-tunedVRAM detected at runtime

Detecting B-series cards: Battlemage GPUs often report as Intel(R) Graphics [0xExxx] (e.g. [0xe212]) — not the word "Arc"; lspci and the OpenVINO device name both omit it. Identify the discrete GPU by its OpenVINO device type (DISCRETE vs INTEGRATED), not by matching "Arc". If a detection step reports "no Arc GPU found" on a B-series card, the card is still fine — confirm with clinfo or python -c "import openvino as ov; print(ov.Core().available_devices)" and continue.

OS: Linux Mint 22.x / Ubuntu 24.04 (Noble). Kernel: Battlemage (B-series) needs the xe driver. linux-oem-24.04 provides it — but a newer generic/mainline kernel (6.11+) that already loads xe and creates a /dev/dri/renderD* node for the card works too. The installer checks whether the GPU is already live and upgrades the kernel only if it isn't — so a working newer kernel won't be downgraded. System RAM: 16 GB minimum (a 16 GB machine reports ~15 GiB usable). Disk: 50 GB+ for a useful model set.


Install paths — pick one

Fully automated. CC asks 3 questions, then handles everything — including a kernel upgrade + reboot only if your GPU isn't already working. You watch.

Step 1 — Install Claude Code if you haven't:

npm install -g @anthropic-ai/claude-code

Prerequisite — passwordless sudo for the install. The automated path runs system commands via sudo, and Claude Code's non-interactive shell can't answer a password prompt. Grant a temporary drop-in and remove it when the install finishes:

echo "$USER ALL=(ALL) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/stormvino-install
sudo chmod 0440 /etc/sudoers.d/stormvino-install
# when the install is done:  sudo rm /etc/sudoers.d/stormvino-install

Step 2 — Clone the repo into your home dir and start CC there. Don't clone into /opt — it's root-owned, so the clone fails; the runbook creates and owns /opt/ov_server for you during install:

git clone https://github.com/Jermalk/stormvino.git ~/stormvino
cd ~/stormvino
claude

Step 3 — In the CC chat, type exactly:

Run the Stormvino installation runbook. @CC_INSTALL.md

The @CC_INSTALL.md mention loads the runbook directly — no file dragging needed. CC reads it and takes over. Answer the 3 questions it asks, then watch.

→ See CC_INSTALL.md for what CC does at each phase.

One command installs on any number of Arc machines simultaneously. Detects GPU VRAM at runtime and tunes config automatically. Fully headless — handles reboots without human intervention.

git clone https://github.com/Jermalk/stormvino.git
cd stormvino
# edit vars/main.yml (3 lines) — then:
ansible-playbook -i hosts.yml stormvino.yml

→ See ANSIBLE.md for the full plan and current implementation status.

📖 Manual (full control, learn every step)

Step-by-step guide with a verification test between every phase. Covers kernel, drivers, Python env, PostgreSQL, models, and systemd services.

git clone https://github.com/Jermalk/stormvino.git
cd stormvino
./install.sh    # detects hardware, routes to the right path

→ See INSTALL.md.


What you get

EndpointDescription
POST /v1/chat/completionsOpenAI-compatible chat, streaming supported
POST /v1/embeddingsSentence embeddings (multilingual-e5-large)
GET /v1/modelsList discovered models
POST /v1/images/generationsImage generation (SDXL, optional)
POST /v1/audio/transcriptionsSpeech-to-text (Whisper, optional)
POST /v1/audio/speechText-to-speech (Kokoro / Piper, optional)
GET /healthServer health + loaded models + VRAM stats
GET /monitorWeb dashboard — live VRAM, throughput, request log

Default port: 11435. Accessible over LAN. Runs as an unprivileged stormvino systemd service (not root); the embedding model is offloaded to the iGPU when present, leaving the Arc's full VRAM for the LLM.

Tested models (B60 / 24 GB VRAM)

ModelVRAMRole
qwen3-14b-int4-ov9.1 GBDefault — reasoning, coding, chat
qwen3-8b-int4-ov4.6 GBAgent turns, fast responses
multilingual-e5-large-int8563 MBEmbeddings + task routing
whisper-large-v3-int8-ov~2 GBSpeech-to-text
qwen2.5-vl-7b-int4-ov~5 GBVision — image understanding

→ See MODELS.md for conversion instructions and VRAM budget tables.


Quick health check

curl -s http://localhost:11435/health | python3 -m json.tool
curl -s http://localhost:11435/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"qwen3-8b-int4-ov","messages":[{"role":"user","content":"Hello"}]}'

Libraries stack

Inference (server runtime)

LibraryVersion
openvino2026.1.0
openvino-genai2026.1.0.0
openvino-tokenizers2026.1.0.0
infergate0.2.0
optimum-intel1.27.0
optimum2.1.0
transformers4.57.6
tokenizers0.22.2

Model conversion (offline, via optimum-cli)

LibraryVersion
nncf3.1.0
onnx1.21.0
onnxruntime1.25.0
safetensors0.7.0
huggingface_hub0.36.2

Configuration

Runtime settings live in config.json. Key settings auto-patched by the installers based on detected GPU VRAM:

KeyDescription
deviceOpenVINO device — auto-detected (e.g. GPU.1)
kv_cache_size_gbKV cache per model — tuned to VRAM tier
max_loaded_modelsModels held in VRAM simultaneously
default_modelModel used when client doesn't specify
embedding_modelEmbedding model directory name
postgres_dsnObservability database connection string

Full reference: INSTALL.md § Phase 7.


Architecture

LayerComponent
HTTPFastAPI + Uvicorn, single worker
LLM inferenceopenvino_genai.LLMPipeline, executor-offloaded
VLM inferenceopenvino_genai.VLMPipeline
EmbeddingsOVModelForFeatureExtraction (optimum-intel)
Task routingEmbedding similarity + signal detection
STTopenvino_genai.WhisperPipeline
TTSKokoro-ONNX (EN) + Piper (PL)
ObservabilityPostgreSQL 16 + pgvector
Monitor UISvelte + uPlot

Hardware reports welcome

Tested Stormvino on a GPU not in the compatibility table? Open a hardware report issue — GPU model, VRAM, kernel version, tokens/sec. Builds the matrix for everyone.


Origin

Stormvino grew out of Shangri-Lab — a personal lab built by an IT architect from Silesia who had no Python background, a pair of Intel Arc GPUs, and a firm belief that local inference shouldn't require Nvidia hardware or magic frameworks.

The philosophy is unchanged: build the simplest thing that gives full visibility first, tune quality only after you can observe it.

Built with Claude Code.