Stormvino

May 26, 2026 · View on GitHub

Stormvino

OpenAI-compatible LLM server for Intel Arc GPUs. Runs local inference via OpenVINO. Speaks the OpenAI API — drop it behind any client that accepts a base_url. No NVIDIA required.

Hardware compatibility

GPU	VRAM	Status	Notes
Arc B60	24 GB	✅ Production	EnvyStorm reference machine
Arc B50	16 GB	🔜 Testing	TinyB — install in progress
Arc B65	TBD	🔜 Planned	Next after B50 confirmed
Arc B70	TBD	🔜 Planned
Other Arc	any	⚙️ Auto-tuned	VRAM detected at runtime

Detecting B-series cards: Battlemage GPUs often report as Intel(R) Graphics [0xExxx] (e.g. [0xe212]) — not the word "Arc"; lspci and the OpenVINO device name both omit it. Identify the discrete GPU by its OpenVINO device type (DISCRETE vs INTEGRATED), not by matching "Arc". If a detection step reports "no Arc GPU found" on a B-series card, the card is still fine — confirm with clinfo or python -c "import openvino as ov; print(ov.Core().available_devices)" and continue.

OS: Linux Mint 22.x / Ubuntu 24.04 (Noble). Kernel: Battlemage (B-series) needs the xe driver. linux-oem-24.04 provides it — but a newer generic/mainline kernel (6.11+) that already loads xe and creates a /dev/dri/renderD* node for the card works too. The installer checks whether the GPU is already live and upgrades the kernel only if it isn't — so a working newer kernel won't be downgraded. System RAM: 16 GB minimum (a 16 GB machine reports ~15 GiB usable). Disk: 50 GB+ for a useful model set.

Install paths — pick one

🤖 Claude Code (recommended for single machine)

Fully automated. CC asks 3 questions, then handles everything — including a kernel upgrade + reboot only if your GPU isn't already working. You watch.

Step 1 — Install Claude Code if you haven't:

npm install -g @anthropic-ai/claude-code

Prerequisite — passwordless sudo for the install. The automated path runs system commands via sudo, and Claude Code's non-interactive shell can't answer a password prompt. Grant a temporary drop-in and remove it when the install finishes:

echo "$USER ALL=(ALL) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/stormvino-install
sudo chmod 0440 /etc/sudoers.d/stormvino-install
# when the install is done:  sudo rm /etc/sudoers.d/stormvino-install

Step 2 — Clone the repo into your home dir and start CC there. Don't clone into /opt — it's root-owned, so the clone fails; the runbook creates and owns /opt/ov_server for you during install:

git clone https://github.com/Jermalk/stormvino.git ~/stormvino
cd ~/stormvino
claude

Step 3 — In the CC chat, type exactly:

Run the Stormvino installation runbook. @CC_INSTALL.md

The @CC_INSTALL.md mention loads the runbook directly — no file dragging needed. CC reads it and takes over. Answer the 3 questions it asks, then watch.

→ See CC_INSTALL.md for what CC does at each phase.

⚙️ Ansible (recommended for multiple machines / repeatable deploys)

One command installs on any number of Arc machines simultaneously. Detects GPU VRAM at runtime and tunes config automatically. Fully headless — handles reboots without human intervention.

git clone https://github.com/Jermalk/stormvino.git
cd stormvino
# edit vars/main.yml (3 lines) — then:
ansible-playbook -i hosts.yml stormvino.yml

→ See ANSIBLE.md for the full plan and current implementation status.

📖 Manual (full control, learn every step)

Step-by-step guide with a verification test between every phase. Covers kernel, drivers, Python env, PostgreSQL, models, and systemd services.

git clone https://github.com/Jermalk/stormvino.git
cd stormvino
./install.sh    # detects hardware, routes to the right path

→ See INSTALL.md.

What you get

Endpoint	Description
`POST /v1/chat/completions`	OpenAI-compatible chat, streaming supported
`POST /v1/embeddings`	Sentence embeddings (multilingual-e5-large)
`GET /v1/models`	List discovered models
`POST /v1/images/generations`	Image generation (SDXL, optional)
`POST /v1/audio/transcriptions`	Speech-to-text (Whisper, optional)
`POST /v1/audio/speech`	Text-to-speech (Kokoro / Piper, optional)
`GET /health`	Server health + loaded models + VRAM stats
`GET /monitor`	Web dashboard — live VRAM, throughput, request log

Default port: 11435. Accessible over LAN. Runs as an unprivileged stormvino systemd service (not root); the embedding model is offloaded to the iGPU when present, leaving the Arc's full VRAM for the LLM.

Tested models (B60 / 24 GB VRAM)

Model	VRAM	Role
`qwen3-14b-int4-ov`	9.1 GB	Default — reasoning, coding, chat
`qwen3-8b-int4-ov`	4.6 GB	Agent turns, fast responses
`multilingual-e5-large-int8`	563 MB	Embeddings + task routing
`whisper-large-v3-int8-ov`	~2 GB	Speech-to-text
`qwen2.5-vl-7b-int4-ov`	~5 GB	Vision — image understanding

→ See MODELS.md for conversion instructions and VRAM budget tables.

Quick health check

curl -s http://localhost:11435/health | python3 -m json.tool

curl -s http://localhost:11435/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"qwen3-8b-int4-ov","messages":[{"role":"user","content":"Hello"}]}'

Libraries stack

Inference (server runtime)

Library	Version
openvino	2026.1.0
openvino-genai	2026.1.0.0
openvino-tokenizers	2026.1.0.0
infergate	0.2.0
optimum-intel	1.27.0
optimum	2.1.0
transformers	4.57.6
tokenizers	0.22.2

Model conversion (offline, via optimum-cli)

Library	Version
nncf	3.1.0
onnx	1.21.0
onnxruntime	1.25.0
safetensors	0.7.0
huggingface_hub	0.36.2

Configuration

Runtime settings live in config.json. Key settings auto-patched by the installers based on detected GPU VRAM:

Key	Description
`device`	OpenVINO device — auto-detected (e.g. `GPU.1`)
`kv_cache_size_gb`	KV cache per model — tuned to VRAM tier
`max_loaded_models`	Models held in VRAM simultaneously
`default_model`	Model used when client doesn't specify
`embedding_model`	Embedding model directory name
`postgres_dsn`	Observability database connection string

Full reference: INSTALL.md § Phase 7.

Architecture

Layer	Component
HTTP	FastAPI + Uvicorn, single worker
LLM inference	`openvino_genai.LLMPipeline`, executor-offloaded
VLM inference	`openvino_genai.VLMPipeline`
Embeddings	`OVModelForFeatureExtraction` (optimum-intel)
Task routing	Embedding similarity + signal detection
STT	`openvino_genai.WhisperPipeline`
TTS	Kokoro-ONNX (EN) + Piper (PL)
Observability	PostgreSQL 16 + pgvector
Monitor UI	Svelte + uPlot

Hardware reports welcome

Tested Stormvino on a GPU not in the compatibility table? Open a hardware report issue — GPU model, VRAM, kernel version, tokens/sec. Builds the matrix for everyone.

Origin

Stormvino grew out of Shangri-Lab — a personal lab built by an IT architect from Silesia who had no Python background, a pair of Intel Arc GPUs, and a firm belief that local inference shouldn't require Nvidia hardware or magic frameworks.

The philosophy is unchanged: build the simplest thing that gives full visibility first, tune quality only after you can observe it.

Built with Claude Code.