Shangri-Lab

April 24, 2026 ยท View on GitHub

The pursuit of the perfect local LLM pipeline. The journey is the point.

A personal lab for experimenting with local LLM pipelines, agentic workflows, and AI-assisted tooling. Built by a .NET developer learning Python, running entirely on local hardware with Intel Arc B60 24GB and OpenVINO.

No cloud APIs. No GPU rental. No magic frameworks yet.


What this is

A collection of experiments, working prototypes, and honest field notes from someone who:

  • Comes from .NET, systems architecture, and automation backgrounds
  • Is not a Python specialist code here is explicit over clever
  • Runs everything locally on Intel Arc B60 24GB with OpenVINO
  • Values observability over abstraction if I can't see what's happening, it doesn't count as working

Hardware

MachineSpecsRole
HP EnvyStormi7-12700K, 64GB RAM, Arc B60 24GB VRAM, LinuxPrimary inference server
NASRyzen, 16GB RAM, LinuxDocker farm (planned)

Inference stack: Intel OpenVINO not Ollama, not CUDA. If you're running Arc and struggling to find practical examples which works, you're in the right place.


Models in use

ModelPurposeThroughput
qwen3-8b-int4-ovFast pre-selection, filtering~105 t/s
qwen3-14b-int4-ovMain reasoning, summarization~40t/s
qwen2.5-coder-14b-int4Code-related tasks~40t/s
qwen2.5-vl-7b-int4-ovMultimodal / image input - custom conversion to ov formatnot benchmarked yet
multilingual-e5-large-int8Embeddings, RAG, deduplicationnot benchmarked yet

Projects


ov-server

An OpenAI-compatible REST API server backed by openvino_genai. Exposes /v1/chat/completions, /v1/embeddings, and /v1/models on port 11435 drop-in replacement for OpenAI API in local tooling (AnythingLLM, LangChain, etc.).

What it does:

  • Serves multiple Qwen INT4 models from Intel Arc via OpenVINO
  • Full streaming support via AsyncTokenStreamer (real token-by-token, not buffered)
  • Vision/multimodal support routes image-containing requests to VLMPipeline automatically
  • Tool call support parses Qwen <tool_call> blocks, returns OpenAI-compatible tool_calls response
  • LRU model eviction with VRAM headroom checks keeps up to 2 models loaded simultaneously
  • Thinking block extraction separates <think>...</think> from answer, formats for display
  • /health endpoint with live stats: throughput, token counts, loaded models, RAM

Notable engineering decisions:

  • Direct sysfs/fdinfo VRAM queries no nvidia-smi, no intel_gpu_top
  • Event loop captured at streamer construction time fixes get_event_loop() deprecation in threaded streaming
  • Per-model asyncio.Lock concurrent requests on different models run in parallel; same model is serialised
  • Single-file server by design keeps deployment simple

Status: Working in production on EnvyStorm


ov-monitor

A rich terminal monitor for the OpenVINO server and Intel Arc GPU. Reads hardware metrics directly from kernel interfaces no root required, no intel_gpu_top dependency.

What it shows:

  • Server status: busy/online, loaded models, last throughput, total request/token counts
  • GPU engine utilisation by type: Render (rcs), Compute (ccs), Video (vcs), VideoEnh (vecs), Blitter (bcs) from fdinfo cycle counter deltas
  • VRAM usage: global from vram0_mm + per-process breakdown via /proc/<pid>/fdinfo
  • GPU temperature (GT + VRAM), fan RPM, instantaneous power from energy counter delta
  • CPU per-core utilisation, frequency, load averages, temperatures
  • System RAM and swap

What makes it interesting:

  • Engine utilisation computed from drm-cycles-* / drm-total-cycles-* counter deltas in /proc/<pid>/fdinfo \u2014 works on xe driver where intel_gpu_top -J is broken
  • VRAM per process without root reads drm-total-vram0 from fdinfo, no debugfs clients required
  • Instantaneous power derived from energy1_input hwmon delta between polls

Status: Working on xe driver (Arc B580)


scraper-pipeline (in progress)

A two-stage local LLM pipeline for scraping and analyzing offers from the web.

Architecture:

URL list Scraper - Qwen3-8b (pre-selector) \ Qwen3-14b (summarizer) \ structured output

Design goals:

  • Full observability - every stage logged, every model decision visible
  • No black boxes - raw model output always captured before parsing
  • Built step by step - baseline first, frameworks later if at all

Status: In progress


Philosophy

Build the simplest thing that gives full visibility first.
Tune quality only after you can observe it.
Shangri-Lab is not a framework. It is a small, well-understood pipeline where every step is visible and trustworthy.

This lab exists because most local LLM content assumes Nvidia hardware, Python expertise, and comfort with heavyweight frameworks. These experiments assume none of those things.

If something works here, it's because it actually works - not because a framework hid the failure.


What you might find useful here

  • Intel Arc + OpenVINO practical setup and model conversion notes
  • Qwen model recipes for INT4 quantized inference on OpenVINO including VLM and coder variants
  • OpenAI-compatible local server that actually works with AnythingLLM, LangChain, and similar tools
  • xe driver GPU monitoring without intel_gpu_top - sysfs/fdinfo approach for Arc cards
  • Two-stage pipeline patterns - fast filter model + heavy reasoning model
  • Honest failure notes - what didn't work and why

Status

This is a personal lab, not a polished product. Things break. Approaches get abandoned. Notes are sometimes incomplete. That's the point (at least for now).


Author

IT manager and software architect from Silesia, Poland.
Background in .NET, Kubernetes, Kafka, automation, electronics, and even geography.
Learning Python one pipeline at a time.


Shangri-Lab - because the perfect pipeline is always just over the next mountain.