TurboQuant + llama.cpp
March 26, 2026 · View on GitHub
Compressing the KV cache to 3 bits with zero accuracy loss.
Making large language models run on phones, not data centers.
What is this?
A practical integration roadmap for bringing Google's TurboQuant (PolarQuant + QJL) into the llama.cpp ecosystem — specifically targeting on-device mobile inference.
This isn't a toy. This is a plan to run 8B parameter models on 6GB phones with 16K+ context windows.
The problem
Running LLMs on mobile devices hits a wall: the KV cache eats all your RAM. A Llama-3.1-8B model with 4K context needs ~1GB just for the KV cache. That's on a device with 6GB total.
The solution
TurboQuant compresses the KV cache by 6x with zero accuracy loss:
| Metric | Without TurboQuant | With TurboQuant | Improvement |
|---|---|---|---|
| KV Cache (8B, 4K ctx) | ~1 GB | ~170 MB | ~6x |
| Max context (6GB device) | ~4K tokens | ~16-24K tokens | 4-6x |
| Attention speed | Baseline | Up to 8x (GPU) | Significant |
How TurboQuant works
Input KV Vector
│
▼
┌─────────────────────┐
│ 1. Random Rotation │ ← Smooths data distribution
│ (Hadamard) │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ 2. PolarQuant │ ← Cartesian → Polar coordinates
│ (main bits) │ No normalization needed
│ │ Zero memory overhead
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ 3. QJL Residual │ ← 1-bit error correction
│ (1 sign bit) │ Johnson-Lindenstrauss transform
│ │ Eliminates bias
└──────────┬──────────┘
│
▼
Compressed KV Cache
(~3 bits per value)
PolarQuant converts vectors from Cartesian (X,Y,Z) to polar (radius + angles). Because post-rotation angles follow a known distribution, no expensive normalization is needed — eliminating the memory overhead that kills traditional methods.
QJL applies a Johnson-Lindenstrauss projection to the residual error from PolarQuant, reducing it to a single sign bit. Zero overhead. Pure mathematical error correction.
Integration Roadmap
Phase 1 — Today (immediate, no TurboQuant dependency)
Optimize existing GGUF quantization for mobile:
# Generate importance matrix for calibration
./llama-imatrix \
-m model-base-F16.gguf \
-f calibration-data.txt \
--chunk 512 \
-o model-imatrix.dat
# Quantize protecting attention tensors (critical for long context)
./llama-quantize \
--imatrix model-imatrix.dat \
--tensor-type "attn_v=q5_k" \
--tensor-type "attn_k=q5_k" \
--tensor-type "ffn_down=q5_k" \
model-base-F16.gguf \
model-mobile-optimized.gguf Q4_K_M
Why protect attention tensors? They store the context memory. Lower quantization on attention = better retention of long conversations.
Phase 2 — Short term (1-3 months)
KV cache quantization landing in llama.cpp:
- Issue #20977 — TurboQuant feature request (opened March 25, 2026)
- Experimental fork — already builds and quantizes
- Discussion #5932 — 4-bit KV cache (long-running community request)
Phase 3 — Medium term (3-6 months)
Native PolarQuant + QJL in GGML kernels, validated on:
- ARM NEON (Android)
- Apple AMX/ANE (iOS)
- Metal compute shaders
See docs/integration-plan.md for the full technical breakdown.
Mobile Model Recommendations
| Device | RAM | Model | Quantization | Size |
|---|---|---|---|---|
| Android mid-range (6-8GB) | 6GB | Qwen3-1.5B / Phi-3-mini | Q4_K_M + imatrix | ~1.2GB |
| Android flagship (12GB+) | 12GB | Llama-3.1-8B / Gemma-2-9B | Q4_K_M + imatrix | ~4.5GB |
| iPhone 15/16 | 6-8GB | Qwen3-1.5B / Phi-3-mini | Q4_K_M + imatrix | ~1.2GB |
| iPad Pro | 16GB | Llama-3.1-8B | Q5_K_M + imatrix | ~5.5GB |
Why this matters
The AI industry spends billions on data centers. But the real frontier is your pocket.
TurboQuant makes it possible to run models with meaningful context windows on consumer devices — no cloud, no API keys, no surveillance.
This is what democratized AI looks like.
References
| Resource | Link |
|---|---|
| TurboQuant paper | arXiv:2504.19874 |
| PolarQuant paper | arXiv:2502.02617 |
| QJL paper | arXiv:2406.03482 |
| Google Research blog | Blog post |
| llama.cpp TurboQuant issue | #20977 |
| Experimental fork | mudler/llama.cpp |
About
Built by Daniel Gamo — independent AI researcher from Spain.
Part of the Orion project: persistent memory + grief tech + mesh networking + total privacy + on-device AI. All in one app, built by one person.
David didn't ask permission to throw the stone.
License
MIT — use it, fork it, build on it.