TurboQuant + llama.cpp

March 26, 2026 · View on GitHub

Compressing the KV cache to 3 bits with zero accuracy loss.
Making large language models run on phones, not data centers.

What is this?

A practical integration roadmap for bringing Google's TurboQuant (PolarQuant + QJL) into the llama.cpp ecosystem — specifically targeting on-device mobile inference.

This isn't a toy. This is a plan to run 8B parameter models on 6GB phones with 16K+ context windows.

The problem

Running LLMs on mobile devices hits a wall: the KV cache eats all your RAM. A Llama-3.1-8B model with 4K context needs ~1GB just for the KV cache. That's on a device with 6GB total.

The solution

TurboQuant compresses the KV cache by 6x with zero accuracy loss:

Metric	Without TurboQuant	With TurboQuant	Improvement
KV Cache (8B, 4K ctx)	~1 GB	~170 MB	~6x
Max context (6GB device)	~4K tokens	~16-24K tokens	4-6x
Attention speed	Baseline	Up to 8x (GPU)	Significant

How TurboQuant works

Input KV Vector
      │
      ▼
┌─────────────────────┐
│  1. Random Rotation  │  ← Smooths data distribution
│     (Hadamard)       │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  2. PolarQuant       │  ← Cartesian → Polar coordinates
│     (main bits)      │     No normalization needed
│                      │     Zero memory overhead
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  3. QJL Residual     │  ← 1-bit error correction
│     (1 sign bit)     │     Johnson-Lindenstrauss transform
│                      │     Eliminates bias
└──────────┬──────────┘
           │
           ▼
    Compressed KV Cache
     (~3 bits per value)

PolarQuant converts vectors from Cartesian (X,Y,Z) to polar (radius + angles). Because post-rotation angles follow a known distribution, no expensive normalization is needed — eliminating the memory overhead that kills traditional methods.

QJL applies a Johnson-Lindenstrauss projection to the residual error from PolarQuant, reducing it to a single sign bit. Zero overhead. Pure mathematical error correction.

Integration Roadmap

Phase 1 — Today (immediate, no TurboQuant dependency)

Optimize existing GGUF quantization for mobile:

# Generate importance matrix for calibration
./llama-imatrix \
  -m model-base-F16.gguf \
  -f calibration-data.txt \
  --chunk 512 \
  -o model-imatrix.dat

# Quantize protecting attention tensors (critical for long context)
./llama-quantize \
  --imatrix model-imatrix.dat \
  --tensor-type "attn_v=q5_k" \
  --tensor-type "attn_k=q5_k" \
  --tensor-type "ffn_down=q5_k" \
  model-base-F16.gguf \
  model-mobile-optimized.gguf Q4_K_M

Why protect attention tensors? They store the context memory. Lower quantization on attention = better retention of long conversations.

Phase 2 — Short term (1-3 months)

KV cache quantization landing in llama.cpp:

Issue #20977 — TurboQuant feature request (opened March 25, 2026)
Experimental fork — already builds and quantizes
Discussion #5932 — 4-bit KV cache (long-running community request)

Phase 3 — Medium term (3-6 months)

Native PolarQuant + QJL in GGML kernels, validated on:

ARM NEON (Android)
Apple AMX/ANE (iOS)
Metal compute shaders

See docs/integration-plan.md for the full technical breakdown.

Mobile Model Recommendations

Device	RAM	Model	Quantization	Size
Android mid-range (6-8GB)	6GB	Qwen3-1.5B / Phi-3-mini	Q4_K_M + imatrix	~1.2GB
Android flagship (12GB+)	12GB	Llama-3.1-8B / Gemma-2-9B	Q4_K_M + imatrix	~4.5GB
iPhone 15/16	6-8GB	Qwen3-1.5B / Phi-3-mini	Q4_K_M + imatrix	~1.2GB
iPad Pro	16GB	Llama-3.1-8B	Q5_K_M + imatrix	~5.5GB

Why this matters

The AI industry spends billions on data centers. But the real frontier is your pocket.

TurboQuant makes it possible to run models with meaningful context windows on consumer devices — no cloud, no API keys, no surveillance.

This is what democratized AI looks like.

References

Resource	Link
TurboQuant paper	arXiv:2504.19874
PolarQuant paper	arXiv:2502.02617
QJL paper	arXiv:2406.03482
Google Research blog	Blog post
llama.cpp TurboQuant issue	#20977
Experimental fork	mudler/llama.cpp

About

Built by Daniel Gamo — independent AI researcher from Spain.

Part of the Orion project: persistent memory + grief tech + mesh networking + total privacy + on-device AI. All in one app, built by one person.

David didn't ask permission to throw the stone.

License

MIT — use it, fork it, build on it.