TurboQuant + llama.cpp

March 26, 2026 · View on GitHub

Compressing the KV cache to 3 bits with zero accuracy loss.
Making large language models run on phones, not data centers.

Google Research llama.cpp Paper License: MIT


What is this?

A practical integration roadmap for bringing Google's TurboQuant (PolarQuant + QJL) into the llama.cpp ecosystem — specifically targeting on-device mobile inference.

This isn't a toy. This is a plan to run 8B parameter models on 6GB phones with 16K+ context windows.

The problem

Running LLMs on mobile devices hits a wall: the KV cache eats all your RAM. A Llama-3.1-8B model with 4K context needs ~1GB just for the KV cache. That's on a device with 6GB total.

The solution

TurboQuant compresses the KV cache by 6x with zero accuracy loss:

MetricWithout TurboQuantWith TurboQuantImprovement
KV Cache (8B, 4K ctx)~1 GB~170 MB~6x
Max context (6GB device)~4K tokens~16-24K tokens4-6x
Attention speedBaselineUp to 8x (GPU)Significant

How TurboQuant works

Input KV Vector


┌─────────────────────┐
│  1. Random Rotation  │  ← Smooths data distribution
│     (Hadamard)       │
└──────────┬──────────┘


┌─────────────────────┐
│  2. PolarQuant       │  ← Cartesian → Polar coordinates
│     (main bits)      │     No normalization needed
│                      │     Zero memory overhead
└──────────┬──────────┘


┌─────────────────────┐
│  3. QJL Residual     │  ← 1-bit error correction
│     (1 sign bit)     │     Johnson-Lindenstrauss transform
│                      │     Eliminates bias
└──────────┬──────────┘


    Compressed KV Cache
     (~3 bits per value)

PolarQuant converts vectors from Cartesian (X,Y,Z) to polar (radius + angles). Because post-rotation angles follow a known distribution, no expensive normalization is needed — eliminating the memory overhead that kills traditional methods.

QJL applies a Johnson-Lindenstrauss projection to the residual error from PolarQuant, reducing it to a single sign bit. Zero overhead. Pure mathematical error correction.


Integration Roadmap

Phase 1 — Today (immediate, no TurboQuant dependency)

Optimize existing GGUF quantization for mobile:

# Generate importance matrix for calibration
./llama-imatrix \
  -m model-base-F16.gguf \
  -f calibration-data.txt \
  --chunk 512 \
  -o model-imatrix.dat

# Quantize protecting attention tensors (critical for long context)
./llama-quantize \
  --imatrix model-imatrix.dat \
  --tensor-type "attn_v=q5_k" \
  --tensor-type "attn_k=q5_k" \
  --tensor-type "ffn_down=q5_k" \
  model-base-F16.gguf \
  model-mobile-optimized.gguf Q4_K_M

Why protect attention tensors? They store the context memory. Lower quantization on attention = better retention of long conversations.

Phase 2 — Short term (1-3 months)

KV cache quantization landing in llama.cpp:

Phase 3 — Medium term (3-6 months)

Native PolarQuant + QJL in GGML kernels, validated on:

  • ARM NEON (Android)
  • Apple AMX/ANE (iOS)
  • Metal compute shaders

See docs/integration-plan.md for the full technical breakdown.


Mobile Model Recommendations

DeviceRAMModelQuantizationSize
Android mid-range (6-8GB)6GBQwen3-1.5B / Phi-3-miniQ4_K_M + imatrix~1.2GB
Android flagship (12GB+)12GBLlama-3.1-8B / Gemma-2-9BQ4_K_M + imatrix~4.5GB
iPhone 15/166-8GBQwen3-1.5B / Phi-3-miniQ4_K_M + imatrix~1.2GB
iPad Pro16GBLlama-3.1-8BQ5_K_M + imatrix~5.5GB

Why this matters

The AI industry spends billions on data centers. But the real frontier is your pocket.

TurboQuant makes it possible to run models with meaningful context windows on consumer devices — no cloud, no API keys, no surveillance.

This is what democratized AI looks like.


References

ResourceLink
TurboQuant paperarXiv:2504.19874
PolarQuant paperarXiv:2502.02617
QJL paperarXiv:2406.03482
Google Research blogBlog post
llama.cpp TurboQuant issue#20977
Experimental forkmudler/llama.cpp

About

Built by Daniel Gamo — independent AI researcher from Spain.

Part of the Orion project: persistent memory + grief tech + mesh networking + total privacy + on-device AI. All in one app, built by one person.

David didn't ask permission to throw the stone.


License

MIT — use it, fork it, build on it.