TurboQuant: KV Cache Compression for LLM Inference

March 27, 2026 ยท View on GitHub

Implementation of TurboQuant KV cache compression (ICLR 2026, arXiv:2504.19874) with vLLM integration. Tested on dense and MoE architectures across RTX 3090 and RTX 5090 GPUs.

Benchmark Results

RTX 5090 (32GB) -- Qwen3.5-27B-AWQ (dense, 4-bit weights, TP=1)

Setup: Single RTX 5090, vLLM 0.18.0, gpu_memory_utilization=0.90, 16 full-attention layers out of 64 total (rest are linear-attention).

MetricBaseline (bf16 KV)TurboQuant (3b key / 2b val)
Prefill tok/s (30k ctx)1,8041,907 (+5.7%)
Decode tok/s (30k ctx)1.2641.303 (+3.1%)
KV cache freed--30.0 GB (across 4 GPUs)
Max token capacity457,072914,144 (2.0x)
Peak activation memory644.6 MB599.2 MB (-7.0%)

8x RTX 3090 (24GB each) -- Qwen3.5-35B-A3B MoE (pruned, 205 experts, TP=8)

Setup: 8x RTX 3090, vLLM 0.18.0, gpu_memory_utilization=0.92, AMD EPYC 7443P 24-Core, 504GB RAM. Model has 10 full-attention layers + 30 linear-attention layers (40 total). TQ compresses only the 10 full-attention layers.

Throughput & Latency (Baseline, bf16 KV)

ContextPrefill tok/sDecode tok/sTTFT (s)Needles Found
1,0007,127129.70.144/5
4,0008,887131.50.454/5
8,0009,684131.10.834/5
16,0009,933133.01.614/5
32,0009,761116.73.284/5
64,0008,843122.67.244/5
100,0008,479106.811.794/5
131,0008,23898.315.904/5
  • Prefill saturates around 10k tok/s, degrades gently to 8.2k at 131k context.
  • Decode drops from 133 to 98 tok/s at 131k (KV readback cost from full-attention layers).
  • TTFT scales linearly with context length (purely compute-bound).
  • Needles 4/5 found consistently at ALL context lengths -- the model reformats one answer.

VRAM Breakdown (per GPU at 131k context)

ComponentSize
Total VRAM24,576 MB
Reserved (0.92 util)22,610 MB
Model weights~6,750 MB
KV cache pool9,035 MB
-- full_attention (10 layers)3,614 MB
-- linear_attention (30 layers)5,421 MB
CUDA overhead + graphs~6,825 MB

Baseline vs TurboQuant KV Cache

ContextBaseline KV/GPUTQ KV/GPUSavings/GPUSavings %
8,00055.7 MB38.5 MB17.2 MB30.9%
32,000191.5 MB132.3 MB59.3 MB30.9%
64,000374.3 MB258.5 MB115.8 MB30.9%
100,000578.1 MB399.2 MB178.8 MB30.9%
131,000755.7 MB521.9 MB233.8 MB30.9%
  • Savings are 30.9% of total KV because TQ only compresses the 10 full-attention layers (40% of KV).
  • The 30 linear-attention layers (60% of KV) are not compressible by TQ.
  • On a pure dense transformer, savings would be 77% (4.4x compression).

Context Extension

TokensMultiplier
Baseline capacity1,411,6801.0x
With TQ2,043,8081.45x

Alternatively, freed VRAM supports 3 additional concurrent 131k-context requests.

Coherence & Quality

TestResult
Single needle (512-131k tokens)PASS at all lengths
5-needle at near-max context5/5 retrieved
3-needle multi-fact coherence3/3 retrieved
Golden ratio completion (all lengths)PASS, perplexity 1.05-1.35
Math reasoning at max contextCoherent (model math error from pruning, not context)

TQ Quantization Quality (head_dim=256, measured on GPU)

Componentcos_simNotes
TQ key compression (3-bit)1.000000Near-lossless
TQ key compression (4-bit)1.000000Near-lossless
Value quantization (2-bit)0.940Bottleneck for quality
Value quantization (4-bit)0.997Recommended for quality-sensitive use
Combined (3b key + 2b val)0.940Value quant dominates degradation

GPU Utilization During Inference

ContextPeak VRAM/GPUGPU UtilCPU %Power
1,00022,284 MB0% idle0.2%132W
32,00022,286 MB57% peak0.4%142W
131,00022,306 MB0% idle0.4%130W
  • VRAM is essentially flat -- KV cache at 131k is only 190 MB/GPU (0.8% of VRAM).
  • No CPU offloading. No KV offloading. Everything fits in VRAM.
  • GPU interconnect is PCIe (no NVLink) -- NODE topology between all GPUs.

Paper Validation (Theorems 1-3)

9 tests validating the paper's theoretical claims:

ClaimVerdictDetails
MSE distortion bounds (Thm 1)PASSWithin bounds for unit-norm vectors
Codebook MSE matches Table 1PASSLloyd-Max codebook is faithful
Unbiasedness (Thm 2)PASSRelative bias < 0.1%
Distortion 1/4^b scaling (Thm 3)PASS2-bit=0.70x, 3-bit=0.82x, 4-bit=0.97x of bound
Recall@8 (3-bit, N=4096)0.55Paper threshold met (>=0.40)
Rank correlation (N=2048)PASSSpearman rho > 0.85
Needle retrievalPASSWorks at all SNR levels
Compression ratio4.41xAt head_dim=256 on full-attention layers

Adversarial Audit

Honest assessment of claims (see audit_claims.py for full data):

ClaimVerdict
"5.1x compression"Misleading -- doesn't count Pi/S matrices or ring buffer. Honest: ~4.6x at 4k tokens, ~5x at 32k+
"Needle-in-haystack passes"True but trivial -- query=key test is too easy. Real LLM queries are not copies of keys
"Recall@8 >= 0.40"Low bar -- 3-bit recall@1 is only 38%. BUT dominant attention tokens are always preserved
"Hybrid decode saves memory"Storage yes, compute no -- dequantizes all history to float32 per decode step
"Distortion follows 1/4^b"True -- initial audit was wrong (unnormalized vectors). Unit-norm: within bound
"30k TQ is faster"Within noise -- N=1 run, total wall time TQ is actually slower
"200k context works"Unverified -- didn't crash, but output quality never checked
"2x context on dense model"True -- measured 30 GB freed on Qwen3.5-27B with 4x RTX 3090

How It Works

TurboQuant compresses KV cache entries using:

  1. Random orthogonal rotation to spread information across dimensions
  2. Lloyd-Max optimal scalar quantization (b-1 bits) on Beta-distributed rotated values
  3. QJL projection for residual sign bits (1 bit per dimension)
  4. Group quantization for values (2-bit or 4-bit, per-group scales and zeros)
  5. Bit-packing: 4 values per byte (2-bit) or 2 per byte (4-bit)

The combined estimator is unbiased: E[estimated inner product] = true inner product.

Architecture

turboquant/
  codebook.py          # Lloyd-Max optimal scalar quantizer for Beta distribution
  codebooks/           # Pre-generated codebook files (d=128/256, bits 2/3/4)
  rotation.py          # Random orthogonal rotation + QJL projection matrices
  quantizer.py         # TurboQuantMSE + TurboQuantProd (Algorithms 1 & 2)
  kv_cache.py          # KV cache manager with value bit-packing
  capture.py           # Modular KV capture hooks for attention layers
  store.py             # Compressed KV store (quantize + append + flat cache)
  score.py             # Attention scoring from compressed keys
  integration/vllm.py  # vLLM adapter (monkey-patch, free_kv_cache, hybrid decode)
  triton_kernels.py    # 3 fused Triton kernels for decode attention
  vllm_attn_backend.py # Thin shim delegating to integration/vllm.py

validate_paper.py      # 9 tests validating Theorems 1-3
audit_claims.py        # Adversarial audit of all claims
test_modular.py        # 19 modular architecture tests
test_turboquant.py     # 7 core quantizer tests
proof.py               # A/B benchmark (baseline vs TQ)

# Profiling scripts (8x RTX 3090 MoE validation)
validate_moe.py        # Baseline measurements via vLLM API
validate_moe_phase2.py # TQ quality on real GPU with head_dim=256
validate_moe_phase3.py # Logprobs, multi-needle, reasoning at max context
profile_100k.py        # Full profiling at 1k-131k context
profile_large.py       # Large context (64k-131k) with file-based payloads
baseline_vs_tq.py      # VRAM comparison: baseline bf16 vs TQ compressed
baseline_vs_tq_v2.py   # Block-level measurement during inference

Usage

pip install -e .

# Run paper validation (CPU, no GPU needed)
python validate_paper.py

# Run adversarial audit
python audit_claims.py

# Run modular tests
python -m pytest test_modular.py -v

# Run proof benchmark (requires 4x RTX 3090 + Qwen3.5-27B-AWQ)
CUDA_VISIBLE_DEVICES=0,1,4,6 python proof.py

Test Results

All 35 tests pass:

  • test_modular.py: 19/19 (modular architecture)
  • test_turboquant.py: 7/7 (core quantizer)
  • validate_paper.py: 9/9 (paper theorem validation)

Limitations

  • Prefill still uses paged cache: KV cache is allocated at engine init and used during prefill. TQ frees it after. True zero-allocation requires deeper vLLM integration.
  • Only full-attention layers: Linear-attention/Mamba layers are not compressed.
  • Value quantization is the bottleneck: 2-bit values cause cos_sim=0.94 degradation. Use 4-bit values (cos_sim=0.997) for quality-sensitive workloads.
  • Hybrid decode dequantizes all history: During compute, all compressed tokens are expanded to float32. The paper's fused Triton kernels exist but the hybrid path doesn't use them yet.
  • MoE models benefit less: Models with linear-attention layers (Qwen3.5 MoE, Mamba hybrids) have incompressible state that limits TQ's overall impact.

Environment

Tested on:

  • vLLM 0.18.0, PyTorch 2.10, CUDA 12.8
  • RTX 5090 (32GB) -- Qwen3.5-27B-AWQ, single GPU
  • 8x RTX 3090 (24GB) -- Qwen3.5-35B-A3B MoE, TP=8
  • Python 3.12