Long-Context Sparse V Validation

March 28, 2026 · View on GitHub

Date: 2026-03-27 Hardware: Apple M5 Max 128GB Branch: feature/turboquant-kv-cache @ a52586e

Motivation

Prior sparse V quality validation used 8-chunk wikitext-2 at c=512 (PPL 6.1756). At 512 tokens, sparse V barely triggers — most attention weights are above the 1e-6 threshold. This eval validates quality at context lengths where sparse V is actively skipping positions, with increased chunk counts for statistical power and q8_0 baselines to confirm corpus sanity.

Methodology

  • llama-perplexity with wikitext-2-raw at c=8192 (20 chunks), c=16384 (10 chunks), c=32768 (5 chunks)
  • KV cache: turbo3 (3.5-bit TurboQuant)
  • Three runs per context length: q8_0 baseline, turbo3 sparse V ON, turbo3 sparse V OFF
  • q8_0 baselines run first to confirm corpus/chunk settings produce sane PPL before evaluating turbo3
  • Sparse V OFF via TURBO_SPARSE_V=0 environment variable

Skip-rate measurement: Direct measurement via output_attentions=True on Qwen3-1.7B (same architecture family, fits in memory with eager attention). Counts attention weights below τ=1e-6 across all layers and heads. The 35B model cannot run eager attention (OOM), so 35B skip rates are estimated from decode speed improvements.

Results

MoE (Qwen3.5-35B-A3B Q8_0) — Full Validation

Each context length tested with three sequential runs: q8_0 baseline first (sanity check), then turbo3 without sparse V (isolates compression), then turbo3 with sparse V (isolates sparse V effect).

wikitext-2 (1.3MB corpus):

ContextChunksq8_0 baselineturbo3 + sparse Vturbo3 no sparse VSparse V Δvs q8_0
8K205.4592 ± 0.0455.5195 ± 0.0455.5195 ± 0.0450.0000+1.1%
16K105.0008 ± 0.0395.0630 ± 0.0405.0630 ± 0.0400.0000+1.2%
32K56.0274 ± 0.0506.1103 ± 0.0516.1103 ± 0.0510.0000+1.4%

wikitext-103 (516MB corpus, 10× statistical power):

ContextChunksq8_0q4_0turbo3 + sparse Vturbo3 no sparse VSparse V Δ
32K507.0638 ± 0.0217.0857 ± 0.0217.1796 ± 0.0217.1796 ± 0.0210.0000

The 50-chunk wikitext-103 run is the strongest validation: tight CI (±0.021), large corpus, high chunk count. Sparse V delta is exactly 0.0000.

Note on q4_0: Included as a reference baseline. No optimization or tuning effort was applied to q4_0 in this work. Development and optimization focused on q8_0 and turbo3 paths. turbo3 uses fewer bits (3.5 vs 4.0), so slightly higher PPL relative to q4_0 is expected.

Dense (Qwen3.5-27B Q8_0)

ContextChunksSparse V ONSparse V OFFDelta
8K87.0152 ± 0.1067.0152 ± 0.1060.0000

Skip Rate — Direct Measurement (Qwen3-1.7B)

Measured directly via output_attentions=True, counting positions where attention weight < τ:

ContextOverall skipMin layerMax layerMedian layer
5129.1%0.0%32.1%6.3%
204820.7%2.0%59.5%15.0%
409628.4%3.7%72.4%24.5%

Skip Rate — Estimated (Qwen3.5-35B MoE, from decode speed data)

ContextDecode ΔEst. skip rate
8K+7.2%~28%
16K+12.9%~51%
32K+22.8%~90%

Interpretation

Sparse V is directly measured to be active at long context. At 4096 tokens, 28.4% of V positions are below threshold (directly measured). Skip rate grows monotonically with context length.

PPL remains numerically identical when sparse V is active. ON/OFF delta is 0.0000 at every context length and corpus tested (8K through 32K on wikitext-2, 32K on wikitext-103 with 50 chunks). The +1.1-1.6% gap vs q8_0 is the underlying TurboQuant compression overhead — consistent across context lengths and unaffected by sparse V.

The 50-chunk wikitext-103 run is the definitive validation. With CI ±0.021, it has sufficient statistical power to detect a PPL difference of ~0.04 (~0.6%). No difference was observed. This supersedes the earlier 2-5 chunk wikitext-2 runs.

The 512-context PPL should be described as a no-regression sanity check. At c=512, sparse V skips ~9% of positions. The true validation is at c=8K+ where skip rates are 28-90% and the optimization is actively changing computation.

Remaining caveats:

  • Direct skip-rate measurement used Qwen3-1.7B (28 layers), not the 35B eval model (40 layers). The 35B model cannot run eager attention without OOM. Skip rates on the 35B are estimated from decode speed improvements.
  • Skip rate varies significantly by layer (0% to 72% at 4K). Early layers show higher skip rates.

Raw Commands

LLAMA=~/local_llms/llama.cpp/build-turbo/bin
MODEL_MOE=~/local_llms/models/Qwen3.5-35B-A3B-Q8_0.gguf
MODEL_DENSE=~/local_llms/models/Qwen3.5-27B-Q8_0.gguf
WIKI=~/local_llms/llama.cpp/wikitext-2-raw/wiki.test.raw

# q8_0 baselines first (sanity check)
for ctx_ch in "8192 20" "16384 10" "32768 5"; do
  set -- $ctx_ch
  $LLAMA/llama-perplexity -m $MODEL_MOE -f $WIKI -c \$1 \
    -ctk q8_0 -ctv q8_0 -fa on --chunks \$2 -ngl 99
done

# turbo3 ON/OFF at each context length
for ctx_ch in "8192 20" "16384 10" "32768 5"; do
  set -- $ctx_ch
  $LLAMA/llama-perplexity -m $MODEL_MOE -f $WIKI -c \$1 \
    -ctk turbo3 -ctv turbo3 -fa on --chunks \$2 -ngl 99
  TURBO_SPARSE_V=0 $LLAMA/llama-perplexity -m $MODEL_MOE -f $WIKI -c \$1 \
    -ctk turbo3 -ctv turbo3 -fa on --chunks \$2 -ngl 99
done

# Direct skip rate measurement
python3 scripts/measure_skip_rate.py 512 2048 4096

# 50-chunk 32K on wikitext-103 (strongest validation)
WIKI103=~/local_llms/llama.cpp/wikitext-103-raw/wiki.train.raw

# 1. q8_0 baseline (anchors the experiment)
$LLAMA/llama-perplexity -m $MODEL_MOE -f $WIKI103 -c 32768 \
  -ctk q8_0 -ctv q8_0 -fa on --chunks 50 -ngl 99

# 2. turbo3 WITHOUT sparse V (isolates compression effect)
TURBO_SPARSE_V=0 $LLAMA/llama-perplexity -m $MODEL_MOE -f $WIKI103 -c 32768 \
  -ctk turbo3 -ctv turbo3 -fa on --chunks 50 -ngl 99

# 3. turbo3 WITH sparse V (isolates sparse V effect)
$LLAMA/llama-perplexity -m $MODEL_MOE -f $WIKI103 -c 32768 \
  -ctk turbo3 -ctv turbo3 -fa on --chunks 50 -ngl 99