Pre-Rotate-Queries Investigation Log

March 25, 2026 ยท View on GitHub

Goal

Move WHT inverse rotation from per-block dequant (O(128) per 4-element access) to graph-level Q forward + V inverse rotation (O(1) per-access amortized). This would reclaim ~77 tok/s speed while maintaining PPL ~6.19.

Model Under Test

  • Qwen3.5-35B-A3B-Q8_0 (MoE)
  • n_embd_head_k = 256, n_embd_head_v = 256, n_head = 16, n_head_kv = 2
  • WHT rotation group = 128 (QK_TURBO3 = 128)
  • Each head has 2 rotation groups (256/128 = 2)

ggml_mul_mat Semantics (Verified)

  • ggml stores 2D tensors column-major: element(i,j) at offset i + j*ne[0]
  • Storing a C row-major array M into ggml: ggml sees M^T
  • ggml_mul_mat(A, x) computes A^T @ x
  • NET EFFECT: storing row-major M and calling mul_mat gives M @ x
  • VERIFIED with 2x2 rotation test: stored R row-major, got R@x output

Rotation Matrix Storage

  • TURBO_ROTATION_R: R = diag(s2) * H/sqrt(128) * diag(s1) (forward rotation)
  • TURBO_ROTATION_RT: R^T = R^{-1} (inverse rotation)
  • For Q forward: store R (TURBO_ROTATION_R) -> mul_mat gives R @ q
  • For V inverse: store R^T (TURBO_ROTATION_RT) -> mul_mat gives R^T @ cur = R^{-1} @ cur
  • Python verified: R^T @ R = I, round-trip error = 1.2e-15

Test Results (all on Qwen3.5-35B-A3B, wikitext-2, 8 chunks, turbo3 K+V, flash attn)

Baseline

ConfigPPLNotes
Dequant inverse ON, no graph rot6.194Known good baseline
No dequant inverse, no graph rot194Fully rotated, no compensation

Graph Q Rotation Only (no V inverse)

Storagemul_mat givesPPLNotes
TURBO_ROTATION_RT for Q (R^T@q)R^T@q = R_inv@q157Wrong direction for K matching
TURBO_ROTATION_R for Q (R@q)R@q157Correct direction but V still rotated

Both give ~157 because V output is still in rotated space (no V inverse). The attention scores differ but V corruption dominates.

Graph Q + V Rotation (both active)

Q StorageV StorageQ givesV givesPPL
RT (R^T@q)R (R@cur)R^T@q (wrong)R@cur (wrong)26.6
R (R@q)RT (R^T@cur)R@q (correct)R^T@cur (correct)23.5

Corrected storage is better (23.5 < 26.6), confirming direction matters. But 23.5 >> 6.19.

Isolation Tests (dequant inverse ON)

Graph rotationPPLNotes
Q rot only (R@q)10.5Q rotation works - degrades quality vs un-rotated K
V inv only (R^T@cur)26.6V inverse works - degrades quality vs un-rotated V

Key Finding

Both rotations mechanically work (modify output, verified with scale(2.0) test). But the full pre-rotate-queries approach (correct Q + correct V) gives PPL 23.5, NOT 6.19.

Unsolved: Why 23.5 Instead of 6.19?

The math proves the approaches should be equivalent:

  • Dequant inverse: x_dequant = R^{-1}(quant(R(x))) ~ x + R^{-1}(epsilon)
  • Graph rotation: Q=R(q), K=quant(R(k)), V=quant(R(v)), out=R^{-1}(attn(Q,K,V))
  • Error magnitudes are identical (orthogonal rotation preserves norms)

Hypotheses to investigate:

  1. Flash attention precision: FA kernel may accumulate differently with rotated vs un-rotated values
  2. Block boundary effects: 128-element rotation groups split across flash attention tiles differently
  3. Metal mul_mat precision: GPU f32 mul_mat on (128,128)@(128,N) may have precision issues
  4. Non-contiguous tensor handling: ggml_cont + reshape chain may not preserve data correctly on Metal
  5. Graph optimizer interference: ggml graph optimizer may simplify/skip the rotation ops

Current Status

  • Dequant inverse rotation RESTORED (PPL = 6.194, speed ~10.7 tok/s)
  • Graph rotation code preserved as TODO comments for future investigation
  • Virtual method infrastructure (get_turbo_rot_forward/inverse) remains in place
  • Rotation tensor allocation and initialization remains in KV cache