Pre-Rotate-Queries Investigation Log

March 25, 2026 · View on GitHub

Goal

Move WHT inverse rotation from per-block dequant (O(128) per 4-element access) to graph-level Q forward + V inverse rotation (O(1) per-access amortized). This would reclaim ~77 tok/s speed while maintaining PPL ~6.19.

Model Under Test

Qwen3.5-35B-A3B-Q8_0 (MoE)
n_embd_head_k = 256, n_embd_head_v = 256, n_head = 16, n_head_kv = 2
WHT rotation group = 128 (QK_TURBO3 = 128)
Each head has 2 rotation groups (256/128 = 2)

ggml_mul_mat Semantics (Verified)

ggml stores 2D tensors column-major: element(i,j) at offset i + j*ne[0]
Storing a C row-major array M into ggml: ggml sees M^T
ggml_mul_mat(A, x) computes A^T @ x
NET EFFECT: storing row-major M and calling mul_mat gives M @ x
VERIFIED with 2x2 rotation test: stored R row-major, got R@x output

Rotation Matrix Storage

TURBO_ROTATION_R: R = diag(s2) * H/sqrt(128) * diag(s1) (forward rotation)
TURBO_ROTATION_RT: R^T = R^{-1} (inverse rotation)
For Q forward: store R (TURBO_ROTATION_R) -> mul_mat gives R @ q
For V inverse: store R^T (TURBO_ROTATION_RT) -> mul_mat gives R^T @ cur = R^{-1} @ cur
Python verified: R^T @ R = I, round-trip error = 1.2e-15

Test Results (all on Qwen3.5-35B-A3B, wikitext-2, 8 chunks, turbo3 K+V, flash attn)

Baseline

Config	PPL	Notes
Dequant inverse ON, no graph rot	6.194	Known good baseline
No dequant inverse, no graph rot	194	Fully rotated, no compensation

Graph Q Rotation Only (no V inverse)

Storage	mul_mat gives	PPL	Notes
TURBO_ROTATION_RT for Q (R^T@q)	R^T@q = R_inv@q	157	Wrong direction for K matching
TURBO_ROTATION_R for Q (R@q)	R@q	157	Correct direction but V still rotated

Both give ~157 because V output is still in rotated space (no V inverse). The attention scores differ but V corruption dominates.

Graph Q + V Rotation (both active)

Q Storage	V Storage	Q gives	V gives	PPL
RT (R^T@q)	R (R@cur)	R^T@q (wrong)	R@cur (wrong)	26.6
R (R@q)	RT (R^T@cur)	R@q (correct)	R^T@cur (correct)	23.5

Corrected storage is better (23.5 < 26.6), confirming direction matters. But 23.5 >> 6.19.

Isolation Tests (dequant inverse ON)

Graph rotation	PPL	Notes
Q rot only (R@q)	10.5	Q rotation works - degrades quality vs un-rotated K
V inv only (R^T@cur)	26.6	V inverse works - degrades quality vs un-rotated V

Key Finding

Both rotations mechanically work (modify output, verified with scale(2.0) test). But the full pre-rotate-queries approach (correct Q + correct V) gives PPL 23.5, NOT 6.19.

Unsolved: Why 23.5 Instead of 6.19?

The math proves the approaches should be equivalent:

Dequant inverse: x_dequant = R^{-1}(quant(R(x))) ~ x + R^{-1}(epsilon)
Graph rotation: Q=R(q), K=quant(R(k)), V=quant(R(v)), out=R^{-1}(attn(Q,K,V))
Error magnitudes are identical (orthogonal rotation preserves norms)

Hypotheses to investigate:

Flash attention precision: FA kernel may accumulate differently with rotated vs un-rotated values
Block boundary effects: 128-element rotation groups split across flash attention tiles differently
Metal mul_mat precision: GPU f32 mul_mat on (128,128)@(128,N) may have precision issues
Non-contiguous tensor handling: ggml_cont + reshape chain may not preserve data correctly on Metal
Graph optimizer interference: ggml graph optimizer may simplify/skip the rotation ops

Current Status

Dequant inverse rotation RESTORED (PPL = 6.194, speed ~10.7 tok/s)
Graph rotation code preserved as TODO comments for future investigation
Virtual method infrastructure (get_turbo_rot_forward/inverse) remains in place
Rotation tensor allocation and initialization remains in KV cache