CPU performance

June 16, 2026 · View on GitHub

TL;DR — the build was SSE-only

The CPU slowness traced to a build trap, not the engine. Under Nix the gcc/clang wrapper strips -march=native (NIX_ENFORCE_NO_NATIVE), so a GGML_NATIVE=ON build silently compiles ggml-cpu with no AVX2/AVX-512/FMA — and the CI build (-DGGML_NATIVE=OFF) has no SIMD either. Confirmed by disassembly:

$ objdump -d libggml-cpu.so | grep -c zmm   # AVX-512   -> 0
$ objdump -d libggml-cpu.so | grep -c ymm   # AVX2      -> 0
$ objdump -d libggml-cpu.so | grep -c vfmadd# FMA       -> 0   (37k xmm/SSE only)

With SIMD actually enabled, ggml-f16 on CPU is ~10× faster and beats the PyTorch/transformers reference — no quantization needed. The fix is to build the CPU backend for all ISAs and pick at runtime (GGML_CPU_ALL_VARIANTS), which also sidesteps the Nix -march=native stripping.

512 tok, f16	tok/s
SSE-only (the trap: `GGML_NATIVE=OFF`, or `=ON` under Nix)	280
AVX-512 (explicit `-mavx512*`, or the zen4 runtime variant)	~3000
PyTorch CPU (fp32, MKL)	1935

The fix: GGML_CPU_ALL_VARIANTS (runtime ISA dispatch)

-march=native is fragile (stripped by Nix; wrong if you build on a different host than you run on). ggml's portable answer is to compile the CPU backend once per ISA level and score+load the best at run time:

cmake -B build -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON ...

This produces libggml-cpu-{sse42,haswell,skylakex,icelake,zen4,…}.so; on this Ryzen 9 7900 it loads libggml-cpu-zen4.so (AVX-512 + VNNI + BF16):

load_backend: loaded CPU backend from libggml-cpu-zen4.so

Engine support (src/backend.cpp): call ggml_backend_load_all() before ggml_backend_init_by_type, and set threads through the registry (ggml_backend_reg_get_proc_address(reg, "ggml_backend_set_n_threads")) since the CPU-specific symbol now lives in the variant .so, not in linked base. Both calls are no-ops for a static build, so one code path serves both. Use the release-portable preset.

ggml vs MKL once SIMD is real

pf-gemm-bench (ggml CPU mul_mat GFLOP/s) vs torch.matmul (MKL), 12 threads:

shape (M,K,N)	ggml SSE	ggml AVX-512	MKL f32
expert gate_up N=16 (f32)	39	426	463
expert gate_up N=512 (f32)	41	528	1098
large 4096² N=512 (f32)	39	442	1137

ggml's f32 GEMM goes 40 → ~500 GFLOP/s — within ~2× of MKL, and at the actual model level (lower per-op overhead than HF's Python expert loop) ggml-f16 wins: 512 tok f16 3006 vs PyTorch 1935. So there was never a missing blocked SGEMM — it just wasn't compiled with SIMD.

Profile (AVX-512) and the minor Q8 option

PF_PROF=noattn|nomoe ablation (512 tok, AVX-512): MoE 64%, attention 34%, rest <1%. PF_NTHREADS sweep: near-linear to 12 physical cores, SMT regresses — the default is optimal.

Quantization is now a minor lever, not a necessity: scripts/requant_q8.py (Q8_0 experts) adds ~15% over f16+AVX-512 (512 tok: 3006 → 3567) but is a strict precision drop — on the 3k-token case it falls below the f16 parity gate (cos 0.9972, 1 argmax flip in 3053), so it would need its own tier. Given f16+AVX-512 already beats the reference, Q8 is optional (e.g. for memory: 1.6 vs 2.8 GiB).

Flash attention (both backends)

With SIMD fixed, attention became the dominant cost at length on both backends (PF_PROF ablation — CPU 8192 tok: attention 72%; Vulkan 2k–32k: ~69%), because the engine built the full [n,n] score matrix and masked it to the sliding window — O(n²) work for an O(n·256) receptive field.

ggml_flash_attn_ext (default; PF_NOFLASH selects the explicit path) fuses QK·softmax·V with no materialized scores, carries the attention sinks (ggml_flash_attn_ext_add_sinks) and the sliding-window mask, and accumulates in F32. It is numerically exact here — passes the f32 cos>=0.99999 gate and window-stitch — and faster where attention dominates:

forward tok/s	CPU 2048	CPU 8192	Vulkan 8192	Vulkan 131072
explicit (`PF_NOFLASH`)	1881	798	11845	8992
flash (default)	3319	1928	26918	20631
speedup	1.8×	2.4×	2.3×	2.3×

Memory and the processing window (W)

PF_WINDOW (the pf_set_window knob, default 4096) sets tokens per forward; longer inputs run as overlapping halo windows. At the default, GGML's footprint is flat across document length — the compute buffer is bounded by the window, not the input:

length	PyTorch VRAM (eager)	GGML Vulkan VRAM (flash, W=4096)
4 096	5 439	2 883
8 192	13 637	2 883
32 768	OOM	2 883
131 072	OOM	2 883

PyTorch (single-pass) grows O(n²) and OOMs by ~16k tokens; GGML holds ~2.9 GiB at 131k. So the default W=4096 is a good fit for VRAM-constrained deployments.

Raising W to cut the halo recompute is tempting but currently a bad trade: it OOMs by W=16384. Flash removed the O(n²) scores, but the sliding-window mask is still a materialized [n,n] tensor — the last O(n²) term.

Banded mask (prototype, `pf-banded-proto`)

Grouping tokens into blocks of B ≥ radius and having each query block attend only to blocks {i-1, i, i+1} makes the mask O(n·B) (a [3B, B, n_blocks] band, constant per block) and the attention compute O(n·band) — while being bit-identical to full masked attention (same dot products, computed locally):

$ pf-banded-proto 256 8192
n=8192 B=256 r=128 | max|d|=0.00e+00 | mask: full 256.0 MiB, band 24.0 MiB (10.7x)

Mask scaling (B=256): 21× smaller at 16k, 85× at 64k.

On by default for sequences >= 2048 tokens (src/model.cpp; PF_BANDED forces it on/off): blocks of B=256, each query block flash-attends to blocks {i-1,i,i+1} with the F16 band mask + sinks; GQA broadcasts over heads; out-of-range tokens are padded and masked. Parity-exact — passes the f32 cos>=0.99999 gate and window-stitch on CPU and Vulkan. Speedups (flash → banded, default W):

tok/s	CPU 8192	Vulkan 8192	Vulkan 32768
flash	2068	42407	33893
banded	2325	105058	83664
	1.1×	2.5×	2.5×

Big on Vulkan (the flash kernel computes the full window; banded only the band), modest on CPU. The measured crossover (banded/flash): 0.9× at 256–512 tok, 1.0× at 2048, then 1.1× (CPU) / 2.5× (Vulkan) at 4096+. Hence the 2048 default cutoff.

Dropping the window (`PF_MOE_CHUNK`)

With banded attention the only remaining O(n) cap on a large single window was the MoE expert matmul's activation scratch (mul_mat_id y_sz > maxStorageBufferRange on Vulkan). The MoE is per-token, so PF_MOE_CHUNK=C runs it in C-token chunks (exact, no halo). It defaults to the forward window (4096), so it's inert at the default window (n <= W) but keeps a larger window from OOMing. Banded + chunking lets a 131072-token document run in one window instead of windowing at W=4096:

131072 tok, Vulkan	tok/s	compute buffer
banded, windowed W=4096	80 897	166 MiB
banded + chunk, single window	103 539	2 389 MiB

~1.28× faster (no halo recompute) for more memory -- the throughput/VRAM tradeoff the window now exposes, capped only by total VRAM. Passes the f32 parity gate.

Reproduce

cmake --preset release-portable && cmake --build --preset release-portable -j
build/release-portable/pf-gemm-bench 12          # GFLOP/s by dtype/shape
build/release-portable/pf-bench <f16.gguf> cpu 5 512
objdump -d build/release-portable/bin/libggml-cpu-zen4.so | grep -c zmm   # > 0