Dots Benchmarks

March 22, 2026 · View on GitHub

Packed GEMM-style matrix multiplication benchmarks comparing NumKong against faer, matrixmultiply, ndarray, nalgebra, and NumPy.

Rust

Library	Precision	GSO/s
`numkong::try_dots_packed_into`	i8 → i32	2783.00
`numkong::try_dots_packed_into`	u8 → i32	2784.10
`numkong::try_dots_packed_into`	bf16 → f32	1250.80
`numkong::try_dots_packed_into`	f16 → f32	1249.70
`faer::linalg::matmul::matmul`	f32 → f32	117.50
`matrixmultiply::sgemm`	f32 → f32	116.77
`ndarray::dot`	f32 → f32	117.51
`nalgebra::gemm`	f32 → f32	118.72
`numkong::try_dots_packed_into`	f32 → f64	197.79
`numkong::try_dots_packed_into`	e4m3 → f32	495.25
`numkong::try_dots_packed_into`	e5m2 → f32	746.22
`numkong::try_dots_packed_into`	e2m3 → f32	1355.00
`numkong::try_dots_packed_into`	e3m2 → f32	693.43

Python

Library	Precision	GSO/s
`numkong.dots_packed`	i8 → i32	2621.97
`numkong.dots_packed`	bf16 → f32	1142.19
`numpy.matmul`	f32 → f32	1854.27
`numkong.dots_packed`	f16 → f32	1134.69
`numkong.dots_packed`	f32 → f64	194.15

Run It

Rust

``$\text{bash}

\text{Default} 2048 \times 2048 \times 2048 \text{workload}

\text{cargo} \text{bench} --\text{bench} \text{bench_dots} --\text{features} \text{bench_dots}

\text{Smaller} 512 \times 512 \times 512 \text{workload}

\text{NUMWARS_DIMS_WIDTH}=512 \text{NUMWARS_DIMS_HEIGHT}=512 \text{NUMWARS_DIMS_DEPTH}=512
\text{cargo} \text{bench} --\text{bench} \text{bench_dots} --\text{features} \text{bench_dots}

\text{Focus} \text{on} \text{float32}

\text{NUMWARS_FILTER}="\text{dots}/\text{f32}"
\text{cargo} \text{bench} --\text{bench} \text{bench_dots} --\text{features} \text{bench_dots} $``

Python

# Default 2048×2048×2048 workload, float32 only
python dots/bench.py --filter 'dots/numpy/f32/2048x2048x2048'