Dots Benchmarks

March 22, 2026 · View on GitHub

Packed GEMM-style matrix multiplication benchmarks comparing NumKong against faer, matrixmultiply, ndarray, nalgebra, and NumPy.

Rust

LibraryPrecisionGSO/s
numkong::try_dots_packed_intoi8 → i322783.00
numkong::try_dots_packed_intou8 → i322784.10
numkong::try_dots_packed_intobf16 → f321250.80
numkong::try_dots_packed_intof16 → f321249.70
faer::linalg::matmul::matmulf32 → f32117.50
matrixmultiply::sgemmf32 → f32116.77
ndarray::dotf32 → f32117.51
nalgebra::gemmf32 → f32118.72
numkong::try_dots_packed_intof32 → f64197.79
numkong::try_dots_packed_intoe4m3 → f32495.25
numkong::try_dots_packed_intoe5m2 → f32746.22
numkong::try_dots_packed_intoe2m3 → f321355.00
numkong::try_dots_packed_intoe3m2 → f32693.43

Python

LibraryPrecisionGSO/s
numkong.dots_packedi8 → i322621.97
numkong.dots_packedbf16 → f321142.19
numpy.matmulf32 → f321854.27
numkong.dots_packedf16 → f321134.69
numkong.dots_packedf32 → f64194.15

Run It

Rust

``$\text{bash}

\text{Default} 2048 \times 2048 \times 2048 \text{workload}

\text{cargo} \text{bench} --\text{bench} \text{bench_dots} --\text{features} \text{bench_dots}

\text{Smaller} 512 \times 512 \times 512 \text{workload}

\text{NUMWARS_DIMS_WIDTH}=512 \text{NUMWARS_DIMS_HEIGHT}=512 \text{NUMWARS_DIMS_DEPTH}=512
\text{cargo} \text{bench} --\text{bench} \text{bench_dots} --\text{features} \text{bench_dots}

\text{Focus} \text{on} \text{float32}

\text{NUMWARS_FILTER}="\text{dots}/\text{f32}"
\text{cargo} \text{bench} --\text{bench} \text{bench_dots} --\text{features} \text{bench_dots} $``

Python

# Default 2048×2048×2048 workload, float32 only
python dots/bench.py --filter 'dots/numpy/f32/2048x2048x2048'