Dots Benchmarks
March 22, 2026 · View on GitHub
Packed GEMM-style matrix multiplication benchmarks comparing NumKong against faer, matrixmultiply, ndarray, nalgebra, and NumPy.
Rust
| Library | Precision | GSO/s |
|---|---|---|
numkong::try_dots_packed_into | i8 → i32 | 2783.00 |
numkong::try_dots_packed_into | u8 → i32 | 2784.10 |
numkong::try_dots_packed_into | bf16 → f32 | 1250.80 |
numkong::try_dots_packed_into | f16 → f32 | 1249.70 |
faer::linalg::matmul::matmul | f32 → f32 | 117.50 |
matrixmultiply::sgemm | f32 → f32 | 116.77 |
ndarray::dot | f32 → f32 | 117.51 |
nalgebra::gemm | f32 → f32 | 118.72 |
numkong::try_dots_packed_into | f32 → f64 | 197.79 |
numkong::try_dots_packed_into | e4m3 → f32 | 495.25 |
numkong::try_dots_packed_into | e5m2 → f32 | 746.22 |
numkong::try_dots_packed_into | e2m3 → f32 | 1355.00 |
numkong::try_dots_packed_into | e3m2 → f32 | 693.43 |
Python
| Library | Precision | GSO/s |
|---|---|---|
numkong.dots_packed | i8 → i32 | 2621.97 |
numkong.dots_packed | bf16 → f32 | 1142.19 |
numpy.matmul | f32 → f32 | 1854.27 |
numkong.dots_packed | f16 → f32 | 1134.69 |
numkong.dots_packed | f32 → f64 | 194.15 |
Run It
Rust
``$\text{bash}
\text{Default} 2048 \times 2048 \times 2048 \text{workload}
\text{cargo} \text{bench} --\text{bench} \text{bench_dots} --\text{features} \text{bench_dots}
\text{Smaller} 512 \times 512 \times 512 \text{workload}
\text{NUMWARS_DIMS_WIDTH}=512 \text{NUMWARS_DIMS_HEIGHT}=512 \text{NUMWARS_DIMS_DEPTH}=512
\text{cargo} \text{bench} --\text{bench} \text{bench_dots} --\text{features} \text{bench_dots}
\text{Focus} \text{on} \text{float32}
\text{NUMWARS_FILTER}="\text{dots}/\text{f32}"
\text{cargo} \text{bench} --\text{bench} \text{bench_dots} --\text{features} \text{bench_dots}
$``
Python
# Default 2048×2048×2048 workload, float32 only
python dots/bench.py --filter 'dots/numpy/f32/2048x2048x2048'