Each Benchmarks

March 23, 2026 · View on GitHub

Elementwise sum and scale bandwidth benchmarks comparing NumKong against scalar baselines, ndarray, and nalgebra.

Rust

LibraryPrecisionGB/s
Sum
numkong::EachSumf32 → f3297.55
nalgebra::addf32 → f3295.31
ndarray::addf32 → f3294.84
serial codef32 → f3294.06
serial codef64 → f6485.48
ndarray::addf64 → f6484.91
nalgebra::addf64 → f6484.55
numkong::EachSumf64 → f6482.77
numkong::EachSumf16 → f1696.56
numkong::EachSumbf16 → bf1617.73
numkong::EachSumi8 → i8111.47
serial codei8 → i8110.81
Scale
serial codef32 → f3282.22
ndarray::scalef32 → f3281.75
numkong::EachScalef32 → f3266.56
nalgebra::scalef32 → f3239.52
serial codef64 → f6472.46
ndarray::scalef64 → f6472.39
numkong::EachScalef64 → f6466.70
nalgebra::scalef64 → f6438.58
numkong::EachScalef16 → f1666.23
numkong::EachScalebf16 → bf1633.19
serial codei8 → i889.21
numkong::EachScalei8 → i826.43

Python

LibraryPrecisionGB/s
Sum
numpy.addi8 → i8143.56
numkong.addi8 → i8123.77
numkong.addf32 → f32118.39
numpy.addf32 → f32115.32
numpy.addf64 → f64114.37
numkong.addf16 → f16107.29
numkong.addf64 → f64100.01
numkong.addbf16 → bf1673.27
numpy.addf16 → f164.08

Run It

Rust

# Default 1M-element tensors
cargo bench --bench bench_each --features bench_each

# Focus on one operation family
NUMWARS_FILTER="each/sum|each/scale" \
cargo bench --bench bench_each --features bench_each

Python

# Default 1M-element tensors, add on float32
python each/bench.py --filter 'add/float32'