Each Benchmarks

March 23, 2026 · View on GitHub

Elementwise sum and scale bandwidth benchmarks comparing NumKong against scalar baselines, ndarray, and nalgebra.

Rust

Library	Precision	GB/s
Sum
`numkong::EachSum`	f32 → f32	97.55
`nalgebra::add`	f32 → f32	95.31
`ndarray::add`	f32 → f32	94.84
serial code	f32 → f32	94.06
serial code	f64 → f64	85.48
`ndarray::add`	f64 → f64	84.91
`nalgebra::add`	f64 → f64	84.55
`numkong::EachSum`	f64 → f64	82.77
`numkong::EachSum`	f16 → f16	96.56
`numkong::EachSum`	bf16 → bf16	17.73
`numkong::EachSum`	i8 → i8	111.47
serial code	i8 → i8	110.81
Scale
serial code	f32 → f32	82.22
`ndarray::scale`	f32 → f32	81.75
`numkong::EachScale`	f32 → f32	66.56
`nalgebra::scale`	f32 → f32	39.52
serial code	f64 → f64	72.46
`ndarray::scale`	f64 → f64	72.39
`numkong::EachScale`	f64 → f64	66.70
`nalgebra::scale`	f64 → f64	38.58
`numkong::EachScale`	f16 → f16	66.23
`numkong::EachScale`	bf16 → bf16	33.19
serial code	i8 → i8	89.21
`numkong::EachScale`	i8 → i8	26.43

Python

Library	Precision	GB/s
Sum
`numpy.add`	i8 → i8	143.56
`numkong.add`	i8 → i8	123.77
`numkong.add`	f32 → f32	118.39
`numpy.add`	f32 → f32	115.32
`numpy.add`	f64 → f64	114.37
`numkong.add`	f16 → f16	107.29
`numkong.add`	f64 → f64	100.01
`numkong.add`	bf16 → bf16	73.27
`numpy.add`	f16 → f16	4.08

Run It

Rust

# Default 1M-element tensors
cargo bench --bench bench_each --features bench_each

# Focus on one operation family
NUMWARS_FILTER="each/sum|each/scale" \
cargo bench --bench bench_each --features bench_each

Python

# Default 1M-element tensors, add on float32
python each/bench.py --filter 'add/float32'