Reduce Benchmarks

March 23, 2026 · View on GitHub

Horizontal sum and row-norm benchmarks comparing NumKong against Polars, ndarray, and scalar baselines.

Rust

LibraryPrecisionGB/s
Sum
polars::ChunkedArray::sumf64 → f64113.57
polars::ChunkedArray::sumf32 → f32110.70
ndarray::sumf64 → f6499.49
ndarray::sumf32 → f3249.83
numkong::reduce_momentsbf16 → f6433.17
numkong::reduce_momentsu8 → u6424.24
serial codeu8 → u6422.96
numkong::reduce_momentsf64 → f6418.26
numkong::reduce_momentsf32 → f6410.31
serial codef32 → f328.50
Row Norms
ndarray::dotf64 → f6489.72
ndarray::dotf32 → f3253.24
numkong::Dotbf16 → f3230.64
numkong::Dotf64 → f6423.44
serial codef64 → f6417.95
numkong::Dotf16 → f3212.93
numkong::Dotf32 → f3210.60
serial codef32 → f329.20

Python

LibraryPrecisionGB/s
Sum
numpy.sumf64 → f6461.26
numpy.sumf32 → f3233.92
numkong.sumu8 → u821.78
numkong.sumi8 → i821.40
numkong.sumf64 → f6416.34
numkong.sumf32 → f329.49
numpy.sumu8 → u87.01
numpy.sumi8 → i86.73
Norm
numpy.linalg.normf64 → f6430.26
numpy.linalg.normf32 → f6420.15
numkong.normf64 → f6417.44
numkong.normf32 → f6415.10

Run It

Rust

# Default 1M-element tensors
cargo bench --bench bench_reduce --features bench_reduce

# Smaller 10K-element tensors
NUMWARS_DIMS=10000 \
cargo bench --bench bench_reduce --features bench_reduce

# Focus on one operation
NUMWARS_FILTER="reduce/sum|reduce/row_norms" \
cargo bench --bench bench_reduce --features bench_reduce

Python

python reduce/bench.py