Reduce Benchmarks
March 23, 2026 · View on GitHub
Horizontal sum and row-norm benchmarks comparing NumKong against Polars, ndarray, and scalar baselines.
Rust
| Library | Precision | GB/s |
|---|---|---|
| Sum | ||
polars::ChunkedArray::sum | f64 → f64 | 113.57 |
polars::ChunkedArray::sum | f32 → f32 | 110.70 |
ndarray::sum | f64 → f64 | 99.49 |
ndarray::sum | f32 → f32 | 49.83 |
numkong::reduce_moments | bf16 → f64 | 33.17 |
numkong::reduce_moments | u8 → u64 | 24.24 |
| serial code | u8 → u64 | 22.96 |
numkong::reduce_moments | f64 → f64 | 18.26 |
numkong::reduce_moments | f32 → f64 | 10.31 |
| serial code | f32 → f32 | 8.50 |
| Row Norms | ||
ndarray::dot | f64 → f64 | 89.72 |
ndarray::dot | f32 → f32 | 53.24 |
numkong::Dot | bf16 → f32 | 30.64 |
numkong::Dot | f64 → f64 | 23.44 |
| serial code | f64 → f64 | 17.95 |
numkong::Dot | f16 → f32 | 12.93 |
numkong::Dot | f32 → f32 | 10.60 |
| serial code | f32 → f32 | 9.20 |
Python
| Library | Precision | GB/s |
|---|---|---|
| Sum | ||
numpy.sum | f64 → f64 | 61.26 |
numpy.sum | f32 → f32 | 33.92 |
numkong.sum | u8 → u8 | 21.78 |
numkong.sum | i8 → i8 | 21.40 |
numkong.sum | f64 → f64 | 16.34 |
numkong.sum | f32 → f32 | 9.49 |
numpy.sum | u8 → u8 | 7.01 |
numpy.sum | i8 → i8 | 6.73 |
| Norm | ||
numpy.linalg.norm | f64 → f64 | 30.26 |
numpy.linalg.norm | f32 → f64 | 20.15 |
numkong.norm | f64 → f64 | 17.44 |
numkong.norm | f32 → f64 | 15.10 |
Run It
Rust
# Default 1M-element tensors
cargo bench --bench bench_reduce --features bench_reduce
# Smaller 10K-element tensors
NUMWARS_DIMS=10000 \
cargo bench --bench bench_reduce --features bench_reduce
# Focus on one operation
NUMWARS_FILTER="reduce/sum|reduce/row_norms" \
cargo bench --bench bench_reduce --features bench_reduce
Python
python reduce/bench.py