Each Benchmarks
March 23, 2026 · View on GitHub
Elementwise sum and scale bandwidth benchmarks comparing NumKong against scalar baselines, ndarray, and nalgebra.
Rust
| Library | Precision | GB/s |
|---|---|---|
| Sum | ||
numkong::EachSum | f32 → f32 | 97.55 |
nalgebra::add | f32 → f32 | 95.31 |
ndarray::add | f32 → f32 | 94.84 |
| serial code | f32 → f32 | 94.06 |
| serial code | f64 → f64 | 85.48 |
ndarray::add | f64 → f64 | 84.91 |
nalgebra::add | f64 → f64 | 84.55 |
numkong::EachSum | f64 → f64 | 82.77 |
numkong::EachSum | f16 → f16 | 96.56 |
numkong::EachSum | bf16 → bf16 | 17.73 |
numkong::EachSum | i8 → i8 | 111.47 |
| serial code | i8 → i8 | 110.81 |
| Scale | ||
| serial code | f32 → f32 | 82.22 |
ndarray::scale | f32 → f32 | 81.75 |
numkong::EachScale | f32 → f32 | 66.56 |
nalgebra::scale | f32 → f32 | 39.52 |
| serial code | f64 → f64 | 72.46 |
ndarray::scale | f64 → f64 | 72.39 |
numkong::EachScale | f64 → f64 | 66.70 |
nalgebra::scale | f64 → f64 | 38.58 |
numkong::EachScale | f16 → f16 | 66.23 |
numkong::EachScale | bf16 → bf16 | 33.19 |
| serial code | i8 → i8 | 89.21 |
numkong::EachScale | i8 → i8 | 26.43 |
Python
| Library | Precision | GB/s |
|---|---|---|
| Sum | ||
numpy.add | i8 → i8 | 143.56 |
numkong.add | i8 → i8 | 123.77 |
numkong.add | f32 → f32 | 118.39 |
numpy.add | f32 → f32 | 115.32 |
numpy.add | f64 → f64 | 114.37 |
numkong.add | f16 → f16 | 107.29 |
numkong.add | f64 → f64 | 100.01 |
numkong.add | bf16 → bf16 | 73.27 |
numpy.add | f16 → f16 | 4.08 |
Run It
Rust
# Default 1M-element tensors
cargo bench --bench bench_each --features bench_each
# Focus on one operation family
NUMWARS_FILTER="each/sum|each/scale" \
cargo bench --bench bench_each --features bench_each
Python
# Default 1M-element tensors, add on float32
python each/bench.py --filter 'add/float32'