Horizontal Reductions in NumKong

April 2, 2026 · View on GitHub

NumKong implements single-pass horizontal reductions over dense vectors: statistical moments (sum + sum-of-squares) and extrema (min + max with argmin + argmax). Both reductions traverse the input once, producing scalar outputs with compensated arithmetic for numerical stability. The only module with full stride support — stride_bytes controls the byte distance between consecutive logical elements, enabling column extraction from row-major matrices and strided array views without copying. Used internally by packing routines for norm precomputation and by distance kernels for normalization.

Moments:

sum=ai,sumsq=ai2\text{sum} = \sum a_i, \quad \text{sumsq} = \sum a_i^2

Min-max:

min=miniai,argmin=argminiai\text{min} = \min_i a_i, \quad \text{argmin} = \arg\min_i a_i

Reformulating as Python pseudocode:

import numpy as np

def moments(a: np.ndarray) -> tuple[float, float]:
    return np.sum(a), np.sum(a ** 2)

def minmax(a: np.ndarray) -> tuple[float, int, float, int]:
    return np.min(a), np.argmin(a), np.max(a), np.argmax(a)

Input & Output Types

Float reductions:

Input TypeOutput TypeDescription
f64f6464-bit double precision
f32f3232-bit single precision
f16f3216-bit half precision, widened output
bf16f3216-bit brain float, widened output

Mini-float reductions:

Input TypeOutput TypeDescription
e4m3f328-bit Float8: 4 exponent, 3 mantissa bits
e5m2f328-bit Float8: 5 exponent, 2 mantissa bits
e2m3f328-bit MX format: 2 exponent, 3 mantissa bits
e3m2f328-bit MX format: 3 exponent, 2 mantissa bits

Integer reductions:

Input TypeOutput TypeDescription
i8i648-bit signed, widened to 64-bit
u8u648-bit unsigned, widened to 64-bit
i16i6416-bit signed, widened to 64-bit
u16u6416-bit unsigned, widened to 64-bit
i32i6432-bit signed, widened to 64-bit
u32u6432-bit unsigned, widened to 64-bit
i64i6464-bit signed
u64u6464-bit unsigned

Sub-byte reductions:

Input TypeOutput TypeDescription
i4i644-bit signed nibbles, widened to 64-bit
u4u644-bit unsigned nibbles, widened to 64-bit
u1u641-bit binary packed octets

Optimizations

Strided Access Across Backends

Reductions accept a stride_bytes parameter specifying the byte distance between consecutive logical elements — the only NumKong module where loads far outnumber stores (N loads, 2-4 scalar stores), making arbitrary strides practical. Serial iterates with byte-pointer arithmetic: ptr += stride_bytes per element. NEON uses hardware de-interleaving loads (vld2q_f32, vld3q_f32, vld4q_f32) for small integer strides (2-4 elements apart), extracting column 0 from interleaved data in a single instruction. Haswell/Skylake use blend masks for small strides and _mm256_i32gather_ps / _mm512_i32gather_pd hardware gathers for larger strides — 8cy per gather on Haswell, ~5cy on Skylake for 16-element gathers. RVV uses native strided loads (__riscv_vlse32_v_f32m1) that accept arbitrary byte strides directly in the load instruction — no gather overhead, no stride-dependent branching.

Kahan-Neumaier Compensated Summation

nk_reduce_moments_f32_serial, nk_reduce_moments_f32_haswell use Neumaier's variant of Kahan summation — maintaining a running compensation term that captures rounding errors. Standard pairwise summation accumulates O(n)O(\sqrt{n}) ULP error for n elements; Neumaier compensation bounds error to O(1)O(1) ULP regardless of vector length. The serial path uses Neumaier's adaptive branch: if (abs(sum) >= abs(val)) selects the larger summand first, minimizing relative error in the compensation term. SIMD backends (nk_reduce_moments_f32_haswell) carry 8 independent compensation lanes in a YMM register — computing round_error = tentative - sum; correction = (sum - (tentative - round_error)) + (val - round_error) without branches, folding all lanes into a single scalar correction at the end.

Fused Moments in a Single Pass

nk_reduce_moments_f32_haswell, nk_reduce_moments_f64_skylake compute sum and sum-of-squares simultaneously — one load feeds both a VADDPS (sum accumulator) and a VFMADD231PS (square accumulator). Two accumulators share the same loaded data, halving memory bandwidth compared to separate sum + norm passes. The squared-norm a2=ai2\|a\|^2 = \sum a_i^2 is a self-dot-product, reused by packing routines (nk_dots_pack_f32_haswell) to precompute per-vector norms during layout transformation. For Float16/BFloat16/Float8 inputs, all backends widen to Float32 before accumulation — NEON FHM (nk_reduce_moments_e4m3_neonfhm) converts e4m3->f16 via lookup, then uses vfmlalq_low_f16 to fuse the Float16 → Float32 widening with the FMA into the Float32 accumulator.

Integer Saturation in Sum-of-Squares

Integer moments accumulate sums in the widest available type: Int8/UInt8/Int16 inputs produce Int64/UInt64 outputs. Sums use widening addition chains — NEON uses pairwise widening (vpaddlq_s16 -> UInt32 -> UInt64 stages); Haswell biases Int8 inputs with 0x80 and uses unsigned SAD (_mm256_sad_epu8) for the sum, correcting by subtracting $128 \times \text{count}attheend.SumofsquarescanoverflowUInt64whensquaringlargeInt32valuesbackendsuseexplicitsaturatingmultiply:checksifabs(val)<at the end. Sum-of-squares can overflow UInt64 when squaring large Int32 values — backends use explicit saturating multiply: checks if `abs(val) <2^{3}$2` (square fits in UInt64), otherwise saturates to I64_MAX. Haswell emulates UInt64 saturating add via XOR-based unsigned comparison: flip sign bits to convert unsigned overflow detection into a signed comparison, then OR with the overflow mask to produce all-ones on saturation.

Recursive Blocking for Counter Overflow

All SIMD backends use loop iteration counters narrower than nk_size_t to save register pressure — UInt8 for Int8 minmax lanes, UInt16 for Float32 moments lanes. When count exceeds the counter's range x lane count (e.g., Haswell Float32: $256 \times 8 = 2048 elements for UInt8 counters), the reduction splits recursively: process the left half, process the right half, combine results with saturating arithmetic. Block caps vary by backend and element width: Haswell Int8 minmax uses UInt8 loop counters (cap = \256 \times 32 = 8192); Skylake Float32 moments uses UInt16 counters (cap = \65536 \times 16 = 1048576$). The recursive split is invisible to the caller — the public API accepts arbitrary count values; internal dispatch chooses between single-pass and recursive based on the cap.

Index Tracking at Different Register Scales

Argmin/argmax requires tracking both values and their positions — but indices need wider storage than values (UInt64 for arbitrary-length vectors, vs UInt8/UInt16/Float32 for data). Haswell Int8 minmax tracks iteration counters in UInt8 lanes (same width as data) — after the loop, the winning lane's counter is multiplied by the lane count and added to the lane index within the register to reconstruct the global position. RVV uses u64m2 registers (LMUL=2) for indices alongside f32m1 for values — the wider index register holds one 64-bit position per Float32 lane, enabling direct merge without post-loop reconstruction. NEON uses same-width counters (u8x16 for i8x16 minmax), limiting block size to $256 \times 16 = 4096$ elements before recursive splitting.

NaN-Aware Extrema Tracking

nk_reduce_minmax_f32_haswell, nk_reduce_minmax_f64_skylake use IEEE ordered-quiet comparisons (_CMP_LT_OQ, _CMP_GT_OQ) — returning false when either operand is NaN, so NaN inputs never replace the running extremum. Tail elements beyond the vector-aligned portion are masked by loading into a NaN-filled register via _mm256_mask_loadu_ps(nan_vec, mask, ptr) — NaN tails cannot win any comparison, eliminating out-of-bounds artifacts. If all inputs are NaN, the sentinels remain (min = F32_MAX, max = F32_MIN) and indices are set to NK_SIZE_MAX, signaling no valid extremum. The final horizontal reduction across lanes uses pairwise VSHUFPS + VMINPS chains — 3 shuffles for a 256-bit register, O(log2w)O(\log_2 w) for width ww.

Performance

The following performance tables are produced by manually re-running nk_test and nk_bench included internal tools to measure both accuracy and throughput at different input shapes. The input size is controlled by the NK_DENSE_DIMENSIONS environment variable and set to 256, 1024, and 4096 elements. The throughput is measured in GB/s as the number of input bytes per second. Accuracy is reported as mean ULP (units in last place) unless noted otherwise — the average number of representable floating-point values between the result and the exact answer. Each kernel runs for at least 20 seconds per configuration. Benchmark threads are pinned to specific cores; on machines with heterogeneous core types (e.g., Apple P/E cores), only the fastest cores are used. Workloads that significantly degrade CPU frequencies (Intel AMX, Apple SME) run in separate passes to avoid affecting throughput measurements of other kernels.

Intel Sapphire Rapids

Native

Kernel25610244096
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_f64_serial1.47 gb/s, 0 ulp1.73 gb/s, 0 ulp1.95 gb/s, 0 ulp
nk_reduce_minmax_f64_serial6.59 gb/s, 0 ulp5.95 gb/s, 0 ulp5.82 gb/s, 0 ulp
nk_reduce_moments_f64_haswell10.8 gb/s, 0.1 ulp9.18 gb/s, 0 ulp6.05 gb/s, 0 ulp
nk_reduce_minmax_f64_haswell8.11 gb/s, 0 ulp9.45 gb/s, 0 ulp6.59 gb/s, 0 ulp
nk_reduce_moments_f64_skylake14.7 gb/s, 0.3 ulp13.9 gb/s, 0.1 ulp11.4 gb/s, 0 ulp
nk_reduce_minmax_f64_skylake9.02 gb/s, 0 ulp18.3 gb/s, 0 ulp9.93 gb/s, 0 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_f32_serial0.458 gb/s, 0 ulp0.437 gb/s, 0 ulp0.449 gb/s, 0 ulp
nk_reduce_minmax_f32_serial3.35 gb/s, 0 ulp3.04 gb/s, 0 ulp3.27 gb/s, 0 ulp
nk_reduce_moments_f32_haswell18.4 gb/s, 0.8 ulp17.8 gb/s, 4.2 ulp11.7 gb/s, 7.7 ulp
nk_reduce_minmax_f32_haswell8.18 gb/s, 0 ulp8.92 gb/s, 0 ulp8.24 gb/s, 0 ulp
nk_reduce_moments_f32_skylake20.7 gb/s, 0.4 ulp20.3 gb/s, 3.1 ulp17.1 gb/s, 8.8 ulp
nk_reduce_minmax_f32_skylake7.35 gb/s, 0 ulp15.9 gb/s, 0 ulp21.8 gb/s, 0 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_bf16_serial0.208 gb/s, 0 ulp0.245 gb/s, 0 ulp0.239 gb/s, 0 ulp
nk_reduce_minmax_bf16_serial0.935 gb/s, 0 ulp0.984 gb/s, 0 ulp1.00 gb/s, 0 ulp
nk_reduce_moments_bf16_haswell11.4 gb/s, 0 ulp12.2 gb/s, 0 ulp10.8 gb/s, 1.6 ulp
nk_reduce_minmax_bf16_haswell4.98 gb/s, 0 ulp7.54 gb/s, 0 ulp9.30 gb/s, 0 ulp
nk_reduce_moments_bf16_skylake18.2 gb/s, 0 ulp27.0 gb/s, 0 ulp17.9 gb/s, 0.7 ulp
nk_reduce_minmax_bf16_skylake6.53 gb/s, 0 ulp18.2 gb/s, 0 ulp13.7 gb/s, 0 ulp
nk_reduce_moments_bf16_genoa18.1 gb/s, 0 ulp20.5 gb/s, 0 ulp19.3 gb/s, 0.8 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_f16_serial0.391 gb/s, 0 ulp0.354 gb/s, 0 ulp0.407 gb/s, 0 ulp
nk_reduce_minmax_f16_serial0.901 gb/s, 0 ulp0.877 gb/s, 0 ulp0.974 gb/s, 0 ulp
nk_reduce_moments_f16_haswell13.5 gb/s, 0 ulp12.6 gb/s, 0 ulp11.0 gb/s, 0.3 ulp
nk_reduce_minmax_f16_haswell6.61 gb/s, 0 ulp9.19 gb/s, 0 ulp8.10 gb/s, 0 ulp
nk_reduce_moments_f16_skylake17.7 gb/s, 0 ulp29.1 gb/s, 0.1 ulp18.6 gb/s, 0 ulp
nk_reduce_minmax_f16_skylake10.2 gb/s, 0 ulp20.8 gb/s, 0 ulp22.0 gb/s, 0 ulp
e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_e5m2_serial0.157 gb/s, 0 ulp0.296 gb/s, 0 ulp0.229 gb/s, 0 ulp
nk_reduce_minmax_e5m2_serial0.418 gb/s, 0 ulp0.417 gb/s, 0 ulp0.451 gb/s, 0 ulp
nk_reduce_moments_e5m2_haswell2.40 gb/s, 0 ulp2.69 gb/s, 0 ulp2.61 gb/s, 0 ulp
nk_reduce_minmax_e5m2_haswell4.48 gb/s, 0 ulp6.80 gb/s, 0 ulp7.21 gb/s, 0 ulp
nk_reduce_moments_e5m2_skylake4.66 gb/s, 0 ulp2.83 gb/s, 0 ulp4.04 gb/s, 0 ulp
nk_reduce_minmax_e5m2_skylake3.90 gb/s, 0 ulp11.8 gb/s, 0 ulp19.1 gb/s, 0 ulp
nk_reduce_moments_e5m2_genoa4.76 gb/s, 0 ulp6.08 gb/s, 0 ulp5.88 gb/s, 0 ulp
e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_e4m3_serial0.121 gb/s, 0 ulp0.129 gb/s, 0 ulp0.158 gb/s, 0 ulp
nk_reduce_minmax_e4m3_serial0.460 gb/s, 0 ulp0.473 gb/s, 0 ulp0.464 gb/s, 0 ulp
nk_reduce_moments_e4m3_haswell1.82 gb/s, 0 ulp1.90 gb/s, 0 ulp1.77 gb/s, 0 ulp
nk_reduce_minmax_e4m3_haswell4.42 gb/s, 0 ulp7.00 gb/s, 0 ulp8.10 gb/s, 0 ulp
nk_reduce_moments_e4m3_skylake2.77 gb/s, 0 ulp3.53 gb/s, 0 ulp2.74 gb/s, 0 ulp
nk_reduce_minmax_e4m3_skylake3.79 gb/s, 0 ulp9.57 gb/s, 0 ulp17.0 gb/s, 0 ulp
nk_reduce_moments_e4m3_genoa4.67 gb/s, 0 ulp5.87 gb/s, 0 ulp5.67 gb/s, 0 ulp
e3m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_e3m2_serial0.158 gb/s, 0 ulp0.279 gb/s, 0 ulp0.348 gb/s, 0 ulp
nk_reduce_minmax_e3m2_serial0.464 gb/s, 0 ulp0.416 gb/s, 0 ulp0.470 gb/s, 0 ulp
nk_reduce_moments_e3m2_haswell2.37 gb/s, 0 ulp2.55 gb/s, 0 ulp2.53 gb/s, 0 ulp
nk_reduce_minmax_e3m2_haswell5.36 gb/s, 0 ulp7.89 gb/s, 0 ulp9.56 gb/s, 0 ulp
nk_reduce_moments_e3m2_skylake2.77 gb/s, 0 ulp3.32 gb/s, 0 ulp3.58 gb/s, 0 ulp
nk_reduce_minmax_e3m2_skylake9.85 gb/s, 0 ulp20.1 gb/s, 0 ulp14.6 gb/s, 0 ulp
nk_reduce_moments_e3m2_icelake8.82 gb/s, 0 ulp9.02 gb/s, 0 ulp13.4 gb/s, 0 ulp
nk_reduce_moments_e3m2_alder4.80 gb/s, 0 ulp7.11 gb/s, 0 ulp7.89 gb/s, 0 ulp
e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_e2m3_serial0.157 gb/s, 0 ulp0.294 gb/s, 0 ulp0.301 gb/s, 0 ulp
nk_reduce_minmax_e2m3_serial0.465 gb/s, 0 ulp0.421 gb/s, 0 ulp0.453 gb/s, 0 ulp
nk_reduce_moments_e2m3_haswell2.43 gb/s, 0 ulp2.45 gb/s, 0 ulp2.58 gb/s, 0 ulp
nk_reduce_minmax_e2m3_haswell5.31 gb/s, 0 ulp7.90 gb/s, 0 ulp9.36 gb/s, 0 ulp
nk_reduce_moments_e2m3_skylake3.49 gb/s, 0 ulp3.02 gb/s, 0 ulp3.66 gb/s, 0 ulp
nk_reduce_minmax_e2m3_skylake6.14 gb/s, 0 ulp17.5 gb/s, 0 ulp20.3 gb/s, 0 ulp
nk_reduce_moments_e2m3_icelake12.7 gb/s, 0 ulp22.7 gb/s, 0 ulp21.7 gb/s, 0 ulp
i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_i8_serial2.21 gb/s2.40 gb/s2.29 gb/s
nk_reduce_minmax_i8_serial0.806 gb/s0.973 gb/s1.09 gb/s
nk_reduce_moments_i8_haswell9.35 gb/s11.9 gb/s12.7 gb/s
nk_reduce_minmax_i8_haswell7.11 gb/s11.7 gb/s13.2 gb/s
nk_reduce_moments_i8_skylake10.4 gb/s16.6 gb/s20.1 gb/s
nk_reduce_minmax_i8_skylake2.96 gb/s14.4 gb/s15.5 gb/s
nk_reduce_moments_i8_icelake14.0 gb/s28.3 gb/s28.4 gb/s
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_u8_serial2.40 gb/s2.49 gb/s2.15 gb/s
nk_reduce_minmax_u8_serial0.776 gb/s0.931 gb/s1.05 gb/s
nk_reduce_moments_u8_haswell10.3 gb/s12.9 gb/s13.6 gb/s
nk_reduce_minmax_u8_haswell7.08 gb/s11.2 gb/s12.0 gb/s
nk_reduce_moments_u8_skylake13.2 gb/s20.1 gb/s19.6 gb/s
nk_reduce_minmax_u8_skylake4.45 gb/s14.0 gb/s20.4 gb/s
nk_reduce_moments_u8_icelake14.6 gb/s21.7 gb/s30.4 gb/s
nk_reduce_moments_u8_alder11.5 gb/s13.3 gb/s13.7 gb/s
i4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_i4_serial0.345 gb/s0.757 gb/s0.752 gb/s
nk_reduce_minmax_i4_serial0.313 gb/s0.285 gb/s0.357 gb/s
nk_reduce_moments_i4_haswell6.36 gb/s9.17 gb/s10.3 gb/s
nk_reduce_moments_i4_skylake7.67 gb/s8.85 gb/s15.4 gb/s
u4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_u4_serial0.438 gb/s0.799 gb/s1.00 gb/s
nk_reduce_minmax_u4_serial0.352 gb/s0.292 gb/s0.397 gb/s
nk_reduce_moments_u4_haswell7.40 gb/s10.7 gb/s10.8 gb/s
nk_reduce_moments_u4_skylake9.45 gb/s15.0 gb/s18.3 gb/s
u1░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_u1_serial1.36 gb/s1.96 gb/s2.04 gb/s
nk_reduce_minmax_u1_serial5.44 gb/s14.7 gb/s84.1 gb/s
nk_reduce_moments_u1_haswell4.29 gb/s9.69 gb/s12.0 gb/s
nk_reduce_moments_u1_skylake2.90 gb/s12.3 gb/s20.6 gb/s
i16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_i16_serial2.54 gb/s2.68 gb/s2.80 gb/s
nk_reduce_minmax_i16_serial1.60 gb/s1.75 gb/s2.07 gb/s
nk_reduce_moments_i16_haswell13.7 gb/s14.7 gb/s12.5 gb/s
nk_reduce_minmax_i16_haswell8.56 gb/s10.9 gb/s10.0 gb/s
nk_reduce_moments_i16_skylake16.8 gb/s21.0 gb/s20.5 gb/s
nk_reduce_minmax_i16_skylake6.74 gb/s15.9 gb/s19.1 gb/s
nk_reduce_moments_i16_icelake19.0 gb/s24.9 gb/s28.2 gb/s
nk_reduce_moments_i16_alder10.0 gb/s12.1 gb/s10.5 gb/s
u16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_u16_serial2.62 gb/s2.55 gb/s2.54 gb/s
nk_reduce_minmax_u16_serial1.28 gb/s1.41 gb/s1.62 gb/s
nk_reduce_moments_u16_haswell6.82 gb/s6.95 gb/s6.60 gb/s
nk_reduce_minmax_u16_haswell8.25 gb/s10.5 gb/s11.6 gb/s
nk_reduce_moments_u16_skylake10.2 gb/s13.9 gb/s12.6 gb/s
nk_reduce_minmax_u16_skylake16.0 gb/s22.6 gb/s16.9 gb/s
nk_reduce_moments_u16_alder7.17 gb/s8.10 gb/s7.57 gb/s
i32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_i32_serial2.39 gb/s2.25 gb/s2.32 gb/s
nk_reduce_minmax_i32_serial2.99 gb/s3.67 gb/s4.48 gb/s
nk_reduce_moments_i32_haswell5.43 gb/s5.37 gb/s4.41 gb/s
nk_reduce_minmax_i32_haswell11.1 gb/s10.2 gb/s10.4 gb/s
nk_reduce_moments_i32_skylake6.87 gb/s11.1 gb/s10.6 gb/s
nk_reduce_minmax_i32_skylake23.8 gb/s24.7 gb/s17.6 gb/s
u32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_u32_serial3.46 gb/s3.53 gb/s3.41 gb/s
nk_reduce_minmax_u32_serial2.81 gb/s3.34 gb/s4.05 gb/s
nk_reduce_moments_u32_haswell6.10 gb/s5.79 gb/s5.27 gb/s
nk_reduce_minmax_u32_haswell10.6 gb/s11.2 gb/s9.95 gb/s
nk_reduce_moments_u32_skylake15.9 gb/s9.96 gb/s15.3 gb/s
nk_reduce_minmax_u32_skylake23.6 gb/s25.3 gb/s21.7 gb/s
i64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_i64_serial2.43 gb/s2.43 gb/s2.43 gb/s
nk_reduce_minmax_i64_serial4.90 gb/s5.54 gb/s6.10 gb/s
nk_reduce_moments_i64_haswell7.16 gb/s6.54 gb/s5.38 gb/s
nk_reduce_minmax_i64_haswell9.50 gb/s9.87 gb/s7.63 gb/s
nk_reduce_moments_i64_skylake13.0 gb/s8.29 gb/s10.5 gb/s
nk_reduce_minmax_i64_skylake11.6 gb/s23.1 gb/s22.0 gb/s
u64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_u64_serial1.94 gb/s1.92 gb/s1.83 gb/s
nk_reduce_minmax_u64_serial5.99 gb/s7.20 gb/s7.33 gb/s
nk_reduce_moments_u64_haswell8.60 gb/s8.45 gb/s5.96 gb/s
nk_reduce_minmax_u64_haswell8.93 gb/s9.81 gb/s7.55 gb/s
nk_reduce_moments_u64_skylake15.6 gb/s19.3 gb/s8.87 gb/s
nk_reduce_minmax_u64_skylake9.90 gb/s23.1 gb/s21.6 gb/s

WASM

Measured with Wasmtime v42 (Cranelift backend).

Kernel25610244096
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_f64_serial1.2 gb/s, 0 ulp1.22 gb/s, 0 ulp1.19 gb/s, 0 ulp
nk_reduce_moments_f64_v128relaxed6.97 gb/s, 0 ulp6.91 gb/s, 0 ulp7.06 gb/s, 0 ulp
nk_reduce_minmax_f64_serial5.23 gb/s, 0 ulp5.49 gb/s, 0 ulp5.71 gb/s, 0 ulp
nk_reduce_minmax_f64_v128relaxed4.9 gb/s, 0 ulp5.16 gb/s, 0 ulp5.11 gb/s, 0 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_f32_serial0.606 gb/s, 0 ulp0.622 gb/s, 0 ulp0.618 gb/s, 0 ulp
nk_reduce_moments_f32_v128relaxed8.84 gb/s, 0.1 ulp6.78 gb/s, 0.4 ulp6.49 gb/s, 0 ulp
nk_reduce_minmax_f32_serial2.63 gb/s, 0 ulp2.73 gb/s, 0 ulp2.86 gb/s, 0 ulp
nk_reduce_minmax_f32_v128relaxed4.59 gb/s, 0 ulp5.02 gb/s, 0 ulp5.05 gb/s, 0 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_bf16_serial0.274 gb/s, 0 ulp0.28 gb/s, 0 ulp0.288 gb/s, 0 ulp
nk_reduce_moments_bf16_v128relaxed7.7 gb/s, 0 ulp6.06 gb/s, 0.1 ulp5.26 gb/s, 1.6 ulp
nk_reduce_minmax_bf16_serial0.873 gb/s, 0 ulp0.977 gb/s, 0 ulp1.04 gb/s, 0 ulp
nk_reduce_minmax_bf16_v128relaxed4.88 gb/s, 0 ulp5.14 gb/s, 0 ulp5.55 gb/s, 0 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_f16_serial0.234 gb/s, 0 ulp0.235 gb/s, 0 ulp0.247 gb/s, 0 ulp
nk_reduce_moments_f16_v128relaxed1.91 gb/s, 0 ulp1.95 gb/s, 0 ulp1.92 gb/s, 0.2 ulp
nk_reduce_minmax_f16_serial0.859 gb/s, 0 ulp0.967 gb/s, 0 ulp1.06 gb/s, 0 ulp
nk_reduce_minmax_f16_v128relaxed1.3 gb/s, 0 ulp1.38 gb/s, 0 ulp1.39 gb/s, 0 ulp
e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_e5m2_serial0.116 gb/s, 0 ulp0.118 gb/s, 0 ulp0.122 gb/s, 0 ulp
nk_reduce_moments_e5m2_v128relaxed1.55 gb/s, 0 ulp1.59 gb/s, 0 ulp1.56 gb/s, 0 ulp
nk_reduce_minmax_e5m2_serial0.471 gb/s, 0 ulp0.499 gb/s, 0 ulp0.527 gb/s, 0 ulp
nk_reduce_minmax_e5m2_v128relaxed1.27 gb/s, 0 ulp2.76 gb/s, 0 ulp3.07 gb/s, 0 ulp
e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_e4m3_serial0.109 gb/s, 0 ulp0.109 gb/s, 0 ulp0.111 gb/s, 0 ulp
nk_reduce_moments_e4m3_v128relaxed1.15 gb/s, 0 ulp1.16 gb/s, 0 ulp1.17 gb/s, 0 ulp
nk_reduce_minmax_e4m3_serial0.451 gb/s, 0 ulp0.495 gb/s, 0 ulp0.535 gb/s, 0 ulp
nk_reduce_minmax_e4m3_v128relaxed1.43 gb/s, 0 ulp2.61 gb/s, 0 ulp3.06 gb/s, 0 ulp
e3m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_e3m2_serial0.121 gb/s, 0 ulp0.117 gb/s, 0 ulp0.12 gb/s, 0 ulp
nk_reduce_moments_e3m2_v128relaxed2.04 gb/s, 0 ulp2.1 gb/s, 0 ulp2.09 gb/s, 0 ulp
nk_reduce_minmax_e3m2_serial0.431 gb/s, 0 ulp0.433 gb/s, 0 ulp0.437 gb/s, 0 ulp
nk_reduce_minmax_e3m2_v128relaxed1.45 gb/s, 0 ulp3.75 gb/s, 0 ulp4.32 gb/s, 0 ulp
e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_e2m3_serial0.117 gb/s, 0 ulp0.118 gb/s, 0 ulp0.12 gb/s, 0 ulp
nk_reduce_moments_e2m3_v128relaxed3.1 gb/s, 0 ulp3.26 gb/s, 0 ulp3.28 gb/s, 0 ulp
nk_reduce_minmax_e2m3_serial0.434 gb/s, 0 ulp0.43 gb/s, 0 ulp0.439 gb/s, 0 ulp
nk_reduce_minmax_e2m3_v128relaxed2.84 gb/s, 0 ulp3.77 gb/s, 0 ulp4.32 gb/s, 0 ulp
i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_i8_serial2.04 gb/s2.07 gb/s2.09 gb/s
nk_reduce_moments_i8_v128relaxed6.69 gb/s7.37 gb/s7.43 gb/s
nk_reduce_minmax_i8_serial0.928 gb/s0.92 gb/s0.935 gb/s
nk_reduce_minmax_i8_v128relaxed5.08 gb/s5.55 gb/s7 gb/s
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_u8_serial2.04 gb/s2.09 gb/s2.1 gb/s
nk_reduce_moments_u8_v128relaxed6.38 gb/s6.96 gb/s7.04 gb/s
nk_reduce_minmax_u8_serial0.851 gb/s0.851 gb/s0.858 gb/s
nk_reduce_minmax_u8_v128relaxed3.97 gb/s4.52 gb/s5.45 gb/s
i16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_i16_serial4.09 gb/s4.12 gb/s4.18 gb/s
nk_reduce_moments_i16_v128relaxed7.03 gb/s7.37 gb/s7.34 gb/s
nk_reduce_minmax_i16_serial1.86 gb/s1.85 gb/s1.87 gb/s
nk_reduce_minmax_i16_v128relaxed6.67 gb/s7.53 gb/s8.2 gb/s
u16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_u16_serial4.09 gb/s4.17 gb/s4.23 gb/s
nk_reduce_moments_u16_v128relaxed6.74 gb/s7.05 gb/s6.91 gb/s
nk_reduce_minmax_u16_serial1.67 gb/s1.67 gb/s1.68 gb/s
nk_reduce_minmax_u16_v128relaxed4.78 gb/s5.43 gb/s5.83 gb/s
i32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_i32_serial3.65 gb/s3.74 gb/s3.77 gb/s
nk_reduce_moments_i32_v128relaxed1.58 gb/s1.6 gb/s1.58 gb/s
nk_reduce_minmax_i32_serial4.27 gb/s4.3 gb/s4.32 gb/s
nk_reduce_minmax_i32_v128relaxed6.91 gb/s7.77 gb/s8.06 gb/s
u32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_u32_serial3.59 gb/s3.6 gb/s3.65 gb/s
nk_reduce_moments_u32_v128relaxed1.22 gb/s1.21 gb/s1.21 gb/s
nk_reduce_minmax_u32_serial3.81 gb/s3.81 gb/s3.88 gb/s
nk_reduce_minmax_u32_v128relaxed5.1 gb/s5.62 gb/s5.89 gb/s
i64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_i64_serial3.09 gb/s3.14 gb/s3.28 gb/s
nk_reduce_moments_i64_v128relaxed2.65 gb/s2.66 gb/s2.65 gb/s
nk_reduce_minmax_i64_serial8.49 gb/s8.38 gb/s8.65 gb/s
nk_reduce_minmax_i64_v128relaxed6.14 gb/s6.37 gb/s6.5 gb/s
u64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_u64_serial2.78 gb/s2.79 gb/s2.87 gb/s
nk_reduce_moments_u64_v128relaxed2.46 gb/s2.48 gb/s2.47 gb/s
nk_reduce_minmax_u64_serial7.83 gb/s7.59 gb/s7.74 gb/s
nk_reduce_minmax_u64_v128relaxed2.54 gb/s2.59 gb/s2.64 gb/s

Apple M5

Native

Kernel25610244096
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_f64_serial6.41 gb/s, 0 ulp7.19 gb/s, 0 ulp6.71 gb/s, 0 ulp
nk_reduce_minmax_f64_serial16.4 gb/s, 0 ulp16.1 gb/s, 0 ulp16.1 gb/s, 0 ulp
nk_reduce_moments_f64_neon16.4 gb/s, 0 ulp16.3 gb/s, 0 ulp17.0 gb/s, 0 ulp
nk_reduce_minmax_f64_neon18.0 gb/s, 0 ulp17.0 gb/s, 0 ulp16.2 gb/s, 0 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_f32_serial3.06 gb/s, 0 ulp3.40 gb/s, 0 ulp3.15 gb/s, 0 ulp
nk_reduce_minmax_f32_serial8.29 gb/s, 0 ulp7.97 gb/s, 0 ulp8.00 gb/s, 0 ulp
nk_reduce_moments_f32_neon16.6 gb/s, 0.4 ulp10.8 gb/s, 1.8 ulp9.78 gb/s, 1.1 ulp
nk_reduce_minmax_f32_neon19.7 gb/s, 0 ulp17.3 gb/s, 0 ulp16.5 gb/s, 0 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_bf16_serial1.49 gb/s, 0 ulp1.62 gb/s, 0 ulp1.55 gb/s, 0 ulp
nk_reduce_minmax_bf16_serial2.19 gb/s, 0 ulp2.49 gb/s, 0 ulp2.69 gb/s, 0 ulp
nk_reduce_moments_bf16_neonbfdot26.1 gb/s, 0 ulp28.2 gb/s, 0.4 ulp29.9 gb/s, 0.3 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_f16_serial1.46 gb/s, 0 ulp1.55 gb/s, 0 ulp1.51 gb/s, 0 ulp
nk_reduce_minmax_f16_serial1.62 gb/s, 0 ulp1.87 gb/s, 0 ulp2.02 gb/s, 0 ulp
nk_reduce_moments_f16_neonhalf21.2 gb/s, 0.1 ulp15.4 gb/s, 0.1 ulp10.6 gb/s, 0.8 ulp
e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_e5m2_serial0.750 gb/s, 0 ulp0.830 gb/s, 0 ulp0.758 gb/s, 0 ulp
nk_reduce_minmax_e5m2_serial1.10 gb/s, 0 ulp1.34 gb/s, 0 ulp1.39 gb/s, 0 ulp
nk_reduce_moments_e5m2_neon10.4 gb/s, ? ulp7.47 gb/s, ? ulp5.22 gb/s, ? ulp
nk_reduce_moments_e5m2_neonfhm12.4 gb/s, 0 ulp7.31 gb/s, 0 ulp4.78 gb/s, 0 ulp
nk_reduce_minmax_e5m2_neon15.5 gb/s, 0 ulp17.4 gb/s, 0 ulp18.1 gb/s, 0 ulp
e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_e4m3_serial0.569 gb/s, 0 ulp0.640 gb/s, 0 ulp0.599 gb/s, 0 ulp
nk_reduce_minmax_e4m3_serial1.14 gb/s, 0 ulp1.39 gb/s, 0 ulp1.42 gb/s, 0 ulp
nk_reduce_moments_e4m3_neon6.52 gb/s, ? ulp5.52 gb/s, ? ulp4.88 gb/s, ? ulp
nk_reduce_moments_e4m3_neonfhm4.21 gb/s, 0 ulp4.23 gb/s, 0 ulp4.11 gb/s, 0 ulp
nk_reduce_minmax_e4m3_neon15.5 gb/s, 0 ulp17.4 gb/s, 0 ulp17.5 gb/s, 0 ulp
e3m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_e3m2_serial0.750 gb/s, 0 ulp0.806 gb/s, 0 ulp0.773 gb/s, 0 ulp
nk_reduce_minmax_e3m2_serial0.640 gb/s, 0 ulp0.641 gb/s, 0 ulp0.634 gb/s, 0 ulp
nk_reduce_moments_e3m2_neon8.64 gb/s, ? ulp6.96 gb/s, ? ulp5.70 gb/s, ? ulp
nk_reduce_minmax_e3m2_neon16.6 gb/s, 0 ulp19.1 gb/s, 0 ulp18.9 gb/s, 0 ulp
e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_e2m3_serial0.753 gb/s, 0 ulp0.821 gb/s, 0 ulp0.775 gb/s, 0 ulp
nk_reduce_minmax_e2m3_serial0.640 gb/s, 0 ulp0.643 gb/s, 0 ulp0.618 gb/s, 0 ulp
nk_reduce_moments_e2m3_neon17.1 gb/s, ? ulp15.2 gb/s, ? ulp11.9 gb/s, ? ulp
nk_reduce_moments_e2m3_neonsdot29.4 gb/s, 0 ulp29.3 gb/s, 0 ulp24.4 gb/s, 0 ulp
nk_reduce_minmax_e2m3_neon17.0 gb/s, 0 ulp19.3 gb/s, 0 ulp19.6 gb/s, 0 ulp
i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_i8_serial3.00 gb/s3.50 gb/s3.21 gb/s
nk_reduce_minmax_i8_serial1.81 gb/s1.94 gb/s1.89 gb/s
nk_reduce_moments_i8_neon27.9 gb/s17.7 gb/s13.3 gb/s
nk_reduce_minmax_i8_neon25.7 gb/s30.3 gb/s27.8 gb/s
nk_reduce_moments_i8_neonsdot44.4 gb/s46.8 gb/s33.4 gb/s
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_u8_serial3.08 gb/s3.51 gb/s3.23 gb/s
nk_reduce_minmax_u8_serial1.84 gb/s1.94 gb/s1.90 gb/s
nk_reduce_moments_u8_neon29.2 gb/s18.0 gb/s13.6 gb/s
nk_reduce_minmax_u8_neon27.0 gb/s28.3 gb/s29.2 gb/s
nk_reduce_moments_u8_neonsdot45.5 gb/s43.5 gb/s33.3 gb/s
i4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_i4_serial2.06 gb/s2.57 gb/s2.39 gb/s
nk_reduce_minmax_i4_serial0.701 gb/s0.794 gb/s0.803 gb/s
u4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_u4_serial2.20 gb/s2.75 gb/s2.62 gb/s
nk_reduce_minmax_u4_serial0.741 gb/s0.845 gb/s0.856 gb/s
u1░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_u1_serial1.79 gb/s1.99 gb/s2.03 gb/s
nk_reduce_minmax_u1_serial9.74 gb/s38.5 gb/s121 gb/s
i16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_i16_serial6.02 gb/s7.03 gb/s6.70 gb/s
nk_reduce_minmax_i16_serial3.66 gb/s3.88 gb/s3.82 gb/s
nk_reduce_moments_i16_neon23.0 gb/s16.2 gb/s11.9 gb/s
nk_reduce_minmax_i16_neon26.8 gb/s28.5 gb/s25.9 gb/s
u16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_u16_serial6.13 gb/s6.79 gb/s6.45 gb/s
nk_reduce_minmax_u16_serial2.72 gb/s2.68 gb/s2.67 gb/s
nk_reduce_moments_u16_neon23.1 gb/s16.3 gb/s11.6 gb/s
nk_reduce_minmax_u16_neon26.6 gb/s28.5 gb/s25.8 gb/s
i32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_i32_serial7.70 gb/s8.09 gb/s7.52 gb/s
nk_reduce_minmax_i32_serial7.35 gb/s7.49 gb/s7.65 gb/s
nk_reduce_moments_i32_neon8.88 gb/s6.74 gb/s6.60 gb/s
nk_reduce_minmax_i32_neon27.3 gb/s27.9 gb/s25.8 gb/s
u32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_u32_serial7.51 gb/s8.10 gb/s7.73 gb/s
nk_reduce_minmax_u32_serial7.28 gb/s7.69 gb/s7.68 gb/s
nk_reduce_moments_u32_neon19.0 gb/s12.8 gb/s11.5 gb/s
nk_reduce_minmax_u32_neon27.6 gb/s28.3 gb/s26.4 gb/s
i64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_i64_serial10.4 gb/s12.9 gb/s11.9 gb/s
nk_reduce_minmax_i64_serial13.7 gb/s14.1 gb/s14.0 gb/s
nk_reduce_moments_i64_neon15.6 gb/s13.1 gb/s12.9 gb/s
nk_reduce_minmax_i64_neon18.0 gb/s16.6 gb/s15.6 gb/s
u64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_u64_serial10.4 gb/s11.9 gb/s11.2 gb/s
nk_reduce_minmax_u64_serial13.4 gb/s14.1 gb/s13.8 gb/s
nk_reduce_moments_u64_neon29.1 gb/s22.5 gb/s22.2 gb/s
nk_reduce_minmax_u64_neon18.3 gb/s16.7 gb/s16.1 gb/s

WASM

Measured with Wasmtime v43 (Cranelift backend).

Kernel25610244096
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_f64_serial7.05 gb/s, 0 ulp6.89 gb/s, 0 ulp7.04 gb/s, 0 ulp
nk_reduce_moments_f64_v128relaxed17.9 gb/s, 0 ulp17.9 gb/s, 0 ulp17.9 gb/s, 0 ulp
nk_reduce_minmax_f64_serial11.8 gb/s, 0 ulp11.5 gb/s, 0 ulp11.4 gb/s, 0 ulp
nk_reduce_minmax_f64_v128relaxed16.4 gb/s, 0 ulp16.8 gb/s, 0 ulp16.9 gb/s, 0 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_f32_serial3.34 gb/s, 0 ulp3.32 gb/s, 0 ulp3.34 gb/s, 0 ulp
nk_reduce_moments_f32_v128relaxed16.7 gb/s, 0.1 ulp11.9 gb/s, 0.5 ulp10.3 gb/s, 0 ulp
nk_reduce_minmax_f32_serial4.40 gb/s, 0 ulp4.31 gb/s, 0 ulp4.28 gb/s, 0 ulp
nk_reduce_minmax_f32_v128relaxed14.5 gb/s, 0 ulp15.8 gb/s, 0 ulp16.7 gb/s, 0 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_bf16_serial1.76 gb/s, 0 ulp1.72 gb/s, 0 ulp1.72 gb/s, 0 ulp
nk_reduce_moments_bf16_v128relaxed20.8 gb/s, 0 ulp14.8 gb/s, 0.2 ulp11.1 gb/s, 1.6 ulp
nk_reduce_minmax_bf16_serial1.80 gb/s, 0 ulp2.01 gb/s, 0 ulp2.10 gb/s, 0 ulp
nk_reduce_minmax_bf16_v128relaxed9.22 gb/s, 0 ulp9.53 gb/s, 0 ulp10.5 gb/s, 0 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_f16_serial1.77 gb/s, 0 ulp1.73 gb/s, 0 ulp1.73 gb/s, 0 ulp
nk_reduce_moments_f16_v128relaxed11.8 gb/s, 0 ulp10.8 gb/s, 0 ulp9.84 gb/s, 0.3 ulp
nk_reduce_minmax_f16_serial1.79 gb/s, 0 ulp1.99 gb/s, 0 ulp2.09 gb/s, 0 ulp
nk_reduce_minmax_f16_v128relaxed6.04 gb/s, 0 ulp7.62 gb/s, 0 ulp8.11 gb/s, 0 ulp
e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_e5m2_serial0.886 gb/s, 0 ulp0.865 gb/s, 0 ulp0.870 gb/s, 0 ulp
nk_reduce_moments_e5m2_v128relaxed3.53 gb/s, 0 ulp3.55 gb/s, 0 ulp3.56 gb/s, 0 ulp
nk_reduce_minmax_e5m2_serial1.12 gb/s, 0 ulp1.18 gb/s, 0 ulp1.22 gb/s, 0 ulp
nk_reduce_minmax_e5m2_v128relaxed5.50 gb/s, 0 ulp10.8 gb/s, 0 ulp14.6 gb/s, 0 ulp
e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_e4m3_serial0.617 gb/s, 0 ulp0.606 gb/s, 0 ulp0.610 gb/s, 0 ulp
nk_reduce_moments_e4m3_v128relaxed3.01 gb/s, 0 ulp3.04 gb/s, 0 ulp3.05 gb/s, 0 ulp
nk_reduce_minmax_e4m3_serial1.09 gb/s, 0 ulp1.21 gb/s, 0 ulp1.27 gb/s, 0 ulp
nk_reduce_minmax_e4m3_v128relaxed6.46 gb/s, 0 ulp10.5 gb/s, 0 ulp15.0 gb/s, 0 ulp
e3m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_e3m2_serial0.881 gb/s, 0 ulp0.873 gb/s, 0 ulp0.868 gb/s, 0 ulp
nk_reduce_moments_e3m2_v128relaxed5.41 gb/s, 0 ulp4.67 gb/s, 0 ulp4.34 gb/s, 0 ulp
nk_reduce_minmax_e3m2_serial0.671 gb/s, 0 ulp0.667 gb/s, 0 ulp0.666 gb/s, 0 ulp
nk_reduce_minmax_e3m2_v128relaxed7.95 gb/s, 0 ulp13.0 gb/s, 0 ulp16.3 gb/s, 0 ulp
e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_e2m3_serial0.884 gb/s, 0 ulp0.877 gb/s, 0 ulp0.857 gb/s, 0 ulp
nk_reduce_moments_e2m3_v128relaxed9.54 gb/s, 0 ulp9.49 gb/s, 0 ulp8.72 gb/s, 0 ulp
nk_reduce_minmax_e2m3_serial0.670 gb/s, 0 ulp0.667 gb/s, 0 ulp0.666 gb/s, 0 ulp
nk_reduce_minmax_e2m3_v128relaxed9.35 gb/s, 0 ulp12.9 gb/s, 0 ulp16.4 gb/s, 0 ulp
i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_i8_serial3.78 gb/s3.68 gb/s3.62 gb/s
nk_reduce_moments_i8_v128relaxed21.6 gb/s22.3 gb/s18.7 gb/s
nk_reduce_minmax_i8_serial1.47 gb/s1.49 gb/s1.49 gb/s
nk_reduce_minmax_i8_v128relaxed13.0 gb/s19.1 gb/s23.1 gb/s
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_u8_serial3.71 gb/s3.68 gb/s3.62 gb/s
nk_reduce_moments_u8_v128relaxed21.6 gb/s22.3 gb/s18.8 gb/s
nk_reduce_minmax_u8_serial1.51 gb/s1.55 gb/s1.55 gb/s
nk_reduce_minmax_u8_v128relaxed12.6 gb/s16.6 gb/s22.5 gb/s
i16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_i16_serial7.52 gb/s7.35 gb/s7.33 gb/s
nk_reduce_moments_i16_v128relaxed13.3 gb/s9.97 gb/s8.83 gb/s
nk_reduce_minmax_i16_serial2.96 gb/s2.99 gb/s2.99 gb/s
nk_reduce_minmax_i16_v128relaxed18.8 gb/s14.5 gb/s16.1 gb/s
u16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_u16_serial7.51 gb/s7.28 gb/s7.23 gb/s
nk_reduce_moments_u16_v128relaxed13.2 gb/s9.98 gb/s8.84 gb/s
nk_reduce_minmax_u16_serial3.04 gb/s3.11 gb/s3.11 gb/s
nk_reduce_minmax_u16_v128relaxed16.5 gb/s14.5 gb/s16.2 gb/s
i32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_i32_serial5.73 gb/s5.66 gb/s5.58 gb/s
nk_reduce_moments_i32_v128relaxed6.40 gb/s6.35 gb/s6.39 gb/s
nk_reduce_minmax_i32_serial6.85 gb/s6.88 gb/s6.90 gb/s
nk_reduce_minmax_i32_v128relaxed14.0 gb/s15.9 gb/s16.7 gb/s
u32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_u32_serial6.01 gb/s5.70 gb/s5.54 gb/s
nk_reduce_moments_u32_v128relaxed1.65 gb/s1.58 gb/s1.56 gb/s
nk_reduce_minmax_u32_serial6.80 gb/s6.87 gb/s6.90 gb/s
nk_reduce_minmax_u32_v128relaxed14.4 gb/s15.9 gb/s16.7 gb/s
i64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_i64_serial9.52 gb/s9.73 gb/s9.62 gb/s
nk_reduce_moments_i64_v128relaxed7.42 gb/s7.40 gb/s7.50 gb/s
nk_reduce_minmax_i64_serial13.6 gb/s13.7 gb/s13.7 gb/s
nk_reduce_minmax_i64_v128relaxed16.4 gb/s16.9 gb/s17.0 gb/s
u64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_reduce_moments_u64_serial10.2 gb/s10.3 gb/s10.5 gb/s
nk_reduce_moments_u64_v128relaxed3.04 gb/s2.98 gb/s2.94 gb/s
nk_reduce_minmax_u64_serial13.6 gb/s13.7 gb/s13.7 gb/s
nk_reduce_minmax_u64_v128relaxed3.31 gb/s3.33 gb/s3.35 gb/s