Element-Wise Arithmetic in NumKong

April 2, 2026 · View on GitHub

NumKong implements element-wise vector arithmetic: addition, scaling, blending, and fused multiply-add across all supported numeric types. Each operation reads one to three input vectors and writes one output vector of the same length, with scalar coefficients α\alpha and β\beta controlling linear combinations. Mixed-precision workflows use narrower input types (Float16, BFloat16, Float8) with Float32 intermediate computation and narrowed output.

Sum (addition):

resulti=ai+bi\text{result}_i = a_i + b_i

Scale:

resulti=αai+β\text{result}_i = \alpha \cdot a_i + \beta

Blend:

resulti=αai+βbi\text{result}_i = \alpha \cdot a_i + \beta \cdot b_i

Fused multiply-add:

resulti=αaibi+βci\text{result}_i = \alpha \cdot a_i \cdot b_i + \beta \cdot c_i

Reformulating as Python pseudocode:

import numpy as np

def fma(a: np.ndarray, b: np.ndarray, c: np.ndarray,
        alpha: float = 1.0, beta: float = 1.0) -> np.ndarray:
    return alpha * a * b + beta * c

def scale(a: np.ndarray, alpha: float = 1.0, beta: float = 0.0) -> np.ndarray:
    return alpha * a + beta

def add(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    return a + b

def blend(a: np.ndarray, b: np.ndarray,
          alpha: float = 1.0, beta: float = 1.0) -> np.ndarray:
    return alpha * a + beta * b

Input & Output Types

Real and integer element-wise operations:

Input TypeOutput TypeDescription
f64f6464-bit IEEE 754 double precision
f32f3232-bit IEEE 754 single precision
f16f1616-bit IEEE 754 half precision
bf16bf1616-bit brain float
e4m3e4m38-bit Float8: 4 exponent, 3 mantissa bits
e5m2e5m28-bit Float8: 5 exponent, 2 mantissa bits
i8i88-bit signed integers, saturating
u8u88-bit unsigned integers, saturating
i16i1616-bit signed integers
u16u1616-bit unsigned integers
i32i3232-bit signed integers
u32u3232-bit unsigned integers
i64i6464-bit signed integers
u64u6464-bit unsigned integers

Complex element-wise operations:

Input TypeOutput TypeDescription
f64cf64c64-bit complex pairs
f32cf32c32-bit complex pairs

Optimizations

Widening-Narrowing Pipeline for Sub-32-bit Types

nk_each_fma_f16_haswell, nk_each_blend_bf16_neonbfdot, nk_each_scale_e4m3_haswell widen inputs to Float32 before arithmetic, then narrow the result back to the original type. The widen-compute-narrow pipeline costs 2 extra conversion instructions per element but guarantees Float32-precision intermediate results — critical for FMA where naive Float16 multiplication would lose 5+ bits of mantissa. Haswell processes 8 Float16 elements per cycle: VCVTPH2PS (widen) -> VFMADD231PS (FMA) -> VCVTPS2PH (narrow), fully pipelined across 3 execution ports.

Saturating Integer Arithmetic

nk_each_sum_i8_haswell, nk_each_sum_u8_neonhalf use saturating addition — clamping to type bounds instead of wrapping on overflow. Haswell uses VPADDSB / VPADDUSB for signed/unsigned 8-bit saturation in a single instruction (32 elements per cycle at YMM width). Serial fallback implements saturation via branch-free min/max: result = min(max(a + b, TYPE_MIN), TYPE_MAX) with overflow detection through sign-bit comparison.

Complex Number Layout

nk_each_fma_f32c_serial, nk_each_blend_f64c_serial operate on interleaved real/imaginary pairs: [re0, im0, re1, im1, ...]. Addition and scaling treat complex vectors as 2N-length real vectors — no special handling needed. FMA requires cross-lane operations for the imaginary part: re(a*b) = re(a)*re(b) - im(a)*im(b), implemented via VFMADDSUB231PS which alternates add/subtract across even/odd lanes.

Performance

The following performance tables are produced by manually re-running nk_test and nk_bench included internal tools to measure both accuracy and throughput at different input shapes. The input size is controlled by the NK_DENSE_DIMENSIONS environment variable and set to 256, 1024, and 4096 elements. The throughput is measured in GB/s as the number of input bytes read per second. Accuracy is reported as mean ULP (units in last place) unless noted otherwise — the average number of representable floating-point values between the result and the exact answer. Rows marked 🧩 use external BLAS baselines rather than NumKong kernels. Each kernel runs for at least 20 seconds per configuration. Benchmark threads are pinned to specific cores; on machines with heterogeneous core types (e.g., Apple P/E cores), only the fastest cores are used. Workloads that significantly degrade CPU frequencies (Intel AMX, Apple SME) run in separate passes to avoid affecting throughput measurements of other kernels.

Intel Sapphire Rapids

Native

Kernel25610244096
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
sum_f64_with_blas 🧩12.1 gb/s8.63 gb/s6.83 gb/s
each_blend_f64_with_blas 🧩11.3 gb/s8.80 gb/s6.71 gb/s
nk_each_sum_f64_serial16.7 gb/s, 0 ulp17.2 gb/s, 0 ulp10.9 gb/s, 0 ulp
nk_each_sum_f64_haswell14.7 gb/s, 0 ulp10.2 gb/s, 0 ulp7.99 gb/s, 0 ulp
nk_each_sum_f64_skylake16.4 gb/s, 0 ulp16.7 gb/s, 0 ulp8.59 gb/s, 0 ulp
nk_each_scale_f64_serial11.4 gb/s, 0 ulp12.5 gb/s, 0 ulp7.99 gb/s, 0 ulp
nk_each_scale_f64_haswell9.61 gb/s, 0 ulp9.22 gb/s, 0 ulp5.08 gb/s, 0 ulp
nk_each_scale_f64_skylake11.2 gb/s, 0 ulp11.9 gb/s, 0 ulp6.30 gb/s, 0 ulp
nk_each_blend_f64_serial16.4 gb/s, 1.4 ulp16.4 gb/s, 1.1 ulp11.8 gb/s, 1.1 ulp
nk_each_blend_f64_haswell13.4 gb/s, 1.5 ulp11.2 gb/s, 1.5 ulp7.85 gb/s, 1.1 ulp
nk_each_blend_f64_skylake16.5 gb/s, 1.7 ulp15.9 gb/s, 1.5 ulp8.55 gb/s, 1.1 ulp
nk_each_fma_f64_serial20.1 gb/s, 1.5 ulp20.8 gb/s, 1.5 ulp11.7 gb/s, 1.3 ulp
nk_each_fma_f64_haswell16.8 gb/s, 1.5 ulp11.1 gb/s, 1.5 ulp9.42 gb/s, 2.8 ulp
nk_each_fma_f64_skylake19.8 gb/s, 1.4 ulp20.4 gb/s, 1.5 ulp11.6 gb/s, 2.7 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
sum_f32_with_blas 🧩11.6 gb/s12.8 gb/s7.57 gb/s
each_blend_f32_with_blas 🧩10.3 gb/s9.60 gb/s6.69 gb/s
nk_each_sum_f32_serial16.9 gb/s, 0 ulp18.8 gb/s, 0 ulp15.7 gb/s, 0 ulp
nk_each_sum_f32_haswell14.6 gb/s, 0 ulp14.3 gb/s, 0 ulp8.49 gb/s, 0 ulp
nk_each_sum_f32_skylake16.7 gb/s, 0 ulp17.9 gb/s, 0 ulp16.5 gb/s, 0 ulp
nk_each_scale_f32_serial10.8 gb/s, 0 ulp13.0 gb/s, 0 ulp12.1 gb/s, 0 ulp
nk_each_scale_f32_haswell9.26 gb/s, 0 ulp10.2 gb/s, 0 ulp6.22 gb/s, 0 ulp
nk_each_scale_f32_skylake12.3 gb/s, 0 ulp11.9 gb/s, 0 ulp12.9 gb/s, 0 ulp
nk_each_blend_f32_serial16.2 gb/s, 351 ulp18.7 gb/s, 2.0 ulp17.4 gb/s, 1.4 ulp
nk_each_blend_f32_haswell14.5 gb/s, 2.3 ulp14.2 gb/s, 2.1 ulp8.16 gb/s, 1.3 ulp
nk_each_blend_f32_skylake15.7 gb/s, 1.9 ulp17.2 gb/s, 1.8 ulp15.9 gb/s, 1.3 ulp
nk_each_fma_f32_serial20.1 gb/s, 1.4 ulp18.7 gb/s, 2.1 ulp19.0 gb/s, 1.6 ulp
nk_each_fma_f32_haswell18.4 gb/s, 1.4 ulp15.3 gb/s, 1.8 ulp9.52 gb/s, 1.5 ulp
nk_each_fma_f32_skylake20.8 gb/s, 1.4 ulp19.3 gb/s, 1.7 ulp16.5 gb/s, 1.5 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_bf16_serial0.175 gb/s, 0 ulp0.178 gb/s, 0 ulp0.173 gb/s, 0 ulp
nk_each_sum_bf16_haswell7.97 gb/s, 0 ulp9.33 gb/s, 0 ulp10.3 gb/s, 0 ulp
nk_each_sum_bf16_skylake11.4 gb/s, 0 ulp11.2 gb/s, 0 ulp14.1 gb/s, 0 ulp
nk_each_scale_bf16_serial0.128 gb/s, 0 ulp0.119 gb/s, 0 ulp0.132 gb/s, 0 ulp
nk_each_scale_bf16_haswell6.08 gb/s, 0 ulp6.55 gb/s, 0 ulp6.92 gb/s, 0 ulp
nk_each_scale_bf16_skylake7.43 gb/s, 0 ulp8.04 gb/s, 0 ulp8.45 gb/s, 0 ulp
nk_each_blend_bf16_serial0.211 gb/s, 0 ulp0.204 gb/s, 0 ulp0.224 gb/s, 0 ulp
nk_each_blend_bf16_haswell8.58 gb/s, 2.2 ulp9.44 gb/s, 1.5 ulp10.2 gb/s, 1.5 ulp
nk_each_blend_bf16_skylake10.3 gb/s, 2.3 ulp11.9 gb/s, 1.3 ulp13.4 gb/s, 1.5 ulp
nk_each_fma_bf16_serial0.264 gb/s, 0 ulp0.260 gb/s, 0 ulp0.256 gb/s, 0 ulp
nk_each_fma_bf16_haswell10.9 gb/s, 1.5 ulp10.3 gb/s, 0.9 ulp11.4 gb/s, 1.0 ulp
nk_each_fma_bf16_skylake14.1 gb/s, 1.2 ulp13.0 gb/s, 0.7 ulp15.8 gb/s, 1.1 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_f16_serial33.7 gb/s, 0 ulp16.1 gb/s, 0 ulp18.8 gb/s, 0 ulp
nk_each_sum_f16_haswell14.4 gb/s, 0 ulp11.8 gb/s, 0 ulp9.84 gb/s, 0 ulp
nk_each_sum_f16_sapphire39.2 gb/s, 0 ulp17.0 gb/s, 0 ulp18.9 gb/s, 0 ulp
nk_each_scale_f16_serial0.423 gb/s, 0 ulp0.282 gb/s, 0 ulp0.409 gb/s, 0 ulp
nk_each_scale_f16_haswell8.92 gb/s, 0 ulp8.59 gb/s, 0 ulp8.15 gb/s, 0 ulp
nk_each_scale_f16_skylake17.0 gb/s, 0 ulp10.7 gb/s, 0 ulp12.1 gb/s, 0 ulp
nk_each_blend_f16_serial0.769 gb/s, 1.3 ulp0.669 gb/s, 1.6 ulp0.792 gb/s, 1.5 ulp
nk_each_blend_f16_haswell13.5 gb/s, 1.2 ulp11.0 gb/s, 1.4 ulp11.7 gb/s, 1.5 ulp
nk_each_blend_f16_skylake16.9 gb/s, 1.3 ulp13.9 gb/s, 1.4 ulp14.2 gb/s, 1.1 ulp
nk_each_fma_f16_serial0.965 gb/s, 1.0 ulp0.787 gb/s, 1.1 ulp0.952 gb/s, 1.2 ulp
nk_each_fma_f16_haswell15.2 gb/s, 1.4 ulp13.6 gb/s, 1.0 ulp15.7 gb/s, 1.1 ulp
nk_each_fma_f16_skylake16.3 gb/s, 1.3 ulp16.2 gb/s, 1.3 ulp15.3 gb/s, 1.1 ulp
e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_e4m3_serial0.0966 gb/s, 0 ulp0.0943 gb/s, 0 ulp0.0946 gb/s, 0 ulp
nk_each_sum_e4m3_haswell0.895 gb/s, 0 ulp0.772 gb/s, 0 ulp0.824 gb/s, 0 ulp
nk_each_sum_e4m3_skylake1.48 gb/s, 0 ulp1.34 gb/s, 0 ulp1.40 gb/s, 0 ulp
nk_each_sum_e4m3_sapphire2.12 gb/s, 0 ulp1.89 gb/s, 0 ulp2.13 gb/s, 0 ulp
nk_each_scale_e4m3_serial0.0550 gb/s, 0 ulp0.0543 gb/s, 0 ulp0.0570 gb/s, 0 ulp
nk_each_scale_e4m3_haswell0.495 gb/s, 0 ulp0.532 gb/s, 0 ulp0.540 gb/s, 0 ulp
nk_each_scale_e4m3_skylake1.05 gb/s, 0 ulp1.02 gb/s, 0 ulp1.10 gb/s, 0 ulp
nk_each_blend_e4m3_serial0.0889 gb/s, 0 ulp0.0927 gb/s, 0 ulp0.0876 gb/s, 0 ulp
nk_each_blend_e4m3_haswell0.807 gb/s, 0.6 ulp0.756 gb/s, 0 ulp0.789 gb/s, 0 ulp
nk_each_blend_e4m3_skylake1.50 gb/s, 0 ulp1.44 gb/s, 0 ulp1.48 gb/s, 0 ulp
nk_each_fma_e4m3_serial0.120 gb/s, 0 ulp0.118 gb/s, 0.9 ulp0.115 gb/s, 0 ulp
nk_each_fma_e4m3_haswell0.989 gb/s, 0 ulp0.909 gb/s, 0 ulp0.967 gb/s, 0.5 ulp
nk_each_fma_e4m3_skylake1.89 gb/s, 0 ulp1.74 gb/s, 0 ulp1.80 gb/s, 0 ulp
e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_e5m2_serial0.115 gb/s, 0 ulp0.114 gb/s, 0 ulp0.110 gb/s, 0 ulp
nk_each_sum_e5m2_haswell1.04 gb/s, 0 ulp0.987 gb/s, 0 ulp0.972 gb/s, 0 ulp
nk_each_sum_e5m2_skylake1.71 gb/s, 0 ulp1.76 gb/s, 0 ulp1.79 gb/s, 0 ulp
nk_each_scale_e5m2_serial0.0593 gb/s, 0 ulp0.0630 gb/s, 0 ulp0.0630 gb/s, 0 ulp
nk_each_scale_e5m2_haswell0.601 gb/s, 0 ulp0.611 gb/s, 0 ulp0.588 gb/s, 0 ulp
nk_each_scale_e5m2_skylake1.11 gb/s, 0 ulp1.11 gb/s, 0 ulp1.12 gb/s, 0 ulp
nk_each_blend_e5m2_serial0.108 gb/s, 0 ulp0.113 gb/s, 0 ulp0.114 gb/s, 50 ulp
nk_each_blend_e5m2_haswell0.999 gb/s, 0 ulp0.895 gb/s, 0 ulp0.951 gb/s, 0 ulp
nk_each_blend_e5m2_skylake1.77 gb/s, 0 ulp1.65 gb/s, 0 ulp1.72 gb/s, 0 ulp
nk_each_fma_e5m2_serial0.155 gb/s, 5.1 ulp0.146 gb/s, 0 ulp0.149 gb/s, 0 ulp
nk_each_fma_e5m2_haswell1.24 gb/s, 0 ulp1.19 gb/s, 0 ulp1.25 gb/s, 0 ulp
nk_each_fma_e5m2_skylake2.36 gb/s, 0 ulp2.01 gb/s, 0 ulp2.11 gb/s, 0 ulp
e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_e2m3_serial0.109 gb/s, 0 ulp0.105 gb/s, 0 ulp0.110 gb/s, 0 ulp
nk_each_scale_e2m3_serial0.0500 gb/s, 0 ulp0.0474 gb/s, 0 ulp0.0495 gb/s, 0 ulp
nk_each_blend_e2m3_serial0.0864 gb/s, 0 ulp0.0888 gb/s, 0 ulp0.0908 gb/s, 0 ulp
nk_each_fma_e2m3_serial0.133 gb/s, 0 ulp0.128 gb/s, 0 ulp0.128 gb/s, 0 ulp
e3m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_e3m2_serial0.118 gb/s, 0 ulp0.115 gb/s, 0 ulp0.106 gb/s, 0 ulp
nk_each_scale_e3m2_serial0.0574 gb/s, 0 ulp0.0555 gb/s, 0 ulp0.0548 gb/s, 0 ulp
nk_each_blend_e3m2_serial0.110 gb/s, 0 ulp0.105 gb/s, 0 ulp0.0956 gb/s, 0 ulp
nk_each_fma_e3m2_serial0.151 gb/s, 0 ulp0.149 gb/s, 0 ulp0.138 gb/s, 0 ulp
i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_i8_serial21.7 gb/s15.9 gb/s12.5 gb/s
nk_each_sum_i8_haswell37.7 gb/s15.7 gb/s16.5 gb/s
nk_each_sum_i8_icelake47.2 gb/s17.7 gb/s17.7 gb/s
nk_each_scale_i8_serial2.34 gb/s1.50 gb/s1.70 gb/s
nk_each_scale_i8_haswell3.91 gb/s3.93 gb/s3.64 gb/s
nk_each_scale_i8_skylake6.74 gb/s6.70 gb/s6.93 gb/s
nk_each_scale_i8_sapphire23.0 gb/s11.4 gb/s10.8 gb/s
nk_each_blend_i8_serial3.66 gb/s2.23 gb/s2.60 gb/s
nk_each_blend_i8_haswell5.95 gb/s5.37 gb/s6.37 gb/s
nk_each_blend_i8_sapphire32.4 gb/s17.7 gb/s15.2 gb/s
nk_each_fma_i8_serial4.49 gb/s2.63 gb/s2.98 gb/s
nk_each_fma_i8_haswell7.36 gb/s6.84 gb/s7.15 gb/s
nk_each_fma_i8_skylake11.2 gb/s9.45 gb/s10.1 gb/s
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_u8_serial17.0 gb/s13.7 gb/s12.5 gb/s
nk_each_sum_u8_haswell42.6 gb/s15.7 gb/s15.4 gb/s
nk_each_sum_u8_icelake45.9 gb/s17.3 gb/s17.4 gb/s
nk_each_scale_u8_serial2.11 gb/s2.03 gb/s1.86 gb/s
nk_each_scale_u8_haswell3.91 gb/s3.89 gb/s4.27 gb/s
nk_each_scale_u8_skylake6.93 gb/s5.99 gb/s6.70 gb/s
nk_each_scale_u8_sapphire24.7 gb/s12.1 gb/s11.9 gb/s
nk_each_blend_u8_serial3.23 gb/s2.62 gb/s3.43 gb/s
nk_each_blend_u8_haswell4.87 gb/s5.10 gb/s5.61 gb/s
nk_each_blend_u8_sapphire39.8 gb/s18.1 gb/s16.5 gb/s
nk_each_fma_u8_serial3.19 gb/s3.92 gb/s4.54 gb/s
nk_each_fma_u8_haswell6.98 gb/s6.29 gb/s7.62 gb/s
nk_each_fma_u8_skylake9.66 gb/s9.21 gb/s10.3 gb/s
i16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_i16_serial12.0 gb/s11.6 gb/s14.2 gb/s
nk_each_sum_i16_haswell26.1 gb/s16.2 gb/s16.8 gb/s
nk_each_sum_i16_icelake39.3 gb/s18.2 gb/s18.0 gb/s
nk_each_scale_i16_serial3.19 gb/s4.25 gb/s3.77 gb/s
nk_each_scale_i16_haswell7.07 gb/s7.42 gb/s7.69 gb/s
nk_each_scale_i16_skylake12.8 gb/s8.94 gb/s10.5 gb/s
nk_each_blend_i16_serial6.00 gb/s6.04 gb/s6.37 gb/s
nk_each_fma_i16_serial8.37 gb/s7.64 gb/s7.07 gb/s
nk_each_fma_i16_haswell10.7 gb/s12.5 gb/s13.4 gb/s
nk_each_fma_i16_skylake18.3 gb/s15.7 gb/s19.0 gb/s
u16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_u16_serial12.2 gb/s11.8 gb/s13.7 gb/s
nk_each_sum_u16_haswell24.1 gb/s14.4 gb/s17.7 gb/s
nk_each_sum_u16_icelake39.7 gb/s18.2 gb/s18.4 gb/s
nk_each_scale_u16_serial4.44 gb/s5.28 gb/s5.48 gb/s
nk_each_scale_u16_haswell7.82 gb/s7.21 gb/s7.66 gb/s
nk_each_scale_u16_skylake15.5 gb/s11.6 gb/s9.04 gb/s
nk_each_blend_u16_serial7.84 gb/s7.08 gb/s9.15 gb/s
nk_each_fma_u16_serial9.01 gb/s8.19 gb/s9.37 gb/s
nk_each_fma_u16_haswell11.2 gb/s12.2 gb/s13.9 gb/s
nk_each_fma_u16_skylake22.6 gb/s20.8 gb/s17.5 gb/s
i32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_i32_serial12.6 gb/s12.7 gb/s10.1 gb/s
nk_each_sum_i32_haswell14.6 gb/s15.3 gb/s13.9 gb/s
nk_each_sum_i32_icelake17.1 gb/s18.5 gb/s15.4 gb/s
nk_each_scale_i32_serial3.04 gb/s3.28 gb/s3.44 gb/s
nk_each_scale_i32_haswell8.94 gb/s8.72 gb/s8.09 gb/s
nk_each_scale_i32_skylake11.1 gb/s12.6 gb/s11.3 gb/s
nk_each_blend_i32_serial4.99 gb/s5.99 gb/s5.06 gb/s
nk_each_fma_i32_serial6.33 gb/s6.38 gb/s6.40 gb/s
nk_each_fma_i32_haswell13.5 gb/s16.4 gb/s11.2 gb/s
nk_each_fma_i32_skylake20.1 gb/s21.3 gb/s16.0 gb/s
u32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_u32_serial13.6 gb/s11.6 gb/s9.94 gb/s
nk_each_sum_u32_haswell16.0 gb/s17.7 gb/s12.7 gb/s
nk_each_sum_u32_icelake17.4 gb/s19.0 gb/s14.8 gb/s
nk_each_scale_u32_serial2.13 gb/s3.22 gb/s2.74 gb/s
nk_each_scale_u32_haswell8.20 gb/s9.40 gb/s9.18 gb/s
nk_each_scale_u32_skylake10.6 gb/s12.9 gb/s11.1 gb/s
nk_each_blend_u32_serial3.76 gb/s5.22 gb/s5.78 gb/s
nk_each_fma_u32_serial4.78 gb/s5.63 gb/s8.43 gb/s
nk_each_fma_u32_haswell13.7 gb/s16.1 gb/s12.1 gb/s
nk_each_fma_u32_skylake20.2 gb/s21.1 gb/s15.2 gb/s
i64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_i64_serial15.9 gb/s15.8 gb/s10.3 gb/s
nk_each_sum_i64_icelake17.7 gb/s19.2 gb/s11.1 gb/s
nk_each_scale_i64_serial7.44 gb/s9.17 gb/s8.72 gb/s
nk_each_scale_i64_skylake11.8 gb/s13.8 gb/s9.10 gb/s
nk_each_blend_i64_serial10.9 gb/s14.3 gb/s10.5 gb/s
nk_each_fma_i64_serial13.5 gb/s19.0 gb/s11.8 gb/s
nk_each_fma_i64_skylake21.7 gb/s22.2 gb/s11.8 gb/s
u64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_u64_serial14.7 gb/s16.9 gb/s10.5 gb/s
nk_each_sum_u64_icelake18.1 gb/s19.2 gb/s12.4 gb/s
nk_each_scale_u64_serial7.99 gb/s9.79 gb/s8.82 gb/s
nk_each_scale_u64_skylake11.6 gb/s13.9 gb/s7.24 gb/s
nk_each_blend_u64_serial11.9 gb/s16.5 gb/s13.8 gb/s
nk_each_fma_u64_serial15.3 gb/s21.6 gb/s14.0 gb/s
nk_each_fma_u64_skylake21.6 gb/s21.8 gb/s11.4 gb/s
f64c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_scale_f64c_serial10.6 gb/s, 3.4 ulp7.04 gb/s, 2.6 ulp5.63 gb/s, 2.0 ulp
nk_each_scale_f64c_haswell9.89 gb/s, 3.5 ulp6.45 gb/s, 2.0 ulp5.90 gb/s, 2.1 ulp
nk_each_scale_f64c_skylake9.29 gb/s, 3.5 ulp6.01 gb/s, 2.0 ulp5.67 gb/s, 2.0 ulp
nk_each_blend_f64c_serial16.2 gb/s, 2.5 ulp8.80 gb/s, 2.4 ulp8.23 gb/s, 2.6 ulp
nk_each_blend_f64c_haswell14.0 gb/s, 2.5 ulp8.26 gb/s, 2.4 ulp8.59 gb/s, 2.7 ulp
nk_each_blend_f64c_skylake14.3 gb/s, 2.5 ulp8.82 gb/s, 2.7 ulp7.33 gb/s, 2.7 ulp
nk_each_fma_f64c_serial17.4 gb/s, 4.5 ulp10.1 gb/s, 3.2 ulp9.45 gb/s, 2.8 ulp
nk_each_fma_f64c_haswell14.5 gb/s, 4.9 ulp8.86 gb/s, 3.1 ulp9.31 gb/s, 2.7 ulp
nk_each_fma_f64c_skylake16.1 gb/s, 4.0 ulp8.93 gb/s, 3.4 ulp10.4 gb/s, 2.8 ulp
f32c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_scale_f32c_serial10.1 gb/s, 2.1 ulp9.28 gb/s, 1.8 ulp6.20 gb/s, 2.1 ulp
nk_each_scale_f32c_haswell8.81 gb/s, 2.1 ulp8.87 gb/s, 1.8 ulp7.73 gb/s, 2.1 ulp
nk_each_scale_f32c_skylake9.67 gb/s, 2.2 ulp7.61 gb/s, 1.8 ulp7.09 gb/s, 2.1 ulp
nk_each_blend_f32c_serial14.6 gb/s, 6.9 ulp10.8 gb/s, 2.8 ulp8.43 gb/s, 8.7 ulp
nk_each_blend_f32c_haswell13.0 gb/s, 8.7 ulp12.1 gb/s, 2.9 ulp10.5 gb/s, 9.7 ulp
nk_each_blend_f32c_skylake14.0 gb/s, 7.7 ulp9.28 gb/s, 2.7 ulp9.47 gb/s, 10.2 ulp
nk_each_fma_f32c_serial15.7 gb/s, 8.8 ulp11.96 gb/s, 3.9 ulp9.33 gb/s, 8.5 ulp
nk_each_fma_f32c_haswell14.1 gb/s, 6.6 ulp10.3 gb/s, 2.9 ulp9.77 gb/s, 7.2 ulp
nk_each_fma_f32c_skylake15.5 gb/s, 9.2 ulp10.3 gb/s, 3.8 ulp9.95 gb/s, 8.3 ulp

Apple M5

Native

Kernel25610244096
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_f64_serial54.7 gb/s, 0 ulp66.7 gb/s, 0 ulp51.8 gb/s, 0 ulp
nk_each_sum_f64_neon53.8 gb/s, 0 ulp71.3 gb/s, 0 ulp56.0 gb/s, 0 ulp
nk_each_scale_f64_serial46.4 gb/s, 0 ulp58.2 gb/s, 0 ulp48.2 gb/s, 0 ulp
nk_each_scale_f64_neon42.1 gb/s, 0 ulp50.8 gb/s, 0 ulp52.0 gb/s, 0 ulp
nk_each_blend_f64_serial75.1 gb/s, 0 ulp71.8 gb/s, 0 ulp60.5 gb/s, 0 ulp
nk_each_blend_f64_neon64.9 gb/s, 1.9 ulp69.8 gb/s, 2.7 ulp64.5 gb/s, 1.8 ulp
nk_each_fma_f64_serial83.7 gb/s, 0 ulp76.3 gb/s, 0 ulp66.0 gb/s, 0 ulp
nk_each_fma_f64_neon71.5 gb/s, 1.4 ulp71.6 gb/s, 1.6 ulp69.6 gb/s, 1.6 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_f32_serial89.9 gb/s, 0 ulp67.8 gb/s, 0 ulp59.6 gb/s, 0 ulp
nk_each_sum_f32_neon75.5 gb/s, 0 ulp66.2 gb/s, 0 ulp55.2 gb/s, 0 ulp
nk_each_scale_f32_serial66.0 gb/s, 0 ulp61.5 gb/s, 0 ulp54.4 gb/s, 0 ulp
nk_each_scale_f32_neon49.7 gb/s, 0 ulp44.7 gb/s, 0 ulp49.2 gb/s, 0 ulp
nk_each_blend_f32_serial87.1 gb/s, 26 ulp66.9 gb/s, 26 ulp52.1 gb/s, 2.0 ulp
nk_each_blend_f32_neon74.4 gb/s, 1.7 ulp66.7 gb/s, 1.6 ulp50.1 gb/s, 1.6 ulp
nk_each_fma_f32_serial89.4 gb/s, 2.1 ulp51.3 gb/s, 2.5 ulp56.2 gb/s, 2.2 ulp
nk_each_fma_f32_neon82.4 gb/s, 2.1 ulp55.2 gb/s, 21 ulp56.1 gb/s, 1.8 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_bf16_serial32.8 gb/s, 0 ulp31.6 gb/s, 0 ulp33.8 gb/s, 0 ulp
nk_each_sum_bf16_neonbfdot42.0 gb/s, 0 ulp41.7 gb/s, 0 ulp43.3 gb/s, 0 ulp
nk_each_scale_bf16_serial19.3 gb/s, 0 ulp17.7 gb/s, 0 ulp19.8 gb/s, 0 ulp
nk_each_scale_bf16_neonbfdot26.9 gb/s, 0 ulp23.8 gb/s, 0 ulp26.7 gb/s, 0 ulp
nk_each_blend_bf16_serial25.3 gb/s, 28 ulp24.0 gb/s, 26 ulp26.4 gb/s, 2.2 ulp
nk_each_blend_bf16_neonbfdot34.8 gb/s, 29 ulp36.1 gb/s, 29 ulp34.7 gb/s, 2.2 ulp
nk_each_fma_bf16_serial31.7 gb/s, 2.1 ulp29.3 gb/s, 2.0 ulp33.1 gb/s, 33 ulp
nk_each_fma_bf16_neonbfdot42.8 gb/s, 1.2 ulp39.8 gb/s, 1.2 ulp39.3 gb/s, 1.5 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_f16_serial91.5 gb/s, 0 ulp68.2 gb/s, 0 ulp66.1 gb/s, 0 ulp
nk_each_sum_f16_neonhalf83.2 gb/s, 0 ulp67.3 gb/s, 0 ulp61.2 gb/s, 0 ulp
nk_each_scale_f16_serial37.6 gb/s, 0 ulp35.9 gb/s, 0 ulp36.2 gb/s, 0 ulp
nk_each_scale_f16_neonhalf50.0 gb/s, 87.3K ulp48.2 gb/s, 84.8K ulp41.2 gb/s, 87.3K ulp
nk_each_blend_f16_serial38.3 gb/s, 2.0 ulp35.8 gb/s, 2.0 ulp39.1 gb/s, 2.3 ulp
nk_each_blend_f16_neonhalf78.4 gb/s, 91.6K ulp59.2 gb/s, 92.6K ulp66.2 gb/s, 91.9K ulp
nk_each_fma_f16_serial43.5 gb/s, 2.1 ulp37.2 gb/s, 1.8 ulp43.4 gb/s, 2.2 ulp
nk_each_fma_f16_neonhalf86.5 gb/s, 97.1K ulp75.4 gb/s, 96.7K ulp71.2 gb/s, 99.2K ulp
e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_e4m3_serial0.358 gb/s, 0 ulp0.332 gb/s, 0 ulp0.369 gb/s, 0 ulp
nk_each_sum_e4m3_neon1.59 gb/s, 0 ulp1.75 gb/s, 0 ulp1.72 gb/s, 0 ulp
nk_each_scale_e4m3_serial0.148 gb/s, 0 ulp0.137 gb/s, 0 ulp0.153 gb/s, 0 ulp
nk_each_scale_e4m3_neon0.954 gb/s, 0 ulp1.07 gb/s, 0 ulp1.03 gb/s, 0 ulp
nk_each_blend_e4m3_serial0.253 gb/s, 0.4 ulp0.239 gb/s, 1.1 ulp0.259 gb/s, 2.8 ulp
nk_each_blend_e4m3_neon1.56 gb/s, 0.1 ulp1.74 gb/s, 0 ulp1.68 gb/s, 0 ulp
nk_each_fma_e4m3_serial0.334 gb/s, 0.7 ulp0.317 gb/s, 0.6 ulp0.343 gb/s, 2.2 ulp
nk_each_fma_e4m3_neon1.92 gb/s, 0.1 ulp2.20 gb/s, 0.6 ulp2.08 gb/s, 1.0 ulp
e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_e5m2_serial0.437 gb/s, 0 ulp0.413 gb/s, 0 ulp0.445 gb/s, 0 ulp
nk_each_sum_e5m2_neon3.11 gb/s, 0 ulp3.48 gb/s, 0 ulp3.36 gb/s, 0 ulp
nk_each_scale_e5m2_serial0.180 gb/s, 0 ulp0.173 gb/s, 0 ulp0.178 gb/s, 0 ulp
nk_each_scale_e5m2_neon1.64 gb/s, 0 ulp1.82 gb/s, 0 ulp1.75 gb/s, 0 ulp
nk_each_blend_e5m2_serial0.317 gb/s, 0 ulp0.322 gb/s, 0 ulp0.316 gb/s, 4.9 ulp
nk_each_blend_e5m2_neon3.02 gb/s, 0.7 ulp3.29 gb/s, 0 ulp3.20 gb/s, 0 ulp
nk_each_fma_e5m2_serial0.476 gb/s, 1.9 ulp0.455 gb/s, 0.9 ulp0.453 gb/s, 4.0 ulp
nk_each_fma_e5m2_neon4.14 gb/s, 0 ulp4.42 gb/s, 1.3 ulp4.31 gb/s, 0 ulp
e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_e2m3_serial0.337 gb/s, ? ulp0.297 gb/s, ? ulp0.290 gb/s, ? ulp
nk_each_scale_e2m3_serial0.102 gb/s, ? ulp0.0925 gb/s, ? ulp0.0939 gb/s, ? ulp
nk_each_blend_e2m3_serial0.226 gb/s, ? ulp0.201 gb/s, ? ulp0.201 gb/s, ? ulp
nk_each_fma_e2m3_serial0.368 gb/s, ? ulp0.331 gb/s, ? ulp0.331 gb/s, ? ulp
e3m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_e3m2_serial0.544 gb/s, ? ulp0.472 gb/s, ? ulp0.458 gb/s, ? ulp
nk_each_scale_e3m2_serial0.176 gb/s, ? ulp0.151 gb/s, ? ulp0.143 gb/s, ? ulp
nk_each_blend_e3m2_serial0.355 gb/s, ? ulp0.302 gb/s, ? ulp0.292 gb/s, ? ulp
nk_each_fma_e3m2_serial0.506 gb/s, ? ulp0.433 gb/s, ? ulp0.436 gb/s, ? ulp
i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_i8_serial94.6 gb/s55.5 gb/s42.7 gb/s
nk_each_sum_i8_neonhalf85.5 gb/s60.5 gb/s39.6 gb/s
nk_each_scale_i8_serial0.171 gb/s0.145 gb/s0.148 gb/s
nk_each_scale_i8_neonhalf22.6 gb/s22.4 gb/s20.8 gb/s
nk_each_blend_i8_serial0.302 gb/s0.254 gb/s0.272 gb/s
nk_each_blend_i8_neonhalf26.5 gb/s27.7 gb/s27.3 gb/s
nk_each_fma_i8_serial0.436 gb/s0.359 gb/s0.376 gb/s
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_u8_serial13.2 gb/s12.3 gb/s12.3 gb/s
nk_each_sum_u8_neonhalf85.3 gb/s84.8 gb/s67.3 gb/s
nk_each_scale_u8_serial0.135 gb/s0.120 gb/s0.119 gb/s
nk_each_scale_u8_neonhalf23.0 gb/s21.9 gb/s21.2 gb/s
nk_each_blend_u8_serial0.256 gb/s0.218 gb/s0.220 gb/s
nk_each_blend_u8_neonhalf27.4 gb/s27.0 gb/s27.4 gb/s
nk_each_fma_u8_serial0.356 gb/s0.333 gb/s0.323 gb/s
i16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_i16_serial100 gb/s69.6 gb/s79.2 gb/s
nk_each_sum_i16_neon75.3 gb/s65.6 gb/s66.9 gb/s
nk_each_scale_i16_serial0.358 gb/s0.297 gb/s0.284 gb/s
nk_each_scale_i16_neon17.6 gb/s18.5 gb/s15.5 gb/s
nk_each_blend_i16_serial0.644 gb/s0.523 gb/s0.492 gb/s
nk_each_fma_i16_serial0.889 gb/s0.731 gb/s0.712 gb/s
nk_each_fma_i16_neon27.9 gb/s27.7 gb/s28.1 gb/s
u16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_u16_serial25.0 gb/s22.4 gb/s23.2 gb/s
nk_each_sum_u16_neon79.2 gb/s66.8 gb/s68.1 gb/s
nk_each_scale_u16_serial0.267 gb/s0.245 gb/s0.241 gb/s
nk_each_scale_u16_neon17.0 gb/s18.9 gb/s18.6 gb/s
nk_each_blend_u16_serial0.520 gb/s0.432 gb/s0.444 gb/s
nk_each_fma_u16_serial0.744 gb/s0.664 gb/s0.638 gb/s
nk_each_fma_u16_neon28.1 gb/s27.9 gb/s28.9 gb/s
i32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_i32_serial68.6 gb/s62.0 gb/s71.4 gb/s
nk_each_sum_i32_neon66.2 gb/s66.9 gb/s62.4 gb/s
nk_each_scale_i32_serial0.672 gb/s0.613 gb/s0.589 gb/s
nk_each_scale_i32_neon18.2 gb/s19.3 gb/s18.6 gb/s
nk_each_blend_i32_serial1.27 gb/s1.11 gb/s0.993 gb/s
nk_each_fma_i32_serial1.71 gb/s1.54 gb/s1.47 gb/s
nk_each_fma_i32_neon28.4 gb/s28.8 gb/s27.6 gb/s
u32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_u32_serial56.2 gb/s46.0 gb/s50.0 gb/s
nk_each_sum_u32_neon84.9 gb/s69.7 gb/s55.1 gb/s
nk_each_scale_u32_serial0.598 gb/s0.555 gb/s0.552 gb/s
nk_each_scale_u32_neon18.4 gb/s18.8 gb/s18.5 gb/s
nk_each_blend_u32_serial1.02 gb/s0.982 gb/s0.953 gb/s
nk_each_fma_u32_serial1.50 gb/s1.43 gb/s1.41 gb/s
nk_each_fma_u32_neon28.2 gb/s28.9 gb/s27.0 gb/s
i64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_i64_serial55.2 gb/s41.2 gb/s53.9 gb/s
nk_each_sum_i64_neon52.9 gb/s41.7 gb/s51.4 gb/s
nk_each_scale_i64_serial7.90 gb/s7.64 gb/s8.13 gb/s
nk_each_scale_i64_neon39.4 gb/s22.8 gb/s34.7 gb/s
nk_each_blend_i64_serial13.8 gb/s13.8 gb/s14.1 gb/s
nk_each_fma_i64_serial13.3 gb/s9.94 gb/s12.8 gb/s
nk_each_fma_i64_neon57.7 gb/s34.2 gb/s49.6 gb/s
u64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sum_u64_serial52.6 gb/s31.5 gb/s50.9 gb/s
nk_each_sum_u64_neon54.3 gb/s41.3 gb/s48.8 gb/s
nk_each_scale_u64_serial9.29 gb/s9.48 gb/s9.11 gb/s
nk_each_scale_u64_neon38.4 gb/s35.1 gb/s35.1 gb/s
nk_each_blend_u64_serial14.3 gb/s16.1 gb/s14.1 gb/s
nk_each_fma_u64_serial14.2 gb/s13.9 gb/s15.2 gb/s
nk_each_fma_u64_neon52.4 gb/s52.0 gb/s53.2 gb/s
f64c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_scale_f64c_serial28.1 gb/s, 2.2 ulp39.2 gb/s, 1.9 ulp37.8 gb/s, 2.2 ulp
nk_each_scale_f64c_neon25.6 gb/s, 1.5 ulp51.0 gb/s, 1.5 ulp42.1 gb/s, 1.3 ulp
nk_each_blend_f64c_serial44.9 gb/s, 4.2 ulp35.5 gb/s, 3.0 ulp48.8 gb/s, 2.6 ulp
nk_each_blend_f64c_neon43.7 gb/s, 3.2 ulp46.3 gb/s, 3.0 ulp48.3 gb/s, 2.2 ulp
nk_each_fma_f64c_serial58.2 gb/s, 3.4 ulp43.9 gb/s, 5.2 ulp52.2 gb/s, 2.5 ulp
nk_each_fma_f64c_neon56.1 gb/s, 3.2 ulp52.2 gb/s, 3.0 ulp53.2 gb/s, 2.4 ulp
f32c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_scale_f32c_serial44.8 gb/s, 17 ulp37.7 gb/s, 2.4 ulp44.3 gb/s, 17 ulp
nk_each_scale_f32c_neon50.8 gb/s, 1.8 ulp52.7 gb/s, 1.6 ulp45.7 gb/s, 1.6 ulp
nk_each_blend_f32c_serial50.3 gb/s, 2.4 ulp49.0 gb/s, 2.6 ulp53.5 gb/s, 28 ulp
nk_each_blend_f32c_neon49.0 gb/s, 2.2 ulp58.3 gb/s, 2.2 ulp51.0 gb/s, 3.2 ulp
nk_each_fma_f32c_serial56.4 gb/s, 3.1 ulp49.8 gb/s, 78 ulp51.5 gb/s, 3.5 ulp
nk_each_fma_f32c_neon62.7 gb/s, 2.9 ulp62.4 gb/s, 4.1 ulp58.0 gb/s, 81 ulp