Vector-Vector Dot Products in NumKong

April 20, 2026 · View on GitHub

NumKong implements dot products for every numeric type supported by the library, as a core building block of higher-level functionality for vectors and higher rank tensors.

Dot product for real numbers and integers is defined as:

dot(a,b)=i=0n1aibi\text{dot}(a, b) = \sum_{i=0}^{n-1} a_i \cdot b_i

For complex numbers, the dot product expands via the distributive property of complex multiplication:

dot(a,b)=i=0n1(ai,rebi,reai,imbi,im)+ji=0n1(ai,rebi,im+ai,imbi,re)\text{dot}(a, b) = \sum_{i=0}^{n-1} (a_{i,re} \cdot b_{i,re} - a_{i,im} \cdot b_{i,im}) + j \sum_{i=0}^{n-1} (a_{i,re} \cdot b_{i,im} + a_{i,im} \cdot b_{i,re})

The conjugate dot product negates the imaginary part of bb:

vdot(a,b)=i=0n1aibiˉ=i=0n1(ai,rebi,re+ai,imbi,im)+ji=0n1(ai,imbi,reai,rebi,im)\text{vdot}(a, b) = \sum_{i=0}^{n-1} a_i \cdot \bar{b_i} = \sum_{i=0}^{n-1} (a_{i,re} \cdot b_{i,re} + a_{i,im} \cdot b_{i,im}) + j \sum_{i=0}^{n-1} (a_{i,im} \cdot b_{i,re} - a_{i,re} \cdot b_{i,im})

Where biˉ\bar{b_i} is the complex conjugate of bib_i. Reformulating as Python pseudocode for interleaved real/imaginary scalar arrays:

def dot_real(a: List[number], b: List[number]) -> number:
    return sum(ai * bi for ai, bi in zip(a, b))

def dot_complex(a: List[number], b: List[number]) -> Tuple[number, number]:
    a_re, a_im = a[0::2], a[1::2]
    b_re, b_im = b[0::2], b[1::2]
    ab_re = sum(ar * br - ai * bi for ar, ai, br, bi in zip(a_re, a_im, b_re, b_im))
    ab_im = sum(ar * bi + ai * br for ar, ai, br, bi in zip(a_re, a_im, b_re, b_im))
    return ab_re, ab_im

def vdot_complex(a: List[number], b: List[number]) -> Tuple[number, number]:
    a_re, a_im = a[0::2], a[1::2]
    b_re, b_im = b[0::2], b[1::2]
    ab_re = sum(ar * br + ai * bi for ar, ai, br, bi in zip(a_re, a_im, b_re, b_im))
    ab_im = sum(ai * br - ar * bi for ar, ai, br, bi in zip(a_re, a_im, b_re, b_im))
    return ab_re, ab_im

Input & Output Types

Real and integer dot products:

Input TypeOutput TypeDescription
f64f6464-bit IEEE 754 double precision
f32f3232-bit IEEE 754 single precision
f16f3216-bit IEEE 754 half precision, widened output
bf16f3216-bit brain float, widened output
e4m3f328-bit Float8: 4 exponent, 3 mantissa bits
e5m2f328-bit Float8: 5 exponent, 2 mantissa bits
e2m3f328-bit MX format: 2 exponent, 3 mantissa bits
e3m2f328-bit MX format: 3 exponent, 2 mantissa bits
i8i328-bit signed integers
u8u328-bit unsigned integers
i4i324-bit signed integers, packed nibble pairs
u4u324-bit unsigned integers, packed nibble pairs
u1u321-bit binary packed octets, popcount of AND

Complex dot products (both dot and vdot):

Input TypeOutput TypeDescription
f64cf64c64-bit complex pairs
f32cf32c32-bit complex pairs
f16cf32c16-bit complex pairs, widened output
bf16cf32c16-bit brain complex pairs, widened output

Optimizations

Compensated Arithmetic for Large Floats

nk_dot_f64_serial uses Neumaier compensated summation — tracking a correction term adjusted by magnitude comparison at each step. nk_dot_f64_haswell, nk_dot_f64_skylake, nk_dot_f64_sve implement the Dot2 algorithm by Ogita, Rump, and Oishi: TwoProd via FMA captures the rounding error of each product exactly, and a TwoSum chain propagates it through the accumulator. On SVE, the final horizontal reduction uses svtbl to extract upper halves at each tree level, applying TwoSum at every stage. The serial path uses Neumaier because it processes one element at a time and can cheaply branch on magnitudes. Dot2 avoids those branches entirely — TwoProd and TwoSum are pure arithmetic with no comparisons, mapping naturally to wide SIMD where branching per lane is impossible.

Lookup Tables for Mini-Floats

nk_dot_e2m3_haswell, nk_dot_e3m2_haswell, nk_dot_e2m3_skylake, nk_dot_e3m2_skylake encode 32 MX format values into scaled integers via dual 16-entry LUTs loaded into vector registers. The low 4 magnitude bits index VPSHUFB, bit 4 selects between the lower and upper table via blending, and the results feed into VPMADDUBSW + VPMADDWD chains with a final ÷256\div 256 scaling. The Sapphire-specific MX implementation in sapphire.h replaces this with a single 64-entry signed LUT via VPERMUTEX2VAR, where the sign bit naturally selects between positive and negative tables. That path accumulates in native Float16 via VFMADD_PH and flushes to Float32 every 128 elements to avoid overflow.

Algebraic Domain Shifting

nk_dot_i8_icelake, nk_dot_u8_icelake work around VPDPBUSD requiring UInt8 × Int8 operands. For Int8 × Int8, one operand is XORed with 0x80 to shift to unsigned, and the correction $128 \cdot \sum b_iiscomputedviaVPSADBW,whichrunsonport5andavoidscontentionwithDPBUSDonports01.nkdoti4icelakeextendsthistopackednibblesusingtheidentityis computed via `VPSADBW`, which runs on port 5 and avoids contention with `DPBUSD` on ports 0-1. `nk_dot_i4_icelake` extends this to packed nibbles using the identity(a'-8)(b'-8) = a' b' - 8(a'+b') + 64twoVPDPBUSDcallshandlelowandhighnibblesseparately,withSADbasedcorrection.nkdoti8v128relaxed,nkdotu8v128relaxedfaceaneventighterconstraint:WASMsi32x4relaxeddoti8x16i7x16addcomputesInt8×Int7,sothesignbitofoneoperandmustbemaskedoffentirely.ForInt8×Int8,thesignbitof— two `VPDPBUSD` calls handle low and high nibbles separately, with SAD-based correction. `nk_dot_i8_v128relaxed`, `nk_dot_u8_v128relaxed` face an even tighter constraint: WASM's `i32x4_relaxed_dot_i8x16_i7x16_add` computes Int8 × Int7, so the sign bit of one operand must be masked off entirely. For Int8 × Int8, the sign bit ofbisclearedtoproducea7bitvalue,andawindowedcorrectionis cleared to produce a 7-bit value, and a windowed correction-128 \cdot \sum_{b_i < 0} a_iisaccumulatedinInt16andflushedevery127iterationstopreventoverflow.ForUInt8×UInt8,is accumulated in Int16 and flushed every 127 iterations to prevent overflow. For UInt8 × UInt8,b is XORed with `0x80` to shift into signed range, same as Ice Lake, with the correction \128 \cdot \sum a_i$ computed via pairwise widening adds.

Octave Decomposition for E4M3 via VNNI

nk_dot_e4m3_icelake splits the 4-bit E4M3 exponent into 2 "octave" bits (top) and 2 "remainder" bits (bottom). The bottom 5 bits (2 remainder + 3 mantissa) map via VPERMB to u8 integers in [0, 120] — identical structure to the E2M3 ×16\times 16 LUT. A subnormal fixup replaces LUT entries for magnitude < 8 with $2 \times \text{mantissa} via a second masked `VPERMB`, avoiding `VPADDB` on the VPDPBUSD execution ports. Sign is computed via `VPTERNLOGD` with immediate 0x14, fusing `(a \oplus b) \wedge \lnot \text{0x7F}` in one instruction. The 4 octave bins per operand produce \4 \times 4 = 16VPDPBUSDcrossproductsaccumulatedinto7registersgroupedbyoctavesum`VPDPBUSD` cross-products accumulated into 7 registers grouped by octave sumk = o_a + o_b \in [0, 6]. Each accumulator is scaled by \2^{4k-20}$ — an exact power of two, introducing no rounding. This processes 64 E4M3 bytes per iteration in u8, doubling the element density of the BF16 upcast path.

Widening Fusion Through BFloat16 on x86

nk_dot_e5m2_genoa converts FP8 values to BF16, then accumulates via VDPBF16PS, reusing Genoa's BF16 dot-product instruction for FP8 types. Each VDPBF16PS fuses two BF16 multiply-adds per 32-bit lane at 6-cycle throughput. On Skylake-X–class CPUs without BF16 dot-product hardware, nk_dot_e4m3_skylake / nk_dot_e5m2_skylake (and their Haswell twins nk_dot_e4m3_haswell / nk_dot_e5m2_haswell) instead route through the Giesen-style FP8 → F16 fake-bit-pattern cast, widen via VCVTPH2PS, and accumulate in F32 with two independent FMA chains reducing into a single register — avoiding the 3-chain scheduler-stall of the BF16 algebraic form on kernels without native BF16 FMA. nk_dot_bf16c_genoa uses the same instruction for complex BF16, preparing operands with VPSHUFB for lane swapping and VPXORD with 0x80000000 for sign flips before feeding into VDPBF16PS.

Deferred Sign-Flip in Complex Dot Products

The Haswell BFloat16Complex/Float16Complex/Float32Complex kernels compute (arbraibi)\sum (a_r b_r - a_i b_i) without per-pair subtraction. Instead, two accumulators collect interleaved products [arbr,aibi,][a_r b_r, a_i b_i, \ldots] and [arbi,aibr,][a_r b_i, a_i b_r, \ldots], and a post-loop XOR flips the sign of every odd lane to produce the subtraction. This gives one FMA per accumulator per iteration, but each lane grows O(n)O(n) while the true result is O(n)O(\sqrt{n}). The Float32Complex kernel absorbs this via Float64 accumulators; Genoa's VDPBF16PS and ARM's FMLSL pair terms naturally. For BFloat16Complex/Float16Complex on Haswell the accumulator is Float32, so the O(logn)O(\log n) precision loss from lane separation is visible in max ULP at large nn, though mean ULP remains low.

Widening Fusion Through Float16 on Arm

nk_dot_f16_neonfhm, nk_dot_f16c_neonfhm use the ARMv8.4-FHM instructions FMLAL/FMLSL, which fuse FP16-to-FP32 conversion with multiply-accumulate in a single operation. vfmlalq_low_f16 and vfmlalq_high_f16 process the lower and upper 4 elements of an 8-wide FP16 vector respectively. For complex dot products, FMLSL provides the subtraction path arebimaimbrea_{re} b_{im} - a_{im} b_{re} without a separate negate step.

Widening Chains on RISC-V

nk_dot_i8_rvv, nk_dot_u8_rvv use vwmul$ \text{for} \text{Int8} \times \text{Int8} → \text{Int16} \text{widening} \text{multiply} \text{followed} \text{by} $vwadd to widen-accumulate into Int32 — a two-stage chain that naturally prevents overflow. nk_dot_bf16_rvvbf16 uses the Zvfbfwma extension's vfwmaccbf16 for fused BFloat16 × BFloat16 → Float32 widening multiply-accumulate. nk_dot_e4m3_rvvbf16, nk_dot_e5m2_rvvbf16 convert Float8 to BFloat16 via 256-entry LUTs, then feed the same vfwmaccbf16 path.

Performance

The following performance tables are produced by manually re-running nk_test and nk_bench included internal tools to measure both accuracy and throughput at different input shapes. The input size is controlled by the NK_DENSE_DIMENSIONS environment variable and set to 256, 1024, and 4096 elements. The throughput is measured in gb/s as the number of bytes read per second amortized for a large batch of vector pairs. Accuracy is reported as mean ULP (units in last place) unless noted otherwise — the average number of representable floating-point values between the result and the exact answer. Rows marked 🧩 use external BLAS baselines rather than NumKong kernels. Each kernel runs for at least 20 seconds per configuration. Benchmark threads are pinned to specific cores; on machines with heterogeneous core types (e.g., Apple P/E cores), only the fastest cores are used. Workloads that significantly degrade CPU frequencies (Intel AMX, Apple SME) run in separate passes to avoid affecting throughput measurements of other kernels.

Intel Sapphire Rapids

Native

Kernel25610244096
f64c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
dot_f64c_with_blas 🧩29.1 gb/s, 25 ulp27.6 gb/s, 97 ulp13.6 gb/s, 32 ulp
vdot_f64c_with_blas 🧩29.1 gb/s, 18 ulp27.9 gb/s, 17 ulp15.3 gb/s, 25 ulp
nk_dot_f64c_serial5.45 gb/s, 3.9 ulp6.49 gb/s, 9.0 ulp6.84 gb/s, 2.9 ulp
nk_vdot_f64c_serial5.47 gb/s, 4.6 ulp6.41 gb/s, 1.6 ulp6.76 gb/s, 2.2 ulp
nk_dot_f64c_skylake23.8 gb/s, 0 ulp23.4 gb/s, 0 ulp11.8 gb/s, 0 ulp
nk_vdot_f64c_skylake23.6 gb/s, 0 ulp23.7 gb/s, 0 ulp11.6 gb/s, 0 ulp
f32c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
dot_f32c_with_blas 🧩28.7 gb/s, 8.6 ulp29.8 gb/s, 13 ulp15.8 gb/s, 19 ulp
vdot_f32c_with_blas 🧩29.2 gb/s, 11 ulp30.2 gb/s, 14 ulp15.7 gb/s, 21 ulp
nk_dot_f32c_serial9.46 gb/s, 0 ulp9.82 gb/s, 0 ulp9.71 gb/s, 0 ulp
nk_vdot_f32c_serial9.64 gb/s, 0 ulp9.95 gb/s, 0 ulp10.1 gb/s, 0 ulp
nk_dot_f32c_haswell22.4 gb/s, 0 ulp22.2 gb/s, 0 ulp12.6 gb/s, 0 ulp
nk_vdot_f32c_haswell22.4 gb/s, 0 ulp21.8 gb/s, 0 ulp14.7 gb/s, 0 ulp
nk_dot_f32c_skylake25.6 gb/s, 0 ulp27.2 gb/s, 0 ulp17.0 gb/s, 0 ulp
nk_vdot_f32c_skylake27.8 gb/s, 0 ulp27.4 gb/s, 0 ulp18.8 gb/s, 0 ulp
bf16c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_bf16c_serial0.628 gb/s, 0.1 ulp0.627 gb/s, 2.3 ulp0.626 gb/s, 7.9 ulp
nk_vdot_bf16c_serial0.622 gb/s, 0.2 ulp0.624 gb/s, 2.1 ulp0.627 gb/s, 11.2 ulp
nk_dot_bf16c_haswell21.5 gb/s, 0.1 ulp18.5 gb/s, 1.3 ulp18.4 gb/s, 3.4 ulp
nk_vdot_bf16c_haswell21.9 gb/s, 0.8 ulp19.0 gb/s, 2.0 ulp18.5 gb/s, 4.5 ulp
nk_dot_bf16c_genoa37.9 gb/s, 0 ulp30.3 gb/s, 1.1 ulp29.5 gb/s, 2.8 ulp
nk_vdot_bf16c_genoa36.1 gb/s, 0.7 ulp30.2 gb/s, 1.2 ulp30.2 gb/s, 3.3 ulp
f16c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_f16c_serial2.02 gb/s, 14.4 ulp2.00 gb/s, 27.3 ulp2.02 gb/s, 34.0 ulp
nk_vdot_f16c_serial1.67 gb/s, 15.0 ulp1.64 gb/s, 26.3 ulp1.64 gb/s, 34.2 ulp
nk_dot_f16c_haswell23.9 gb/s, 12.7 ulp19.4 gb/s, 22.3 ulp19.3 gb/s, 40.1 ulp
nk_vdot_f16c_haswell24.0 gb/s, 11.1 ulp20.0 gb/s, 17.4 ulp17.1 gb/s, 29.2 ulp
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
dot_f64_with_blas 🧩27.8 gb/s, 6.9 ulp30.1 gb/s, 9.3 ulp15.7 gb/s, 20 ulp
nk_dot_f64_serial4.28 gb/s, 2.2 ulp4.39 gb/s, 2.0 ulp4.42 gb/s, 3.3 ulp
nk_dot_f64_haswell24.2 gb/s, 0 ulp25.7 gb/s, 0 ulp18.3 gb/s, 0 ulp
nk_dot_f64_skylake29.0 gb/s, 0 ulp28.6 gb/s, 0 ulp24.9 gb/s, 0 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
dot_f32_with_blas 🧩47.8 gb/s, 14 ulp30.7 gb/s, 14 ulp29.7 gb/s, 15 ulp
nk_dot_f32_serial11.0 gb/s, 0 ulp11.2 gb/s, 0 ulp11.5 gb/s, 0 ulp
nk_dot_f32_haswell30.5 gb/s, 0 ulp23.9 gb/s, 0 ulp24.4 gb/s, 0 ulp
nk_dot_f32_skylake44.2 gb/s, 0 ulp29.8 gb/s, 0 ulp30.0 gb/s, 0 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_bf16_serial0.633 gb/s, 0 ulp0.630 gb/s, 0.5 ulp0.638 gb/s, 5.4 ulp
nk_dot_bf16_haswell39.3 gb/s, 0 ulp25.5 gb/s, 0.2 ulp20.2 gb/s, 25.3 ulp
nk_dot_bf16_skylake62.7 gb/s, 0 ulp30.2 gb/s, 0.2 ulp29.5 gb/s, 2.3 ulp
nk_dot_bf16_genoa88.8 gb/s, 0 ulp29.7 gb/s, 0.2 ulp31.2 gb/s, 2.2 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_f16_serial1.31 gb/s, 11.5 ulp1.32 gb/s, 33.7 ulp1.30 gb/s, 59.7 ulp
nk_dot_f16_haswell31.3 gb/s, 7.0 ulp22.8 gb/s, 14.0 ulp19.8 gb/s, 29.8 ulp
nk_dot_f16_skylake54.9 gb/s, 6.2 ulp31.7 gb/s, 8.6 ulp30.9 gb/s, 22.8 ulp
e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_e5m2_serial1.90 gb/s, 0 ulp1.07 gb/s, 0 ulp1.08 gb/s, 0 ulp
nk_dot_e5m2_haswell4.92 gb/s, 0 ulp4.95 gb/s, 0 ulp4.80 gb/s, 0 ulp
nk_dot_e5m2_skylake6.20 gb/s, 0 ulp6.36 gb/s, 0 ulp6.25 gb/s, 0 ulp
nk_dot_e5m2_genoa12.1 gb/s, 0 ulp12.6 gb/s, 0 ulp12.6 gb/s, 0 ulp
e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_e4m3_serial0.762 gb/s, 0 ulp0.424 gb/s, 0 ulp0.420 gb/s, 0 ulp
nk_dot_e4m3_haswell3.78 gb/s, 0 ulp3.77 gb/s, 0 ulp3.75 gb/s, 0 ulp
nk_dot_e4m3_skylake5.10 gb/s, 0 ulp5.16 gb/s, 0 ulp5.21 gb/s, 0 ulp
nk_dot_e4m3_icelake13.2 gb/s, 0 ulp14.9 gb/s, 0 ulp14.7 gb/s, 0 ulp
e3m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_e3m2_serial1.47 gb/s, 0 ulp1.05 gb/s, 0 ulp1.04 gb/s, 0 ulp
nk_dot_e3m2_haswell12.0 gb/s, 0 ulp12.2 gb/s, 0 ulp12.2 gb/s, 0 ulp
nk_dot_e3m2_skylake21.6 gb/s, 0 ulp23.1 gb/s, 0 ulp23.2 gb/s, 0 ulp
nk_dot_e3m2_icelake23.1 gb/s, 0 ulp24.3 gb/s, 0 ulp23.9 gb/s, 0 ulp
e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_e2m3_serial1.87 gb/s, 0 ulp1.25 gb/s, 0 ulp1.96 gb/s, 0 ulp
nk_dot_e2m3_haswell20.5 gb/s, 0 ulp20.4 gb/s, 0 ulp19.3 gb/s, 0 ulp
nk_dot_e2m3_skylake35.7 gb/s, 0 ulp33.2 gb/s, 0 ulp30.7 gb/s, 0 ulp
nk_dot_e2m3_icelake58.0 gb/s, 0 ulp46.0 gb/s, 0 ulp31.5 gb/s, 0 ulp
nk_dot_e2m3_alder29.9 gb/s, 0 ulp30.8 gb/s, 0 ulp29.1 gb/s, 0 ulp
i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_i8_serial16.9 gb/s16.8 gb/s15.6 gb/s
nk_dot_i8_haswell43.2 gb/s35.8 gb/s29.1 gb/s
nk_dot_i8_skylake52.9 gb/s36.5 gb/s28.5 gb/s
nk_dot_i8_icelake64.0 gb/s46.2 gb/s26.8 gb/s
nk_dot_i8_alder42.8 gb/s40.4 gb/s31.1 gb/s
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_u8_serial16.9 gb/s16.5 gb/s15.8 gb/s
nk_dot_u8_haswell47.7 gb/s37.7 gb/s29.1 gb/s
nk_dot_u8_skylake48.7 gb/s32.6 gb/s27.5 gb/s
nk_dot_u8_icelake68.4 gb/s46.9 gb/s30.2 gb/s
nk_dot_u8_alder42.1 gb/s41.8 gb/s31.6 gb/s
i4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_i4_serial9.37 gb/s11.8 gb/s11.8 gb/s
nk_dot_i4_haswell8.22 gb/s8.53 gb/s8.23 gb/s
nk_dot_i4_icelake24.3 gb/s36.3 gb/s25.5 gb/s
u4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_u4_serial10.6 gb/s12.0 gb/s11.9 gb/s
nk_dot_u4_haswell15.0 gb/s16.4 gb/s14.3 gb/s
nk_dot_u4_icelake48.1 gb/s64.4 gb/s30.9 gb/s
u1░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_u1_serial3.92 gb/s5.04 gb/s4.97 gb/s
nk_dot_u1_haswell14.7 gb/s43.2 gb/s69.4 gb/s
nk_dot_u1_icelake17.9 gb/s68.8 gb/s110 gb/s

WASM

Measured with Wasmtime v42 (Cranelift backend).

Kernel25610244096
f64c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_f64c_serial1.76 gb/s, 5.3 ulp2.43 gb/s, 3.3 ulp0.28 gb/s, 2.9 ulp
nk_vdot_f64c_serial1.02 gb/s, 3.5 ulp2.44 gb/s, 5.5 ulp0.15 gb/s, 2.2 ulp
nk_dot_f64c_v128relaxed2.80 gb/s, 37.8 ulp3.01 gb/s, 34.9 ulp0.21 gb/s, 167 ulp
nk_vdot_f64c_v128relaxed2.06 gb/s, 20.1 ulp2.87 gb/s, 51.4 ulp0.04 gb/s, 57.2 ulp
f32c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_f32c_serial1.26 gb/s, 0 ulp1.74 gb/s, 0 ulp0.08 gb/s, 0 ulp
nk_vdot_f32c_serial1.13 gb/s, 0 ulp1.78 gb/s, 0 ulp0.21 gb/s, 0 ulp
nk_dot_f32c_v128relaxed1.62 gb/s, 0 ulp1.92 gb/s, 0 ulp0.20 gb/s, 0 ulp
nk_vdot_f32c_v128relaxed1.66 gb/s, 0 ulp1.69 gb/s, 0 ulp0.13 gb/s, 0 ulp
bf16c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_bf16c_serial1.00 gb/s, 0.1 ulp2.29 gb/s, 1.7 ulp0.08 gb/s, 7.9 ulp
nk_vdot_bf16c_serial0.581 gb/s, 0.1 ulp0.919 gb/s, 2.9 ulp0.30 gb/s, 11.2 ulp
f16c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_f16c_serial1.39 gb/s, 12.6 ulp0.759 gb/s, 22.2 ulp0.23 gb/s, 34 ulp
nk_vdot_f16c_serial1.11 gb/s, 14.4 ulp0.828 gb/s, 41.8 ulp0.02 gb/s, 34.2 ulp
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_f64_serial1.53 gb/s, 3.1 ulp1.72 gb/s, 2.5 ulp0.20 gb/s, 3.3 ulp
nk_dot_f64_v128relaxed2.62 gb/s, 3.2 ulp2.11 gb/s, 3.6 ulp0.28 gb/s, 3.8 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_f32_serial1.95 gb/s, 0 ulp1.89 gb/s, 0 ulp0.28 gb/s, 0 ulp
nk_dot_f32_v128relaxed0.083 gb/s, 0 ulp1.61 gb/s, 0 ulp1.37 gb/s, 0 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_bf16_serial2.90 gb/s, 0 ulp2.27 gb/s, 0.5 ulp0.22 gb/s, 5.2 ulp
nk_dot_bf16_v128relaxed0.521 gb/s, 0 ulp2.30 gb/s, 0.3 ulp0.30 gb/s, 2.4 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_f16_serial0.648 gb/s, 13.5 ulp0.712 gb/s, 32 ulp0.08 gb/s, 59.7 ulp
nk_dot_f16_v128relaxed1.58 gb/s, 7.0 ulp1.05 gb/s, 30.8 ulp0.09 gb/s, 65.1 ulp
e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_e5m2_serial1.27 gb/s, 0 ulp0.679 gb/s, 0 ulp0.10 gb/s, 0 ulp
nk_dot_e5m2_v128relaxed0.970 gb/s, 0 ulp0.955 gb/s, 0 ulp0.17 gb/s, 0 ulp
e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_e4m3_serial0.312 gb/s, 0 ulp0.342 gb/s, 0 ulp0.12 gb/s, 0 ulp
nk_dot_e4m3_v128relaxed1.05 gb/s, 0 ulp0.721 gb/s, 0 ulp0.30 gb/s, 0 ulp
e3m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_e3m2_serial0.565 gb/s, 0 ulp0.552 gb/s, 0 ulp0.06 gb/s, 0 ulp
nk_dot_e3m2_v128relaxed0.670 gb/s, 0 ulp2.91 gb/s, 0 ulp0.24 gb/s, 0 ulp
e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_e2m3_serial0.584 gb/s, 0 ulp0.661 gb/s, 0 ulp0.07 gb/s, 0 ulp
nk_dot_e2m3_v128relaxed2.69 gb/s, 0 ulp0.131 gb/s, 0 ulp0.09 gb/s, 0 ulp
i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_i8_serial1.17 gb/s1.17 gb/s0.29 gb/s
nk_dot_i8_v128relaxed1.71 gb/s0.896 gb/s0.24 gb/s
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_u8_serial1.16 gb/s0.658 gb/s0.30 gb/s
nk_dot_u8_v128relaxed0.873 gb/s0.997 gb/s0.15 gb/s
i4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_i4_serial0.217 gb/s0.226 gb/s0.28 gb/s
nk_dot_i4_v128relaxed1.53 gb/s2.87 gb/s0.24 gb/s
u4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_u4_serial0.303 gb/s0.250 gb/s0.003 gb/s
nk_dot_u4_v128relaxed0.126 gb/s2.70 gb/s0.08 gb/s
u1░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_u1_serial1.95 gb/s1.53 gb/s0.09 gb/s
nk_dot_u1_v128relaxed0.548 gb/s1.88 gb/s0.13 gb/s

Apple M5

Native

Kernel25610244096
f64c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_f64c_serial8.02 gb/s, 5 ulp7.30 gb/s, 3 ulp7.25 gb/s, 9.7 ulp
nk_vdot_f64c_serial8.29 gb/s, 4.2 ulp7.53 gb/s, 3.3 ulp7.38 gb/s, 3.3 ulp
nk_dot_f64c_neon23.7 gb/s, 0 ulp21.6 gb/s, 0 ulp21.3 gb/s, 0 ulp
nk_vdot_f64c_neon23.6 gb/s, 0 ulp21.8 gb/s, 0 ulp20.9 gb/s, 0 ulp
f32c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_f32c_serial27.8 gb/s, 0 ulp24.6 gb/s, 0 ulp23.2 gb/s, 0 ulp
nk_vdot_f32c_serial27.2 gb/s, 0 ulp24.0 gb/s, 0 ulp22.6 gb/s, 0 ulp
nk_dot_f32c_neon22.8 gb/s, 0 ulp18.2 gb/s, 0 ulp16.9 gb/s, 0 ulp
nk_vdot_f32c_neon22.7 gb/s, 0 ulp17.5 gb/s, 0 ulp16.7 gb/s, 0 ulp
bf16c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_bf16c_serial15.6 gb/s, 0.2 ulp12.5 gb/s, 2.8 ulp12.5 gb/s, 15.8 ulp
nk_vdot_bf16c_serial15.9 gb/s, 0.2 ulp12.9 gb/s, 2.6 ulp11.7 gb/s, 11.4 ulp
nk_dot_bf16c_neonbfdot26.3 gb/s, 0.1 ulp18.5 gb/s, 2 ulp17.6 gb/s, 8.8 ulp
nk_vdot_bf16c_neonbfdot26.5 gb/s, 0.1 ulp18.2 gb/s, 1.8 ulp17.3 gb/s, 8.8 ulp
f16c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_f16c_serial15.8 gb/s, 20.8 ulp13.0 gb/s, 64.1 ulp12.3 gb/s, 73.1 ulp
nk_vdot_f16c_serial15.8 gb/s, 24.8 ulp13.0 gb/s, 31.9 ulp12.3 gb/s, 137 ulp
nk_dot_f16c_neonhalf26.1 gb/s, 3.0 ulp18.4 gb/s, 6.5 ulp16.8 gb/s, 20.5 ulp
nk_vdot_f16c_neonhalf26.1 gb/s, 34.9 ulp18.5 gb/s, 40.7 ulp17.0 gb/s, 73.1 ulp
nk_dot_f16c_neonfhm25.3 gb/s, 3.0 ulp17.1 gb/s, 6.5 ulp15.9 gb/s, 20.5 ulp
nk_vdot_f16c_neonfhm25.0 gb/s, 31.4 ulp17.0 gb/s, 38.6 ulp15.8 gb/s, 67.6 ulp
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_f64_serial8.11 gb/s, 2.4 ulp8.13 gb/s, 175 ulp8.09 gb/s, 2.7 ulp
nk_dot_f64_neon44.2 gb/s, 0 ulp42.3 gb/s, 0 ulp38.4 gb/s, 0 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_f32_serial23.3 gb/s, 0 ulp15.8 gb/s, 0 ulp14.6 gb/s, 0 ulp
nk_dot_f32_neon46.4 gb/s, 0 ulp38.0 gb/s, 0 ulp34.8 gb/s, 0 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_bf16_serial12.4 gb/s, 0 ulp8.59 gb/s, 0.9 ulp7.36 gb/s, 6 ulp
nk_dot_bf16_neon39.0 gb/s, 3.7 ulp27.2 gb/s, 3.7 ulp19.9 gb/s, 3.7 ulp
nk_dot_bf16_neonbfdot70.8 gb/s, 0 ulp60.8 gb/s, 0.6 ulp47.8 gb/s, 4.5 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_f16_serial12.0 gb/s, 19 ulp8.33 gb/s, 31.1 ulp7.11 gb/s, 57.8 ulp
nk_dot_f16_neon35.7 gb/s, 33.4 ulp25.8 gb/s, 37.4 ulp21.3 gb/s, 23.1 ulp
nk_dot_f16_neonfhm48.7 gb/s, 14.9 ulp27.5 gb/s, 26.7 ulp18.8 gb/s, 39.9 ulp
e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_e5m2_serial3.80 gb/s, 0 ulp3.41 gb/s, 0 ulp3.41 gb/s, 0 ulp
nk_dot_e5m2_neon19.0 gb/s, 0 ulp13.2 gb/s, 0 ulp10.5 gb/s, 0 ulp
nk_dot_e5m2_neonfhm25.6 gb/s, 0 ulp15.3 gb/s, 0 ulp9.55 gb/s, 0 ulp
nk_dot_e5m2_neonbfdot3.65 gb/s, 0 ulp3.82 gb/s, 0 ulp3.68 gb/s, 0 ulp
e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_e4m3_serial1.74 gb/s, 0 ulp1.72 gb/s, 0 ulp1.71 gb/s, 0 ulp
nk_dot_e4m3_neon4.44 gb/s, 0 ulp4.51 gb/s, 0 ulp4.57 gb/s, 0 ulp
nk_dot_e4m3_neonfhm10.1 gb/s, 0 ulp8.51 gb/s, 0 ulp7.96 gb/s, 0 ulp
nk_dot_e4m3_neonbfdot3.59 gb/s, 0 ulp3.68 gb/s, 0 ulp3.64 gb/s, 0 ulp
e3m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_e3m2_serial2.51 gb/s, 0 ulp2.33 gb/s, 0 ulp2.24 gb/s, 0 ulp
nk_dot_e3m2_neonsdot20.5 gb/s, 0 ulp20.7 gb/s, 0 ulp20.1 gb/s, 0 ulp
e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_e2m3_serial2.54 gb/s, 0 ulp2.27 gb/s, 0 ulp2.29 gb/s, 0 ulp
nk_dot_e2m3_neonsdot47.3 gb/s, 0 ulp47.5 gb/s, 0 ulp43.4 gb/s, 0 ulp
i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_i8_serial115 gb/s102 gb/s92.3 gb/s
nk_dot_i8_neonsdot92.8 gb/s87.4 gb/s59.9 gb/s
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_u8_serial110 gb/s99.2 gb/s94.9 gb/s
nk_dot_u8_neonsdot92.5 gb/s86.6 gb/s59.5 gb/s
i4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_i4_serial23.0 gb/s24.4 gb/s24.2 gb/s
nk_dot_i4_neonsdot58.2 gb/s44.7 gb/s30.4 gb/s
u4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_u4_serial25.3 gb/s27.2 gb/s26.9 gb/s
nk_dot_u4_neonsdot67.3 gb/s47.4 gb/s29.4 gb/s
u1░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_u1_serial7.01 gb/s7.63 gb/s7.19 gb/s
nk_dot_u1_neon33.3 gb/s64.6 gb/s88.0 gb/s

WASM

Measured with Wasmtime v43 (Cranelift backend).

Kernel25610244096
f64c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_f64c_serial4.17 gb/s, 3.8 ulp6.03 gb/s, 3.9 ulp6.37 gb/s, 3.2 ulp
nk_vdot_f64c_serial6.00 gb/s, 3.8 ulp6.55 gb/s, 3.4 ulp6.83 gb/s, 15.1 ulp
nk_dot_f64c_v128relaxed46.7 gb/s, 26 ulp38.2 gb/s, 42 ulp40.5 gb/s, 88 ulp
nk_vdot_f64c_v128relaxed46.1 gb/s, 22.8 ulp39.8 gb/s, 37.3 ulp39.9 gb/s, 43.6 ulp
f32c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_f32c_serial20.9 gb/s, 0 ulp21.3 gb/s, 0 ulp22.5 gb/s, 0 ulp
nk_vdot_f32c_serial19.9 gb/s, 0 ulp21.5 gb/s, 0 ulp22.4 gb/s, 0 ulp
nk_dot_f32c_v128relaxed22.4 gb/s, 0 ulp20.3 gb/s, 0 ulp19.8 gb/s, 0 ulp
nk_vdot_f32c_v128relaxed21.9 gb/s, 0 ulp20.3 gb/s, 0 ulp19.9 gb/s, 0 ulp
bf16c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_bf16c_serial10.3 gb/s, 0.1 ulp11.6 gb/s, 2.5 ulp11.6 gb/s, 10 ulp
nk_vdot_bf16c_serial10.7 gb/s, 0.2 ulp11.7 gb/s, 2.1 ulp11.6 gb/s, 11.4 ulp
f16c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_f16c_serial3.76 gb/s, 13 ulp3.83 gb/s, 20 ulp3.86 gb/s, 90 ulp
nk_vdot_f16c_serial3.81 gb/s, 13.9 ulp3.89 gb/s, 35.5 ulp3.85 gb/s, 42.4 ulp
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_f64_serial7.26 gb/s, 2.4 ulp7.45 gb/s, 2.6 ulp7.96 gb/s, 2.2 ulp
nk_dot_f64_v128relaxed38.7 gb/s, 2.6 ulp42.0 gb/s, 3.2 ulp43.9 gb/s, 2.6 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_f32_serial19.0 gb/s, 16 ulp14.6 gb/s, 69 ulp14.0 gb/s, 104 ulp
nk_dot_f32_v128relaxed20.4 gb/s, 0 ulp18.9 gb/s, 0 ulp18.7 gb/s, 0 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_bf16_serial9.53 gb/s, 0 ulp7.42 gb/s, 0.6 ulp7.20 gb/s, 5.9 ulp
nk_dot_bf16_v128relaxed41.9 gb/s, 0 ulp28.3 gb/s, 0.4 ulp21.5 gb/s, 3.7 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_f16_serial3.31 gb/s, 16 ulp3.63 gb/s, 26 ulp3.66 gb/s, 53 ulp
nk_dot_f16_v128relaxed11.4 gb/s, 9.0 ulp11.2 gb/s, 23 ulp12.0 gb/s, 39 ulp
e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_e5m2_serial3.02 gb/s, 0 ulp2.95 gb/s, 0 ulp3.16 gb/s, 0 ulp
nk_dot_e5m2_v128relaxed3.47 gb/s, 0 ulp3.45 gb/s, 0 ulp3.48 gb/s, 0 ulp
e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_e4m3_serial0.978 gb/s, 0 ulp0.893 gb/s, 0 ulp0.936 gb/s, 0 ulp
nk_dot_e4m3_v128relaxed2.78 gb/s, 0 ulp2.75 gb/s, 0 ulp2.78 gb/s, 0 ulp
e3m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_e3m2_serial3.13 gb/s, 0 ulp2.95 gb/s, 0 ulp3.16 gb/s, 0 ulp
nk_dot_e3m2_v128relaxed12.1 gb/s, 0 ulp11.9 gb/s, 0 ulp12.6 gb/s, 0 ulp
e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_e2m3_serial2.99 gb/s, 0 ulp3.00 gb/s, 0 ulp3.17 gb/s, 0 ulp
nk_dot_e2m3_v128relaxed20.4 gb/s, 0 ulp20.6 gb/s, 0 ulp21.7 gb/s, 0 ulp
i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_i8_serial22.2 gb/s19.6 gb/s17.8 gb/s
nk_dot_i8_v128relaxed42.0 gb/s49.0 gb/s49.7 gb/s
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_u8_serial23.0 gb/s19.9 gb/s17.8 gb/s
nk_dot_u8_v128relaxed29.3 gb/s33.0 gb/s35.1 gb/s
i4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_i4_serial0.990 gb/s0.923 gb/s0.985 gb/s
nk_dot_i4_v128relaxed15.2 gb/s17.9 gb/s19.2 gb/s
u4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_u4_serial0.992 gb/s0.933 gb/s0.988 gb/s
nk_dot_u4_v128relaxed30.2 gb/s32.1 gb/s33.8 gb/s
u1░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dot_u1_serial5.26 gb/s5.80 gb/s6.48 gb/s
nk_dot_u1_v128relaxed21.2 gb/s47.4 gb/s67.3 gb/s