Batched Dot Products in NumKong

April 20, 2026 · View on GitHub

NumKong implements batched GEMM computing C = A × Bᵀ (packed) and C = A × Aᵀ (symmetric). B is pre-packed once and reused across queries. This is the foundation for the spatials, sets, and maxsim modules.

Packed dot product computes the full cross-product matrix:

Cij=kAikBjkTC_{ij} = \sum_{k} A_{ik} \cdot B_{jk}^T

Symmetric dot product uses the same matrix for both operands:

Cij=kAikAjkC_{ij} = \sum_{k} A_{ik} \cdot A_{jk}

Reformulating as Python pseudocode:

import numpy as np

def dots_packed(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    return a @ b.T

def dots_symmetric(a: np.ndarray) -> np.ndarray:
    return a @ a.T

Input & Output Types

Input TypeOutput TypeDescription
f64f6464-bit IEEE 754 double precision
f32f3232-bit IEEE 754 single precision
f16f3216-bit IEEE 754 half precision, widened output
bf16f3216-bit brain float, widened output
e4m3f328-bit Float8: 4 exponent, 3 mantissa bits
e5m2f328-bit Float8: 5 exponent, 2 mantissa bits
e2m3f328-bit MX format: 2 exponent, 3 mantissa bits
e3m2f328-bit MX format: 3 exponent, 2 mantissa bits
i8i328-bit signed integers
u8u328-bit unsigned integers
i4i324-bit signed integers, packed nibble pairs
u4u324-bit unsigned integers, packed nibble pairs
u1u321-bit binary packed octets, popcount of AND

Optimizations

B Matrix Pre-Packing with Stride Breaking

nk_dots_pack_f32_serial, nk_dots_pack_f32_haswell, nk_dots_pack_bf16_haswell, nk_dots_pack_i8_haswell pre-pack the B matrix into a contiguous buffer optimized for streaming access during GEMM. Power-of-2 stride detection — when stride_bytes & (stride_bytes - 1) == 0 — adds depth_simd_dimensions padding to avoid cache associativity conflicts on set-associative caches. Type conversion is amortized into the pack step: BFloat16 → Float32, Float16 → Float32, and Float8 → Float32 conversions happen once during packing instead of per-row during GEMM. A 64-byte header stores metadata: column count, depth dimensions, and padded depth. Row grouping (group_size=16) zero-pads partial groups at matrix edges for uniform SIMD processing.

Tiled Register Accumulation

nk_dots_packed_f32_haswell, nk_dots_packed_f32_skylake, nk_dots_packed_f32_neon use a 4×4 tile kernel with 16 accumulators to handle ~80% of the work. A 1×8 tile kernel with 8 accumulators handles edge rows that don't fill a full 4-row tile. No depth blocking is used — the kernel relies on hardware prefetch for streaming A/B access patterns. Row loads are amortized across multiple dot products: each A row is loaded once and multiplied against 4 B columns per tile pass.

AMX 2D Tile Engine

The Sapphire Rapids AMX backends for bf16, mini-floats, i8, and u8 use Intel AMX's 8 tile registers (TMM0–TMM7), each 1 KB (16 rows × 64 bytes). Convention: TMM0–1 hold A tiles, TMM2–3 hold B tiles, TMM4–7 are C accumulators — giving a 2×2 output tile (32×32 Float32 results) per tile pass. TDPBF16PS tmm_c, tmm_a, tmm_b performs a 16×16 outer product with 32 BFloat16 multiply-adds per cell (16×16×32 = 8,192 MACs per instruction). Each A row contains 16 BFloat16 pairs interleaved as [a₀, a₁, a₀, a₁, ...] and B columns as [b₀, b₁, b₀, b₁, ...] — the hardware consumes two BFloat16 elements per slot, accumulating into Float32. TDPBSSD tmm_c, tmm_a, tmm_b does the same for Int8: 64 bytes per row gives 16×16×64 = 16,384 Int8 MACs per instruction. Int8 data is quad-interleaved: [a₀, a₁, a₂, a₃, a₀, a₁, a₂, a₃, ...] so the hardware can consume four Int8 elements per 32-bit slot. Tile configuration via LDTILECFG sets row counts and column byte-widths per tile — allows undersized tiles at matrix edges without masking. Morton Z-curve ordering for tile traversal improves cache reuse when both A and B exceed L2. This eliminates the explicit M×N×K loop nesting and register file pressure of vector ISAs — the entire dot-product reduction happens inside the tile instruction. FP8 inputs on Sapphire AMX go through an on-the-fly E4M3/E5M2 → BF16 pack via the Ice Lake VPERMI2W LUT helpers — port-5-bound but the simplest correct route to feed TDPBF16PS tiles. Granite Rapids adds TDPFP16PS (same tile shape, FP16 operands); the E5M2 variant widens inputs with a single VPUNPCK*BW against zero into FP16 tiles at pack time and then reuses the native FP16 compute loop — keeps the intermediate at FP16 precision instead of truncating to BF16 like the Sapphire path.

SME Outer-Product Streaming

nk_dots_packed_f32_smef64, nk_dots_packed_bf16_sme, nk_dots_packed_f64_smef64 use Arm's SME ZA tile array (up to 4 named tiles ZA0–ZA3 in 32-bit mode, each SVL×SVL elements). FMOPA za, pn/m, pm/m, zn.s, zm.s computes a full SVL×SVL rank-1 update in one instruction — one row of A times one row of B, accumulated into ZA. ZA0 time-shares between data staging and accumulation: A rows are loaded horizontally into ZA0 (st1w {za0h.s[ws]}, ...), then read vertically (svread_ver_za32_f32_m) to produce transposed column vectors for B. This avoids explicit transpose operations — the tile's 2D addressing provides free transposition. ZA1–ZA3 serve as accumulators while ZA0 stages the next data. A 3-column-tile fast path handles B column count ≤ 3×SVL using ZA1–ZA3 as three separate accumulator tiles, avoiding spill/reload cycles. For wider B, the kernel falls back to multi-pass accumulation with ZA store/load between passes. BFMOPA$ \text{for} \text{BFloat16} \text{uses} \text{the} \text{same} \text{outer}-\text{product} \text{pattern} \text{but} \text{with} \text{BFloat16} → \text{Float32} \text{widening} — 2 \times \text{the} \text{depth} \text{per} \text{instruction} \text{vs} \text{Float32} $FMOPA. SMSTART/SMSTOP streaming mode transitions cost ~50–100 cycles, amortized across the full M×N output. Ozaki splitting for Float64 (nk_dots_packed_f64_smef64) splits each Float64 into 3 mantissa-masked Float32 slices, computes 6 FMOPAs (all cross-products of 3×2 slices) into 3 ZA accumulators, then reconstructs the Float64 result — achieving Float64 precision using Float32 tile hardware.

Compensated Integer GEMM

nk_dots_packed_i8_icelake, nk_dots_packed_u8_icelake, nk_dots_packed_i8_haswell work around the unsigned×signed operand requirement of integer dot-product instructions. VPDPBUSD (Ice Lake+) computes UInt8×Int8 dot products accumulating directly to Int32 — but requires one unsigned and one signed operand. For signed×signed (Int8×Int8), one operand is XOR'd with 0x80 to shift to unsigned range, introducing a bias of $128 \cdot \sum_k b_kperoutputelement.Ratherthancomputingthebiascorrectionperelementinsidetheinnerloop(requiringextraregistersforrunningsums),theBcolumnsumsper output element. Rather than computing the bias correction per-element inside the inner loop (requiring extra registers for running sums), the B column sums\sum_k b_kareprecomputedonceduringpackingandstoredinthepackedbuffermetadata.TheinnerlooponlyneedstheVPDPBUSDaccumulatorthebiassubtractionisasinglepostloopcorrection:result[i][j]=128bcolumnsum[j].Thisreducesperaccumulatorstatefrom2registers(dot+runningsum)to1register(dotonly),freeingregistersformoreaccumulatorsinthe4×4tile.HaswellfallbackusesVPMADDUBSW are pre-computed once during packing and stored in the packed buffer metadata. The inner loop only needs the `VPDPBUSD` accumulator — the bias subtraction is a single post-loop correction: `result[i][j] -= 128 * b_column_sum[j]`. This reduces per-accumulator state from 2 registers (dot + running sum) to 1 register (dot only), freeing registers for more accumulators in the 4×4 tile. Haswell fallback uses `VPMADDUBSW (\text{UInt8} \times \text{Int8}→\text{Int16}) + $VPMADDWD` (Int16→Int32), a two-instruction chain with Int16 intermediate overflow risk — quantization ranges must be tighter ([-79, 79] vs [-127, 127]).

4-Way Finalizer Amortization

All packed and symmetric kernels across the dots, spatials, and sets modules share a finalizer-based design. The 4×4 tile accumulates 16 dot products in registers, then stores results 4-wide via nk_b128_vec_t — a union of f32[4], i32[4], u32[4] fitting a 128-bit register. A finalizer function pointer processes 4 results simultaneously, amortizing horizontal reductions and type conversions:

// 4-wide finalizer signature
void finalizer(nk_b128_vec_t dots,          // 4 dot products
               nk_f32_t query_norm,         // precomputed query squared-norm
               nk_b128_vec_t target_norms,  // 4 target squared-norms
               nk_b128_vec_t *results)      // 4 output distances

// Angular: 4 divisions + 4 subtractions in one call
results->f32s[i] = 1 - dots.f32s[i] / sqrt(query_norm * target_norms.f32s[i])

// Euclidean: 4 sqrt(a² + b² - 2ab) in one call
results->f32s[i] = sqrt(query_norm + target_norms.f32s[i] - 2 * dots.f32s[i])

The 4×4 tile emits 4 rows of 4 results each — the finalizer is called 4 times per tile, once per query row. For the 1×8 edge tile, two finalizer calls handle 8 results. This design decouples the GEMM loop from the distance metric: the same tiled accumulation code serves dots, spatials, and sets by swapping only the finalizer function pointer.

Performance

The following performance tables are produced by manually re-running nk_test and nk_bench included internal tools to measure both accuracy and throughput at different input shapes. The input size is controlled by NK_MATRIX_HEIGHT, NK_MATRIX_WIDTH, and NK_MATRIX_DEPTH environment variables, all set to the same value for products of two square matrices. Columns show throughput for 256³, 1024³, and 4096³ matrix products. The throughput is measured in GSO/s as Giga Scalar Operations per Second, with ops = 2 · M · N · K arithmetic complexity for an M × K by K × N product. Accuracy is reported as mean ULP (units in last place) unless noted otherwise — the average number of representable floating-point values between the result and the exact answer. Rows marked 🧩 use external BLAS or MKL baselines rather than NumKong kernels. Each kernel runs for at least 20 seconds per configuration. Benchmark threads are pinned to specific cores; on machines with heterogeneous core types (e.g., Apple P/E cores), only the fastest cores are used. Workloads that significantly degrade CPU frequencies (Intel AMX, Apple SME) run in separate passes to avoid affecting throughput measurements of other kernels.

Intel Sapphire Rapids

Native

Kernel256³1024³4096³
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
dots_packed_f64_with_blas 🧩58.7 gso/s, 16 ulp73.1 gso/s, 58 ulp73.8 gso/s, 56.2 ulp
dots_packed_f64_with_mkl 🧩59.9 gso/s, 16 ulp73.7 gso/s, 58 ulp73.3 gso/s, 56.2 ulp
dots_symmetric_f64_with_blas 🧩50.8 gso/s, 13 ulp70.4 gso/s, 30 ulp74 gso/s, 50.8 ulp
nk_dots_packed_f64_serial0.850 gso/s, 2 ulp0.846 gso/s, 4.6 ulp0.862 gso/s, 5.9 ulp
nk_dots_symmetric_f64_serial0.484 gso/s, 2 ulp0.472 gso/s, 2.9 ulp0.471 gso/s, 3.9 ulp
nk_dots_packed_f64_haswell5.93 gso/s, 0 ulp6.11 gso/s, 0 ulp6.16 gso/s, 0 ulp
nk_dots_symmetric_f64_haswell5.68 gso/s, 0 ulp5.99 gso/s, 0 ulp5.86 gso/s, 0 ulp
nk_dots_packed_f64_skylake8.26 gso/s, 0 ulp9.27 gso/s, 0 ulp9.06 gso/s, 0 ulp
nk_dots_symmetric_f64_skylake7.53 gso/s, 0 ulp8.63 gso/s, 0 ulp8.58 gso/s, 0 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
dots_packed_f32_with_blas 🧩113 gso/s, 18 ulp139 gso/s, 30 ulp147 gso/s, 267 ulp
dots_symmetric_f32_with_blas 🧩94.5 gso/s, 23 ulp126 gso/s, 39 ulp146 gso/s, 260 ulp
nk_dots_packed_f32_serial9.98 gso/s, 5.3 ulp10.1 gso/s, 11.8 ulp10.1 gso/s, 14.5 ulp
nk_dots_symmetric_f32_serial4.96 gso/s, 11.1 ulp5.01 gso/s, 13.4 ulp5.01 gso/s, 14.1 ulp
nk_dots_packed_f32_haswell30.4 gso/s, 0 ulp32.5 gso/s, 0 ulp31.9 gso/s, 0 ulp
nk_dots_symmetric_f32_haswell15.5 gso/s, 0 ulp17.9 gso/s, 0 ulp18.4 gso/s, 0 ulp
nk_dots_packed_f32_skylake35.4 gso/s, 0 ulp41.4 gso/s, 0 ulp40.0 gso/s, 0 ulp
nk_dots_symmetric_f32_skylake22.4 gso/s, 0 ulp28.2 gso/s, 0 ulp28.1 gso/s, 0 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
dots_packed_bf16_with_mkl 🧩182 gso/s, 0 ulp523 gso/s, 0.7 ulp847 gso/s, 5.8 ulp
nk_dots_packed_bf16_serial1.20 gso/s, 0 ulp1.21 gso/s, 0.5 ulp1.22 gso/s, 5.4 ulp
nk_dots_symmetric_bf16_serial1.16 gso/s, 0 ulp1.19 gso/s, 0.9 ulp1.18 gso/s, 5.4 ulp
nk_dots_packed_bf16_haswell65.6 gso/s, 0 ulp73.3 gso/s, 0.3 ulp76.8 gso/s, 4.4 ulp
nk_dots_symmetric_bf16_haswell40.2 gso/s, 0 ulp55.6 gso/s, 0.5 ulp60.8 gso/s, 4.6 ulp
nk_dots_packed_bf16_skylake79.8 gso/s, 0 ulp92.1 gso/s, 0.3 ulp102 gso/s, 3.5 ulp
nk_dots_symmetric_bf16_skylake57.4 gso/s, 0 ulp78.9 gso/s, 0.5 ulp82.5 gso/s, 3.5 ulp
nk_dots_packed_bf16_genoa65.8 gso/s, 0 ulp83.2 gso/s, 0.3 ulp88.9 gso/s, 3.5 ulp
nk_dots_symmetric_bf16_genoa52.5 gso/s, 0 ulp70.5 gso/s, 0.5 ulp76.0 gso/s, 3.5 ulp
nk_dots_packed_bf16_sapphireamx348 gso/s, 0 ulp706 gso/s, 0.7 ulp667 gso/s, 5.8 ulp
nk_dots_symmetric_bf16_sapphireamx84.2 gso/s, 0 ulp120 gso/s, 0.5 ulp120 gso/s, 5.8 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
dots_packed_f16_with_mkl 🧩123 gso/s, 17 ulp138 gso/s, 31 ulp138 gso/s, 39.5 ulp
nk_dots_packed_f16_serial8.19 gso/s, 14 ulp8.21 gso/s, 40 ulp8.11 gso/s, 326 ulp
nk_dots_symmetric_f16_serial4.02 gso/s, 8.9 ulp4.04 gso/s, 25 ulp4.03 gso/s, 55.6 ulp
nk_dots_packed_f16_haswell65.1 gso/s, 12 ulp74.4 gso/s, 22 ulp71.5 gso/s, 374 ulp
nk_dots_symmetric_f16_haswell34.4 gso/s, 7.7 ulp44.0 gso/s, 32 ulp46.5 gso/s, 486 ulp
nk_dots_packed_f16_skylake74.7 gso/s, 7.3 ulp99.0 gso/s, 21 ulp94.0 gso/s, 138 ulp
nk_dots_symmetric_f16_skylake40.9 gso/s, 5.9 ulp56.8 gso/s, 25 ulp58.8 gso/s, 32 ulp
e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_e5m2_serial4.86 gso/s, 0 ulp4.75 gso/s, 0 ulp4.88 gso/s, 0 ulp
nk_dots_symmetric_e5m2_serial3.97 gso/s, 0 ulp4.28 gso/s, 0 ulp4.50 gso/s, 0 ulp
nk_dots_packed_e5m2_haswell29.1 gso/s, 0 ulp31.5 gso/s, 0 ulp30.6 gso/s, 0 ulp
nk_dots_symmetric_e5m2_haswell15.6 gso/s, 0 ulp16.4 gso/s, 0 ulp17.0 gso/s, 0 ulp
nk_dots_packed_e5m2_skylake34.6 gso/s, 0 ulp37.9 gso/s, 0 ulp38.9 gso/s, 0 ulp
nk_dots_symmetric_e5m2_skylake21.2 gso/s, 0 ulp22.7 gso/s, 0 ulp22.5 gso/s, 0 ulp
nk_dots_packed_e5m2_genoa41.7 gso/s, 0 ulp48.7 gso/s, 0 ulp49.1 gso/s, 0 ulp
nk_dots_symmetric_e5m2_genoa30.0 gso/s, 0 ulp33.3 gso/s, 0 ulp33.7 gso/s, 0 ulp
nk_dots_packed_e5m2_sapphireamx254 gso/s, 0 ulp407 gso/s, 0 ulp419 gso/s, 0 ulp
nk_dots_symmetric_e5m2_sapphireamx50.9 gso/s, 0 ulp69.9 gso/s, 0 ulp67.4 gso/s, 0 ulp
e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_e4m3_serial0.489 gso/s, 0 ulp0.499 gso/s, 0 ulp0.489 gso/s, 0 ulp
nk_dots_symmetric_e4m3_serial0.394 gso/s, 0 ulp0.390 gso/s, 0 ulp0.391 gso/s, 0 ulp
nk_dots_packed_e4m3_haswell24.3 gso/s, 0 ulp26.1 gso/s, 0 ulp25.2 gso/s, 0 ulp
nk_dots_symmetric_e4m3_haswell13.5 gso/s, 0 ulp14.0 gso/s, 0 ulp14.3 gso/s, 0 ulp
nk_dots_packed_e4m3_skylake31.6 gso/s, 0 ulp32.6 gso/s, 0 ulp34.0 gso/s, 0 ulp
nk_dots_symmetric_e4m3_skylake17.3 gso/s, 0 ulp18.2 gso/s, 0 ulp18.6 gso/s, 0 ulp
nk_dots_packed_e4m3_genoa38.6 gso/s, 0 ulp43.8 gso/s, 0 ulp43.7 gso/s, 0 ulp
nk_dots_symmetric_e4m3_genoa27.3 gso/s, 0 ulp29.4 gso/s, 0 ulp29.2 gso/s, 0 ulp
nk_dots_packed_e4m3_sapphireamx222 gso/s, 0 ulp333 gso/s, 0 ulp332 gso/s, 0 ulp
nk_dots_symmetric_e4m3_sapphireamx33.1 gso/s, 0 ulp36.3 gso/s, 0 ulp35.4 gso/s, 0 ulp
e3m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_e3m2_serial4.97 gso/s, 0 ulp4.90 gso/s, 0 ulp5.03 gso/s, 0 ulp
nk_dots_symmetric_e3m2_serial3.40 gso/s, 0 ulp3.81 gso/s, 0 ulp3.88 gso/s, 0 ulp
nk_dots_packed_e3m2_haswell31.0 gso/s, 0 ulp32.2 gso/s, 0 ulp33.9 gso/s, 0 ulp
nk_dots_symmetric_e3m2_haswell29.0 gso/s, 0 ulp31.7 gso/s, 0 ulp31.1 gso/s, 0 ulp
nk_dots_packed_e3m2_skylake39.3 gso/s, 0 ulp43.4 gso/s, 0 ulp44.1 gso/s, 0 ulp
nk_dots_symmetric_e3m2_skylake40.0 gso/s, 0 ulp46.6 gso/s, 0 ulp47.1 gso/s, 0 ulp
nk_dots_packed_e3m2_sapphireamx263 gso/s, 0 ulp471 gso/s, 0 ulp471 gso/s, 0 ulp
nk_dots_symmetric_e3m2_sapphireamx62.9 gso/s, 0 ulp101 gso/s, 0 ulp89.1 gso/s, 0 ulp
e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_e2m3_serial4.98 gso/s, 0 ulp4.95 gso/s, 0 ulp5.00 gso/s, 0 ulp
nk_dots_symmetric_e2m3_serial3.48 gso/s, 0 ulp3.83 gso/s, 0 ulp3.85 gso/s, 0 ulp
nk_dots_packed_e2m3_haswell58.6 gso/s, 0 ulp62.5 gso/s, 0 ulp65.3 gso/s, 0 ulp
nk_dots_symmetric_e2m3_haswell50.5 gso/s, 0 ulp61.2 gso/s, 0 ulp64.2 gso/s, 0 ulp
nk_dots_packed_e2m3_skylake69.8 gso/s, 0 ulp81.8 gso/s, 0 ulp88.4 gso/s, 0 ulp
nk_dots_symmetric_e2m3_skylake65.5 gso/s, 0 ulp83.4 gso/s, 0 ulp84.6 gso/s, 0 ulp
nk_dots_packed_e2m3_sapphireamx419 gso/s, 0 ulp1,195 gso/s, 0 ulp1,067 gso/s, 0 ulp
nk_dots_symmetric_e2m3_sapphireamx94.5 gso/s, 0 ulp213 gso/s, 0 ulp184 gso/s, 0 ulp
nk_dots_packed_e2m3_alder72.9 gso/s, 0 ulp78.6 gso/s, 0 ulp85.7 gso/s, 0 ulp
nk_dots_symmetric_e2m3_alder61.6 gso/s, 0 ulp75.2 gso/s, 0 ulp54.9 gso/s, 0 ulp
i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
dots_packed_i8u8_with_mkl 🧩250 gso/s627 gso/s1,670 gso/s
nk_dots_packed_i8_serial6.44 gso/s6.62 gso/s7.44 gso/s
nk_dots_symmetric_i8_serial2.93 gso/s2.99 gso/s5.83 gso/s
nk_dots_packed_i8_haswell87.7 gso/s104 gso/s108 gso/s
nk_dots_symmetric_i8_haswell64 gso/s80.9 gso/s173 gso/s
nk_dots_packed_i8_icelake191 gso/s326 gso/s410 gso/s
nk_dots_symmetric_i8_icelake79.2 gso/s303 gso/s760 gso/s
nk_dots_packed_i8_sapphireamx547 gso/s1,610 gso/s1,300 gso/s
nk_dots_symmetric_i8_sapphireamx112 gso/s266 gso/s221 gso/s
nk_dots_packed_i8_alder180 gso/s229 gso/s270 gso/s
nk_dots_symmetric_i8_alder108 gso/s218 gso/s263 gso/s
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_u8_serial7.45 gso/s7.79 gso/s7.88 gso/s
nk_dots_symmetric_u8_serial2.81 gso/s2.91 gso/s5.35 gso/s
nk_dots_packed_u8_haswell88 gso/s102 gso/s107 gso/s
nk_dots_symmetric_u8_haswell64.3 gso/s79.8 gso/s181 gso/s
nk_dots_packed_u8_icelake194 gso/s329 gso/s402 gso/s
nk_dots_symmetric_u8_icelake83.9 gso/s300 gso/s755 gso/s
nk_dots_packed_u8_sapphireamx550 gso/s1,680 gso/s1,330 gso/s
nk_dots_symmetric_u8_sapphireamx113 gso/s270 gso/s223 gso/s
nk_dots_packed_u8_alder181 gso/s230 gso/s266 gso/s
nk_dots_symmetric_u8_alder108 gso/s216 gso/s257 gso/s
i4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_i4_serial2.43 gso/s2.43 gso/s2.24 gso/s
nk_dots_symmetric_i4_serial2.26 gso/s2.13 gso/s4.44 gso/s
nk_dots_packed_i4_icelake135 gso/s211 gso/s254 gso/s
nk_dots_symmetric_i4_icelake78.7 gso/s252 gso/s581 gso/s
u4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_u4_serial3.27 gso/s3.37 gso/s3.33 gso/s
nk_dots_symmetric_u4_serial3.02 gso/s3.06 gso/s6.13 gso/s
nk_dots_packed_u4_icelake152 gso/s302 gso/s387 gso/s
nk_dots_symmetric_u4_icelake97.3 gso/s311 gso/s697 gso/s
u1░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_u1_haswell225 gso/s261 gso/s344 gso/s
nk_dots_symmetric_u1_haswell122 gso/s277 gso/s756 gso/s
nk_dots_packed_u1_icelake196 gso/s750 gso/s1,390 gso/s
nk_dots_symmetric_u1_icelake171 gso/s661 gso/s2,500 gso/s

WASM

Measured with Wasmtime v42 (Cranelift backend).

Kernel256³1024³4096³
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_f64_serial0.947 gso/s, 3.4 ulp0.969 gso/s, 2.4 ulp0.969 gso/s, 0 ulp
nk_dots_symmetric_f64_serial0.957 gso/s, 3.7 ulp1.11 gso/s, 2.5 ulp1.16 gso/s, 0 ulp
nk_dots_packed_f64_v128relaxed2.73 gso/s, 23.6 ulp2.79 gso/s, 32.5 ulp2.81 gso/s, 3.9 ulp
nk_dots_symmetric_f64_v128relaxed2.01 gso/s, 21.6 ulp2.55 gso/s, 41.2 ulp2.77 gso/s, 2.9 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_f32_serial4.27 gso/s, 14.6 ulp4.35 gso/s, 28.6 ulp4.47 gso/s, 25.3 ulp
nk_dots_symmetric_f32_serial3.13 gso/s, 11.5 ulp5.09 gso/s, 34.8 ulp5.78 gso/s, 44.7 ulp
nk_dots_packed_f32_v128relaxed10.4 gso/s, 12.9 ulp10.6 gso/s, 26.5 ulp10.9 gso/s, 39.7 ulp
nk_dots_symmetric_f32_v128relaxed3.73 gso/s, 10.3 ulp6.27 gso/s, 28.6 ulp7.43 gso/s, 76.2 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_bf16_serial4.33 gso/s, 0 ulp4.46 gso/s, 0.4 ulp4.45 gso/s, 9.5 ulp
nk_dots_symmetric_bf16_serial3.76 gso/s, 0 ulp6.36 gso/s, 0.5 ulp7.43 gso/s, 4.9 ulp
nk_dots_packed_bf16_v128relaxed23.2 gso/s, 0 ulp24.5 gso/s, 0.4 ulp24.9 gso/s, 6.8 ulp
nk_dots_symmetric_bf16_v128relaxed4.92 gso/s, 0 ulp10.5 gso/s, 0.5 ulp13.7 gso/s, 4.9 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_f16_serial4.33 gso/s, 26 ulp4.46 gso/s, 26 ulp4.45 gso/s, 26 ulp
nk_dots_symmetric_f16_serial3.76 gso/s, 28 ulp6.36 gso/s, 28 ulp7.43 gso/s, 28 ulp
nk_dots_packed_f16_v128relaxed7.39 gso/s, 27 ulp7.36 gso/s, 27 ulp7.45 gso/s, 27 ulp
nk_dots_symmetric_f16_v128relaxed3.70 gso/s, 28 ulp3.83 gso/s, 28 ulp3.87 gso/s, 28 ulp
e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_e5m2_serial2.63 gso/s, 0 ulp2.69 gso/s, 0 ulp2.70 gso/s, 0 ulp
nk_dots_symmetric_e5m2_serial1.62 gso/s, 0 ulp2.04 gso/s, 0 ulp2.16 gso/s, 0 ulp
nk_dots_packed_e5m2_v128relaxed6.25 gso/s, 0 ulp6.50 gso/s, 0 ulp6.55 gso/s, 0 ulp
nk_dots_symmetric_e5m2_v128relaxed3.37 gso/s, 0 ulp5.23 gso/s, 0 ulp6.06 gso/s, 0 ulp
e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_e4m3_serial0.348 gso/s, 0 ulp0.345 gso/s, 0 ulp0.345 gso/s, 0 ulp
nk_dots_symmetric_e4m3_serial0.321 gso/s, 0 ulp0.340 gso/s, 0 ulp0.345 gso/s, 0 ulp
nk_dots_packed_e4m3_v128relaxed4.80 gso/s, 0 ulp4.92 gso/s, 0 ulp4.96 gso/s, 0 ulp
nk_dots_symmetric_e4m3_v128relaxed2.85 gso/s, 0 ulp4.17 gso/s, 0 ulp4.62 gso/s, 0 ulp
e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_e2m3_serial2.63 gso/s, 0 ulp2.69 gso/s, 0 ulp2.71 gso/s, 0 ulp
nk_dots_symmetric_e2m3_serial1.62 gso/s, 0 ulp2.06 gso/s, 0 ulp2.14 gso/s, 0 ulp
nk_dots_packed_e2m3_v128relaxed17.2 gso/s, 0 ulp18.2 gso/s, 0 ulp18.7 gso/s, 0 ulp
nk_dots_symmetric_e2m3_v128relaxed5.35 gso/s, 0 ulp11.6 gso/s, 0 ulp16.3 gso/s, 0 ulp
i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_i8_serial4.40 gso/s4.54 gso/s4.73 gso/s
nk_dots_symmetric_i8_serial2.74 gso/s3.89 gso/s4.29 gso/s
nk_dots_packed_i8_v128relaxed36.5 gso/s38.5 gso/s41.1 gso/s
nk_dots_symmetric_i8_v128relaxed29.2 gso/s36.3 gso/s39.2 gso/s
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_u8_serial4.94 gso/s5.14 gso/s4.88 gso/s
nk_dots_symmetric_u8_serial2.74 gso/s3.94 gso/s4.40 gso/s
nk_dots_packed_u8_v128relaxed35.2 gso/s37.7 gso/s40.5 gso/s
nk_dots_symmetric_u8_v128relaxed21.0 gso/s26.6 gso/s28.6 gso/s
i4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_i4_serial6.34 gso/s6.40 gso/s6.59 gso/s
nk_dots_symmetric_i4_serial2.70 gso/s3.76 gso/s4.13 gso/s
nk_dots_packed_i4_v128relaxed9.81 gso/s10.3 gso/s10.4 gso/s
nk_dots_symmetric_i4_v128relaxed4.95 gso/s15.6 gso/s32.8 gso/s
u4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_u4_serial5.61 gso/s5.76 gso/s5.79 gso/s
nk_dots_symmetric_u4_serial3.01 gso/s4.34 gso/s4.94 gso/s
nk_dots_packed_u4_v128relaxed58.6 gso/s71.0 gso/s76.5 gso/s
nk_dots_symmetric_u4_v128relaxed6.97 gso/s21.9 gso/s46.7 gso/s
u1░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_u1_serial96.2 gso/s143 gso/s151 gso/s
nk_dots_packed_u1_v128relaxed166 gso/s280 gso/s294 gso/s
nk_dots_symmetric_u1_serial7.42 gso/s27.9 gso/s87.3 gso/s
nk_dots_symmetric_u1_v128relaxed7.35 gso/s27.5 gso/s81.9 gso/s

Apple M5

Native

Kernel256³1024³4096³
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_f64_serial2.49 gso/s, 3 ulp2.36 gso/s, 5 ulp2.48 gso/s, 6 ulp
nk_dots_symmetric_f64_serial1.38 gso/s, 0 ulp1.36 gso/s, 0 ulp1.49 gso/s, 0 ulp
nk_dots_packed_f64_neon6.31 gso/s, 0 ulp6.00 gso/s, 0 ulp6.34 gso/s, 0 ulp
nk_dots_symmetric_f64_neon5.57 gso/s, 0 ulp5.41 gso/s, 0 ulp5.40 gso/s, 0 ulp
nk_dots_packed_f64_smef6445.9 gso/s, 1.5 ulp46.3 gso/s, 1.1 ulp46.2 gso/s, 0.9 ulp
nk_dots_symmetric_f64_smef6422.5 gso/s, 1.5 ulp24.3 gso/s, 1.2 ulp21.3 gso/s, 1.1 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_f32_serial12.0 gso/s, 19 ulp11.4 gso/s, 30 ulp12.2 gso/s, 725 ulp
nk_dots_symmetric_f32_serial8.75 gso/s, 3.1 ulp9.15 gso/s, 12.8 ulp9.62 gso/s, 39.9 ulp
nk_dots_packed_f32_neon42.5 gso/s, 0 ulp40.6 gso/s, 0 ulp42.0 gso/s, 0 ulp
nk_dots_symmetric_f32_neon10.9 gso/s, 4.6 ulp10.5 gso/s, 17.7 ulp10.8 gso/s, 59 ulp
nk_dots_packed_f32_smef64236 gso/s, 0 ulp268 gso/s, 15 ulp221 gso/s, 0 ulp
nk_dots_symmetric_f32_smef6478.1 gso/s, 4.3 ulp94.1 gso/s, 19.0 ulp55.3 gso/s, 0 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_bf16_serial20.4 gso/s, 0.1 ulp19.6 gso/s, 0.5 ulp20.3 gso/s, 5 ulp
nk_dots_symmetric_bf16_serial16.3 gso/s, 0.01 ulp16.9 gso/s, 0.7 ulp17.8 gso/s, 115 ulp
nk_dots_packed_bf16_neon83.0 gso/s, 0 ulp80.2 gso/s, 0 ulp84.0 gso/s, 0 ulp
nk_dots_symmetric_bf16_neon39.5 gso/s, 0 ulp41.2 gso/s, 0 ulp41.9 gso/s, 0 ulp
nk_dots_packed_bf16_neonbfdot57.9 gso/s, 0 ulp58.5 gso/s, 0.5 ulp63.4 gso/s, 7.2 ulp
nk_dots_symmetric_bf16_neonbfdot38.6 gso/s, 0 ulp41.1 gso/s, 0.5 ulp43.5 gso/s, 0 ulp
nk_dots_packed_bf16_sme1,106 gso/s, 0 ulp1,208 gso/s, 4.2 ulp1,190 gso/s, 3.8 ulp
nk_dots_symmetric_bf16_sme606 gso/s, 0.07 ulp650 gso/s, 1.2 ulp458 gso/s, 1.8 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_f16_serial14.8 gso/s, 204 ulp14.2 gso/s, 36 ulp14.8 gso/s, 326 ulp
nk_dots_symmetric_f16_serial24.3 gso/s, 13 ulp24.9 gso/s, 24.6 ulp26.7 gso/s, 506 ulp
nk_dots_packed_f16_neonhalf77.0 gso/s, 16.8 ulp79.1 gso/s, 25.5 ulp84.2 gso/s, 618 ulp
nk_dots_symmetric_f16_neonhalf20.5 gso/s, 12.1 ulp20.4 gso/s, 25.0 ulp22.5 gso/s, 506 ulp
nk_dots_packed_f16_neonfhm104 gso/s, 16.7 ulp110 gso/s, 25.5 ulp118 gso/s, 618 ulp
nk_dots_symmetric_f16_neonfhm34.5 gso/s, 12.1 ulp40.4 gso/s, 25.0 ulp41.5 gso/s, 506 ulp
nk_dots_packed_f16_sme1,106 gso/s, 14.8 ulp1,213 gso/s, 28.2 ulp1,190 gso/s, 28.2 ulp
nk_dots_symmetric_f16_sme607 gso/s, 12.1 ulp636 gso/s, 23.8 ulp458 gso/s, 24.4 ulp
e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_e5m2_serial15.9 gso/s, 0 ulp16.7 gso/s, 0 ulp17.2 gso/s, 0 ulp
nk_dots_symmetric_e5m2_serial7.56 gso/s, 0 ulp8.37 gso/s, 0 ulp8.99 gso/s, 0 ulp
nk_dots_packed_e5m2_neonfhm88.1 gso/s, 0 ulp97.3 gso/s, 0 ulp103 gso/s, 0 ulp
nk_dots_symmetric_e5m2_neonfhm61.0 gso/s, 0 ulp73.2 gso/s, 0 ulp79.3 gso/s, 0 ulp
nk_dots_packed_e5m2_sme729 gso/s, 0 ulp800 gso/s, 0 ulp792 gso/s, 0 ulp
nk_dots_symmetric_e5m2_sme208 gso/s, 0 ulp227 gso/s, 0 ulp229 gso/s, 0 ulp
e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_e4m3_serial1.24 gso/s, 0 ulp1.20 gso/s, 0 ulp1.24 gso/s, 0 ulp
nk_dots_symmetric_e4m3_serial1.20 gso/s, 0 ulp1.24 gso/s, 0 ulp1.32 gso/s, 0 ulp
nk_dots_packed_e4m3_neonfhm29.6 gso/s, 0 ulp32.2 gso/s, 0 ulp34.1 gso/s, 0 ulp
nk_dots_symmetric_e4m3_neonfhm32.0 gso/s, 0 ulp36.6 gso/s, 0 ulp38.9 gso/s, 0 ulp
nk_dots_packed_e4m3_sme284 gso/s, 0 ulp314 gso/s, 0 ulp316 gso/s, 0 ulp
nk_dots_symmetric_e4m3_sme74.3 gso/s, 0 ulp80.9 gso/s, 0 ulp77.8 gso/s, 0 ulp
e3m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_e3m2_serial14.0 gso/s, 0 ulp14.6 gso/s, 0 ulp15.5 gso/s, 0 ulp
nk_dots_symmetric_e3m2_serial7.51 gso/s, 0 ulp8.10 gso/s, 0 ulp9.05 gso/s, 0 ulp
nk_dots_packed_e3m2_sme671 gso/s, 0 ulp738 gso/s, 0 ulp730 gso/s, 0 ulp
nk_dots_symmetric_e3m2_sme191 gso/s, 0 ulp206 gso/s, 0 ulp207 gso/s, 0 ulp
e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_e2m3_serial14.4 gso/s, 0 ulp14.8 gso/s, 0 ulp15.5 gso/s, 0 ulp
nk_dots_symmetric_e2m3_serial7.58 gso/s, 0 ulp8.21 gso/s, 0 ulp9.09 gso/s, 0 ulp
nk_dots_packed_e2m3_sme1,211 gso/s, 0 ulp1,404 gso/s, 0 ulp1,313 gso/s, 0 ulp
nk_dots_symmetric_e2m3_sme372 gso/s, 0 ulp410 gso/s, 0 ulp416 gso/s, 0 ulp
i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_i8_serial18.9 gso/s20.0 gso/s20.2 gso/s
nk_dots_symmetric_i8_serial12.6 gso/s13.9 gso/s14.8 gso/s
nk_dots_packed_i8_neonsdot345 gso/s419 gso/s477 gso/s
nk_dots_symmetric_i8_neonsdot76.6 gso/s86.9 gso/s87.2 gso/s
nk_dots_packed_i8_sme2,348 gso/s2,687 gso/s2,570 gso/s
nk_dots_symmetric_i8_sme1,390 gso/s1,531 gso/s1,369 gso/s
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_u8_serial16.3 gso/s16.3 gso/s17.4 gso/s
nk_dots_symmetric_u8_serial14.8 gso/s16.2 gso/s17.5 gso/s
nk_dots_packed_u8_neonsdot343 gso/s413 gso/s470 gso/s
nk_dots_symmetric_u8_neonsdot76.1 gso/s87.4 gso/s87.7 gso/s
nk_dots_packed_u8_sme2,351 gso/s2,684 gso/s2,570 gso/s
nk_dots_symmetric_u8_sme1,390 gso/s1,543 gso/s1,371 gso/s
i4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_i4_serial18.3 gso/s18.2 gso/s19.6 gso/s
nk_dots_symmetric_i4_serial13.7 gso/s14.9 gso/s15.6 gso/s
nk_dots_packed_i4_neonsdot259 gso/s284 gso/s291 gso/s
nk_dots_symmetric_i4_neonsdot129 gso/s162 gso/s171 gso/s
nk_dots_packed_i4_sme2,269 gso/s2,455 gso/s2,396 gso/s
nk_dots_symmetric_i4_sme1,585 gso/s1,692 gso/s1,737 gso/s
u4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_u4_serial19.4 gso/s19.4 gso/s20.6 gso/s
nk_dots_symmetric_u4_serial14.9 gso/s16.4 gso/s17.4 gso/s
nk_dots_packed_u4_neonsdot300 gso/s319 gso/s340 gso/s
nk_dots_symmetric_u4_neonsdot128 gso/s166 gso/s173 gso/s
nk_dots_packed_u4_sme2,342 gso/s2,503 gso/s2,471 gso/s
nk_dots_symmetric_u4_sme1,695 gso/s1,925 gso/s2,055 gso/s
u1░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_u1_serial405 gso/s467 gso/s534 gso/s
nk_dots_symmetric_u1_serial254 gso/s430 gso/s519 gso/s
nk_dots_packed_u1_neon849 gso/s932 gso/s1,014 gso/s
nk_dots_symmetric_u1_neon318 gso/s580 gso/s664 gso/s
nk_dots_packed_u1_smebi321,903 gso/s12,029 gso/s26,354 gso/s
nk_dots_symmetric_u1_smebi32176 gso/s768 gso/s2,153 gso/s

WASM

Measured with Wasmtime v43 (Cranelift backend).

Kernel256³1024³4096³
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_f64_serial2.15 gso/s, 3 ulp2.07 gso/s, 5 ulp2.23 gso/s, 2.2 ulp
nk_dots_symmetric_f64_serial2.35 gso/s, 4 ulp2.24 gso/s, 3 ulp2.46 gso/s, 2.4 ulp
nk_dots_packed_f64_v128relaxed5.59 gso/s, 32.4 ulp6.10 gso/s, 32.4 ulp6.24 gso/s, 32.4 ulp
nk_dots_symmetric_f64_v128relaxed5.26 gso/s, 37.6 ulp5.89 gso/s, 37.6 ulp6.04 gso/s, 37.6 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_f32_serial8.95 gso/s, 19 ulp8.71 gso/s, 30 ulp9.17 gso/s, 41.7 ulp
nk_dots_symmetric_f32_serial10.9 gso/s, 20 ulp10.5 gso/s, 29 ulp11.6 gso/s, 58.8 ulp
nk_dots_packed_f32_v128relaxed27.4 gso/s, 44.1 ulp31.6 gso/s, 44.1 ulp32.7 gso/s, 44.1 ulp
nk_dots_symmetric_f32_v128relaxed10.0 gso/s, 48.2 ulp10.9 gso/s, 48.2 ulp11.2 gso/s, 48.2 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_bf16_serial23.1 gso/s, 0.1 ulp21.6 gso/s, 0.5 ulp24.3 gso/s, 1.3 ulp
nk_dots_symmetric_bf16_serial24.3 gso/s, 0 ulp24.9 gso/s, 0.6 ulp28.0 gso/s, 1.1 ulp
nk_dots_packed_bf16_v128relaxed70.4 gso/s, 1.4 ulp86.2 gso/s, 1.4 ulp90.3 gso/s, 1.4 ulp
nk_dots_symmetric_bf16_v128relaxed37.2 gso/s, 1.3 ulp45.5 gso/s, 1.3 ulp47.7 gso/s, 1.3 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_f16_serial12.2 gso/s, 204 ulp11.6 gso/s, 36 ulp12.4 gso/s, 25.9 ulp
nk_dots_symmetric_f16_serial1.65 gso/s, 13 ulp1.54 gso/s, 29 ulp1.70 gso/s, 27.9 ulp
nk_dots_packed_f16_v128relaxed35.4 gso/s, ? ulp40.7 gso/s, ? ulp39.3 gso/s, ? ulp
nk_dots_symmetric_f16_v128relaxed14.7 gso/s, ? ulp17.1 gso/s, ? ulp17.3 gso/s, ? ulp
e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_e5m2_serial5.95 gso/s, 0 ulp5.59 gso/s, 0 ulp6.31 gso/s, 0 ulp
nk_dots_symmetric_e5m2_serial8.98 gso/s, 0 ulp9.09 gso/s, 0 ulp10.2 gso/s, 0 ulp
nk_dots_packed_e5m2_v128relaxed23.0 gso/s, 0 ulp25.5 gso/s, 0 ulp25.9 gso/s, 0 ulp
nk_dots_symmetric_e5m2_v128relaxed12.3 gso/s, 0 ulp13.8 gso/s, 0 ulp14.2 gso/s, 0 ulp
e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_e4m3_serial0.884 gso/s, 0 ulp0.840 gso/s, 0 ulp0.911 gso/s, 0 ulp
nk_dots_symmetric_e4m3_serial0.868 gso/s, 0 ulp0.826 gso/s, 0 ulp0.915 gso/s, 0 ulp
nk_dots_packed_e4m3_v128relaxed19.2 gso/s, 0 ulp20.8 gso/s, 0 ulp22.5 gso/s, 0 ulp
nk_dots_symmetric_e4m3_v128relaxed10.7 gso/s, 0 ulp11.7 gso/s, 0 ulp12.1 gso/s, 0 ulp
e3m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_e3m2_serial5.89 gso/s, 0 ulp5.73 gso/s, 0 ulp6.25 gso/s, 0 ulp
nk_dots_symmetric_e3m2_serial7.69 gso/s, 0 ulp7.45 gso/s, 0 ulp8.68 gso/s, 0 ulp
nk_dots_packed_e3m2_v128relaxed35.2 gso/s, 0 ulp38.9 gso/s, 0 ulp40.1 gso/s, 0 ulp
nk_dots_symmetric_e3m2_v128relaxed32.0 gso/s, 0 ulp38.1 gso/s, 0 ulp39.7 gso/s, 0 ulp
e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_e2m3_serial5.97 gso/s, 0 ulp5.69 gso/s, 0 ulp6.32 gso/s, 0 ulp
nk_dots_symmetric_e2m3_serial7.65 gso/s, 0 ulp7.71 gso/s, 0 ulp8.66 gso/s, 0 ulp
nk_dots_packed_e2m3_v128relaxed35.4 gso/s, 0 ulp39.0 gso/s, 0 ulp40.1 gso/s, 0 ulp
nk_dots_symmetric_e2m3_v128relaxed31.6 gso/s, 0 ulp37.6 gso/s, 0 ulp39.7 gso/s, 0 ulp
i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_i8_serial16.5 gso/s16.0 gso/s16.7 gso/s
nk_dots_symmetric_i8_serial12.5 gso/s11.8 gso/s13.6 gso/s
nk_dots_packed_i8_v128relaxed44.0 gso/s50.0 gso/s52.1 gso/s
nk_dots_symmetric_i8_v128relaxed37.7 gso/s45.5 gso/s50.6 gso/s
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_u8_serial17.2 gso/s16.7 gso/s17.7 gso/s
nk_dots_symmetric_u8_serial13.0 gso/s12.1 gso/s14.1 gso/s
nk_dots_packed_u8_v128relaxed43.3 gso/s47.7 gso/s50.8 gso/s
nk_dots_symmetric_u8_v128relaxed34.6 gso/s42.2 gso/s48.6 gso/s
i4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_i4_serial15.0 gso/s14.3 gso/s15.9 gso/s
nk_dots_symmetric_i4_serial12.8 gso/s12.6 gso/s14.0 gso/s
nk_dots_packed_i4_v128relaxed29.3 gso/s26.7 gso/s25.8 gso/s
nk_dots_symmetric_i4_v128relaxed54.0 gso/s70.9 gso/s80.8 gso/s
u4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_u4_serial14.6 gso/s14.1 gso/s15.4 gso/s
nk_dots_symmetric_u4_serial11.9 gso/s11.8 gso/s13.0 gso/s
nk_dots_packed_u4_v128relaxed84.9 gso/s92.5 gso/s96.2 gso/s
nk_dots_symmetric_u4_v128relaxed67.4 gso/s87.7 gso/s93.7 gso/s
u1░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_dots_packed_u1_serial236 gso/s265 gso/s311 gso/s
nk_dots_symmetric_u1_serial173 gso/s321 gso/s443 gso/s
nk_dots_packed_u1_v128relaxed598 gso/s804 gso/s871 gso/s
nk_dots_symmetric_u1_v128relaxed183 gso/s390 gso/s543 gso/s