Batched Distance Matrices in NumKong

April 2, 2026 · View on GitHub

NumKong implements batched distance matrix computation via pre-packed dot products plus normalization. Angular distance and Euclidean distance are computed from the packed dot product output without materializing an intermediate C matrix.

Angular distance from pre-packed dot products:

Dij=1CijAi2Bj2D_{ij} = 1 - \frac{C_{ij}}{\sqrt{\|A_i\|^2 \cdot \|B_j\|^2}}

Euclidean distance from pre-packed dot products:

Dij=Ai2+Bj22CijD_{ij} = \sqrt{\|A_i\|^2 + \|B_j\|^2 - 2 C_{ij}}

Reformulating as Python pseudocode:

import numpy as np

def angulars_packed(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    dots = a @ b.T
    a_norms = np.sum(a ** 2, axis=1, keepdims=True)
    b_norms = np.sum(b ** 2, axis=1, keepdims=True)
    return 1 - dots / np.sqrt(a_norms * b_norms.T)

def euclideans_packed(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    dots = a @ b.T
    a_norms = np.sum(a ** 2, axis=1, keepdims=True)
    b_norms = np.sum(b ** 2, axis=1, keepdims=True)
    return np.sqrt(np.maximum(a_norms + b_norms.T - 2 * dots, 0))

Input & Output Types

Input TypeOutput TypeDescription
f64f6464-bit IEEE 754 double precision
f32f3232-bit IEEE 754 single precision
f16f3216-bit IEEE 754 half precision, widened output
bf16f3216-bit brain float, widened output
e4m3f328-bit Float8: 4 exponent, 3 mantissa bits
e5m2f328-bit Float8: 5 exponent, 2 mantissa bits
e2m3f328-bit MX format: 2 exponent, 3 mantissa bits
e3m2f328-bit MX format: 3 exponent, 2 mantissa bits
i8f328-bit signed integers, float output
u8f328-bit unsigned integers, float output
i4f324-bit signed integers, float output
u4f324-bit unsigned integers, float output

Optimizations

Distance-from-Dot Algebraic Reduction

nk_angulars_packed_f32_haswell, nk_angulars_packed_f32_skylake, nk_euclideans_packed_f32_haswell, nk_euclideans_packed_f32_skylake derive distance matrices from pre-packed dot product output without materializing an intermediate result matrix. Angular distance rewrites as $1 - \text{dot}(a,b) \cdot \text{rsqrt}(|a|^2 \cdot |b|^2),convertingtwoseparatesquarerootsandadivisionintoonersqrtandonemultiply.Euclideandistanceexpandstheidentity, converting two separate square roots and a division into one rsqrt and one multiply. Euclidean distance expands the identity |a - b|^2 = |a|^2 + |b|^2 - 2 \cdot \text{dot}(a,b),requiringonlyonefinalsqrtperoutputelement.Bothformulasdecomposeinto:(1)abatchedGEMMforallM×Ndotproducts,(2)pervectorsquarednormsprecomputedonceduringpacking.Thesingularspatial/kernelscomputethesethreesums(, requiring only one final sqrt per output element. Both formulas decompose into: (1) a batched GEMM for all M×N dot products, (2) per-vector squared norms precomputed once during packing. The singular `spatial/` kernels compute these three sums (\sum a_i b_i,, \sum a_i^2,, \sum b_i^2$) in a single pass with three interleaved accumulators; the batched spatials/ kernels separate them — norms are computed once per vector during packing, and dots come from the GEMM — trading register pressure for amortized cost across the full M×N output.

Serial vs Vectorized Sqrt and Rsqrt Cost

nk_angular_through_f32_from_dot_serial_ uses the Quake 3 fast inverse square root (magic constant 0x5F375A86, three Newton-Raphson iterations, ~34.9 correct bits for Float32) to compute dot * rsqrt(query_norm * target_norm). nk_angular_through_f32_from_dot_haswell_ replaces this with hardware _mm_rsqrt_ps (~12-bit approximation, 5cy latency, 1/cy on port 0) plus one Newton-Raphson refinement step (~22–24 correct bits). nk_euclidean_through_f32_from_dot_serial_ computes sqrt(x) as x * rsqrt(x) — reusing the same rsqrt path. nk_euclidean_through_f32_from_dot_haswell_ uses exact _mm_sqrt_ps (11cy latency, 7cy throughput for XMM) instead of the rsqrt approximation — the subtraction a2+b22dot\|a\|^2 + \|b\|^2 - 2 \cdot \text{dot} can produce values near zero where rsqrt error would be amplified by the subsequent multiply. For Float64, all backends use exact division and sqrt — no fast rsqrt approximation, since reaching 52 mantissa bits of precision would need 4+ Newton-Raphson iterations, negating the speed advantage. The 4-wide finalizer batching amortizes these costs: one rsqrt or sqrt call processes 4 output elements simultaneously, hiding the latency behind the GEMM tile's computation.

Norm Precomputation in Packed Buffers

nk_dots_pack_f32_serial, nk_dots_pack_f32_haswell, nk_dots_pack_bf16_haswell compute per-column squared norms bj2=kbjk2=dot(bj,bj)\|b_j\|^2 = \sum_k b_{jk}^2 = \text{dot}(b_j, b_j) during the packing step via nk_reduce_moments_* primitives. The squared norm is a self-dot-product — already a byproduct of touching every element for type conversion and layout transformation. Angular and Euclidean finalizers read norms from packed buffer metadata, eliminating a separate O(N·K) norm pass over B.

Performance

The following performance tables are produced by manually re-running nk_test and nk_bench included internal tools to measure both accuracy and throughput at different input shapes. The input size is controlled by NK_MATRIX_HEIGHT, NK_MATRIX_WIDTH, and NK_MATRIX_DEPTH environment variables, all set to the same value for batched distance computations over square matrices. Columns show throughput for 256³, 1024³, and 4096³ configurations. The throughput is measured in GSO/s as Giga Scalar Operations per Second, with ops=2MNK\text{ops} = 2 \cdot M \cdot N \cdot K complexity for computing M×NM \times N pairwise distances over KK-dimensional vectors. Accuracy is reported as mean ULP (units in last place) unless noted otherwise — the average number of representable floating-point values between the result and the exact answer. Each kernel runs for at least 20 seconds per configuration. Benchmark threads are pinned to specific cores; on machines with heterogeneous core types (e.g., Apple P/E cores), only the fastest cores are used. Workloads that significantly degrade CPU frequencies (Intel AMX, Apple SME) run in separate passes to avoid affecting throughput measurements of other kernels.

Intel Sapphire Rapids

Native

Kernel256³1024³4096³
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_f64_serial0.578 gso/s, 0 ulp0.691 gso/s, 0 ulp0.787 gso/s, 0 ulp
nk_angulars_symmetric_f64_serial0.477 gso/s, 0 ulp0.569 gso/s, 0 ulp1.24 gso/s, 0 ulp
nk_euclideans_packed_f64_serial0.569 gso/s, 0.6 ulp0.692 gso/s, 0.6 ulp0.775 gso/s, 0.6 ulp
nk_euclideans_symmetric_f64_serial0.477 gso/s, 0.6 ulp0.562 gso/s, 0.6 ulp1.26 gso/s, 0.3 ulp
nk_angulars_packed_f64_haswell5.89 gso/s, 0 ulp6.04 gso/s, 0 ulp6.08 gso/s, 0 ulp
nk_angulars_symmetric_f64_haswell5.17 gso/s, 0 ulp5.56 gso/s, 0 ulp11.3 gso/s, 0 ulp
nk_euclideans_packed_f64_haswell5.83 gso/s, 0.2 ulp6.21 gso/s, 0.2 ulp6.24 gso/s, 0.2 ulp
nk_euclideans_symmetric_f64_haswell5.33 gso/s, 0.2 ulp5.62 gso/s, 0.2 ulp11.7 gso/s, 0.2 ulp
nk_angulars_packed_f64_skylake7.56 gso/s, 0 ulp8.46 gso/s, 0 ulp8.92 gso/s, 0 ulp
nk_angulars_symmetric_f64_skylake7.37 gso/s, 0 ulp8.66 gso/s, 0 ulp17.1 gso/s, 0 ulp
nk_euclideans_packed_f64_skylake8.06 gso/s, 0.2 ulp8.37 gso/s, 0.2 ulp8.06 gso/s, 0.2 ulp
nk_euclideans_symmetric_f64_skylake7.14 gso/s, 0.2 ulp8.43 gso/s, 0.2 ulp17.4 gso/s, 0.2 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_f32_serial15.0 gso/s, 0.1 ulp16.3 gso/s, 0.1 ulp16.4 gso/s, 0.1 ulp
nk_angulars_symmetric_f32_serial3.86 gso/s, 0.1 ulp4.29 gso/s, 0.1 ulp8.62 gso/s, 0.1 ulp
nk_euclideans_packed_f32_serial15.3 gso/s, 0.6 ulp17.0 gso/s, 0.5 ulp17.0 gso/s, 0.5 ulp
nk_euclideans_symmetric_f32_serial3.97 gso/s, 0.6 ulp4.16 gso/s, 0.5 ulp8.38 gso/s, 0.3 ulp
nk_angulars_packed_f32_haswell29.3 gso/s, 0 ulp31.6 gso/s, 0 ulp31.6 gso/s, 0 ulp
nk_angulars_symmetric_f32_haswell21.4 gso/s, 0 ulp24.8 gso/s, 0 ulp52 gso/s, 0 ulp
nk_euclideans_packed_f32_haswell29.7 gso/s, 0.2 ulp32 gso/s, 0.2 ulp32.9 gso/s, 0.2 ulp
nk_euclideans_symmetric_f32_haswell21.8 gso/s, 0.2 ulp25.7 gso/s, 0.2 ulp53 gso/s, 0.2 ulp
nk_angulars_packed_f32_skylake33.3 gso/s, 0 ulp39.4 gso/s, 0 ulp37.5 gso/s, 0 ulp
nk_angulars_symmetric_f32_skylake24.8 gso/s, 0 ulp25.5 gso/s, 0 ulp61.4 gso/s, 0 ulp
nk_euclideans_packed_f32_skylake34.4 gso/s, 0.2 ulp40.3 gso/s, 0.2 ulp40.3 gso/s, 0.2 ulp
nk_euclideans_symmetric_f32_skylake25.1 gso/s, 0.2 ulp29.3 gso/s, 0.2 ulp65.9 gso/s, 0.2 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_bf16_serial1.18 gso/s, 0 ulp1.21 gso/s, 0 ulp1.19 gso/s, 0.1 ulp
nk_angulars_symmetric_bf16_serial1.19 gso/s, 0 ulp1.18 gso/s, 0 ulp2.35 gso/s, 0 ulp
nk_euclideans_packed_bf16_serial1.20 gso/s, 0.6 ulp1.18 gso/s, 0.6 ulp1.16 gso/s, 6.0 ulp
nk_euclideans_symmetric_bf16_serial1.11 gso/s, 0.6 ulp1.14 gso/s, 0.6 ulp2.34 gso/s, 0.4 ulp
nk_angulars_packed_bf16_haswell54.6 gso/s, 0 ulp65.7 gso/s, 0 ulp66.1 gso/s, 0.1 ulp
nk_angulars_symmetric_bf16_haswell38.3 gso/s, 0 ulp50.1 gso/s, 0 ulp106 gso/s, 0 ulp
nk_euclideans_packed_bf16_haswell58 gso/s, 0.2 ulp65.7 gso/s, 0.3 ulp70.7 gso/s, 5.8 ulp
nk_euclideans_symmetric_bf16_haswell38.6 gso/s, 0.2 ulp49.8 gso/s, 0.3 ulp109 gso/s, 0.3 ulp
nk_angulars_packed_bf16_skylake67.8 gso/s, 0 ulp87.7 gso/s, 0 ulp86.4 gso/s, 0.1 ulp
nk_angulars_symmetric_bf16_skylake48.8 gso/s, 0 ulp58.7 gso/s, 0 ulp125 gso/s, 0 ulp
nk_euclideans_packed_bf16_skylake64 gso/s, 0.2 ulp87.4 gso/s, 0.3 ulp90.8 gso/s, 5.8 ulp
nk_euclideans_symmetric_bf16_skylake48.8 gso/s, 0.2 ulp58.9 gso/s, 0.3 ulp121 gso/s, 0.3 ulp
nk_angulars_packed_bf16_genoa59.7 gso/s, 0 ulp81.9 gso/s, 0 ulp87.2 gso/s, 0 ulp
nk_angulars_symmetric_bf16_genoa54.9 gso/s, 0 ulp61.2 gso/s, 0 ulp137 gso/s, 0 ulp
nk_euclideans_packed_bf16_genoa63 gso/s, 0.2 ulp79.6 gso/s, 0.3 ulp87.3 gso/s, 0.3 ulp
nk_euclideans_symmetric_bf16_genoa53.4 gso/s, 0.2 ulp60.2 gso/s, 0.3 ulp130 gso/s, 0.3 ulp
nk_angulars_packed_bf16_sapphireamx287 gso/s, 0 ulp364 gso/s, 0 ulp582 gso/s, 0 ulp
nk_angulars_symmetric_bf16_sapphireamx75.7 gso/s, 0 ulp114 gso/s, 0 ulp116 gso/s, 0 ulp
nk_euclideans_packed_bf16_sapphireamx328 gso/s, 0.3 ulp573 gso/s, 0.3 ulp632 gso/s, 0.3 ulp
nk_euclideans_symmetric_bf16_sapphireamx76.3 gso/s, 0.3 ulp115 gso/s, 0.3 ulp123 gso/s, 0.3 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_f16_serial7.46 gso/s, 0.1 ulp7.97 gso/s, 0.1 ulp8.12 gso/s, 0.1 ulp
nk_angulars_symmetric_f16_serial4.04 gso/s, 0.1 ulp4.09 gso/s, 0.1 ulp8.13 gso/s, 0.1 ulp
nk_euclideans_packed_f16_serial7.69 gso/s, 0.7 ulp7.73 gso/s, 1.1 ulp8.34 gso/s, 0.6 ulp
nk_euclideans_symmetric_f16_serial4.08 gso/s, 0.7 ulp4.19 gso/s, 1.1 ulp8.23 gso/s, 0.5 ulp
nk_angulars_packed_f16_haswell62 gso/s, 0.1 ulp74.4 gso/s, 0.1 ulp70.6 gso/s, 0.1 ulp
nk_angulars_symmetric_f16_haswell38.3 gso/s, 0.1 ulp54.9 gso/s, 0.1 ulp121 gso/s, 0.1 ulp
nk_euclideans_packed_f16_haswell62.9 gso/s, 0.4 ulp75.2 gso/s, 0.9 ulp75.7 gso/s, 0.5 ulp
nk_euclideans_symmetric_f16_haswell39.6 gso/s, 0.4 ulp54.2 gso/s, 0.9 ulp123 gso/s, 0.3 ulp
nk_angulars_packed_f16_skylake66.6 gso/s, 0.1 ulp85.2 gso/s, 0.1 ulp88.3 gso/s, 0 ulp
nk_angulars_symmetric_f16_skylake50.1 gso/s, 0.1 ulp57.7 gso/s, 0.1 ulp126 gso/s, 0 ulp
nk_euclideans_packed_f16_skylake69.6 gso/s, 0.4 ulp93.3 gso/s, 0.9 ulp91 gso/s, 0.5 ulp
nk_euclideans_symmetric_f16_skylake49.4 gso/s, 0.4 ulp59.8 gso/s, 0.9 ulp134 gso/s, 0.3 ulp
e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_e5m2_serial0.587 gso/s, 0 ulp0.553 gso/s, 0 ulp0.563 gso/s, 0 ulp
nk_angulars_symmetric_e5m2_serial0.446 gso/s, 0 ulp0.427 gso/s, 0 ulp0.847 gso/s, 0 ulp
nk_euclideans_packed_e5m2_serial0.576 gso/s, 0.5 ulp0.571 gso/s, 0.5 ulp0.557 gso/s, 0.2 ulp
nk_euclideans_symmetric_e5m2_serial0.424 gso/s, 0.5 ulp0.437 gso/s, 0.5 ulp0.836 gso/s, 0.2 ulp
nk_angulars_packed_e5m2_haswell27.4 gso/s, 0 ulp30.4 gso/s, 0 ulp31 gso/s, 0 ulp
nk_angulars_symmetric_e5m2_haswell15.3 gso/s, 0 ulp15.7 gso/s, 0 ulp32.3 gso/s, 0 ulp
nk_euclideans_packed_e5m2_haswell28 gso/s, 0 ulp30.8 gso/s, 0 ulp30.6 gso/s, 0 ulp
nk_euclideans_symmetric_e5m2_haswell15.4 gso/s, 0 ulp15.9 gso/s, 0 ulp32 gso/s, 0 ulp
nk_angulars_packed_e5m2_skylake32.9 gso/s, 0 ulp36.7 gso/s, 0 ulp40.1 gso/s, 0 ulp
nk_angulars_symmetric_e5m2_skylake19 gso/s, 0 ulp21 gso/s, 0 ulp42.7 gso/s, 0 ulp
nk_euclideans_packed_e5m2_skylake34.1 gso/s, 0 ulp37.9 gso/s, 0 ulp39.6 gso/s, 0 ulp
nk_euclideans_symmetric_e5m2_skylake20 gso/s, 0 ulp18.4 gso/s, 0 ulp41.6 gso/s, 0 ulp
nk_angulars_packed_e5m2_genoa39.6 gso/s, 0 ulp46.8 gso/s, 0 ulp47.5 gso/s, 0 ulp
nk_angulars_symmetric_e5m2_genoa30 gso/s, 0 ulp32.5 gso/s, 0 ulp66.3 gso/s, 0 ulp
nk_euclideans_packed_e5m2_genoa42.3 gso/s, 0 ulp49.1 gso/s, 0 ulp51.3 gso/s, 0 ulp
nk_euclideans_symmetric_e5m2_genoa30.1 gso/s, 0 ulp32.8 gso/s, 0 ulp64.9 gso/s, 0 ulp
nk_angulars_packed_e5m2_sapphireamx216 gso/s, 0 ulp355 gso/s, 0 ulp427 gso/s, 0 ulp
nk_angulars_symmetric_e5m2_sapphireamx48.7 gso/s, 0 ulp73.3 gso/s, 0 ulp72.3 gso/s, 0 ulp
nk_euclideans_packed_e5m2_sapphireamx220 gso/s, 0 ulp375 gso/s, 0 ulp408 gso/s, 0 ulp
nk_euclideans_symmetric_e5m2_sapphireamx48.3 gso/s, 0 ulp73.3 gso/s, 0 ulp74 gso/s, 0 ulp
e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_e4m3_serial0.479 gso/s, 0 ulp0.473 gso/s, 0 ulp0.485 gso/s, 0 ulp
nk_angulars_symmetric_e4m3_serial0.395 gso/s, 0 ulp0.390 gso/s, 0 ulp0.795 gso/s, 0 ulp
nk_euclideans_packed_e4m3_serial0.467 gso/s, 0.5 ulp0.484 gso/s, 0.5 ulp0.480 gso/s, 0.5 ulp
nk_euclideans_symmetric_e4m3_serial0.395 gso/s, 0.5 ulp0.395 gso/s, 0.5 ulp0.781 gso/s, 0.3 ulp
nk_angulars_packed_e4m3_haswell20.6 gso/s, 0 ulp22.5 gso/s, 0 ulp21.8 gso/s, 0 ulp
nk_angulars_symmetric_e4m3_haswell12.2 gso/s, 0 ulp12.1 gso/s, 0 ulp24.7 gso/s, 0 ulp
nk_euclideans_packed_e4m3_haswell20.7 gso/s, 0 ulp22.4 gso/s, 0 ulp23.4 gso/s, 0.2 ulp
nk_euclideans_symmetric_e4m3_haswell11.2 gso/s, 0 ulp11.9 gso/s, 0 ulp24.6 gso/s, 0.1 ulp
nk_angulars_packed_e4m3_skylake28.8 gso/s, 0 ulp32.8 gso/s, 0 ulp31.3 gso/s, 0 ulp
nk_angulars_symmetric_e4m3_skylake16.4 gso/s, 0 ulp17.4 gso/s, 0 ulp35.1 gso/s, 0 ulp
nk_euclideans_packed_e4m3_skylake27.8 gso/s, 0 ulp31.2 gso/s, 0 ulp31.7 gso/s, 0.2 ulp
nk_euclideans_symmetric_e4m3_skylake16.1 gso/s, 0 ulp16.8 gso/s, 0 ulp34.4 gso/s, 0.1 ulp
nk_angulars_packed_e4m3_genoa40.8 gso/s, 0 ulp48.4 gso/s, 0 ulp52.1 gso/s, 0 ulp
nk_angulars_symmetric_e4m3_genoa30.3 gso/s, 0 ulp31.5 gso/s, 0 ulp69.2 gso/s, 0 ulp
nk_euclideans_packed_e4m3_genoa43.3 gso/s, 0 ulp50.9 gso/s, 0 ulp48.8 gso/s, 0.1 ulp
nk_euclideans_symmetric_e4m3_genoa29.9 gso/s, 0 ulp31.9 gso/s, 0 ulp64.6 gso/s, 0.1 ulp
nk_angulars_packed_e4m3_sapphireamx212 gso/s, 0 ulp325 gso/s, 0 ulp418 gso/s, 0 ulp
nk_angulars_symmetric_e4m3_sapphireamx50.5 gso/s, 0 ulp73.4 gso/s, 0 ulp72 gso/s, 0 ulp
nk_euclideans_packed_e4m3_sapphireamx216 gso/s, 0.1 ulp372 gso/s, 0.1 ulp394 gso/s, 0.1 ulp
nk_euclideans_symmetric_e4m3_sapphireamx49.3 gso/s, 0.1 ulp70.1 gso/s, 0.1 ulp73.1 gso/s, 0.1 ulp
e3m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_e3m2_serial0.554 gso/s, 0 ulp0.524 gso/s, 0 ulp0.534 gso/s, 0 ulp
nk_angulars_symmetric_e3m2_serial0.439 gso/s, 0 ulp0.427 gso/s, 0 ulp0.839 gso/s, 0 ulp
nk_euclideans_packed_e3m2_serial0.556 gso/s, 0.5 ulp0.549 gso/s, 0.5 ulp0.509 gso/s, 0.2 ulp
nk_euclideans_symmetric_e3m2_serial0.413 gso/s, 0.5 ulp0.427 gso/s, 0.5 ulp0.829 gso/s, 0.2 ulp
nk_angulars_packed_e3m2_haswell30.3 gso/s, 0 ulp32.2 gso/s, 0 ulp32.8 gso/s, 0 ulp
nk_angulars_symmetric_e3m2_haswell27.1 gso/s, 0 ulp32.8 gso/s, 0 ulp65.7 gso/s, 0 ulp
nk_euclideans_packed_e3m2_haswell30.1 gso/s, 0 ulp32.3 gso/s, 0 ulp33.5 gso/s, 0 ulp
nk_euclideans_symmetric_e3m2_haswell28.5 gso/s, 0 ulp32.6 gso/s, 0 ulp66.1 gso/s, 0 ulp
nk_angulars_packed_e3m2_skylake37.4 gso/s, 0 ulp41.4 gso/s, 0 ulp44.1 gso/s, 0 ulp
nk_angulars_symmetric_e3m2_skylake39 gso/s, 0 ulp41.9 gso/s, 0 ulp87.3 gso/s, 0 ulp
nk_euclideans_packed_e3m2_skylake35.7 gso/s, 0 ulp41.3 gso/s, 0 ulp43 gso/s, 0 ulp
nk_euclideans_symmetric_e3m2_skylake36.2 gso/s, 0 ulp36.4 gso/s, 0 ulp87.8 gso/s, 0 ulp
nk_angulars_packed_e3m2_genoa48 gso/s, 0 ulp56 gso/s, 0 ulp59.3 gso/s, 0 ulp
nk_angulars_symmetric_e3m2_genoa40 gso/s, 0 ulp40.8 gso/s, 0 ulp87.4 gso/s, 0 ulp
nk_euclideans_packed_e3m2_genoa49.8 gso/s, 0 ulp58.4 gso/s, 0 ulp61 gso/s, 0 ulp
nk_euclideans_symmetric_e3m2_genoa38.4 gso/s, 0 ulp41.6 gso/s, 0 ulp87.7 gso/s, 0 ulp
nk_angulars_packed_e3m2_sapphireamx238 gso/s, 0 ulp420 gso/s, 0 ulp431 gso/s, 0 ulp
nk_angulars_symmetric_e3m2_sapphireamx60.7 gso/s, 0 ulp96.5 gso/s, 0 ulp90.9 gso/s, 0 ulp
nk_euclideans_packed_e3m2_sapphireamx224 gso/s, 0 ulp426 gso/s, 0 ulp443 gso/s, 0 ulp
nk_euclideans_symmetric_e3m2_sapphireamx60.8 gso/s, 0 ulp99.2 gso/s, 0 ulp92.6 gso/s, 0 ulp
e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_e2m3_serial0.332 gso/s, 0 ulp0.325 gso/s, 0 ulp0.320 gso/s, 0 ulp
nk_angulars_symmetric_e2m3_serial0.298 gso/s, 0 ulp0.305 gso/s, 0 ulp0.568 gso/s, 0 ulp
nk_euclideans_packed_e2m3_serial0.324 gso/s, 0.5 ulp0.310 gso/s, 0.5 ulp0.313 gso/s, 0.2 ulp
nk_euclideans_symmetric_e2m3_serial0.293 gso/s, 0.5 ulp0.295 gso/s, 0.5 ulp0.586 gso/s, 0.2 ulp
nk_angulars_packed_e2m3_haswell54.2 gso/s, 0 ulp61 gso/s, 0 ulp66.2 gso/s, 0 ulp
nk_angulars_symmetric_e2m3_haswell48.2 gso/s, 0 ulp60 gso/s, 0 ulp128 gso/s, 0 ulp
nk_euclideans_packed_e2m3_haswell55.9 gso/s, 0 ulp63.4 gso/s, 0 ulp64.8 gso/s, 0 ulp
nk_euclideans_symmetric_e2m3_haswell48.6 gso/s, 0 ulp62.1 gso/s, 0 ulp128 gso/s, 0 ulp
nk_angulars_packed_e2m3_skylake65.1 gso/s, 0 ulp79.4 gso/s, 0 ulp85.4 gso/s, 0 ulp
nk_angulars_symmetric_e2m3_skylake61.7 gso/s, 0 ulp81.1 gso/s, 0 ulp163 gso/s, 0 ulp
nk_euclideans_packed_e2m3_skylake65.1 gso/s, 0 ulp80.4 gso/s, 0 ulp80.8 gso/s, 0 ulp
nk_euclideans_symmetric_e2m3_skylake60.8 gso/s, 0 ulp62.3 gso/s, 0 ulp167 gso/s, 0 ulp
nk_angulars_packed_e2m3_genoa47.7 gso/s, 0 ulp55.4 gso/s, 0 ulp60 gso/s, 0 ulp
nk_angulars_symmetric_e2m3_genoa36.4 gso/s, 0 ulp41.5 gso/s, 0 ulp86.7 gso/s, 0 ulp
nk_euclideans_packed_e2m3_genoa50 gso/s, 0 ulp59.1 gso/s, 0 ulp58.3 gso/s, 0 ulp
nk_euclideans_symmetric_e2m3_genoa38 gso/s, 0 ulp42.3 gso/s, 0 ulp85.1 gso/s, 0 ulp
nk_angulars_packed_e2m3_sapphireamx350 gso/s, 0 ulp956 gso/s, 0 ulp1,020 gso/s, 0 ulp
nk_angulars_symmetric_e2m3_sapphireamx88.4 gso/s, 0 ulp203 gso/s, 0 ulp188 gso/s, 0 ulp
nk_euclideans_packed_e2m3_sapphireamx337 gso/s, 0 ulp990 gso/s, 0 ulp992 gso/s, 0 ulp
nk_euclideans_symmetric_e2m3_sapphireamx88.7 gso/s, 0 ulp193 gso/s, 0 ulp201 gso/s, 0 ulp
i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_i8_serial8.84 gso/s, 0 ulp9.49 gso/s, 0 ulp10.1 gso/s, 0 ulp
nk_angulars_symmetric_i8_serial4.40 gso/s, 0 ulp4.45 gso/s, 0 ulp9.58 gso/s, ? ulp
nk_euclideans_packed_i8_serial8.64 gso/s, 0.4 ulp9.84 gso/s, 0.4 ulp9.94 gso/s, 0.4 ulp
nk_euclideans_symmetric_i8_serial4.47 gso/s, 0.4 ulp4.64 gso/s, 0.4 ulp9.15 gso/s, ? ulp
nk_angulars_packed_i8_haswell79.5 gso/s, 0 ulp102 gso/s, 0 ulp109 gso/s, 0 ulp
nk_angulars_symmetric_i8_haswell60.6 gso/s, 0 ulp77.4 gso/s, 0 ulp168 gso/s, ? ulp
nk_euclideans_packed_i8_haswell82.5 gso/s, 0 ulp102 gso/s, 0 ulp109 gso/s, 0 ulp
nk_euclideans_symmetric_i8_haswell62 gso/s, 0 ulp76.5 gso/s, 0 ulp166 gso/s, ? ulp
nk_angulars_packed_i8_icelake155 gso/s, 0 ulp206 gso/s, 0 ulp402 gso/s, 0 ulp
nk_angulars_symmetric_i8_icelake103 gso/s, 0 ulp263 gso/s, 0 ulp690 gso/s, ? ulp
nk_euclideans_packed_i8_icelake169 gso/s, 0 ulp313 gso/s, 0 ulp393 gso/s, 0 ulp
nk_euclideans_symmetric_i8_icelake108 gso/s, 0 ulp268 gso/s, 0 ulp695 gso/s, ? ulp
nk_angulars_packed_i8_sapphireamx427 gso/s, 0 ulp1,020 gso/s, 0 ulp1,170 gso/s, 0 ulp
nk_angulars_symmetric_i8_sapphireamx106 gso/s, 0 ulp261 gso/s, 0 ulp210 gso/s, 0 ulp
nk_euclideans_packed_i8_sapphireamx428 gso/s, 0 ulp1,240 gso/s, 0 ulp1,170 gso/s, 0 ulp
nk_euclideans_symmetric_i8_sapphireamx104 gso/s, 0 ulp243 gso/s, 0 ulp219 gso/s, 0 ulp
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_u8_serial12.2 gso/s, 0.3 ulp12.8 gso/s, 0.3 ulp13.0 gso/s, 0.3 ulp
nk_angulars_symmetric_u8_serial4.48 gso/s, 0.3 ulp4.73 gso/s, 0.3 ulp9.50 gso/s, ? ulp
nk_euclideans_packed_u8_serial12.0 gso/s, 0.5 ulp13.1 gso/s, 0.5 ulp13.4 gso/s, 0.6 ulp
nk_euclideans_symmetric_u8_serial4.52 gso/s, 0.5 ulp4.69 gso/s, 0.5 ulp9.65 gso/s, ? ulp
nk_angulars_packed_u8_haswell54.6 gso/s, 0.3 ulp87.8 gso/s, 0.3 ulp104 gso/s, 0.3 ulp
nk_angulars_symmetric_u8_haswell44.6 gso/s, 0.3 ulp70.2 gso/s, 0.3 ulp161 gso/s, ? ulp
nk_euclideans_packed_u8_haswell55.5 gso/s, 0.5 ulp87.7 gso/s, 0.5 ulp105 gso/s, 0.6 ulp
nk_euclideans_symmetric_u8_haswell45.3 gso/s, 0.5 ulp68.4 gso/s, 0.5 ulp159 gso/s, ? ulp
nk_angulars_packed_u8_icelake154 gso/s, 0.3 ulp301 gso/s, 0.3 ulp404 gso/s, 0.3 ulp
nk_angulars_symmetric_u8_icelake108 gso/s, 0.3 ulp267 gso/s, 0.3 ulp699 gso/s, ? ulp
nk_euclideans_packed_u8_icelake168 gso/s, 0 ulp300 gso/s, 0 ulp402 gso/s, 0 ulp
nk_euclideans_symmetric_u8_icelake109 gso/s, 0 ulp253 gso/s, 0 ulp695 gso/s, ? ulp
nk_angulars_packed_u8_sapphireamx444 gso/s, 0.2 ulp1,210 gso/s, 0.2 ulp1,220 gso/s, 0.2 ulp
nk_angulars_symmetric_u8_sapphireamx103 gso/s, 0.2 ulp257 gso/s, 0.2 ulp227 gso/s, 0.2 ulp
nk_euclideans_packed_u8_sapphireamx432 gso/s, 0 ulp1,240 gso/s, 0 ulp1,200 gso/s, 0 ulp
nk_euclideans_symmetric_u8_sapphireamx102 gso/s, 0 ulp256 gso/s, 0 ulp220 gso/s, 0 ulp
i4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_i4_serial3.79 gso/s, ? ulp3.83 gso/s, ? ulp4.06 gso/s, ? ulp
nk_angulars_symmetric_i4_serial3.52 gso/s, ? ulp3.58 gso/s, ? ulp7.08 gso/s, ? ulp
nk_euclideans_packed_i4_serial3.69 gso/s, ? ulp3.91 gso/s, ? ulp3.76 gso/s, ? ulp
nk_euclideans_symmetric_i4_serial3.45 gso/s, ? ulp3.64 gso/s, ? ulp6.99 gso/s, ? ulp
nk_angulars_packed_i4_icelake117 gso/s, ? ulp208 gso/s, ? ulp249 gso/s, ? ulp
nk_angulars_symmetric_i4_icelake103 gso/s, ? ulp233 gso/s, ? ulp561 gso/s, ? ulp
nk_euclideans_packed_i4_icelake121 gso/s, ? ulp173 gso/s, ? ulp246 gso/s, ? ulp
nk_euclideans_symmetric_i4_icelake101 gso/s, ? ulp228 gso/s, ? ulp572 gso/s, ? ulp
u4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_u4_serial5.49 gso/s, ? ulp5.60 gso/s, ? ulp5.78 gso/s, ? ulp
nk_angulars_symmetric_u4_serial5.18 gso/s, ? ulp5.57 gso/s, ? ulp11.5 gso/s, ? ulp
nk_euclideans_packed_u4_serial5.23 gso/s, ? ulp5.50 gso/s, ? ulp5.64 gso/s, ? ulp
nk_euclideans_symmetric_u4_serial5.22 gso/s, ? ulp5.47 gso/s, ? ulp11.1 gso/s, ? ulp
nk_angulars_packed_u4_icelake153 gso/s, ? ulp270 gso/s, ? ulp381 gso/s, ? ulp
nk_angulars_symmetric_u4_icelake122 gso/s, ? ulp264 gso/s, ? ulp658 gso/s, ? ulp
nk_euclideans_packed_u4_icelake158 gso/s, ? ulp285 gso/s, ? ulp385 gso/s, ? ulp
nk_euclideans_symmetric_u4_icelake120 gso/s, ? ulp279 gso/s, ? ulp624 gso/s, ? ulp

WASM

Measured with Wasmtime v42 (Cranelift backend).

Kernel256³1024³4096³
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_f64_serial1.38 gso/s, 0 ulp1.37 gso/s, 0 ulp1.36 gso/s, 0 ulp
nk_angulars_symmetric_f64_serial0.267 gso/s, 0 ulp0.268 gso/s, 0 ulp0.258 gso/s, 0 ulp
nk_euclideans_packed_f64_serial1.41 gso/s, 0.6 ulp1.37 gso/s, 0.6 ulp1.36 gso/s, 0.6 ulp
nk_euclideans_symmetric_f64_serial0.272 gso/s, 0.6 ulp0.271 gso/s, 0.5 ulp0.161 gso/s, 0.5 ulp
nk_angulars_packed_f64_v128relaxed10.9 gso/s, 0.1 ulp10.9 gso/s, 0.1 ulp10.9 gso/s, 0.1 ulp
nk_angulars_symmetric_f64_v128relaxed0.238 gso/s, 0.1 ulp0.240 gso/s, 0.1 ulp0.271 gso/s, 0.1 ulp
nk_euclideans_packed_f64_v128relaxed11.0 gso/s, 0.6 ulp11.2 gso/s, 0.6 ulp11.2 gso/s, 0.6 ulp
nk_euclideans_symmetric_f64_v128relaxed0.0463 gso/s, 0.6 ulp0.0465 gso/s, 0.5 ulp0.00806 gso/s, 0.5 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_f32_serial4.16 gso/s, 0.1 ulp4.26 gso/s, 0.1 ulp4.39 gso/s, 0.1 ulp
nk_angulars_symmetric_f32_serial3.08 gso/s, 0.1 ulp4.88 gso/s, 0.1 ulp5.69 gso/s, 0.1 ulp
nk_euclideans_packed_f32_serial4.19 gso/s, 0.6 ulp4.32 gso/s, 0.6 ulp4.33 gso/s, 0.5 ulp
nk_euclideans_symmetric_f32_serial3.05 gso/s, 0.5 ulp4.97 gso/s, 0.5 ulp5.64 gso/s, 0.5 ulp
nk_angulars_packed_f32_v128relaxed9.41 gso/s, 0.1 ulp10.6 gso/s, 0.1 ulp10.7 gso/s, 0.1 ulp
nk_angulars_symmetric_f32_v128relaxed3.64 gso/s, 0.1 ulp6.14 gso/s, 0.1 ulp7.33 gso/s, 0.1 ulp
nk_euclideans_packed_f32_v128relaxed9.55 gso/s, 0.2 ulp10.6 gso/s, 0.2 ulp10.6 gso/s, 0.2 ulp
nk_euclideans_symmetric_f32_v128relaxed3.55 gso/s, 0.2 ulp6.15 gso/s, 0.2 ulp7.27 gso/s, 0.2 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_bf16_serial4.10 gso/s, 0 ulp4.33 gso/s, 0.2 ulp4.45 gso/s, 0.6 ulp
nk_angulars_symmetric_bf16_serial3.74 gso/s, 0 ulp6.15 gso/s, 0.2 ulp7.39 gso/s, 0.6 ulp
nk_euclideans_packed_bf16_serial4.26 gso/s, 0.7 ulp4.35 gso/s, 6.1 ulp4.40 gso/s, 32 ulp
nk_euclideans_symmetric_bf16_serial3.80 gso/s, 0.6 ulp6.16 gso/s, 5.3 ulp7.40 gso/s, 28 ulp
nk_angulars_packed_bf16_v128relaxed22.0 gso/s, 0 ulp24.8 gso/s, 0.2 ulp24.7 gso/s, 0.6 ulp
nk_angulars_symmetric_bf16_v128relaxed4.78 gso/s, 0 ulp9.61 gso/s, 0.2 ulp12.5 gso/s, 0.6 ulp
nk_euclideans_packed_bf16_v128relaxed22.2 gso/s, 0.7 ulp24.1 gso/s, 6.1 ulp24.8 gso/s, 32 ulp
nk_euclideans_symmetric_bf16_v128relaxed4.72 gso/s, 0.3 ulp9.53 gso/s, 5.1 ulp12.4 gso/s, 28 ulp
e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_e2m3_serial2.66 gso/s, 0 ulp2.71 gso/s, 0 ulp2.63 gso/s, 0 ulp
nk_angulars_symmetric_e2m3_serial0.0400 gso/s, 0 ulp0.0413 gso/s, 0 ulp0.238 gso/s, 0 ulp
nk_euclideans_packed_e2m3_serial2.74 gso/s, 0.5 ulp2.70 gso/s, 0.5 ulp2.67 gso/s, 0.5 ulp
nk_euclideans_symmetric_e2m3_serial0.0403 gso/s, 0.5 ulp0.0411 gso/s, 0.4 ulp0.0401 gso/s, 0.4 ulp
nk_angulars_packed_e2m3_v128relaxed18.4 gso/s, 0 ulp18.6 gso/s, 0 ulp18.5 gso/s, 0 ulp
nk_angulars_symmetric_e2m3_v128relaxed0.0559 gso/s, 0 ulp0.0180 gso/s, 0 ulp0.131 gso/s, 0 ulp
nk_euclideans_packed_e2m3_v128relaxed18.5 gso/s, 0 ulp18.7 gso/s, 0 ulp18.1 gso/s, 0 ulp
nk_euclideans_symmetric_e2m3_v128relaxed0.206 gso/s, 0 ulp0.0170 gso/s, 0 ulp0.0554 gso/s, 0 ulp
i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_i8_serial4.73 gso/s, 0 ulp4.81 gso/s, 0 ulp4.59 gso/s, 0 ulp
nk_angulars_symmetric_i8_serial0.00447 gso/s, 0 ulp0.198 gso/s, 0 ulp0.190 gso/s, 0 ulp
nk_euclideans_packed_i8_serial4.77 gso/s, 0.5 ulp4.80 gso/s, 0.4 ulp4.65 gso/s, 0.4 ulp
nk_euclideans_symmetric_i8_serial0.201 gso/s, 0.5 ulp0.0819 gso/s, 0.4 ulp0.0823 gso/s, 0.4 ulp
nk_angulars_packed_i8_v128relaxed31.6 gso/s, 0 ulp31.7 gso/s, 0 ulp31.1 gso/s, 0 ulp
nk_angulars_symmetric_i8_v128relaxed0.0304 gso/s, 0 ulp0.0680 gso/s, 0 ulp0.298 gso/s, 0 ulp
nk_euclideans_packed_i8_v128relaxed31.5 gso/s, 0 ulp32.3 gso/s, 0 ulp30.8 gso/s, 0 ulp
nk_euclideans_symmetric_i8_v128relaxed0.224 gso/s, 0 ulp0.222 gso/s, 0 ulp0.143 gso/s, 0 ulp
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_u8_serial4.26 gso/s, 0.4 ulp5.07 gso/s, 0.3 ulp5.11 gso/s, 0.3 ulp
nk_angulars_symmetric_u8_serial2.64 gso/s, 0.4 ulp4.02 gso/s, 0.3 ulp4.34 gso/s, 0.3 ulp
nk_euclideans_packed_u8_serial4.35 gso/s, 0.5 ulp4.67 gso/s, 0.5 ulp5.09 gso/s, 0.5 ulp
nk_euclideans_symmetric_u8_serial2.64 gso/s, 0.5 ulp3.97 gso/s, 0.5 ulp4.38 gso/s, 0.5 ulp
nk_angulars_packed_u8_v128relaxed23.7 gso/s, 0.3 ulp25.1 gso/s, 0.3 ulp25.8 gso/s, 0.3 ulp
nk_angulars_symmetric_u8_v128relaxed19.6 gso/s, 0.3 ulp23.2 gso/s, 0.3 ulp24.1 gso/s, 0.3 ulp
nk_euclideans_packed_u8_v128relaxed23.8 gso/s, 0 ulp25.3 gso/s, 0 ulp25.8 gso/s, 0 ulp
nk_euclideans_symmetric_u8_v128relaxed19.5 gso/s, 0 ulp23.0 gso/s, 0 ulp24.6 gso/s, 0 ulp
i4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_i4_serial6.22 gso/s, 0.35 ulp6.41 gso/s, 0.34 ulp6.55 gso/s, 0.35 ulp
nk_angulars_symmetric_i4_serial2.64 gso/s, 0.34 ulp3.69 gso/s, 0.34 ulp4.18 gso/s, 0.34 ulp
nk_euclideans_packed_i4_serial6.00 gso/s, 0.49 ulp6.43 gso/s, 0.54 ulp6.56 gso/s, 0.64 ulp
nk_euclideans_symmetric_i4_serial2.61 gso/s, 0.48 ulp3.68 gso/s, 0.53 ulp4.14 gso/s, 0.63 ulp
u4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_u4_serial5.38 gso/s, 0.35 ulp5.60 gso/s, 0.34 ulp5.81 gso/s, 0.35 ulp
nk_angulars_symmetric_u4_serial2.90 gso/s, 0.34 ulp4.28 gso/s, 0.34 ulp4.90 gso/s, 0.34 ulp
nk_euclideans_packed_u4_serial5.25 gso/s, 0.49 ulp5.64 gso/s, 0.54 ulp5.82 gso/s, 0.64 ulp
nk_euclideans_symmetric_u4_serial2.89 gso/s, 0.48 ulp4.30 gso/s, 0.53 ulp4.86 gso/s, 0.63 ulp

Apple M5

Native

Kernel256³1024³4096³
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_f64_serial2.37 gso/s, 0 ulp2.35 gso/s, 0 ulp2.67 gso/s, 0 ulp
nk_angulars_symmetric_f64_serial1.36 gso/s, 0.04 ulp1.41 gso/s, 0.02 ulp1.56 gso/s, 0.01 ulp
nk_euclideans_packed_f64_serial2.36 gso/s, 0.6 ulp2.41 gso/s, 0.6 ulp2.67 gso/s, 0.6 ulp
nk_euclideans_symmetric_f64_serial1.44 gso/s, 0.6 ulp1.52 gso/s, 0.6 ulp1.56 gso/s, 0.6 ulp
nk_angulars_packed_f64_neon6.05 gso/s, 7,798 ulp6.28 gso/s, 3,868 ulp6.34 gso/s, 1,720 ulp
nk_angulars_symmetric_f64_neon5.29 gso/s, 7,660 ulp5.39 gso/s, 3,790 ulp5.44 gso/s, 1,720 ulp
nk_euclideans_packed_f64_neon5.97 gso/s, 0.2 ulp5.97 gso/s, 0.2 ulp6.37 gso/s, 0.2 ulp
nk_euclideans_symmetric_f64_neon5.25 gso/s, 0.2 ulp5.29 gso/s, 0.2 ulp5.48 gso/s, 0.2 ulp
nk_angulars_packed_f64_smef6440.6 gso/s, 0.02 ulp44.7 gso/s, 0.02 ulp46.0 gso/s, 0.02 ulp
nk_angulars_symmetric_f64_smef6419.9 gso/s, 0.02 ulp24.1 gso/s, 0.02 ulp20.8 gso/s, 0.02 ulp
nk_euclideans_packed_f64_smef6441.0 gso/s, 0.24 ulp44.9 gso/s, 0.24 ulp46.1 gso/s, 0.24 ulp
nk_euclideans_symmetric_f64_smef6420.2 gso/s, 0.28 ulp24.1 gso/s, 0.28 ulp20.9 gso/s, 0.28 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_f32_serial10.7 gso/s, 0.1 ulp11.7 gso/s, 0.1 ulp12.2 gso/s, 0.1 ulp
nk_angulars_symmetric_f32_serial8.31 gso/s, 0.3 ulp8.76 gso/s, 0.3 ulp9.69 gso/s, 0.1 ulp
nk_euclideans_packed_f32_serial10.5 gso/s, 0.6 ulp11.4 gso/s, 0.5 ulp12.5 gso/s, 0.5 ulp
nk_euclideans_symmetric_f32_serial8.89 gso/s, 3.9 ulp8.91 gso/s, 7.9 ulp9.69 gso/s, 3.4 ulp
nk_angulars_packed_f32_neon37.6 gso/s, 0 ulp40.6 gso/s, 0 ulp42.2 gso/s, 1,740 ulp
nk_angulars_symmetric_f32_neon9.73 gso/s, 7,690 ulp10.5 gso/s, 3,830 ulp10.8 gso/s, 1,730 ulp
nk_euclideans_packed_f32_neon37.9 gso/s, 0.2 ulp39.7 gso/s, 0.2 ulp42.0 gso/s, 3.5 ulp
nk_euclideans_symmetric_f32_neon10.1 gso/s, 3.8 ulp10.3 gso/s, 7.8 ulp10.9 gso/s, 3.5 ulp
nk_angulars_packed_f32_smef64149 gso/s, 0.15 ulp230 gso/s, 0.15 ulp214 gso/s, 0.15 ulp
nk_angulars_symmetric_f32_smef6450.7 gso/s, 0.13 ulp85.4 gso/s, 0.13 ulp54.0 gso/s, 0.13 ulp
nk_euclideans_packed_f32_smef64151 gso/s, 2.2 ulp230 gso/s, 2.2 ulp213 gso/s, 2.2 ulp
nk_euclideans_symmetric_f32_smef6451.7 gso/s, 1.5 ulp86.1 gso/s, 1.5 ulp54.2 gso/s, 1.5 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_bf16_serial18.0 gso/s, 0 ulp19.9 gso/s, 0.1 ulp20.7 gso/s, 0 ulp
nk_angulars_symmetric_bf16_serial15.6 gso/s, 0.04 ulp16.9 gso/s, 0.1 ulp18.6 gso/s, 0.04 ulp
nk_euclideans_packed_bf16_serial19.1 gso/s, 0.6 ulp19.5 gso/s, 3.1 ulp21.8 gso/s, 2.1 ulp
nk_euclideans_symmetric_bf16_serial15.9 gso/s, 0.6 ulp16.9 gso/s, 3.1 ulp18.6 gso/s, 2.1 ulp
nk_angulars_packed_bf16_neonbfdot56.0 gso/s, 0 ulp57.4 gso/s, 0.1 ulp63.2 gso/s, 0.04 ulp
nk_angulars_symmetric_bf16_neonbfdot37.5 gso/s, 0 ulp39.6 gso/s, 0.1 ulp43.4 gso/s, 0.04 ulp
nk_euclideans_packed_bf16_neonbfdot55.7 gso/s, 0.3 ulp56.5 gso/s, 2.9 ulp62.1 gso/s, 1.9 ulp
nk_euclideans_symmetric_bf16_neonbfdot39.0 gso/s, 0.3 ulp42.1 gso/s, 2.9 ulp43.2 gso/s, 1.9 ulp
nk_angulars_packed_bf16_sme400 gso/s, 0.04 ulp821 gso/s, 0.04 ulp1,082 gso/s, 0.04 ulp
nk_angulars_symmetric_bf16_sme218 gso/s, 0.03 ulp464 gso/s, 0.03 ulp442 gso/s, 0.03 ulp
nk_euclideans_packed_bf16_sme468 gso/s, 0.54 ulp886 gso/s, 0.54 ulp1,109 gso/s, 0.54 ulp
nk_euclideans_symmetric_bf16_sme207 gso/s, 0.28 ulp473 gso/s, 0.28 ulp445 gso/s, 0.28 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_f16_serial12.8 gso/s, 0.1 ulp14.4 gso/s, 0.1 ulp14.9 gso/s, 0.1 ulp
nk_angulars_symmetric_f16_serial21.7 gso/s, 0.1 ulp25.2 gso/s, 0.09 ulp28.2 gso/s, 0.1 ulp
nk_euclideans_packed_f16_serial13.1 gso/s, 1.1 ulp13.9 gso/s, 0.7 ulp15.7 gso/s, 5.6 ulp
nk_euclideans_symmetric_f16_serial23.6 gso/s, 1.1 ulp25.2 gso/s, 0.7 ulp28.4 gso/s, 5.6 ulp
nk_angulars_packed_f16_neonhalf72.2 gso/s, 0.1 ulp78.6 gso/s, 0.1 ulp83.8 gso/s, 0.1 ulp
nk_angulars_symmetric_f16_neonhalf19.3 gso/s, 0.1 ulp20.9 gso/s, 0.1 ulp21.8 gso/s, 0.1 ulp
nk_euclideans_packed_f16_neonhalf73.0 gso/s, 0.9 ulp76.2 gso/s, 0.7 ulp83.7 gso/s, 5.9 ulp
nk_euclideans_symmetric_f16_neonhalf19.2 gso/s, 0.9 ulp20.2 gso/s, 0.6 ulp21.9 gso/s, 5.8 ulp
nk_angulars_packed_f16_neonfhm96.2 gso/s, 0.1 ulp107 gso/s, 0.1 ulp118 gso/s, 0.1 ulp
nk_angulars_symmetric_f16_neonfhm35.4 gso/s, 0.1 ulp39.1 gso/s, 0.1 ulp42.5 gso/s, 0.1 ulp
nk_euclideans_packed_f16_neonfhm100 gso/s, 0.9 ulp110 gso/s, 0.7 ulp119 gso/s, 5.9 ulp
nk_euclideans_symmetric_f16_neonfhm37.2 gso/s, 0.9 ulp39.4 gso/s, 0.6 ulp42.0 gso/s, 5.8 ulp
nk_angulars_packed_f16_sme419 gso/s, 0.1 ulp839 gso/s, 0.1 ulp1,091 gso/s, 0.1 ulp
nk_angulars_symmetric_f16_sme241 gso/s, 0.1 ulp487 gso/s, 0.1 ulp450 gso/s, 0.1 ulp
nk_euclideans_packed_f16_sme491 gso/s, 0 ulp906 gso/s, 0.06 ulp1,118 gso/s, 2.9 ulp
nk_euclideans_symmetric_f16_sme227 gso/s, 0.3 ulp500 gso/s, 0.6 ulp451 gso/s, 0.3 ulp
e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_e5m2_serial15.8 gso/s, 0 ulp16.7 gso/s, 0 ulp17.2 gso/s, 0 ulp
nk_angulars_symmetric_e5m2_serial7.78 gso/s, 0 ulp8.37 gso/s, 0 ulp8.99 gso/s, 0 ulp
nk_euclideans_packed_e5m2_serial15.5 gso/s, 0.5 ulp16.7 gso/s, 0.5 ulp17.2 gso/s, 0.5 ulp
nk_euclideans_symmetric_e5m2_serial7.93 gso/s, 0.5 ulp8.37 gso/s, 0.5 ulp8.99 gso/s, 0.5 ulp
nk_angulars_packed_e5m2_neonfhm84.3 gso/s, 0 ulp97.3 gso/s, 0 ulp103 gso/s, 0 ulp
nk_angulars_symmetric_e5m2_neonfhm58.8 gso/s, 0 ulp73.2 gso/s, 0 ulp79.3 gso/s, 0 ulp
nk_euclideans_packed_e5m2_neonfhm88.1 gso/s, 0 ulp110 gso/s, 0 ulp119 gso/s, 0 ulp
nk_euclideans_symmetric_e5m2_neonfhm66.1 gso/s, 0 ulp60.3 gso/s, 0 ulp64.4 gso/s, 0 ulp
nk_angulars_packed_e5m2_sme350 gso/s, 0.01 ulp609 gso/s, 0.01 ulp744 gso/s, 0.01 ulp
nk_angulars_symmetric_e5m2_sme138 gso/s, 0.01 ulp204 gso/s, 0.01 ulp226 gso/s, 0.01 ulp
nk_euclideans_packed_e5m2_sme399 gso/s, 0.005 ulp655 gso/s, 0.005 ulp762 gso/s, 0.005 ulp
nk_euclideans_symmetric_e5m2_sme132 gso/s, 0.004 ulp206 gso/s, 0.004 ulp227 gso/s, 0.004 ulp
e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_e4m3_serial1.15 gso/s, 0 ulp1.20 gso/s, 0 ulp1.24 gso/s, 0 ulp
nk_angulars_symmetric_e4m3_serial1.22 gso/s, 0.03 ulp1.24 gso/s, 0.02 ulp1.32 gso/s, 0.01 ulp
nk_euclideans_packed_e4m3_serial1.23 gso/s, 0.5 ulp1.20 gso/s, 0.5 ulp1.24 gso/s, 0.5 ulp
nk_euclideans_symmetric_e4m3_serial1.25 gso/s, 0.5 ulp1.24 gso/s, 0.5 ulp1.32 gso/s, 0.3 ulp
nk_angulars_packed_e4m3_neonfhm29.1 gso/s, 0 ulp32.2 gso/s, 0 ulp34.1 gso/s, 0 ulp
nk_angulars_symmetric_e4m3_neonfhm32.0 gso/s, 0 ulp36.6 gso/s, 0 ulp38.9 gso/s, 0 ulp
nk_euclideans_packed_e4m3_neonfhm30.0 gso/s, 0 ulp32.2 gso/s, 0 ulp34.1 gso/s, 0.2 ulp
nk_euclideans_symmetric_e4m3_neonfhm34.1 gso/s, 0 ulp36.6 gso/s, 0 ulp38.9 gso/s, 0.2 ulp
nk_angulars_packed_e4m3_sme184 gso/s, 0.01 ulp272 gso/s, 0.01 ulp307 gso/s, 0.01 ulp
nk_angulars_symmetric_e4m3_sme56.4 gso/s, 0.01 ulp74.6 gso/s, 0.01 ulp78.5 gso/s, 0.01 ulp
nk_euclideans_packed_e4m3_sme200 gso/s, 0.11 ulp279 gso/s, 0.11 ulp310 gso/s, 0.11 ulp
nk_euclideans_symmetric_e4m3_sme55.5 gso/s, 0.11 ulp75.1 gso/s, 0.11 ulp78.4 gso/s, 0.11 ulp
e3m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_e3m2_serial14.2 gso/s, 0 ulp14.6 gso/s, 0 ulp15.5 gso/s, 0 ulp
nk_angulars_symmetric_e3m2_serial7.77 gso/s, 0 ulp8.10 gso/s, 0 ulp9.05 gso/s, 0 ulp
nk_euclideans_packed_e3m2_serial13.9 gso/s, 0.5 ulp14.6 gso/s, 0.5 ulp15.5 gso/s, 0.5 ulp
nk_euclideans_symmetric_e3m2_serial8.08 gso/s, 0.5 ulp8.10 gso/s, 0.5 ulp9.05 gso/s, 0.5 ulp
nk_angulars_packed_e3m2_sme327 gso/s, 0.01 ulp573 gso/s, 0.01 ulp690 gso/s, 0.01 ulp
nk_angulars_symmetric_e3m2_sme124 gso/s, 0.01 ulp184 gso/s, 0.01 ulp204 gso/s, 0.01 ulp
nk_euclideans_packed_e3m2_sme379 gso/s, 0 ulp604 gso/s, 0 ulp702 gso/s, 0 ulp
nk_euclideans_symmetric_e3m2_sme119 gso/s, 0 ulp186 gso/s, 0 ulp205 gso/s, 0 ulp
e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_e2m3_serial14.1 gso/s, 0 ulp14.8 gso/s, 0 ulp15.5 gso/s, 0 ulp
nk_angulars_symmetric_e2m3_serial7.89 gso/s, 0 ulp8.21 gso/s, 0 ulp9.09 gso/s, 0 ulp
nk_euclideans_packed_e2m3_serial13.6 gso/s, 0.5 ulp14.8 gso/s, 0.5 ulp15.7 gso/s, 0.5 ulp
nk_euclideans_symmetric_e2m3_serial7.93 gso/s, 0.5 ulp8.21 gso/s, 0.5 ulp9.09 gso/s, 0.5 ulp
nk_angulars_packed_e2m3_sme415 gso/s, 0.01 ulp926 gso/s, 0.01 ulp1,216 gso/s, 0.01 ulp
nk_angulars_symmetric_e2m3_sme170 gso/s, 0.01 ulp342 gso/s, 0.01 ulp404 gso/s, 0.01 ulp
nk_euclideans_packed_e2m3_sme470 gso/s, 0 ulp1,011 gso/s, 0 ulp1,269 gso/s, 0 ulp
nk_euclideans_symmetric_e2m3_sme163 gso/s, 0 ulp348 gso/s, 0 ulp408 gso/s, 0 ulp
i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_i8_serial18.3 gso/s, 0 ulp20.0 gso/s, 0 ulp20.2 gso/s, 0 ulp
nk_angulars_symmetric_i8_serial13.5 gso/s, 0 ulp13.9 gso/s, 0 ulp14.8 gso/s, 0 ulp
nk_euclideans_packed_i8_serial18.7 gso/s, 0.4 ulp20.0 gso/s, 0.4 ulp20.2 gso/s, 0.4 ulp
nk_euclideans_symmetric_i8_serial13.7 gso/s, 0.4 ulp13.9 gso/s, 0.4 ulp14.8 gso/s, 0.4 ulp
nk_angulars_packed_i8_neonsdot280 gso/s, 0 ulp357 gso/s, 0 ulp477 gso/s, 0 ulp
nk_angulars_symmetric_i8_neonsdot74.0 gso/s, 0 ulp86.9 gso/s, 0 ulp87.2 gso/s, 0 ulp
nk_euclideans_packed_i8_neonsdot305 gso/s, 0 ulp419 gso/s, 0 ulp477 gso/s, 0 ulp
nk_euclideans_symmetric_i8_neonsdot73.4 gso/s, 0 ulp87.0 gso/s, 0 ulp87.2 gso/s, 0 ulp
nk_angulars_packed_i8_sme492 gso/s, 0.01 ulp1,356 gso/s, 0.01 ulp2,166 gso/s, 0.01 ulp
nk_angulars_symmetric_i8_sme200 gso/s, 0.01 ulp873 gso/s, 0.01 ulp1,214 gso/s, 0.01 ulp
nk_euclideans_packed_i8_sme584 gso/s, 0 ulp1,546 gso/s, 0 ulp2,263 gso/s, 0 ulp
nk_euclideans_symmetric_i8_sme201 gso/s, 0 ulp917 gso/s, 0 ulp1,256 gso/s, 0 ulp
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_u8_serial15.5 gso/s, 0.3 ulp16.3 gso/s, 0.3 ulp17.4 gso/s, 0.3 ulp
nk_angulars_symmetric_u8_serial15.7 gso/s, 0.3 ulp16.2 gso/s, 0.3 ulp17.5 gso/s, 0.3 ulp
nk_euclideans_packed_u8_serial16.4 gso/s, 0.5 ulp16.3 gso/s, 0.5 ulp17.4 gso/s, 0.6 ulp
nk_euclideans_symmetric_u8_serial16.1 gso/s, 0.5 ulp16.2 gso/s, 0.5 ulp17.5 gso/s, 0.6 ulp
nk_angulars_packed_u8_neonsdot284 gso/s, 0.3 ulp369 gso/s, 0.3 ulp470 gso/s, 0.3 ulp
nk_angulars_symmetric_u8_neonsdot72.4 gso/s, 0.3 ulp87.4 gso/s, 0.3 ulp87.7 gso/s, 0.3 ulp
nk_euclideans_packed_u8_neonsdot302 gso/s, 0 ulp419 gso/s, 0 ulp470 gso/s, 0 ulp
nk_euclideans_symmetric_u8_neonsdot72.0 gso/s, 0 ulp87.0 gso/s, 0 ulp87.7 gso/s, 0 ulp
nk_angulars_packed_u8_sme492 gso/s, 0.32 ulp1,369 gso/s, 0.32 ulp2,169 gso/s, 0.32 ulp
nk_angulars_symmetric_u8_sme201 gso/s, 0.32 ulp874 gso/s, 0.32 ulp1,217 gso/s, 0.32 ulp
nk_euclideans_packed_u8_sme584 gso/s, 0 ulp1,545 gso/s, 0 ulp2,260 gso/s, 0 ulp
nk_euclideans_symmetric_u8_sme199 gso/s, 0 ulp905 gso/s, 0 ulp1,248 gso/s, 0 ulp
i4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_i4_serial17.3 gso/s, 0.3 ulp18.2 gso/s, 0.3 ulp19.6 gso/s, 0.3 ulp
nk_angulars_symmetric_i4_serial14.3 gso/s, 0.3 ulp14.9 gso/s, 0.3 ulp15.6 gso/s, 0.3 ulp
nk_euclideans_packed_i4_serial17.9 gso/s, 0.5 ulp18.2 gso/s, 0.5 ulp19.6 gso/s, 0.6 ulp
nk_euclideans_symmetric_i4_serial14.6 gso/s, 0.5 ulp14.9 gso/s, 0.5 ulp15.6 gso/s, 0.6 ulp
nk_angulars_packed_i4_neonsdot215 gso/s, 0.3 ulp284 gso/s, 0.3 ulp291 gso/s, 0.3 ulp
nk_angulars_symmetric_i4_neonsdot104 gso/s, 0.3 ulp162 gso/s, 0.3 ulp171 gso/s, 0.3 ulp
nk_euclideans_packed_i4_neonsdot225 gso/s, 0 ulp284 gso/s, 0 ulp291 gso/s, 0 ulp
nk_euclideans_symmetric_i4_neonsdot105 gso/s, 0 ulp162 gso/s, 0 ulp171 gso/s, 0 ulp
nk_angulars_packed_i4_sme486 gso/s, 0.32 ulp1,309 gso/s, 0.32 ulp2,041 gso/s, 0.32 ulp
nk_angulars_symmetric_i4_sme201 gso/s, 0.32 ulp913 gso/s, 0.32 ulp1,488 gso/s, 0.32 ulp
nk_euclideans_packed_i4_sme576 gso/s, 0 ulp1,453 gso/s, 0 ulp2,126 gso/s, 0 ulp
nk_euclideans_symmetric_i4_sme200 gso/s, 0 ulp948 gso/s, 0 ulp1,527 gso/s, 0 ulp
u4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_u4_serial18.0 gso/s, 0.3 ulp19.4 gso/s, 0.3 ulp20.6 gso/s, 0.3 ulp
nk_angulars_symmetric_u4_serial15.5 gso/s, 0.3 ulp16.4 gso/s, 0.3 ulp17.4 gso/s, 0.3 ulp
nk_euclideans_packed_u4_serial19.1 gso/s, 0.5 ulp19.4 gso/s, 0.5 ulp20.6 gso/s, 0.6 ulp
nk_euclideans_symmetric_u4_serial15.7 gso/s, 0.5 ulp16.4 gso/s, 0.5 ulp17.4 gso/s, 0.6 ulp
nk_angulars_packed_u4_neonsdot241 gso/s, 0.3 ulp319 gso/s, 0.3 ulp340 gso/s, 0.3 ulp
nk_angulars_symmetric_u4_neonsdot107 gso/s, 0.3 ulp166 gso/s, 0.3 ulp173 gso/s, 0.3 ulp
nk_euclideans_packed_u4_neonsdot250 gso/s, 0 ulp340 gso/s, 0 ulp340 gso/s, 0 ulp
nk_euclideans_symmetric_u4_neonsdot105 gso/s, 0 ulp173 gso/s, 0 ulp173 gso/s, 0 ulp
nk_angulars_packed_u4_sme490 gso/s, 0.32 ulp1,322 gso/s, 0.32 ulp2,081 gso/s, 0.32 ulp
nk_angulars_symmetric_u4_sme205 gso/s, 0.32 ulp974 gso/s, 0.32 ulp1,682 gso/s, 0.32 ulp
nk_euclideans_packed_u4_sme582 gso/s, 0 ulp1,487 gso/s, 0 ulp2,162 gso/s, 0 ulp
nk_euclideans_symmetric_u4_sme205 gso/s, 0 ulp1,013 gso/s, 0 ulp1,734 gso/s, 0 ulp

WASM

Measured with Wasmtime v43 (Cranelift backend).

Kernel256³1024³4096³
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_f64_serial2.33 gso/s, 0 ulp2.12 gso/s, 0 ulp2.19 gso/s, 0 ulp
nk_angulars_symmetric_f64_serial1.35 gso/s, 0 ulp1.41 gso/s, 0 ulp1.57 gso/s, 0 ulp
nk_euclideans_packed_f64_serial2.35 gso/s, 0.4 ulp2.37 gso/s, 0.4 ulp2.50 gso/s, 0.4 ulp
nk_euclideans_symmetric_f64_serial1.40 gso/s, 0.4 ulp1.50 gso/s, 0.4 ulp1.59 gso/s, 0.4 ulp
nk_angulars_packed_f64_v128relaxed5.65 gso/s, 0.1 ulp5.44 gso/s, 0.1 ulp6.22 gso/s, 0.1 ulp
nk_angulars_symmetric_f64_v128relaxed5.29 gso/s, 0.1 ulp5.74 gso/s, 0.1 ulp6.01 gso/s, 0.1 ulp
nk_euclideans_packed_f64_v128relaxed5.69 gso/s, 0.4 ulp6.05 gso/s, 0.4 ulp6.22 gso/s, 0.4 ulp
nk_euclideans_symmetric_f64_v128relaxed5.29 gso/s, 0.4 ulp5.89 gso/s, 0.4 ulp6.02 gso/s, 0.4 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_f32_serial10.3 gso/s, 0.1 ulp10.1 gso/s, 0.1 ulp10.6 gso/s, 0.1 ulp
nk_angulars_symmetric_f32_serial8.18 gso/s, 0.1 ulp8.59 gso/s, 0.1 ulp9.54 gso/s, 0.1 ulp
nk_euclideans_packed_f32_serial10.4 gso/s, 0.3 ulp10.4 gso/s, 0.3 ulp11.0 gso/s, 0.3 ulp
nk_euclideans_symmetric_f32_serial8.58 gso/s, 0.3 ulp8.67 gso/s, 0.3 ulp9.52 gso/s, 0.3 ulp
nk_angulars_packed_f32_v128relaxed25.2 gso/s, 0.1 ulp30.7 gso/s, 0.1 ulp32.3 gso/s, 0.1 ulp
nk_angulars_symmetric_f32_v128relaxed9.91 gso/s, 0.1 ulp10.9 gso/s, 0.1 ulp11.1 gso/s, 0.1 ulp
nk_euclideans_packed_f32_v128relaxed26.6 gso/s, 0.2 ulp30.7 gso/s, 0.2 ulp32.2 gso/s, 0.2 ulp
nk_euclideans_symmetric_f32_v128relaxed9.98 gso/s, 0.2 ulp10.9 gso/s, 0.2 ulp11.1 gso/s, 0.2 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_bf16_serial21.7 gso/s, 0.3 ulp21.3 gso/s, 0.3 ulp24.3 gso/s, 0.3 ulp
nk_angulars_symmetric_bf16_serial22.2 gso/s, 0.3 ulp24.4 gso/s, 0.3 ulp27.8 gso/s, 0.3 ulp
nk_euclideans_packed_bf16_serial19.1 gso/s, 5.3 ulp21.4 gso/s, 5.3 ulp24.2 gso/s, 5.3 ulp
nk_euclideans_symmetric_bf16_serial22.0 gso/s, 5.3 ulp24.7 gso/s, 5.3 ulp27.7 gso/s, 5.3 ulp
nk_angulars_packed_bf16_v128relaxed70.2 gso/s, 0.3 ulp82.3 gso/s, 0.3 ulp89.9 gso/s, 0.3 ulp
nk_angulars_symmetric_bf16_v128relaxed36.9 gso/s, 0.3 ulp44.9 gso/s, 0.3 ulp47.3 gso/s, 0.3 ulp
nk_euclideans_packed_bf16_v128relaxed76.4 gso/s, 5.3 ulp87.3 gso/s, 5.3 ulp89.7 gso/s, 5.3 ulp
nk_euclideans_symmetric_bf16_v128relaxed37.1 gso/s, 5.3 ulp43.1 gso/s, 5.3 ulp45.5 gso/s, 5.3 ulp
e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_e2m3_serial5.78 gso/s, 0 ulp5.93 gso/s, 0 ulp6.28 gso/s, 0 ulp
nk_angulars_symmetric_e2m3_serial6.52 gso/s, 0 ulp8.09 gso/s, 0 ulp8.52 gso/s, 0 ulp
nk_euclideans_packed_e2m3_serial5.25 gso/s, 0.3 ulp5.94 gso/s, 0.3 ulp6.20 gso/s, 0.3 ulp
nk_euclideans_symmetric_e2m3_serial6.99 gso/s, 0.3 ulp8.08 gso/s, 0.3 ulp8.52 gso/s, 0.3 ulp
nk_angulars_packed_e2m3_v128relaxed36.7 gso/s, 0 ulp38.8 gso/s, 0 ulp39.9 gso/s, 0 ulp
nk_angulars_symmetric_e2m3_v128relaxed31.4 gso/s, 0 ulp37.3 gso/s, 0 ulp39.5 gso/s, 0 ulp
nk_euclideans_packed_e2m3_v128relaxed36.8 gso/s, 0 ulp38.9 gso/s, 0 ulp40.0 gso/s, 0 ulp
nk_euclideans_symmetric_e2m3_v128relaxed31.8 gso/s, 0 ulp37.5 gso/s, 0 ulp39.5 gso/s, 0 ulp
i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_i8_serial13.7 gso/s, 0 ulp16.4 gso/s, 0 ulp17.3 gso/s, 0 ulp
nk_angulars_symmetric_i8_serial10.5 gso/s, 0 ulp12.8 gso/s, 0 ulp13.4 gso/s, 0 ulp
nk_euclideans_packed_i8_serial14.4 gso/s, 0.5 ulp16.5 gso/s, 0.5 ulp17.3 gso/s, 0.5 ulp
nk_euclideans_symmetric_i8_serial11.3 gso/s, 0.5 ulp12.6 gso/s, 0.5 ulp13.4 gso/s, 0.5 ulp
nk_angulars_packed_i8_v128relaxed45.2 gso/s, 0 ulp50.0 gso/s, 0 ulp52.0 gso/s, 0 ulp
nk_angulars_symmetric_i8_v128relaxed37.7 gso/s, 0 ulp47.5 gso/s, 0 ulp50.4 gso/s, 0 ulp
nk_euclideans_packed_i8_v128relaxed45.6 gso/s, 0 ulp50.2 gso/s, 0 ulp52.0 gso/s, 0 ulp
nk_euclideans_symmetric_i8_v128relaxed37.4 gso/s, 0 ulp46.8 gso/s, 0 ulp50.4 gso/s, 0 ulp
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_angulars_packed_u8_serial14.7 gso/s, 0.3 ulp17.0 gso/s, 0.3 ulp17.8 gso/s, 0.3 ulp
nk_angulars_symmetric_u8_serial10.9 gso/s, 0.3 ulp13.2 gso/s, 0.3 ulp13.9 gso/s, 0.3 ulp
nk_euclideans_packed_u8_serial14.9 gso/s, 0.4 ulp17.0 gso/s, 0.4 ulp17.8 gso/s, 0.4 ulp
nk_euclideans_symmetric_u8_serial11.7 gso/s, 0.4 ulp13.1 gso/s, 0.4 ulp13.9 gso/s, 0.4 ulp
nk_angulars_packed_u8_v128relaxed43.7 gso/s, 0.3 ulp49.0 gso/s, 0.3 ulp50.7 gso/s, 0.3 ulp
nk_angulars_symmetric_u8_v128relaxed34.7 gso/s, 0.3 ulp45.6 gso/s, 0.3 ulp48.4 gso/s, 0.3 ulp
nk_euclideans_packed_u8_v128relaxed44.8 gso/s, 0 ulp49.0 gso/s, 0 ulp50.7 gso/s, 0 ulp
nk_euclideans_symmetric_u8_v128relaxed35.0 gso/s, 0 ulp44.2 gso/s, 0 ulp48.5 gso/s, 0 ulp