Spatial Distances in NumKong

April 2, 2026 · View on GitHub

NumKong implements spatial distance functions for dense vectors: squared Euclidean distance, Euclidean distance, and angular (cosine) distance. These metrics are commonly used in nearest-neighbor search, clustering, and dimensionality reduction, and are implemented for every numeric type supported by the library.

Squared Euclidean distance measures the sum of squared element-wise differences:

sqeuclidean(a,b)=i=0n1(aibi)2\text{sqeuclidean}(a, b) = \sum_{i=0}^{n-1} (a_i - b_i)^2

Euclidean distance is the square root of the squared Euclidean distance:

euclidean(a,b)=i=0n1(aibi)2\text{euclidean}(a, b) = \sqrt{\sum_{i=0}^{n-1} (a_i - b_i)^2}

Angular distance (cosine distance) measures the angle between two vectors:

angular(a,b)=1i=0n1aibii=0n1ai2i=0n1bi2\text{angular}(a, b) = 1 - \frac{\sum_{i=0}^{n-1} a_i \cdot b_i}{\sqrt{\sum_{i=0}^{n-1} a_i^2} \cdot \sqrt{\sum_{i=0}^{n-1} b_i^2}}

Reformulating as Python pseudocode:

import numpy as np

def sqeuclidean(a: np.ndarray, b: np.ndarray) -> float:
    return np.sum((a - b) ** 2)

def euclidean(a: np.ndarray, b: np.ndarray) -> float:
    return np.sqrt(np.sum((a - b) ** 2))

def angular(a: np.ndarray, b: np.ndarray) -> float:
    ab = np.dot(a, b)
    a2 = np.dot(a, a)
    b2 = np.dot(b, b)
    if a2 == 0 and b2 == 0: return 0
    if ab == 0: return 1
    return 1 - ab / (np.sqrt(a2) * np.sqrt(b2))

Input & Output Types

Input TypeOutput TypeDescription
f64f6464-bit IEEE 754 double precision
f32f3232-bit IEEE 754 single precision
f16f3216-bit IEEE 754 half precision, widened output
bf16f3216-bit brain float, widened output
e5m2f328-bit Float8: 5 exponent, 2 mantissa bits
e4m3f328-bit Float8: 4 exponent, 3 mantissa bits
e3m2f328-bit MX format: 3 exponent, 2 mantissa bits
e2m3f328-bit MX format: 2 exponent, 3 mantissa bits
i8f328-bit signed integers
u8f328-bit unsigned integers
i4f324-bit signed integers, packed nibble pairs
u4f324-bit unsigned integers, packed nibble pairs

Optimizations

Three-Accumulator Angular Pattern

nk_angular_f32_haswell, nk_angular_f32_skylake, nk_angular_f32_neon compute cosine distance as $1 - ab / (\sqrt{a^2} \cdot \sqrt{b^2}),requiringthreeconcurrentdotproductsinasinglepass:, requiring three concurrent dot products in a single pass: \sum a_i b_i,, \sum a_i^2,and, and \sum b_i^2.AllspatialangularkernelsinterleavethesethreeFMAstreamssothateachvectorelementisloadedonceandimmediatelycontributestoallthreeaccumulators.ThistriplesregisterpressurecomparedtoaplaindotproductonHaswellwith16YMMregisters,threeindependent4registeraccumulatorchainsleaveonly4registersfortemporaries.Thesinglepassdesignisessentialbecausereadingtwovectorsoflength. All spatial angular kernels interleave these three FMA streams so that each vector element is loaded once and immediately contributes to all three accumulators. This triples register pressure compared to a plain dot product — on Haswell with 16 YMM registers, three independent 4-register accumulator chains leave only 4 registers for temporaries. The single-pass design is essential because reading two vectors of length n once costs \2n cache line fetches, while a three-pass approach would cost \6n$.

Reciprocal Square Root with Newton-Raphson Refinement

nk_angular_f32_haswell, nk_angular_f64_haswell, nk_angular_f32_neon, nk_angular_f64_neon compute the final normalization via in-hardware reciprocal square root estimates refined by Newton-Raphson iteration. The iteration formula is xn+1=xn(3dxn2)/2x_{n+1} = x_n \cdot (3 - d \cdot x_n^2) / 2, where dd is the value whose reciprocal square root is needed. NEON vrsqrte + vrsqrts performs one refinement step, reaching roughly 22 bits of precision. Haswell VRSQRT14 provides $2^{-14} relative error and one Newton-Raphson step doubles the precision to approximately 28 bits. Skylake `VRSQRT28` achieves \2^{-28}accuracydirectly,eliminatingtheneedforarefinementstepentirely.Thisreciprocalsquarerootisneededforbotheuclideandistance( accuracy directly, eliminating the need for a refinement step entirely. This reciprocal square root is needed for both euclidean distance (\sqrt{d}viaviad \cdot \text{rsqrt}(d)) and angular distance (\1/\sqrt{a^2} \cdot 1/\sqrt{b^2}$).

Absolute Differences for Integer Types

nk_sqeuclidean_i8_haswell, nk_sqeuclidean_u8_haswell, nk_sqeuclidean_i8_icelake, nk_sqeuclidean_u8_icelake compute squared Euclidean distance by first obtaining element-wise absolute differences, then squaring and accumulating. For signed i8, XOR with 0x80 converts the range from [-128, 127] to unsigned [0, 255], then saturating subtract in both directions followed by OR gives ab|a - b|:

bias_a = _mm256_xor_si256(a, 0x80)
bias_b = _mm256_xor_si256(b, 0x80)
abs_diff = _mm256_or_si256(_mm256_subs_epu8(bias_a, bias_b), _mm256_subs_epu8(bias_b, bias_a))

For unsigned u8, the same saturating subtract trick works without the XOR bias. The absolute differences are then zero-extended via VPUNPCKLBW/VPUNPCKHBW (1 cycle, cheaper than VPMOVZXBW) and squared+accumulated via VPMADDWD, which computes di2+di+12d_i^2 + d_{i+1}^2 in one instruction.

Masked Neumaier Compensation on Skylake

nk_sqeuclidean_f64_skylake uses VGETEXP-based Neumaier TwoSum inside AVX-512 masked loops. The mask register tracks which lanes are active, handling tail elements when the vector length is not a multiple of the SIMD width. The compensation term accumulates the low-order rounding errors from each addition, and because the mask propagates through both the main sum and the compensation update, even the final partial iteration maintains full Neumaier accuracy. This avoids the need for a separate scalar tail loop that would otherwise lose the compensated error tracking.

Performance

The following performance tables are produced by manually re-running nk_test and nk_bench included internal tools to measure both accuracy and throughput at different input shapes. The input size is controlled by the NK_DENSE_DIMENSIONS environment variable and set to 256, 1024, and 4096 elements. The throughput is measured in GB/s as the number of input bytes per second. Accuracy is reported as mean ULP (units in last place) unless noted otherwise — the average number of representable floating-point values between the result and the exact answer. Each kernel runs for at least 20 seconds per configuration. Benchmark threads are pinned to specific cores; on machines with heterogeneous core types (e.g., Apple P/E cores), only the fastest cores are used. Workloads that significantly degrade CPU frequencies (Intel AMX, Apple SME) run in separate passes to avoid affecting throughput measurements of other kernels.

Intel Sapphire Rapids

Native

Kernel25610244096
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_f64_serial8.00 gb/s, 0.1 ulp8.32 gb/s, 0 ulp8.13 gb/s, 0 ulp
nk_euclidean_f64_serial7.81 gb/s, 0.6 ulp7.95 gb/s, 0.5 ulp8.34 gb/s, 0.5 ulp
nk_angular_f64_serial2.80 gb/s, 0 ulp3.03 gb/s, 0 ulp3.18 gb/s, 0 ulp
nk_sqeuclidean_f64_skylake32.4 gb/s, 0.4 ulp30.6 gb/s, 0.7 ulp22.2 gb/s, 1.3 ulp
nk_euclidean_f64_skylake31.7 gb/s, 0.3 ulp29.4 gb/s, 0.4 ulp22.9 gb/s, 0.7 ulp
nk_angular_f64_skylake26.5 gb/s, 0 ulp26.8 gb/s, 0 ulp17.8 gb/s, 0 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_f32_serial4.01 gb/s, 0 ulp4.06 gb/s, 0 ulp4.19 gb/s, 0 ulp
nk_euclidean_f32_serial3.99 gb/s, 0.1 ulp4.07 gb/s, 0.1 ulp4.11 gb/s, 0.1 ulp
nk_angular_f32_serial1.29 gb/s, 0 ulp1.41 gb/s, 0 ulp1.53 gb/s, 0 ulp
nk_sqeuclidean_f32_skylake36.5 gb/s, 0 ulp27.0 gb/s, 0 ulp23.2 gb/s, 0 ulp
nk_euclidean_f32_skylake36.4 gb/s, 0.1 ulp28.1 gb/s, 0.1 ulp26.7 gb/s, 0.1 ulp
nk_angular_f32_skylake24.3 gb/s, 0 ulp23.2 gb/s, 0 ulp22.5 gb/s, 0 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_bf16_serial0.582 gb/s, 0 ulp0.358 gb/s, 0 ulp0.390 gb/s, 0 ulp
nk_euclidean_bf16_serial0.569 gb/s, 0.5 ulp0.373 gb/s, 0.5 ulp0.372 gb/s, 0.4 ulp
nk_angular_bf16_serial0.455 gb/s, 0 ulp0.241 gb/s, 0 ulp0.259 gb/s, 0 ulp
nk_sqeuclidean_bf16_haswell27.7 gb/s, 0.5 ulp14.0 gb/s, 7.5 ulp11.8 gb/s, 27 ulp
nk_euclidean_bf16_haswell23.3 gb/s, 0.3 ulp13.4 gb/s, 4.1 ulp12.0 gb/s, 15 ulp
nk_angular_bf16_haswell20.1 gb/s, 0 ulp13.4 gb/s, 0 ulp10.6 gb/s, 0.2 ulp
nk_sqeuclidean_bf16_genoa50.1 gb/s, 0.3 ulp21.0 gb/s, 0.5 ulp20.5 gb/s, 10 ulp
nk_euclidean_bf16_genoa48.3 gb/s, 0.2 ulp23.1 gb/s, 0.3 ulp20.4 gb/s, 5.8 ulp
nk_angular_bf16_genoa36.4 gb/s, 0 ulp22.4 gb/s, 0 ulp21.0 gb/s, 0.1 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_f16_serial0.950 gb/s, 0.1 ulp0.872 gb/s, 0.1 ulp0.864 gb/s, 0.1 ulp
nk_euclidean_f16_serial0.934 gb/s, 0.5 ulp0.913 gb/s, 0.5 ulp0.906 gb/s, 0.5 ulp
nk_angular_f16_serial0.881 gb/s, 0 ulp0.531 gb/s, 0 ulp0.543 gb/s, 0 ulp
nk_sqeuclidean_f16_haswell29.8 gb/s, 0.4 ulp14.8 gb/s, 1.4 ulp11.8 gb/s, 5.2 ulp
nk_euclidean_f16_haswell22.9 gb/s, 0.3 ulp12.9 gb/s, 0.8 ulp10.6 gb/s, 2.8 ulp
nk_angular_f16_haswell19.9 gb/s, 0.1 ulp17.5 gb/s, 0.1 ulp16.1 gb/s, 0.1 ulp
e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_e5m2_serial0.955 gb/s, 0 ulp1.01 gb/s, 0 ulp1.02 gb/s, 0 ulp
nk_euclidean_e5m2_serial0.954 gb/s, 0.5 ulp0.985 gb/s, 0.5 ulp1.03 gb/s, 0.5 ulp
nk_angular_e5m2_serial0.336 gb/s, 0 ulp0.385 gb/s, 0 ulp0.407 gb/s, 0 ulp
nk_sqeuclidean_e5m2_skylake4.44 gb/s, 0 ulp4.65 gb/s, 0 ulp5.80 gb/s, 0 ulp
nk_euclidean_e5m2_skylake4.34 gb/s, 0 ulp4.65 gb/s, 0 ulp5.88 gb/s, 0 ulp
nk_angular_e5m2_skylake3.83 gb/s, 0 ulp4.39 gb/s, 0 ulp6.10 gb/s, 0 ulp
nk_sqeuclidean_e5m2_genoa7.12 gb/s, 0 ulp8.07 gb/s, 0 ulp8.05 gb/s, 0 ulp
nk_euclidean_e5m2_genoa7.01 gb/s, 0 ulp6.97 gb/s, 0 ulp8.16 gb/s, 0 ulp
nk_angular_e5m2_genoa6.33 gb/s, 0 ulp6.79 gb/s, 0 ulp7.99 gb/s, 0 ulp
e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_e4m3_serial0.569 gb/s, 0 ulp0.606 gb/s, 0 ulp0.609 gb/s, 0 ulp
nk_euclidean_e4m3_serial0.587 gb/s, 0.5 ulp0.602 gb/s, 0.5 ulp0.578 gb/s, 0.5 ulp
nk_angular_e4m3_serial0.326 gb/s, 0 ulp0.196 gb/s, 0 ulp0.366 gb/s, 0 ulp
nk_sqeuclidean_e4m3_skylake3.84 gb/s, 0 ulp3.62 gb/s, 0 ulp3.95 gb/s, 0.2 ulp
nk_euclidean_e4m3_skylake3.48 gb/s, 0 ulp3.69 gb/s, 0 ulp3.33 gb/s, 0.2 ulp
nk_angular_e4m3_skylake4.22 gb/s, 0 ulp3.38 gb/s, 0 ulp4.54 gb/s, 0 ulp
nk_sqeuclidean_e4m3_icelake10.2 gb/s, 0 ulp12.0 gb/s, 0 ulp12.0 gb/s, 0.2 ulp
nk_euclidean_e4m3_icelake10.3 gb/s, 0 ulp11.8 gb/s, 0 ulp11.9 gb/s, 0.2 ulp
nk_angular_e4m3_icelake8.78 gb/s, 0 ulp11.3 gb/s, 0 ulp11.9 gb/s, 0 ulp
e3m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_e3m2_serial1.01 gb/s, 0 ulp0.971 gb/s, 0 ulp1.03 gb/s, 0 ulp
nk_euclidean_e3m2_serial0.997 gb/s, 0.5 ulp0.990 gb/s, 0.5 ulp0.999 gb/s, 0.4 ulp
nk_angular_e3m2_serial0.332 gb/s, 0 ulp0.361 gb/s, 0 ulp0.437 gb/s, 0 ulp
nk_sqeuclidean_e3m2_skylake4.47 gb/s, 0 ulp5.46 gb/s, 0 ulp5.04 gb/s, 0 ulp
nk_euclidean_e3m2_skylake4.34 gb/s, 0 ulp6.20 gb/s, 0 ulp5.10 gb/s, 0 ulp
nk_angular_e3m2_skylake3.79 gb/s, 0 ulp4.41 gb/s, 0 ulp4.82 gb/s, 0 ulp
nk_sqeuclidean_e3m2_genoa8.79 gb/s, 0 ulp9.52 gb/s, 0 ulp10.6 gb/s, 0 ulp
nk_euclidean_e3m2_genoa8.68 gb/s, 0 ulp9.01 gb/s, 0 ulp12.8 gb/s, 0 ulp
nk_angular_e3m2_genoa6.89 gb/s, 0 ulp9.30 gb/s, 0 ulp10.3 gb/s, 0 ulp
nk_sqeuclidean_e3m2_icelake21.2 gb/s, 0 ulp22.1 gb/s, 0 ulp21.9 gb/s, 0 ulp
nk_euclidean_e3m2_icelake21.2 gb/s, 0 ulp22.9 gb/s, 0 ulp21.2 gb/s, 0 ulp
nk_angular_e3m2_icelake14.1 gb/s, 0 ulp18.0 gb/s, 0 ulp17.6 gb/s, 0 ulp
e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_e2m3_serial0.964 gb/s, 0 ulp0.981 gb/s, 0 ulp1.03 gb/s, 0 ulp
nk_euclidean_e2m3_serial0.979 gb/s, 0.5 ulp0.966 gb/s, 0.5 ulp1.02 gb/s, 0.5 ulp
nk_angular_e2m3_serial0.347 gb/s, 0 ulp0.389 gb/s, 0 ulp0.418 gb/s, 0 ulp
nk_sqeuclidean_e2m3_skylake4.58 gb/s, 0 ulp4.65 gb/s, 0 ulp5.08 gb/s, 0 ulp
nk_euclidean_e2m3_skylake4.48 gb/s, 0 ulp4.39 gb/s, 0 ulp4.96 gb/s, 0 ulp
nk_angular_e2m3_skylake3.94 gb/s, 0 ulp4.25 gb/s, 0 ulp4.90 gb/s, 0 ulp
nk_sqeuclidean_e2m3_genoa9.62 gb/s, 0 ulp10.9 gb/s, 0 ulp10.8 gb/s, 0 ulp
nk_euclidean_e2m3_genoa8.45 gb/s, 0 ulp9.80 gb/s, 0 ulp10.3 gb/s, 0 ulp
nk_angular_e2m3_genoa7.21 gb/s, 0 ulp10.1 gb/s, 0 ulp10.4 gb/s, 0 ulp
nk_sqeuclidean_e2m3_icelake50.7 gb/s, 0 ulp42.6 gb/s, 0 ulp31.0 gb/s, 0 ulp
nk_euclidean_e2m3_icelake50.2 gb/s, 0 ulp44.3 gb/s, 0 ulp31.0 gb/s, 0 ulp
nk_angular_e2m3_icelake27.2 gb/s, 0 ulp34.9 gb/s, 0 ulp30.5 gb/s, 0 ulp
i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_i8_serial34.0 gb/s18.4 gb/s16.5 gb/s
nk_euclidean_i8_serial29.0 gb/s, 0.4 ulp18.0 gb/s, 0.4 ulp15.6 gb/s, 0.4 ulp
nk_angular_i8_serial7.88 gb/s, 0 ulp6.31 gb/s, 0 ulp6.12 gb/s, 0 ulp
nk_sqeuclidean_i8_haswell38.4 gb/s17.9 gb/s18.4 gb/s
nk_euclidean_i8_haswell35.6 gb/s, 0 ulp17.0 gb/s, 0 ulp15.5 gb/s, 0 ulp
nk_angular_i8_haswell20.3 gb/s, 0.1 ulp12.9 gb/s, 0 ulp11.9 gb/s, 0 ulp
nk_sqeuclidean_i8_icelake60.2 gb/s24.5 gb/s23.5 gb/s
nk_euclidean_i8_icelake59.0 gb/s, 0 ulp23.0 gb/s, 0 ulp22.3 gb/s, 0 ulp
nk_angular_i8_icelake25.2 gb/s, 0.1 ulp18.4 gb/s, 0 ulp20.5 gb/s, 0 ulp
nk_sqeuclidean_i8_alder33.4 gb/s17.4 gb/s17.6 gb/s
nk_euclidean_i8_alder31.9 gb/s, 0 ulp19.1 gb/s, 0 ulp17.8 gb/s, 0 ulp
nk_angular_i8_alder26.2 gb/s, 0.1 ulp17.1 gb/s, 0 ulp17.8 gb/s, 0 ulp
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_u8_serial11.7 gb/s8.77 gb/s7.07 gb/s
nk_euclidean_u8_serial11.6 gb/s, 0.5 ulp8.31 gb/s, 0.5 ulp8.36 gb/s, 0.6 ulp
nk_angular_u8_serial7.95 gb/s, 0.4 ulp6.68 gb/s, 0.4 ulp5.88 gb/s, 0.4 ulp
nk_sqeuclidean_u8_haswell45.4 gb/s17.7 gb/s18.5 gb/s
nk_euclidean_u8_haswell38.9 gb/s, 0 ulp18.8 gb/s, 0 ulp19.3 gb/s, 0 ulp
nk_angular_u8_haswell21.9 gb/s, 0.7 ulp11.7 gb/s, 0.6 ulp13.4 gb/s, 0.5 ulp
nk_sqeuclidean_u8_icelake70.1 gb/s28.8 gb/s21.0 gb/s
nk_euclidean_u8_icelake66.4 gb/s, 0 ulp27.6 gb/s, 0 ulp23.5 gb/s, 0 ulp
nk_angular_u8_icelake28.9 gb/s, 0.7 ulp21.2 gb/s, 0.6 ulp21.5 gb/s, 0.5 ulp
nk_sqeuclidean_u8_alder32.2 gb/s17.5 gb/s19.0 gb/s
nk_euclidean_u8_alder31.3 gb/s, 0 ulp17.0 gb/s, 0 ulp19.6 gb/s, 0 ulp
nk_angular_u8_alder26.5 gb/s, 0.7 ulp17.1 gb/s, 0.6 ulp17.5 gb/s, 0.5 ulp
i4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_i4_serial15.4 gb/s16.5 gb/s15.6 gb/s
nk_euclidean_i4_serial12.2 gb/s, 0.5 ulp15.6 gb/s, 0.5 ulp15.2 gb/s, 0.6 ulp
nk_angular_i4_serial5.60 gb/s, 0.4 ulp6.42 gb/s, 0.4 ulp6.69 gb/s, 0.4 ulp
nk_sqeuclidean_i4_icelake23.6 gb/s51.5 gb/s29.3 gb/s
nk_euclidean_i4_icelake20.6 gb/s, 0 ulp45.2 gb/s, 0 ulp28.9 gb/s, 0 ulp
nk_angular_i4_icelake5.14 gb/s, 0.7 ulp18.0 gb/s, 0.6 ulp17.6 gb/s, 0.5 ulp
u4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_u4_serial15.6 gb/s17.3 gb/s15.8 gb/s
nk_euclidean_u4_serial12.0 gb/s, 0.5 ulp15.9 gb/s, 0.5 ulp15.3 gb/s, 0.6 ulp
nk_angular_u4_serial5.20 gb/s, 0.4 ulp6.63 gb/s, 0.4 ulp7.01 gb/s, 0.4 ulp
nk_sqeuclidean_u4_icelake22.7 gb/s23.7 gb/s24.5 gb/s
nk_euclidean_u4_icelake20.9 gb/s, 0 ulp18.8 gb/s, 0 ulp24.1 gb/s, 0 ulp
nk_angular_u4_icelake9.32 gb/s, 0.7 ulp27.4 gb/s, 0.6 ulp24.2 gb/s, 0.5 ulp

WASM

Measured with Wasmtime v42 (Cranelift backend).

Kernel25610244096
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_f64_serial2.97 gb/s, 0.1 ulp3.16 gb/s, 0 ulp0.02 gb/s, 0 ulp
nk_euclidean_f64_serial0.104 gb/s, 0.6 ulp1.06 gb/s, 0.6 ulp0.33 gb/s, 0.5 ulp
nk_angular_f64_serial1.91 gb/s, 0.1 ulp1.93 gb/s, 0 ulp0.18 gb/s, 0 ulp
nk_sqeuclidean_f64_v128relaxed1.23 gb/s, 1.3 ulp1.87 gb/s, 2.5 ulp0.15 gb/s, 5.0 ulp
nk_euclidean_f64_v128relaxed0.315 gb/s, 0.7 ulp2.21 gb/s, 1.4 ulp0.03 gb/s, 2.8 ulp
nk_angular_f64_v128relaxed1.14 gb/s, 0.1 ulp0.928 gb/s, 0.1 ulp0.26 gb/s, 0.1 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_f32_serial0.657 gb/s, 0 ulp0.928 gb/s, 0 ulp0.06 gb/s, 0 ulp
nk_euclidean_f32_serial0.757 gb/s, 0.1 ulp0.914 gb/s, 0.1 ulp0.05 gb/s, 0.1 ulp
nk_angular_f32_serial0.882 gb/s, 0 ulp0.902 gb/s, 0 ulp0.26 gb/s, 0 ulp
nk_sqeuclidean_f32_v128relaxed2.87 gb/s, 0.7 ulp3.03 gb/s, 1.3 ulp1.77 gb/s, 2.6 ulp
nk_euclidean_f32_v128relaxed1.83 gb/s, 0.4 ulp3.00 gb/s, 0.7 ulp0.22 gb/s, 1.4 ulp
nk_angular_f32_v128relaxed3.37 gb/s, 0 ulp0.991 gb/s, 0 ulp0.19 gb/s, 0 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_bf16_serial1.89 gb/s, 0 ulp1.09 gb/s, 0 ulp0.31 gb/s, 0 ulp
nk_euclidean_bf16_serial2.02 gb/s, 0.6 ulp2.13 gb/s, 0.5 ulp0.29 gb/s, 0.5 ulp
nk_angular_bf16_serial0.399 gb/s, 0 ulp0.308 gb/s, 0 ulp0.11 gb/s, 0 ulp
nk_sqeuclidean_bf16_v128relaxed2.10 gb/s, 0.9 ulp1.94 gb/s, 12.6 ulp0.17 gb/s, 20.8 ulp
nk_euclidean_bf16_v128relaxed2.08 gb/s, 0.5 ulp2.22 gb/s, 7.0 ulp0.13 gb/s, 11.4 ulp
nk_angular_bf16_v128relaxed1.08 gb/s, 0 ulp2.09 gb/s, 0.2 ulp0.20 gb/s, 0.6 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_f16_serial1.10 gb/s, 0.1 ulp1.13 gb/s, 0.1 ulp0.20 gb/s, 0.1 ulp
nk_euclidean_f16_serial1.17 gb/s, 0.6 ulp1.16 gb/s, 0.6 ulp0.26 gb/s, 0.5 ulp
nk_angular_f16_serial0.363 gb/s, 0 ulp0.372 gb/s, 0 ulp0.06 gb/s, 0 ulp
nk_sqeuclidean_f16_v128relaxed1.12 gb/s, 0.9 ulp0.633 gb/s, 3.6 ulp0.03 gb/s, 9.7 ulp
nk_euclidean_f16_v128relaxed0.806 gb/s, 0.5 ulp0.991 gb/s, 2.0 ulp0.09 gb/s, 5.4 ulp
nk_angular_f16_v128relaxed1.79 gb/s, 0.1 ulp0.976 gb/s, 0.1 ulp0.00 gb/s, 0.1 ulp
e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_e5m2_serial0.713 gb/s, 0 ulp0.689 gb/s, 0 ulp0.16 gb/s, 0 ulp
nk_euclidean_e5m2_serial0.637 gb/s, 0.5 ulp0.736 gb/s, 0.5 ulp0.12 gb/s, 0.5 ulp
nk_angular_e5m2_serial0.169 gb/s, 0 ulp0.162 gb/s, 0 ulp0.17 gb/s, 0 ulp
e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_e4m3_serial0.374 gb/s, 0 ulp0.383 gb/s, 0 ulp0.09 gb/s, 0 ulp
nk_euclidean_e4m3_serial0.374 gb/s, 0.5 ulp0.360 gb/s, 0.5 ulp0.09 gb/s, 0.5 ulp
nk_angular_e4m3_serial0.162 gb/s, 0 ulp0.166 gb/s, 0 ulp0.17 gb/s, 0 ulp
e3m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_e3m2_serial0.712 gb/s, 0 ulp0.744 gb/s, 0 ulp0.17 gb/s, 0 ulp
nk_euclidean_e3m2_serial0.709 gb/s, 0.5 ulp0.759 gb/s, 0.5 ulp0.17 gb/s, 0.5 ulp
nk_angular_e3m2_serial0.152 gb/s, 0 ulp0.165 gb/s, 0 ulp0.17 gb/s, 0 ulp
e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_e2m3_serial0.702 gb/s, 0 ulp0.760 gb/s, 0 ulp0.13 gb/s, 0 ulp
nk_euclidean_e2m3_serial0.650 gb/s, 0.5 ulp0.753 gb/s, 0.5 ulp0.15 gb/s, 0.5 ulp
nk_angular_e2m3_serial0.158 gb/s, 0 ulp0.168 gb/s, 0 ulp0.17 gb/s, 0 ulp
i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_i8_serial0.327 gb/s0.328 gb/s0.09 gb/s
nk_euclidean_i8_serial2.93 gb/s, 0.5 ulp0.174 gb/s, 0.4 ulp0.14 gb/s, 0.4 ulp
nk_angular_i8_serial1.23 gb/s, 0 ulp0.946 gb/s, 0 ulp0.10 gb/s, 0 ulp
nk_sqeuclidean_i8_v128relaxed1.84 gb/s0.736 gb/s0.08 gb/s
nk_euclidean_i8_v128relaxed1.36 gb/s, 0 ulp0.805 gb/s, 0 ulp0.21 gb/s, 0 ulp
nk_angular_i8_v128relaxed1.80 gb/s, 0 ulp2.79 gb/s, 0 ulp0.14 gb/s, 0 ulp
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_u8_serial0.528 gb/s0.496 gb/s0.30 gb/s
nk_euclidean_u8_serial0.00982 gb/s, 0.5 ulp0.311 gb/s, 0.5 ulp0.04 gb/s, 0.6 ulp
nk_angular_u8_serial0.813 gb/s, 0.5 ulp1.46 gb/s, 0.4 ulp0.29 gb/s, 0.5 ulp
nk_sqeuclidean_u8_v128relaxed3.05 gb/s1.68 gb/s0.28 gb/s
nk_euclidean_u8_v128relaxed2.52 gb/s, 0 ulp1.70 gb/s, 0 ulp0.09 gb/s, 0 ulp
nk_angular_u8_v128relaxed2.47 gb/s, 526M ulp1.91 gb/s, 501M ulp0.09 gb/s, 443M ulp
i4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_i4_serial1.91 gb/s1.94 gb/s0.30 gb/s
nk_euclidean_i4_serial1.76 gb/s, 0.5 ulp1.90 gb/s, 0.5 ulp0.02 gb/s, 0.0 ulp
nk_angular_i4_serial1.28 gb/s, 0.5 ulp1.34 gb/s, 0.5 ulp0.10 gb/s, 0.5 ulp
u4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_u4_serial2.91 gb/s3.00 gb/s0.09 gb/s
nk_euclidean_u4_serial2.78 gb/s, 0.5 ulp3.01 gb/s, 0.5 ulp0.10 gb/s, 0.0 ulp
nk_angular_u4_serial1.84 gb/s, 0.5 ulp2.03 gb/s, 0.5 ulp0.21 gb/s, 0.5 ulp

Apple M5

Native

Kernel25610244096
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_f64_serial12.4 gb/s, 0.1 ulp12.8 gb/s, 0 ulp12.8 gb/s, 0 ulp
nk_euclidean_f64_serial12.7 gb/s, 0.6 ulp12.9 gb/s, 0.5 ulp12.6 gb/s, 0.5 ulp
nk_angular_f64_serial8.42 gb/s, 0 ulp8.57 gb/s, 0 ulp8.30 gb/s, 0 ulp
nk_sqeuclidean_f64_neon50.6 gb/s, 1.3 ulp40.0 gb/s, 2.6 ulp36.1 gb/s, 5.1 ulp
nk_euclidean_f64_neon48.4 gb/s, 0.7 ulp38.7 gb/s, 1.4 ulp35.1 gb/s, 2.8 ulp
nk_angular_f64_neon33.3 gb/s, 0.1 ulp33.4 gb/s, 0 ulp32.4 gb/s, 0 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_f32_serial6.32 gb/s, 0 ulp6.25 gb/s, 0 ulp6.30 gb/s, 0 ulp
nk_euclidean_f32_serial6.31 gb/s, 0.1 ulp6.37 gb/s, 0.1 ulp6.41 gb/s, 0.1 ulp
nk_angular_f32_serial4.03 gb/s, 0 ulp4.06 gb/s, 0 ulp4.07 gb/s, 0 ulp
nk_sqeuclidean_f32_neon25.3 gb/s, 0.1 ulp19.1 gb/s, 0 ulp17.5 gb/s, 0 ulp
nk_euclidean_f32_neon25.0 gb/s, 0.1 ulp20.8 gb/s, 0.1 ulp18.6 gb/s, 0.1 ulp
nk_angular_f32_neon22.2 gb/s, 0 ulp17.3 gb/s, 0 ulp16.6 gb/s, 0 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_bf16_serial3.19 gb/s, 0 ulp3.16 gb/s, 0 ulp3.14 gb/s, 0 ulp
nk_euclidean_bf16_serial3.16 gb/s, 0.5 ulp3.08 gb/s, 0.5 ulp3.11 gb/s, 0.5 ulp
nk_angular_bf16_serial1.88 gb/s, 0 ulp1.91 gb/s, 0 ulp1.93 gb/s, 0 ulp
nk_sqeuclidean_bf16_neonbfdot35.0 gb/s, 0.9 ulp22.7 gb/s, 13 ulp18.8 gb/s, 21 ulp
nk_euclidean_bf16_neonbfdot33.4 gb/s, 0.5 ulp23.0 gb/s, 7.0 ulp18.6 gb/s, 12 ulp
nk_angular_bf16_neonbfdot23.8 gb/s, 0 ulp32.7 gb/s, 0.1 ulp35.9 gb/s, 0 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_f16_serial3.09 gb/s, 0.1 ulp3.16 gb/s, 0.1 ulp3.10 gb/s, 0.1 ulp
nk_euclidean_f16_serial3.13 gb/s, 0.6 ulp3.14 gb/s, 0.5 ulp3.11 gb/s, 0.5 ulp
nk_angular_f16_serial1.84 gb/s, 0 ulp1.92 gb/s, 0 ulp1.88 gb/s, 0 ulp
nk_sqeuclidean_f16_neonhalf34.7 gb/s, 0.9 ulp21.5 gb/s, 3.6 ulp18.3 gb/s, 9.7 ulp
nk_euclidean_f16_neonhalf32.7 gb/s, 0.5 ulp21.7 gb/s, 2.0 ulp18.4 gb/s, 5.3 ulp
nk_angular_f16_neonhalf25.2 gb/s, 0.1 ulp19.6 gb/s, 0.1 ulp17.3 gb/s, 0.1 ulp
e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_e5m2_serial2.09 gb/s, 0 ulp2.08 gb/s, 0 ulp2.10 gb/s, 0 ulp
nk_euclidean_e5m2_serial2.06 gb/s, 0.5 ulp2.10 gb/s, 0.5 ulp2.05 gb/s, 0.5 ulp
nk_angular_e5m2_serial0.921 gb/s, 0 ulp0.956 gb/s, 0 ulp0.938 gb/s, 0 ulp
nk_sqeuclidean_e5m2_neon18.2 gb/s, 0 ulp12.8 gb/s, 0 ulp9.84 gb/s, 0 ulp
nk_euclidean_e5m2_neon18.0 gb/s, 0.5 ulp11.8 gb/s, 0.5 ulp9.33 gb/s, 0.5 ulp
nk_angular_e5m2_neon13.7 gb/s, 0 ulp10.9 gb/s, 0 ulp9.83 gb/s, 0 ulp
e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_e4m3_serial1.07 gb/s, 0 ulp1.12 gb/s, 0 ulp1.11 gb/s, 0 ulp
nk_euclidean_e4m3_serial1.01 gb/s, 0.5 ulp1.12 gb/s, 0.5 ulp1.09 gb/s, 0.5 ulp
nk_angular_e4m3_serial0.711 gb/s, 0 ulp0.732 gb/s, 0 ulp0.729 gb/s, 0 ulp
nk_sqeuclidean_e4m3_neon4.29 gb/s, 0.2 ulp4.36 gb/s, 0.2 ulp4.33 gb/s, 0.2 ulp
nk_euclidean_e4m3_neon4.20 gb/s, 0.5 ulp4.11 gb/s, 0.5 ulp4.17 gb/s, 0.5 ulp
nk_angular_e4m3_neon4.13 gb/s, 0 ulp4.21 gb/s, 0 ulp4.16 gb/s, 0 ulp
e3m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_e3m2_serial1.95 gb/s, 0 ulp2.15 gb/s, 0 ulp2.08 gb/s, 0 ulp
nk_euclidean_e3m2_serial1.97 gb/s, 0.5 ulp2.18 gb/s, 0.5 ulp2.09 gb/s, 0.5 ulp
nk_angular_e3m2_serial0.900 gb/s, 0 ulp0.985 gb/s, 0 ulp0.943 gb/s, 0 ulp
nk_sqeuclidean_e3m2_neon4.73 gb/s, 0 ulp5.19 gb/s, 0 ulp5.03 gb/s, 0 ulp
nk_euclidean_e3m2_neon4.78 gb/s, 0 ulp5.23 gb/s, 0 ulp5.05 gb/s, 0 ulp
nk_angular_e3m2_neon4.24 gb/s, 0 ulp4.85 gb/s, 0 ulp4.73 gb/s, 0 ulp
e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_e2m3_serial1.98 gb/s, 0 ulp2.20 gb/s, 0 ulp2.11 gb/s, 0 ulp
nk_euclidean_e2m3_serial1.91 gb/s, 0.5 ulp2.16 gb/s, 0.5 ulp2.09 gb/s, 0.4 ulp
nk_angular_e2m3_serial0.885 gb/s, 0 ulp0.985 gb/s, 0 ulp0.953 gb/s, 0 ulp
nk_sqeuclidean_e2m3_neon4.67 gb/s, 0 ulp5.06 gb/s, 0 ulp5.07 gb/s, 0 ulp
nk_euclidean_e2m3_neon4.84 gb/s, 0 ulp5.17 gb/s, 0 ulp4.98 gb/s, 0 ulp
nk_angular_e2m3_neon4.45 gb/s, 0 ulp4.88 gb/s, 0 ulp4.73 gb/s, 0 ulp
i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_i8_serial61.6 gb/s75.2 gb/s64.9 gb/s
nk_euclidean_i8_serial43.5 gb/s54.0 gb/s64.3 gb/s
nk_angular_i8_serial55.8 gb/s63.7 gb/s49.6 gb/s
nk_sqeuclidean_i8_neonsdot89.1 gb/s85.9 gb/s58.8 gb/s
nk_euclidean_i8_neonsdot86.9 gb/s78.9 gb/s57.4 gb/s
nk_angular_i8_neonsdot66.5 gb/s68.9 gb/s50.9 gb/s
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_u8_serial62.9 gb/s77.2 gb/s66.5 gb/s
nk_euclidean_u8_serial45.7 gb/s52.3 gb/s61.4 gb/s
nk_angular_u8_serial17.8 gb/s18.5 gb/s16.0 gb/s
nk_sqeuclidean_u8_neonsdot91.7 gb/s83.1 gb/s56.6 gb/s
nk_euclidean_u8_neonsdot87.9 gb/s79.3 gb/s56.5 gb/s
nk_angular_u8_neonsdot68.0 gb/s64.8 gb/s49.5 gb/s
i4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_i4_serial22.9 gb/s25.2 gb/s25.1 gb/s
nk_euclidean_i4_serial20.1 gb/s23.6 gb/s24.2 gb/s
nk_angular_i4_serial9.11 gb/s10.4 gb/s10.4 gb/s
u4░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_u4_serial26.7 gb/s26.8 gb/s22.4 gb/s
nk_euclidean_u4_serial20.9 gb/s22.5 gb/s21.2 gb/s
nk_angular_u4_serial9.00 gb/s9.57 gb/s9.62 gb/s

WASM

Measured with Wasmtime v43 (Cranelift backend).

Kernel25610244096
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_f64_serial20.3 gb/s, 0.1 ulp19.3 gb/s, 0 ulp20.3 gb/s, 0 ulp
nk_euclidean_f64_serial20.0 gb/s, 0.6 ulp19.4 gb/s, 0.6 ulp20.3 gb/s, 0.5 ulp
nk_angular_f64_serial9.28 gb/s, 0 ulp8.83 gb/s, 0 ulp9.29 gb/s, 0 ulp
nk_sqeuclidean_f64_v128relaxed48.2 gb/s, 1.3 ulp35.7 gb/s, 2.6 ulp37.0 gb/s, 5.0 ulp
nk_euclidean_f64_v128relaxed50.1 gb/s, 0.7 ulp36.5 gb/s, 1.4 ulp36.7 gb/s, 2.8 ulp
nk_angular_f64_v128relaxed31.5 gb/s, 0.1 ulp22.3 gb/s, 0.1 ulp22.4 gb/s, 0.1 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_f32_serial8.90 gb/s, 0 ulp8.54 gb/s, 0 ulp8.86 gb/s, 0 ulp
nk_euclidean_f32_serial8.86 gb/s, 0.1 ulp8.58 gb/s, 0.1 ulp8.86 gb/s, 0.1 ulp
nk_angular_f32_serial4.33 gb/s, 0 ulp4.17 gb/s, 0 ulp4.30 gb/s, 0 ulp
nk_sqeuclidean_f32_v128relaxed20.4 gb/s, 0.7 ulp17.9 gb/s, 1.3 ulp18.3 gb/s, 2.6 ulp
nk_euclidean_f32_v128relaxed20.3 gb/s, 0.4 ulp18.0 gb/s, 0.7 ulp18.3 gb/s, 1.4 ulp
nk_angular_f32_v128relaxed19.7 gb/s, 0 ulp17.8 gb/s, 0 ulp18.5 gb/s, 0 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_bf16_serial5.16 gb/s, 0 ulp4.88 gb/s, 0 ulp5.10 gb/s, 0 ulp
nk_euclidean_bf16_serial5.12 gb/s, 0.6 ulp4.89 gb/s, 0.5 ulp5.10 gb/s, 0.5 ulp
nk_angular_bf16_serial2.24 gb/s, 0 ulp2.15 gb/s, 0 ulp2.24 gb/s, 0 ulp
nk_sqeuclidean_bf16_v128relaxed39.9 gb/s, 0.9 ulp27.0 gb/s, 13 ulp20.3 gb/s, 21 ulp
nk_euclidean_bf16_v128relaxed38.6 gb/s, 0.5 ulp27.1 gb/s, 7.0 ulp21.2 gb/s, 12 ulp
nk_angular_bf16_v128relaxed27.9 gb/s, 0 ulp22.6 gb/s, 0.2 ulp20.5 gb/s, 0.6 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_f16_serial3.22 gb/s, 0.1 ulp3.06 gb/s, 0.1 ulp3.09 gb/s, 0.1 ulp
nk_euclidean_f16_serial3.19 gb/s, 0.6 ulp2.92 gb/s, 0.5 ulp3.26 gb/s, 0.5 ulp
nk_angular_f16_serial2.33 gb/s, 0 ulp2.21 gb/s, 0 ulp2.32 gb/s, 0 ulp
nk_sqeuclidean_f16_v128relaxed11.2 gb/s, 0.9 ulp11.0 gb/s, 3.6 ulp11.8 gb/s, 9.6 ulp
nk_euclidean_f16_v128relaxed11.3 gb/s, 0.5 ulp11.0 gb/s, 2.0 ulp11.8 gb/s, 5.3 ulp
nk_angular_f16_v128relaxed9.41 gb/s, 0.1 ulp9.56 gb/s, 0.1 ulp10.4 gb/s, 0.1 ulp
i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_i8_serial14.9 gb/s14.7 gb/s16.5 gb/s
nk_euclidean_i8_serial14.7 gb/s, 0.5 ulp14.8 gb/s, 0.4 ulp16.3 gb/s, 0.4 ulp
nk_angular_i8_serial8.06 gb/s, 0 ulp8.42 gb/s, 0 ulp10.7 gb/s, 0 ulp
nk_sqeuclidean_i8_v128relaxed30.7 gb/s22.9 gb/s18.0 gb/s
nk_euclidean_i8_v128relaxed27.4 gb/s, 0 ulp22.6 gb/s, 0 ulp17.9 gb/s, 0 ulp
nk_angular_i8_v128relaxed17.2 gb/s, 0 ulp18.1 gb/s, 0 ulp19.7 gb/s, 0 ulp
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_sqeuclidean_u8_serial14.8 gb/s14.5 gb/s16.3 gb/s
nk_euclidean_u8_serial14.5 gb/s, 0.5 ulp14.5 gb/s, 0.5 ulp16.0 gb/s, 0.6 ulp
nk_angular_u8_serial7.86 gb/s, 0.5 ulp8.25 gb/s, 0.5 ulp10.7 gb/s, 0.4 ulp
nk_sqeuclidean_u8_v128relaxed33.2 gb/s24.6 gb/s18.3 gb/s
nk_euclidean_u8_v128relaxed28.6 gb/s, 0 ulp23.7 gb/s, 0 ulp18.2 gb/s, 0 ulp
nk_angular_u8_v128relaxed14.1 gb/s, 0 ulp15.0 gb/s, 0 ulp16.1 gb/s, 0 ulp