Curved Space Distances in NumKong

April 2, 2026 · View on GitHub

NumKong implements distance functions for curved metric spaces: bilinear forms compute aTCba^T C b for an arbitrary metric tensor CC, while Mahalanobis distance generalizes Euclidean distance to account for correlations between dimensions. Complex bilinear forms extend this to Hermitian inner products. These operations are central to Gaussian process inference, metric learning, and statistical distance measures.

The bilinear form for real vectors is:

bilinear(a,b,C)=aTCb=i=0n1j=0n1aicijbj\text{bilinear}(a, b, C) = a^T C b = \sum_{i=0}^{n-1} \sum_{j=0}^{n-1} a_i \cdot c_{ij} \cdot b_j

The Mahalanobis distance is:

mahalanobis(a,b,C)=(ab)TC(ab)\text{mahalanobis}(a, b, C) = \sqrt{(a - b)^T C (a - b)}

For complex vectors, the bilinear form uses the conjugate transpose:

bilinear(a,b,C)=aHCb=i=0n1j=0n1aiˉcijbj\text{bilinear}(a, b, C) = a^H C b = \sum_{i=0}^{n-1} \sum_{j=0}^{n-1} \bar{a_i} \cdot c_{ij} \cdot b_j

Reformulating as Python pseudocode:

import numpy as np

def bilinear(a: np.ndarray, b: np.ndarray, C: np.ndarray) -> float:
    return a @ C @ b

def mahalanobis(a: np.ndarray, b: np.ndarray, C: np.ndarray) -> float:
    diff = a - b
    return np.sqrt(diff @ C @ diff)

def bilinear_complex(a: np.ndarray, b: np.ndarray, C: np.ndarray) -> complex:
    return np.conj(a) @ C @ b

Input & Output Types

Real bilinear and Mahalanobis:

Input TypeOutput TypeDescription
f64f6464-bit IEEE 754 double precision
f32f3232-bit IEEE 754 single precision
f16f3216-bit IEEE 754 half precision, widened output
bf16f3216-bit brain float, widened output

Complex bilinear:

Input TypeOutput TypeDescription
f64cf64c64-bit complex pairs
f32cf32c32-bit complex pairs
f16cf32c16-bit complex pairs, widened output
bf16cf32c16-bit brain complex pairs, widened output

Optimizations

Row-Major Streaming with Nested Dot2

nk_bilinear_f64_skylake, nk_mahalanobis_f64_skylake decompose the bilinear form aTCba^T C b as iaidot(Ci,b)\sum_i a_i \cdot \text{dot}(C_i, b) where CiC_i is the ii-th row of the metric tensor. Each inner dot product uses Dot2 compensation — TwoProd via FMA captures the rounding error of each cijbjc_{ij} \cdot b_j product exactly, and a TwoSum chain propagates it through the accumulator. The outer sum over rows uses a second level of compensation, tracking the rounding error of each airia_i \cdot r_i accumulation. This nested structure gives O(n)O(n) cache-friendly sequential access to the n×nn \times n matrix CC, since each row is read once and discarded. nk_bilinear_f32_neon, nk_bilinear_f32_skylake, nk_mahalanobis_f32_neon, nk_mahalanobis_f32_skylake use the same row-major streaming pattern but accumulate in f64 instead of Dot2, which provides sufficient precision for f32 inputs.

SME Outer-Product Accumulation

nk_bilinear_f32_smef64, nk_bilinear_f64_smef64, nk_bilinear_f32c_smef64, nk_bilinear_f64c_smef64, nk_mahalanobis_f32_smef64, nk_mahalanobis_f64_smef64 use the Scalable Matrix Extension to compute the bilinear form as an outer-product accumulation. Each FMOPA instruction performs a rank-1 update aibTa_i \cdot b^T into the SME ZA tile array, and the matrix CC is streamed row-by-row and multiplied into the accumulator. This differs from the row-major dot approach — it reformulates aTCba^T C b as a matrix-multiply problem where SME's 2D tile registers use the matrix engine's throughput. For dimensions that align to the tile size, this approach has high throughput; dimensions that do not align fall back to NEON for cleanup of the residual elements.

Complex Bilinear Decomposition

nk_bilinear_f32c_neon, nk_bilinear_f32c_skylake, nk_bilinear_f64c_skylake compute aHCba^H C b where each element involves 4 real multiplications from the complex product aiˉcijbj\bar{a_i} \cdot c_{ij} \cdot b_j. The kernel decomposes this into real and imaginary dot products over rows of CC: for each row ii, it computes the real part as ai,redot(Ci,b)re+ai,imdot(Ci,b)ima_{i,re} \cdot \text{dot}(C_i, b)_{re} + a_{i,im} \cdot \text{dot}(C_i, b)_{im} and the imaginary part with the conjugation baked in as sign flips. This fuses the conjugation of aa into the sign of the cross terms rather than explicitly negating the imaginary components, saving one negate operation per element.

Performance

The following performance tables are produced by manually re-running nk_test and nk_bench included internal tools to measure both accuracy and throughput at different input shapes. The input size is controlled by the NK_CURVED_DIMENSIONS environment variable. The metric tensor is a square matrix of side NN, so each bilinear form xMx\mathbf{x}^\top M \mathbf{x} has O(N2)O(N^2) arithmetic complexity. Columns show matrix side length: 256², 1024², 4096². The throughput is measured in GSO/s as Giga Scalar Operations per Second. Accuracy is reported as mean ULP (units in last place) averaged over all test pairs — the average number of representable floating-point values between the computed result and the exact answer. Each kernel runs for at least 20 seconds per configuration. Benchmark threads are pinned to specific cores; on machines with heterogeneous core types (e.g., Apple P/E cores), only the fastest cores are used. Workloads that significantly degrade CPU frequencies (Intel AMX, Apple SME) run in separate passes to avoid affecting throughput measurements of other kernels. Rows marked 🧩 use external BLAS baselines rather than NumKong kernels.

Intel Sapphire Rapids

Native

Kernel256²1024²4096²
f64c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
bilinear_f64c_with_blas 🧩1.25 gso/s1.36 gso/s1.38 gso/s
nk_bilinear_f64c_serial0.0862 gso/s, 0.5 ulp0.161 gso/s, 0.2 ulp0.171 gso/s, 0.5 ulp
nk_bilinear_f64c_skylake0.583 gso/s, 3.5 ulp0.718 gso/s, 3.5 ulp0.765 gso/s, 3.5 ulp
f32c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
bilinear_f32c_with_blas 🧩2.14 gso/s2.61 gso/s2.57 gso/s
nk_bilinear_f32c_serial0.756 gso/s, 0 ulp1.37 gso/s, 0 ulp1.37 gso/s, 0 ulp
nk_bilinear_f32c_skylake1.72 gso/s, 0 ulp1.75 gso/s, 0 ulp1.46 gso/s, 0 ulp
bf16c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_bf16c_serial0.154 gso/s, 5 ulp0.158 gso/s, 5.8 ulp0.155 gso/s, 5 ulp
nk_bilinear_bf16c_genoa2.81 gso/s, 5 ulp4.57 gso/s, 5 ulp4.47 gso/s, 5 ulp
f16c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_f16c_serial0.585 gso/s, 7.2 ulp0.592 gso/s, 7.2 ulp0.600 gso/s, 7.2 ulp
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
bilinear_f64_with_blas 🧩2.84 gso/s3.23 gso/s3.14 gso/s
nk_bilinear_f64_serial0.291 gso/s, 0.7 ulp0.565 gso/s, 0.4 ulp0.577 gso/s, 0.7 ulp
nk_mahalanobis_f64_serial0.267 gso/s, 0 ulp0.537 gso/s, 0 ulp0.539 gso/s, 0 ulp
nk_bilinear_f64_skylake1.79 gso/s, 1.6 ulp1.71 gso/s, 1.3 ulp1.59 gso/s, 1 ulp
nk_mahalanobis_f64_skylake1.77 gso/s, 0 ulp1.82 gso/s, 0 ulp2.12 gso/s, 0.2 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
bilinear_f32_with_blas 🧩4.09 gso/s5.61 gso/s6.59 gso/s
nk_bilinear_f32_serial1.19 gso/s, 0 ulp2.71 gso/s, 0 ulp2.68 gso/s, 0 ulp
nk_mahalanobis_f32_serial2.36 gso/s, 0 ulp2.53 gso/s, 0 ulp2.40 gso/s, 0 ulp
nk_bilinear_f32_haswell3.45 gso/s, 0 ulp3.66 gso/s, 0 ulp3.24 gso/s, 0 ulp
nk_mahalanobis_f32_haswell3.37 gso/s, 0 ulp3.28 gso/s, 0 ulp3.30 gso/s, 0 ulp
nk_bilinear_f32_skylake3.68 gso/s, 0 ulp3.08 gso/s, 0 ulp2.71 gso/s, 0 ulp
nk_mahalanobis_f32_skylake3.45 gso/s, 0 ulp2.94 gso/s, 0 ulp3.32 gso/s, 0 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_bf16_serial0.321 gso/s, 16 ulp0.331 gso/s, 13 ulp0.314 gso/s, 12 ulp
nk_mahalanobis_bf16_serial0.216 gso/s, 2.2 ulp0.215 gso/s, 2.1 ulp0.211 gso/s, 2.3 ulp
nk_bilinear_bf16_haswell6.75 gso/s, 11 ulp7.04 gso/s, 13 ulp6.80 gso/s, 13 ulp
nk_mahalanobis_bf16_haswell5.93 gso/s, 1 ulp5.77 gso/s, 1 ulp5.86 gso/s, 1 ulp
nk_bilinear_bf16_genoa6.22 gso/s, 18 ulp10.9 gso/s, 18 ulp10.3 gso/s, 18 ulp
nk_mahalanobis_bf16_genoa7.04 gso/s, 8.55K ulp8.76 gso/s, 8.41K ulp8.57 gso/s, 8.41K ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_f16_serial0.654 gso/s, 23 ulp0.652 gso/s, 23 ulp0.657 gso/s, 23 ulp
nk_mahalanobis_f16_serial0.510 gso/s, 2.7 ulp0.520 gso/s, 3.2 ulp0.500 gso/s, 2.7 ulp
nk_bilinear_f16_haswell7.36 gso/s, 37 ulp7.30 gso/s, 37 ulp7.29 gso/s, 37 ulp
nk_mahalanobis_f16_haswell6.75 gso/s, 1 ulp6.24 gso/s, 1 ulp6.83 gso/s, 1 ulp

WASM

Measured with Wasmtime v42 (Cranelift backend).

Kernel256²1024²4096²
f64c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_f64c_serial0.21 gso/s, 1.2 ulp0.21 gso/s, 1.2 ulp0.21 gso/s, 1.2 ulp
f32c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_f32c_serial1.10 gso/s, 0 ulp1.07 gso/s, 0 ulp1.10 gso/s, 0 ulp
bf16c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_bf16c_serial1.26 gso/s, 9.8 ulp1.31 gso/s, 9.8 ulp1.27 gso/s, 9.5 ulp
f16c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_f16c_serial0.40 gso/s, 39 ulp0.38 gso/s, 39 ulp0.40 gso/s, 39 ulp
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_f64_serial0.49 gso/s, 0.6 ulp0.49 gso/s, 0.6 ulp0.48 gso/s, 0.6 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_f32_serial2.54 gso/s, 0 ulp2.62 gso/s, 0 ulp2.53 gso/s, 0 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_bf16_serial2.91 gso/s, 27 ulp2.90 gso/s, 22 ulp2.98 gso/s, 22 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_f16_serial0.76 gso/s, 74 ulp0.76 gso/s, 74 ulp0.78 gso/s, 74 ulp

Apple M4

Native

Kernel256²1024²4096²
f64c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_f64c_serial0.368 gso/s, 2.2 ulp0.371 gso/s, 2.2 ulp0.367 gso/s, 2.2 ulp
f32c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_f32c_serial2.33 gso/s, 0 ulp2.27 gso/s, 0 ulp2.28 gso/s, 0 ulp
nk_bilinear_f32c_neon2.11 gso/s, 0 ulp1.89 gso/s, 0 ulp1.85 gso/s, 0 ulp
bf16c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_bf16c_serial2.83 gso/s, 33.0 ulp2.54 gso/s, 34.5 ulp2.49 gso/s, 34.5 ulp
nk_bilinear_bf16c_neonbfdot5.05 gso/s, 17.0 ulp4.20 gso/s, 17.0 ulp4.04 gso/s, 17.0 ulp
f16c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_f16c_serial2.81 gso/s, 51.8 ulp2.54 gso/s, 51.8 ulp2.48 gso/s, 51.8 ulp
nk_bilinear_f16c_neonhalf5.00 gso/s, 17.3 ulp4.16 gso/s, 17.3 ulp4.00 gso/s, 16.4 ulp
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_f64_serial0.717 gso/s, 0.4 ulp0.711 gso/s, 0.4 ulp0.721 gso/s, 0.4 ulp
nk_mahalanobis_f64_serial0.664 gso/s, 0.5 ulp0.667 gso/s, 0.5 ulp0.672 gso/s, 0.5 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_f32_serial3.92 gso/s, 0 ulp3.05 gso/s, 0 ulp2.87 gso/s, 0 ulp
nk_mahalanobis_f32_serial3.42 gso/s, 0 ulp2.88 gso/s, 0 ulp2.74 gso/s, 0 ulp
nk_bilinear_f32_neon4.90 gso/s, 0 ulp3.82 gso/s, 0 ulp3.49 gso/s, 0 ulp
nk_mahalanobis_f32_neon4.68 gso/s, 0 ulp3.71 gso/s, 0 ulp3.48 gso/s, 0 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_bf16_serial4.17 gso/s, 20.7 ulp3.19 gso/s, 21.2 ulp2.94 gso/s, 20.7 ulp
nk_mahalanobis_bf16_serial3.86 gso/s, 2.1 ulp2.98 gso/s, 2.2 ulp2.79 gso/s, 2.1 ulp
nk_bilinear_bf16_neonbfdot28.0 gso/s, 28.0 ulp23.5 gso/s, 41.2 ulp20.4 gso/s, 41.1 ulp
nk_mahalanobis_bf16_neonbfdot9.14 gso/s, 2.2 ulp7.93 gso/s, 2.2 ulp7.43 gso/s, 2.2 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_f16_serial? gso/s, ? ulp? gso/s, ? ulp? gso/s, ? ulp
nk_mahalanobis_f16_serial? gso/s, ? ulp? gso/s, ? ulp? gso/s, ? ulp
nk_bilinear_f16_neonhalf? gso/s, ? ulp? gso/s, ? ulp? gso/s, ? ulp
nk_mahalanobis_f16_neonhalf? gso/s, ? ulp? gso/s, ? ulp? gso/s, ? ulp

WASM

Measured with Wasmtime v43 (Cranelift backend).

Kernel256²1024²4096²
f64c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_f64c_serial0.445 gso/s, ? ulp0.445 gso/s, ? ulp0.445 gso/s, ? ulp
f32c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_f32c_serial2.83 gso/s, ? ulp2.83 gso/s, ? ulp2.84 gso/s, ? ulp
bf16c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_bf16c_serial3.05 gso/s, ? ulp3.02 gso/s, ? ulp3.03 gso/s, ? ulp
f16c░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_f16c_serial0.984 gso/s, ? ulp0.992 gso/s, ? ulp0.995 gso/s, ? ulp
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_f64_serial0.998 gso/s, ? ulp0.999 gso/s, ? ulp0.999 gso/s, ? ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_f32_serial5.00 gso/s, ? ulp3.73 gso/s, ? ulp3.49 gso/s, ? ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_bf16_serial4.84 gso/s, ? ulp3.83 gso/s, ? ulp3.60 gso/s, ? ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_bilinear_f16_serial1.90 gso/s, ? ulp1.75 gso/s, ? ulp1.93 gso/s, ? ulp