Point Cloud Alignment in NumKong

April 16, 2026 · View on GitHub

NumKong implements three algorithms for 3D point cloud comparison and alignment, used in structural biology (protein alignment), robotics (point cloud registration), and computer graphics (mesh registration).

RMSD measures raw point-pair deviation without centering or alignment:

RMSD=1naibi2\text{RMSD} = \sqrt{\frac{1}{n}\sum \|a_i - b_i\|^2}

Kabsch finds the optimal rotation RR that minimizes RMSD after centering both clouds at their centroids aˉ\bar{a}, bˉ\bar{b}, recovering RR from the SVD of the cross-covariance matrix HH:

H=(aiaˉ)(bibˉ)T=UΣVT,R=VUTH = \sum (a_i - \bar{a})(b_i - \bar{b})^T = U \Sigma V^T, \quad R = V U^T

Umeyama extends Kabsch with a uniform scale factor ss derived from the singular values and source variance σa2\sigma_a^2:

s=tr(Σ)nσa2,RMSD=1nsR(aiaˉ)(bibˉ)2s = \frac{\text{tr}(\Sigma)}{n \cdot \sigma_a^2}, \quad \text{RMSD} = \sqrt{\frac{1}{n}\sum \|s \cdot R(a_i - \bar{a}) - (b_i - \bar{b})\|^2}

Reformulating as Python pseudocode:

import numpy as np

def kabsch(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    a_c, b_c = a - a.mean(0), b - b.mean(0)
    H = a_c.T @ b_c
    U, S, Vt = np.linalg.svd(H)
    d = np.sign(np.linalg.det(Vt.T @ U.T))
    R = Vt.T @ np.diag([1, 1, d]) @ U.T
    return R

def umeyama(a: np.ndarray, b: np.ndarray) -> tuple:
    a_c, b_c = a - a.mean(0), b - b.mean(0)
    H = a_c.T @ b_c
    U, S, Vt = np.linalg.svd(H)
    d = np.sign(np.linalg.det(Vt.T @ U.T))
    R = Vt.T @ np.diag([1, 1, d]) @ U.T
    scale = S.sum() / (len(a) * np.var(a_c))
    return R, scale

def rmsd(a: np.ndarray, b: np.ndarray) -> float:
    return np.sqrt(np.mean(np.sum((a - b) ** 2, axis=1)))

Input & Output Types

Input TypeOutput TypeDescription
f64f6464-bit IEEE 754 double precision
f32f3232-bit IEEE 754 single precision
f16f3216-bit IEEE 754 half precision, widened output
bf16`f32$16-\text{bit} \text{brain} \text{float}, \text{widened} \text{output}

\text{Optimizations}

\text{McAdams} \text{Branching}-\text{Free} 3 \times 3 \text{SVD}

$nk_kabsch_f32_serial, nk_kabsch_f64_haswell, nk_umeyama_f32_neonuse a Jacobi eigenanalysis with fixed 16 iterations (no convergence check) for deterministic behavior. Quaternion-accumulated rotations: each Jacobi sweep updates a 4-element quaternion instead of recomputing eigenvectors. Approximate Givens angles viank_approximate_givens_quaternion_` — a γ-threshold test selects between computed angles and precomputed cos(π/8), sin(π/8) constants. Cyclic permutation of matrix elements avoids explicit sorting of eigenvalues.

Stride-3 Deinterleaving

Point clouds are stored interleaved as [x₀,y₀,z₀, x₁,y₁,z₁, ...]. NEON uses vld3q_f32 to hardware-deinterleave 4 XYZ triplets in one instruction — no gather needed. Haswell uses _mm256_i32gather_ps with indices [0,3,6,9,12,15,18,21] to load 8 x-coordinates from 8 points. RVV uses indexed loads with dynamic stride to adapt to variable vector length.

Reflection Correction

nk_kabsch_f32_haswell, nk_kabsch_f64_skylake check for improper rotations after computing R=VUTR = V U^T from the SVD of the cross-covariance matrix H=UΣVTH = U \Sigma V^T. If det(R)=1\det(R) = -1 (a reflection rather than a rotation), the last column of VV is negated before recomputing RR. This ensures the output is always a proper rotation matrix with det(R)=+1\det(R) = +1.

Pre-Scaled Rotation for Umeyama

nk_umeyama_f32_haswell, nk_umeyama_f64_skylake fold the computed scale factor ss into the rotation matrix before applying to points. The Umeyama transform is bi=sRai+tb_i = s R a_i + t; by precomputing R=sRR' = s R once, the per-point operation reduces to bi=Rai+tb_i = R' a_i + t, avoiding a per-point scalar multiply.

Why SME and SVE Were Removed

Historical note: experimental SME variants of RMSD, Kabsch, and Umeyama were implemented in 1,052 lines across sme.h and smef64.h (commit 0e0bc30c) and removed 4 days later (commit f55e9a71). The fundamental mismatch: the algorithm computes a 3×3 cross-covariance matrix H=(aiaˉ)(bibˉ)TH = \sum (a_i - \bar{a})(b_i - \bar{b})^T — a sum of outer products of 3D vectors. SME's FMOPA operates on SVL-wide vectors (16+ elements at SVL=512), but the outer products here are 3×3 — the tile is 99.6% wasted (9 useful cells out of 256). Three approaches were explored in a design document (sme_design.h, 398 lines): (1) batched outer products — reformulates as 9 independent dot products but loses SME's outer-product strength, falling back to what NEON already does; (2) streaming SVE with svld3 — hardware stride-3 deinterleaving processes 16 points per iteration vs NEON's 4, but SMSTART/SMSTOP mode transitions cost ~100 cycles and the 3×3 SVD step cannot use streaming mode at all; (3) SME for SVD — the 3×3 matrix cannot fill even one 16×16 tile. Performance estimates from the design document: NEON baseline ~2.25N cycles for N points; streaming SVE ~1.2N cycles but with ~100-cycle mode transition overhead — for typical protein alignment workloads (N = 100–500 atoms), the overhead dominates. Experimental SVE mesh kernels (sve.h, svehalf.h, 112 lines total) were removed in the same commit — variable vector length added complexity without clear benefit over fixed-width NEON for the 3D point cloud problem.

Performance

The following performance tables are produced by manually re-running nk_test and nk_bench included internal tools to measure both accuracy and throughput at different input shapes. The input size is controlled by the NK_MESH_POINTS environment variable and set to 256, 1024, and 4096 points. Each alignment computes centroids, covariance, and a 3×3 SVD over NN point pairs, so cost is O(N)O(N) per alignment with a large constant. The throughput is measured in mp/s as millions of 3D points aligned per second. Accuracy is reported as mean ULP (units in last place) unless noted otherwise — the average number of representable floating-point values between the result and the exact answer. Each kernel runs for at least 20 seconds per configuration. Benchmark threads are pinned to specific cores; on machines with heterogeneous core types (e.g., Apple P/E cores), only the fastest cores are used. Workloads that significantly degrade CPU frequencies (Intel AMX, Apple SME) run in separate passes to avoid affecting throughput measurements of other kernels.

Intel Granite Rapids

Xeon 6776P, 2.3 GHz base, cpu_scaling_enabled=false. Serial kernels compiled with -fno-tree-vectorize.

Native

Kernel25610244096
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_rmsd_f64_serial93.7 mp/s, 0.5 ulp87.4 mp/s, 0.5 ulp69.8 mp/s, 0.5 ulp
nk_kabsch_f64_serial11.8 mp/s, 0.8 ulp13.6 mp/s, 0.8 ulp12.8 mp/s, 0.8 ulp
nk_umeyama_f64_serial10.4 mp/s, 0.3 ulp11.7 mp/s, 0.3 ulp11.5 mp/s, 0.3 ulp
nk_rmsd_f64_haswell523 mp/s, 0.3 ulp564 mp/s, 0.4 ulp449 mp/s, 0.8 ulp
nk_kabsch_f64_haswell65.3 mp/s, 0.5 ulp203 mp/s, 0.9 ulp326 mp/s, 1.5 ulp
nk_umeyama_f64_haswell68.0 mp/s, 0.5 ulp200 mp/s, 0.8 ulp324 mp/s, 1.5 ulp
nk_rmsd_f64_skylake546 mp/s, 0.2 ulp587 mp/s, 0.3 ulp583 mp/s, 0.4 ulp
nk_kabsch_f64_skylake34.5 mp/s, 0.4 ulp107 mp/s, 0.5 ulp261 mp/s, 0.8 ulp
nk_umeyama_f64_skylake24.3 mp/s, 0.3 ulp82.7 mp/s, 0.5 ulp201 mp/s, 0.8 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_rmsd_f32_serial68.9 mp/s, 0.5 ulp70.7 mp/s, 0.5 ulp72.1 mp/s, 0.5 ulp
nk_kabsch_f32_serial11.2 mp/s, 0.8 ulp12.8 mp/s, 0.8 ulp14.0 mp/s, 0.9 ulp
nk_umeyama_f32_serial10.1 mp/s, 0.3 ulp11.2 mp/s, 0.3 ulp12.1 mp/s, 0.4 ulp
nk_rmsd_f32_haswell686 mp/s, 0.3 ulp848 mp/s, 0.5 ulp841 mp/s, 0.9 ulp
nk_kabsch_f32_haswell90.4 mp/s, 0.9 ulp250 mp/s, 1.3 ulp455 mp/s, 7.6 ulp
nk_umeyama_f32_haswell87.7 mp/s, 0.3 ulp250 mp/s, 0.4 ulp374 mp/s, 0.7 ulp
nk_rmsd_f32_skylake1,016 mp/s, 1.2 ulp1,112 mp/s, 1.2 ulp1,042 mp/s, 4.3 ulp
nk_kabsch_f32_skylake81.8 mp/s, 0.9 ulp241 mp/s, 4.1 ulp549 mp/s, 3.1 ulp
nk_umeyama_f32_skylake58.0 mp/s, 0.6 ulp168 mp/s, 2.9 ulp459 mp/s, 2.1 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_rmsd_bf16_haswell284 mp/s, 0.3 ulp281 mp/s, 3.5 ulp273 mp/s, 12.8 ulp
nk_kabsch_bf16_haswell36.2 mp/s, 0.4 ulp106 mp/s, 7.6 ulp186 mp/s, 33.0 ulp
nk_umeyama_bf16_haswell34.5 mp/s, 0.3 ulp102 mp/s, 5.3 ulp186 mp/s, 23.1 ulp
nk_rmsd_bf16_skylake1,837 mp/s, 0.4 ulp2,357 mp/s, 5.4 ulp2,422 mp/s, 11.8 ulp
nk_kabsch_bf16_skylake34.1 mp/s, 0.3 ulp131 mp/s, 3.2 ulp487 mp/s, 20.4 ulp
nk_umeyama_bf16_skylake34.6 mp/s, 0.3 ulp130 mp/s, 2.2 ulp394 mp/s, 14.3 ulp
nk_rmsd_bf16_genoa1,743 mp/s, 0.3 ulp2,323 mp/s, 3.1 ulp2,066 mp/s, 20.2 ulp
nk_kabsch_bf16_genoa33.4 mp/s, 0.3 ulp133 mp/s, 3.2 ulp405 mp/s, 20.3 ulp
nk_umeyama_bf16_genoa33.2 mp/s, 0.3 ulp129 mp/s, 2.2 ulp439 mp/s, 14.3 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_rmsd_f16_haswell273 mp/s, 0.2 ulp274 mp/s, 0.7 ulp291 mp/s, 2.5 ulp
nk_kabsch_f16_haswell34.4 mp/s, 0.5 ulp98.0 mp/s, 1.8 ulp197 mp/s, 8.2 ulp
nk_umeyama_f16_haswell35.5 mp/s, 0.4 ulp97.9 mp/s, 1.2 ulp196 mp/s, 5.7 ulp
nk_rmsd_f16_skylake1,834 mp/s, 0.3 ulp2,341 mp/s, 1.3 ulp2,418 mp/s, 3.9 ulp
nk_kabsch_f16_skylake34.0 mp/s, 0.7 ulp132 mp/s, 0.5 ulp480 mp/s, 4.7 ulp
nk_umeyama_f16_skylake33.8 mp/s, 0.5 ulp127 mp/s, 0.4 ulp481 mp/s, 3.3 ulp

WASM

Measured with Wasmtime v43 (Cranelift backend), WASI-SDK 24, -msimd128 -mrelaxed-simd.

Kernel25610244096
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_rmsd_f64_serial89.9 mp/s, 0.5 ulp86.1 mp/s, 0.5 ulp73.4 mp/s, 0.5 ulp
nk_rmsd_f64_v128relaxed485 mp/s, 0.4 ulp552 mp/s, 0.7 ulp412 mp/s, 1.3 ulp
nk_kabsch_f64_serial12.1 mp/s, 0.8 ulp13.9 mp/s, 0.8 ulp14.0 mp/s, 0.9 ulp
nk_kabsch_f64_v128relaxed66.0 mp/s, 0.9 ulp188 mp/s, 1.7 ulp177 mp/s, 3.1 ulp
nk_umeyama_f64_serial10.8 mp/s, 0.3 ulp12.3 mp/s, 0.3 ulp12.2 mp/s, 0.4 ulp
nk_umeyama_f64_v128relaxed64.0 mp/s, 0.8 ulp187 mp/s, 1.6 ulp178 mp/s, 3.2 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_rmsd_f32_serial80.6 mp/s, 0.5 ulp82.7 mp/s, 0.5 ulp70.3 mp/s, 0.5 ulp
nk_rmsd_f32_v128relaxed452 mp/s, 1.5 ulp416 mp/s, 1.3 ulp399 mp/s, 4.8 ulp
nk_kabsch_f32_serial11.4 mp/s, 0.8 ulp12.8 mp/s, 0.9 ulp12.7 mp/s, 0.8 ulp
nk_kabsch_f32_v128relaxed79.5 mp/s, 4.2 ulp132 mp/s, 3.9 ulp177 mp/s, 14.3 ulp
nk_umeyama_f32_serial10.1 mp/s, 0.3 ulp11.2 mp/s, 0.3 ulp11.2 mp/s, 0.3 ulp
nk_umeyama_f32_v128relaxed79.4 mp/s, 2.8 ulp138 mp/s, 2.8 ulp194 mp/s, 10.1 ulp

Apple M5

Native

Kernel25610244096
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_rmsd_f64_serial279 mp/s, 0.5 ulp267 mp/s, 0.5 ulp279 mp/s, 0.5 ulp
nk_kabsch_f64_serial40.4 mp/s, 1.4 ulp47.3 mp/s, 2.6 ulp50.2 mp/s, 5.4 ulp
nk_umeyama_f64_serial34.5 mp/s, 1.0 ulp39.2 mp/s, 1.9 ulp41.6 mp/s, 3.7 ulp
nk_rmsd_f64_neon1,776 mp/s, 0.4 ulp1,536 mp/s, 0.7 ulp2,037 mp/s, 1.3 ulp
nk_kabsch_f64_neon119 mp/s, 0.8 ulp222 mp/s, 1.3 ulp304 mp/s, 2.2 ulp
nk_umeyama_f64_neon115 mp/s, 0.4 ulp220 mp/s, 0.8 ulp296 mp/s, 1.6 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_rmsd_f32_serial264 mp/s, 0.5 ulp264 mp/s, 0.5 ulp261 mp/s, 0.5 ulp
nk_kabsch_f32_serial39.4 mp/s, 1.4 ulp46.0 mp/s, 2.7 ulp49.9 mp/s, 5.0 ulp
nk_umeyama_f32_serial33.6 mp/s, 0.9 ulp38.8 mp/s, 1.8 ulp41.4 mp/s, 3.5 ulp
nk_rmsd_f32_neon1,912 mp/s, 1.5 ulp2,239 mp/s, 1.3 ulp1,966 mp/s, 4.8 ulp
nk_kabsch_f32_neon135 mp/s, 0.7 ulp288 mp/s, 0.9 ulp385 mp/s, 1.4 ulp
nk_umeyama_f32_neon130 mp/s, 0.3 ulp272 mp/s, 0.4 ulp367 mp/s, 0.8 ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_rmsd_bf16_neonbfdot3,728 mp/s, 0.4 ulp3,756 mp/s, 6.0 ulp3,769 mp/s, 10.0 ulp
nk_kabsch_bf16_neonbfdot180 mp/s, 0.7 ulp448 mp/s, 0.9 ulp726 mp/s, 1.3 ulp
nk_umeyama_bf16_neonbfdot176 mp/s, 0.2 ulp433 mp/s, 0.4 ulp705 mp/s, 0.8 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_rmsd_f16_neonhalf2,998 mp/s, 0.4 ulp3,215 mp/s, 1.7 ulp3,216 mp/s, 4.6 ulp
nk_kabsch_f16_neonhalf178 mp/s, 0.9 ulp443 mp/s, 1.3 ulp711 mp/s, 2.4 ulp
nk_umeyama_f16_neonhalf175 mp/s, 0.4 ulp408 mp/s, 0.8 ulp620 mp/s, 1.5 ulp

WASM

Measured with Wasmtime v43 (Cranelift backend).

Kernel25610244096
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_rmsd_f64_serial137 mp/s, 2.6 ulp134 mp/s, 2.6 ulp142 mp/s, 2.6 ulp
nk_rmsd_f64_v128relaxed1,377 mp/s, 0.8 ulp1,038 mp/s, 0.8 ulp1,566 mp/s, 0.8 ulp
nk_kabsch_f64_serial42.3 mp/s, 2.7 ulp50.4 mp/s, 2.7 ulp55.5 mp/s, 2.7 ulp
nk_kabsch_f64_v128relaxed121 mp/s, 2.2 ulp225 mp/s, 2.2 ulp345 mp/s, 2.2 ulp
nk_umeyama_f64_serial36.1 mp/s, 1.8 ulp41.3 mp/s, 1.8 ulp46.0 mp/s, 1.8 ulp
nk_umeyama_f64_v128relaxed112 mp/s, 1.5 ulp207 mp/s, 1.5 ulp293 mp/s, 1.5 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_rmsd_f32_serial120 mp/s, 2.7 ulp120 mp/s, 2.7 ulp124 mp/s, 2.7 ulp
nk_rmsd_f32_v128relaxed1,025 mp/s, 0.5 ulp1,038 mp/s, 0.5 ulp1,093 mp/s, 0.5 ulp
nk_kabsch_f32_serial39.6 mp/s, 2.6 ulp47.6 mp/s, 2.6 ulp51.4 mp/s, 2.6 ulp
nk_kabsch_f32_v128relaxed125 mp/s, 1.3 ulp255 mp/s, 1.3 ulp366 mp/s, 1.3 ulp
nk_umeyama_f32_serial30.5 mp/s, 1.8 ulp35.0 mp/s, 1.8 ulp38.9 mp/s, 1.8 ulp
nk_umeyama_f32_v128relaxed118 mp/s, 0.8 ulp240 mp/s, 0.8 ulp338 mp/s, 0.8 ulp