Point Cloud Alignment in NumKong
April 16, 2026 · View on GitHub
NumKong implements three algorithms for 3D point cloud comparison and alignment, used in structural biology (protein alignment), robotics (point cloud registration), and computer graphics (mesh registration).
RMSD measures raw point-pair deviation without centering or alignment:
Kabsch finds the optimal rotation that minimizes RMSD after centering both clouds at their centroids , , recovering from the SVD of the cross-covariance matrix :
Umeyama extends Kabsch with a uniform scale factor derived from the singular values and source variance :
Reformulating as Python pseudocode:
import numpy as np
def kabsch(a: np.ndarray, b: np.ndarray) -> np.ndarray:
a_c, b_c = a - a.mean(0), b - b.mean(0)
H = a_c.T @ b_c
U, S, Vt = np.linalg.svd(H)
d = np.sign(np.linalg.det(Vt.T @ U.T))
R = Vt.T @ np.diag([1, 1, d]) @ U.T
return R
def umeyama(a: np.ndarray, b: np.ndarray) -> tuple:
a_c, b_c = a - a.mean(0), b - b.mean(0)
H = a_c.T @ b_c
U, S, Vt = np.linalg.svd(H)
d = np.sign(np.linalg.det(Vt.T @ U.T))
R = Vt.T @ np.diag([1, 1, d]) @ U.T
scale = S.sum() / (len(a) * np.var(a_c))
return R, scale
def rmsd(a: np.ndarray, b: np.ndarray) -> float:
return np.sqrt(np.mean(np.sum((a - b) ** 2, axis=1)))
Input & Output Types
| Input Type | Output Type | Description |
|---|---|---|
f64 | f64 | 64-bit IEEE 754 double precision |
f32 | f32 | 32-bit IEEE 754 single precision |
f16 | f32 | 16-bit IEEE 754 half precision, widened output |
bf16 | `f32$ | 16-\text{bit} \text{brain} \text{float}, \text{widened} \text{output} |
\text{Optimizations}
\text{McAdams} \text{Branching}-\text{Free} 3 \times 3 \text{SVD}
$nk_kabsch_f32_serial, nk_kabsch_f64_haswell, nk_umeyama_f32_neonuse a Jacobi eigenanalysis with fixed 16 iterations (no convergence check) for deterministic behavior. Quaternion-accumulated rotations: each Jacobi sweep updates a 4-element quaternion instead of recomputing eigenvectors. Approximate Givens angles viank_approximate_givens_quaternion_` — a γ-threshold test selects between computed angles and precomputed cos(π/8), sin(π/8) constants.
Cyclic permutation of matrix elements avoids explicit sorting of eigenvalues.
Stride-3 Deinterleaving
Point clouds are stored interleaved as [x₀,y₀,z₀, x₁,y₁,z₁, ...].
NEON uses vld3q_f32 to hardware-deinterleave 4 XYZ triplets in one instruction — no gather needed.
Haswell uses _mm256_i32gather_ps with indices [0,3,6,9,12,15,18,21] to load 8 x-coordinates from 8 points.
RVV uses indexed loads with dynamic stride to adapt to variable vector length.
Reflection Correction
nk_kabsch_f32_haswell, nk_kabsch_f64_skylake check for improper rotations after computing from the SVD of the cross-covariance matrix .
If (a reflection rather than a rotation), the last column of is negated before recomputing .
This ensures the output is always a proper rotation matrix with .
Pre-Scaled Rotation for Umeyama
nk_umeyama_f32_haswell, nk_umeyama_f64_skylake fold the computed scale factor into the rotation matrix before applying to points.
The Umeyama transform is ; by precomputing once, the per-point operation reduces to , avoiding a per-point scalar multiply.
Why SME and SVE Were Removed
Historical note: experimental SME variants of RMSD, Kabsch, and Umeyama were implemented in 1,052 lines across sme.h and smef64.h (commit 0e0bc30c) and removed 4 days later (commit f55e9a71).
The fundamental mismatch: the algorithm computes a 3×3 cross-covariance matrix — a sum of outer products of 3D vectors.
SME's FMOPA operates on SVL-wide vectors (16+ elements at SVL=512), but the outer products here are 3×3 — the tile is 99.6% wasted (9 useful cells out of 256).
Three approaches were explored in a design document (sme_design.h, 398 lines):
(1) batched outer products — reformulates as 9 independent dot products but loses SME's outer-product strength, falling back to what NEON already does;
(2) streaming SVE with svld3 — hardware stride-3 deinterleaving processes 16 points per iteration vs NEON's 4, but SMSTART/SMSTOP mode transitions cost ~100 cycles and the 3×3 SVD step cannot use streaming mode at all;
(3) SME for SVD — the 3×3 matrix cannot fill even one 16×16 tile.
Performance estimates from the design document: NEON baseline ~2.25N cycles for N points; streaming SVE ~1.2N cycles but with ~100-cycle mode transition overhead — for typical protein alignment workloads (N = 100–500 atoms), the overhead dominates.
Experimental SVE mesh kernels (sve.h, svehalf.h, 112 lines total) were removed in the same commit — variable vector length added complexity without clear benefit over fixed-width NEON for the 3D point cloud problem.
Performance
The following performance tables are produced by manually re-running nk_test and nk_bench included internal tools to measure both accuracy and throughput at different input shapes.
The input size is controlled by the NK_MESH_POINTS environment variable and set to 256, 1024, and 4096 points.
Each alignment computes centroids, covariance, and a 3×3 SVD over point pairs, so cost is per alignment with a large constant.
The throughput is measured in mp/s as millions of 3D points aligned per second.
Accuracy is reported as mean ULP (units in last place) unless noted otherwise — the average number of representable floating-point values between the result and the exact answer.
Each kernel runs for at least 20 seconds per configuration.
Benchmark threads are pinned to specific cores; on machines with heterogeneous core types (e.g., Apple P/E cores), only the fastest cores are used.
Workloads that significantly degrade CPU frequencies (Intel AMX, Apple SME) run in separate passes to avoid affecting throughput measurements of other kernels.
Intel Granite Rapids
Xeon 6776P, 2.3 GHz base, cpu_scaling_enabled=false.
Serial kernels compiled with -fno-tree-vectorize.
Native
| Kernel | 256 | 1024 | 4096 |
|---|---|---|---|
| f64 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_rmsd_f64_serial | 93.7 mp/s, 0.5 ulp | 87.4 mp/s, 0.5 ulp | 69.8 mp/s, 0.5 ulp |
nk_kabsch_f64_serial | 11.8 mp/s, 0.8 ulp | 13.6 mp/s, 0.8 ulp | 12.8 mp/s, 0.8 ulp |
nk_umeyama_f64_serial | 10.4 mp/s, 0.3 ulp | 11.7 mp/s, 0.3 ulp | 11.5 mp/s, 0.3 ulp |
nk_rmsd_f64_haswell | 523 mp/s, 0.3 ulp | 564 mp/s, 0.4 ulp | 449 mp/s, 0.8 ulp |
nk_kabsch_f64_haswell | 65.3 mp/s, 0.5 ulp | 203 mp/s, 0.9 ulp | 326 mp/s, 1.5 ulp |
nk_umeyama_f64_haswell | 68.0 mp/s, 0.5 ulp | 200 mp/s, 0.8 ulp | 324 mp/s, 1.5 ulp |
nk_rmsd_f64_skylake | 546 mp/s, 0.2 ulp | 587 mp/s, 0.3 ulp | 583 mp/s, 0.4 ulp |
nk_kabsch_f64_skylake | 34.5 mp/s, 0.4 ulp | 107 mp/s, 0.5 ulp | 261 mp/s, 0.8 ulp |
nk_umeyama_f64_skylake | 24.3 mp/s, 0.3 ulp | 82.7 mp/s, 0.5 ulp | 201 mp/s, 0.8 ulp |
| f32 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_rmsd_f32_serial | 68.9 mp/s, 0.5 ulp | 70.7 mp/s, 0.5 ulp | 72.1 mp/s, 0.5 ulp |
nk_kabsch_f32_serial | 11.2 mp/s, 0.8 ulp | 12.8 mp/s, 0.8 ulp | 14.0 mp/s, 0.9 ulp |
nk_umeyama_f32_serial | 10.1 mp/s, 0.3 ulp | 11.2 mp/s, 0.3 ulp | 12.1 mp/s, 0.4 ulp |
nk_rmsd_f32_haswell | 686 mp/s, 0.3 ulp | 848 mp/s, 0.5 ulp | 841 mp/s, 0.9 ulp |
nk_kabsch_f32_haswell | 90.4 mp/s, 0.9 ulp | 250 mp/s, 1.3 ulp | 455 mp/s, 7.6 ulp |
nk_umeyama_f32_haswell | 87.7 mp/s, 0.3 ulp | 250 mp/s, 0.4 ulp | 374 mp/s, 0.7 ulp |
nk_rmsd_f32_skylake | 1,016 mp/s, 1.2 ulp | 1,112 mp/s, 1.2 ulp | 1,042 mp/s, 4.3 ulp |
nk_kabsch_f32_skylake | 81.8 mp/s, 0.9 ulp | 241 mp/s, 4.1 ulp | 549 mp/s, 3.1 ulp |
nk_umeyama_f32_skylake | 58.0 mp/s, 0.6 ulp | 168 mp/s, 2.9 ulp | 459 mp/s, 2.1 ulp |
| bf16 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_rmsd_bf16_haswell | 284 mp/s, 0.3 ulp | 281 mp/s, 3.5 ulp | 273 mp/s, 12.8 ulp |
nk_kabsch_bf16_haswell | 36.2 mp/s, 0.4 ulp | 106 mp/s, 7.6 ulp | 186 mp/s, 33.0 ulp |
nk_umeyama_bf16_haswell | 34.5 mp/s, 0.3 ulp | 102 mp/s, 5.3 ulp | 186 mp/s, 23.1 ulp |
nk_rmsd_bf16_skylake | 1,837 mp/s, 0.4 ulp | 2,357 mp/s, 5.4 ulp | 2,422 mp/s, 11.8 ulp |
nk_kabsch_bf16_skylake | 34.1 mp/s, 0.3 ulp | 131 mp/s, 3.2 ulp | 487 mp/s, 20.4 ulp |
nk_umeyama_bf16_skylake | 34.6 mp/s, 0.3 ulp | 130 mp/s, 2.2 ulp | 394 mp/s, 14.3 ulp |
nk_rmsd_bf16_genoa | 1,743 mp/s, 0.3 ulp | 2,323 mp/s, 3.1 ulp | 2,066 mp/s, 20.2 ulp |
nk_kabsch_bf16_genoa | 33.4 mp/s, 0.3 ulp | 133 mp/s, 3.2 ulp | 405 mp/s, 20.3 ulp |
nk_umeyama_bf16_genoa | 33.2 mp/s, 0.3 ulp | 129 mp/s, 2.2 ulp | 439 mp/s, 14.3 ulp |
| f16 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_rmsd_f16_haswell | 273 mp/s, 0.2 ulp | 274 mp/s, 0.7 ulp | 291 mp/s, 2.5 ulp |
nk_kabsch_f16_haswell | 34.4 mp/s, 0.5 ulp | 98.0 mp/s, 1.8 ulp | 197 mp/s, 8.2 ulp |
nk_umeyama_f16_haswell | 35.5 mp/s, 0.4 ulp | 97.9 mp/s, 1.2 ulp | 196 mp/s, 5.7 ulp |
nk_rmsd_f16_skylake | 1,834 mp/s, 0.3 ulp | 2,341 mp/s, 1.3 ulp | 2,418 mp/s, 3.9 ulp |
nk_kabsch_f16_skylake | 34.0 mp/s, 0.7 ulp | 132 mp/s, 0.5 ulp | 480 mp/s, 4.7 ulp |
nk_umeyama_f16_skylake | 33.8 mp/s, 0.5 ulp | 127 mp/s, 0.4 ulp | 481 mp/s, 3.3 ulp |
WASM
Measured with Wasmtime v43 (Cranelift backend), WASI-SDK 24, -msimd128 -mrelaxed-simd.
| Kernel | 256 | 1024 | 4096 |
|---|---|---|---|
| f64 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_rmsd_f64_serial | 89.9 mp/s, 0.5 ulp | 86.1 mp/s, 0.5 ulp | 73.4 mp/s, 0.5 ulp |
nk_rmsd_f64_v128relaxed | 485 mp/s, 0.4 ulp | 552 mp/s, 0.7 ulp | 412 mp/s, 1.3 ulp |
nk_kabsch_f64_serial | 12.1 mp/s, 0.8 ulp | 13.9 mp/s, 0.8 ulp | 14.0 mp/s, 0.9 ulp |
nk_kabsch_f64_v128relaxed | 66.0 mp/s, 0.9 ulp | 188 mp/s, 1.7 ulp | 177 mp/s, 3.1 ulp |
nk_umeyama_f64_serial | 10.8 mp/s, 0.3 ulp | 12.3 mp/s, 0.3 ulp | 12.2 mp/s, 0.4 ulp |
nk_umeyama_f64_v128relaxed | 64.0 mp/s, 0.8 ulp | 187 mp/s, 1.6 ulp | 178 mp/s, 3.2 ulp |
| f32 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_rmsd_f32_serial | 80.6 mp/s, 0.5 ulp | 82.7 mp/s, 0.5 ulp | 70.3 mp/s, 0.5 ulp |
nk_rmsd_f32_v128relaxed | 452 mp/s, 1.5 ulp | 416 mp/s, 1.3 ulp | 399 mp/s, 4.8 ulp |
nk_kabsch_f32_serial | 11.4 mp/s, 0.8 ulp | 12.8 mp/s, 0.9 ulp | 12.7 mp/s, 0.8 ulp |
nk_kabsch_f32_v128relaxed | 79.5 mp/s, 4.2 ulp | 132 mp/s, 3.9 ulp | 177 mp/s, 14.3 ulp |
nk_umeyama_f32_serial | 10.1 mp/s, 0.3 ulp | 11.2 mp/s, 0.3 ulp | 11.2 mp/s, 0.3 ulp |
nk_umeyama_f32_v128relaxed | 79.4 mp/s, 2.8 ulp | 138 mp/s, 2.8 ulp | 194 mp/s, 10.1 ulp |
Apple M5
Native
| Kernel | 256 | 1024 | 4096 |
|---|---|---|---|
| f64 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_rmsd_f64_serial | 279 mp/s, 0.5 ulp | 267 mp/s, 0.5 ulp | 279 mp/s, 0.5 ulp |
nk_kabsch_f64_serial | 40.4 mp/s, 1.4 ulp | 47.3 mp/s, 2.6 ulp | 50.2 mp/s, 5.4 ulp |
nk_umeyama_f64_serial | 34.5 mp/s, 1.0 ulp | 39.2 mp/s, 1.9 ulp | 41.6 mp/s, 3.7 ulp |
nk_rmsd_f64_neon | 1,776 mp/s, 0.4 ulp | 1,536 mp/s, 0.7 ulp | 2,037 mp/s, 1.3 ulp |
nk_kabsch_f64_neon | 119 mp/s, 0.8 ulp | 222 mp/s, 1.3 ulp | 304 mp/s, 2.2 ulp |
nk_umeyama_f64_neon | 115 mp/s, 0.4 ulp | 220 mp/s, 0.8 ulp | 296 mp/s, 1.6 ulp |
| f32 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_rmsd_f32_serial | 264 mp/s, 0.5 ulp | 264 mp/s, 0.5 ulp | 261 mp/s, 0.5 ulp |
nk_kabsch_f32_serial | 39.4 mp/s, 1.4 ulp | 46.0 mp/s, 2.7 ulp | 49.9 mp/s, 5.0 ulp |
nk_umeyama_f32_serial | 33.6 mp/s, 0.9 ulp | 38.8 mp/s, 1.8 ulp | 41.4 mp/s, 3.5 ulp |
nk_rmsd_f32_neon | 1,912 mp/s, 1.5 ulp | 2,239 mp/s, 1.3 ulp | 1,966 mp/s, 4.8 ulp |
nk_kabsch_f32_neon | 135 mp/s, 0.7 ulp | 288 mp/s, 0.9 ulp | 385 mp/s, 1.4 ulp |
nk_umeyama_f32_neon | 130 mp/s, 0.3 ulp | 272 mp/s, 0.4 ulp | 367 mp/s, 0.8 ulp |
| bf16 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_rmsd_bf16_neonbfdot | 3,728 mp/s, 0.4 ulp | 3,756 mp/s, 6.0 ulp | 3,769 mp/s, 10.0 ulp |
nk_kabsch_bf16_neonbfdot | 180 mp/s, 0.7 ulp | 448 mp/s, 0.9 ulp | 726 mp/s, 1.3 ulp |
nk_umeyama_bf16_neonbfdot | 176 mp/s, 0.2 ulp | 433 mp/s, 0.4 ulp | 705 mp/s, 0.8 ulp |
| f16 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_rmsd_f16_neonhalf | 2,998 mp/s, 0.4 ulp | 3,215 mp/s, 1.7 ulp | 3,216 mp/s, 4.6 ulp |
nk_kabsch_f16_neonhalf | 178 mp/s, 0.9 ulp | 443 mp/s, 1.3 ulp | 711 mp/s, 2.4 ulp |
nk_umeyama_f16_neonhalf | 175 mp/s, 0.4 ulp | 408 mp/s, 0.8 ulp | 620 mp/s, 1.5 ulp |
WASM
Measured with Wasmtime v43 (Cranelift backend).
| Kernel | 256 | 1024 | 4096 |
|---|---|---|---|
| f64 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_rmsd_f64_serial | 137 mp/s, 2.6 ulp | 134 mp/s, 2.6 ulp | 142 mp/s, 2.6 ulp |
nk_rmsd_f64_v128relaxed | 1,377 mp/s, 0.8 ulp | 1,038 mp/s, 0.8 ulp | 1,566 mp/s, 0.8 ulp |
nk_kabsch_f64_serial | 42.3 mp/s, 2.7 ulp | 50.4 mp/s, 2.7 ulp | 55.5 mp/s, 2.7 ulp |
nk_kabsch_f64_v128relaxed | 121 mp/s, 2.2 ulp | 225 mp/s, 2.2 ulp | 345 mp/s, 2.2 ulp |
nk_umeyama_f64_serial | 36.1 mp/s, 1.8 ulp | 41.3 mp/s, 1.8 ulp | 46.0 mp/s, 1.8 ulp |
nk_umeyama_f64_v128relaxed | 112 mp/s, 1.5 ulp | 207 mp/s, 1.5 ulp | 293 mp/s, 1.5 ulp |
| f32 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_rmsd_f32_serial | 120 mp/s, 2.7 ulp | 120 mp/s, 2.7 ulp | 124 mp/s, 2.7 ulp |
nk_rmsd_f32_v128relaxed | 1,025 mp/s, 0.5 ulp | 1,038 mp/s, 0.5 ulp | 1,093 mp/s, 0.5 ulp |
nk_kabsch_f32_serial | 39.6 mp/s, 2.6 ulp | 47.6 mp/s, 2.6 ulp | 51.4 mp/s, 2.6 ulp |
nk_kabsch_f32_v128relaxed | 125 mp/s, 1.3 ulp | 255 mp/s, 1.3 ulp | 366 mp/s, 1.3 ulp |
nk_umeyama_f32_serial | 30.5 mp/s, 1.8 ulp | 35.0 mp/s, 1.8 ulp | 38.9 mp/s, 1.8 ulp |
nk_umeyama_f32_v128relaxed | 118 mp/s, 0.8 ulp | 240 mp/s, 0.8 ulp | 338 mp/s, 0.8 ulp |