Geospatial Distances in NumKong
April 2, 2026 · View on GitHub
NumKong implements geodesic distance functions for points on Earth's surface: Haversine computes great-circle distance on a perfect sphere, while Vincenty solves the inverse geodesic problem on the WGS-84 oblate spheroid. Both operate on arrays of latitude/longitude pairs in radians and produce distances in meters.
The Haversine formula computes the great-circle distance between two points:
where is Earth's mean radius and are latitude and longitude in radians.
Vincenty's formula solves the inverse geodesic problem on an oblate spheroid, iteratively refining the reduced latitude difference until convergence:
where and are the equatorial and polar semi-axes of the WGS-84 ellipsoid, is the angular separation, and is the correction term computed through iterative convergence.
Reformulating as Python pseudocode:
import numpy as np
def haversine(lat1, lon1, lat2, lon2, R=6371000):
dlat = lat2 - lat1
dlon = lon2 - lon1
a = np.sin(dlat / 2) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2) ** 2
return 2 * R * np.arcsin(np.sqrt(a))
Input coordinates are in radians, output distances are in meters.
Input & Output Types
| Input Type | Output Type | Description |
|---|---|---|
f64 | f64 | 64-bit double precision |
f32 | f32 | 32-bit single precision |
Optimizations
Trigonometric Polynomial Approximations
nk_haversine_f32_haswell, nk_haversine_f32_skylake, nk_haversine_f32_neon replace libm sin, cos, atan2, and asin with SIMD polynomial approximations achieving approximately 3.5 ULP accuracy.
Range reduction maps the input angle to using Cody-Waite extended precision constants, then an odd-degree minimax polynomial evaluates sin and an even-degree polynomial evaluates cos.
The f32 kernels use 5-term polynomials while f64 kernels use 11-term polynomials for the extra precision required by double-precision inputs.
This avoids the latency of scalar libm calls — each trigonometric evaluation would otherwise serialize through a single execution port, while the polynomial chains pipeline across multiple FMA units.
Vincenty Iterative Convergence with Masked Lanes
nk_vincenty_f64_haswell, nk_vincenty_f64_skylake, nk_vincenty_f64_neon implement the full Vincenty inverse formula with up to 100 iterations and a convergence threshold of $10^{-12}$ radians (approximately 6 micrometers on Earth's surface).
Each SIMD lane may converge at a different iteration count, so the kernel accumulates a converged_mask via _mm256_or_pd(converged_mask, newly_converged) and selectively freezes converged lanes with _mm256_blendv_pd(lambda_new, lambda, converged_mask).
Early exit uses _mm256_movemask_pd — when all 4 bits (for f64) or 8 bits (for f32) are set, the loop breaks.
Coincident points and equatorial edge cases are handled by blending safe values (ones) into the intermediate terms to avoid division by zero, without requiring branches that would diverge across SIMD lanes.
Potential Optimization: Haversine Without Final Arc Conversion
The haversine formula computes where .
Since both asin and sqrt are monotonically increasing, comparing values directly produces the same ordering as comparing full haversine distances.
For ranking-only use cases, a future "similarity mode" could skip the final sqrt/atan2 conversion and return directly, eliminating the two most expensive operations in the pipeline.
Currently, all kernels compute the full distance.
Performance
The following performance tables are produced by manually re-running nk_test and nk_bench included internal tools to measure both accuracy and throughput at different input shapes.
The input size is controlled by the NK_MAX_COORD_ANGLE environment variable and set to ≤1°, ≤30°, and ≤180° maximum angular separation between pairs of coordinates.
The larger the angular separation between pairs, the longer the algorithm may take to converge and the higher the error.
The throughput is measured in MP/s as the number of Millions of pairwise point distances computed per second - amortized for a large batch size, with NK_DENSE_DIMENSIONS=1536 by default.
Current nk_test output reports geospatial accuracy in two forms: mean/max absolute error in meters against Vincenty's formula computed at double-double (f118) precision, and mean/max ULP against the matching high-precision implementation of the same formula. The historical tables below use the meter-based summary where it has been remeasured; older x86 rows still retain their original ULP figures until rerun.
Each kernel runs for at least 20 seconds per configuration.
Benchmark threads are pinned to specific cores; on machines with heterogeneous core types (e.g., Apple P/E cores), only the fastest cores are used.
Workloads that significantly degrade CPU frequencies (Intel AMX, Apple SME) run in separate passes to avoid affecting throughput measurements of other kernels.
Intel Sapphire Rapids
Native
| Kernel | ≤1° | ≤30° | ≤180° |
|---|---|---|---|
| f64 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_haversine_f64_serial | 1.95 mp/s, 0.8 ulp | 2.10 mp/s, 0.8 ulp | 2.02 mp/s, 1.7 ulp |
nk_vincenty_f64_serial | 0.565 mp/s, 82 ulp | 0.481 mp/s, 3.9 ulp | 0.514 mp/s, 1.1 ulp |
nk_haversine_f64_haswell | 73.3 mp/s, 0.6 ulp | 68.3 mp/s, 0.6 ulp | 70.4 mp/s, 1.5 ulp |
nk_vincenty_f64_haswell | 12.2 mp/s, 80 ulp | 10.2 mp/s, 3.6 ulp | 7.15 mp/s, 1.1 ulp |
nk_haversine_f64_skylake | 106 mp/s, 0.6 ulp | 107 mp/s, 0.6 ulp | 99.8 mp/s, 1.5 ulp |
nk_vincenty_f64_skylake | 20.4 mp/s, 171K ulp | 17.5 mp/s, 6.57K ulp | 11.2 mp/s, 1.02K ulp |
| f32 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_haversine_f32_serial | 56.2 mp/s, 3.4 ulp | 62.3 mp/s, 2.9 ulp | 57.2 mp/s, 55 ulp |
nk_vincenty_f32_serial | 3.25 mp/s, 58.3K ulp | 2.39 mp/s, 306 ulp | 1.79 mp/s, 103 ulp |
nk_haversine_f32_haswell | 247 mp/s, 3.2 ulp | 282 mp/s, 2.7 ulp | 281 mp/s, 54 ulp |
nk_vincenty_f32_haswell | 53.6 mp/s, 26.2K ulp | 46.4 mp/s, 289 ulp | 16.5 mp/s, 61 ulp |
nk_haversine_f32_skylake | 350 mp/s, 3.1 ulp | 328 mp/s, 2.7 ulp | 356 mp/s, 53 ulp |
nk_vincenty_f32_skylake | 78.7 mp/s, 7.16K ulp | 73.6 mp/s, 406 ulp | 20.1 mp/s, 105 ulp |
WASM
Measured with Wasmtime v42 (Cranelift backend).
| Kernel | ≤1° | ≤30° | ≤180° |
|---|---|---|---|
| f64 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_haversine_f64_serial | ? mp/s, 0.9 ulp | ? mp/s, 0.9 ulp | ? mp/s, 1.8 ulp |
nk_vincenty_f64_serial | ? mp/s, 102 ulp | ? mp/s, 3.7 ulp | ? mp/s, 1.1 ulp |
nk_haversine_f64_v128relaxed | ? mp/s, 0.6 ulp | ? mp/s, 0.6 ulp | ? mp/s, 1.7 ulp |
nk_vincenty_f64_v128relaxed | ? mp/s, 104 ulp | ? mp/s, 3.4 ulp | ? mp/s, 1.1 ulp |
| f32 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_haversine_f32_serial | ? mp/s, 3.5 ulp | ? mp/s, 2.9 ulp | ? mp/s, 53.6 ulp |
nk_vincenty_f32_serial | ? mp/s, 70.5K ulp | ? mp/s, 326 ulp | ? mp/s, 65.5 ulp |
nk_haversine_f32_v128relaxed | ? mp/s, 6.5 ulp | ? mp/s, 5.6 ulp | ? mp/s, 53.3 ulp |
nk_vincenty_f32_v128relaxed | ? mp/s, 23.8K ulp | ? mp/s, 323 ulp | ? mp/s, 64.0 ulp |
Apple M4
Native
| Kernel | ≤1° | ≤30° | ≤180° |
|---|---|---|---|
| f64 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_haversine_f64_serial | 3.47 mp/s, 1.12 km | 3.47 mp/s, 32.8 km | 3.48 mp/s, 150 km |
nk_vincenty_f64_serial | 0.888 mp/s, 2.20 nm | 0.770 mp/s, 2.79 nm | 0.662 mp/s, 622 nm |
nk_haversine_f64_neon | 72.8 mp/s, 1.12 km | 72.5 mp/s, 32.8 km | 72.8 mp/s, 150 km |
nk_vincenty_f64_neon | 9.34 mp/s, 2.12 nm | 7.61 mp/s, 2.33 nm | 5.99 mp/s, 622 nm |
| f32 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_haversine_f32_serial | 14.1 mp/s, 1.12 km | 13.8 mp/s, 32.8 km | 14.0 mp/s, 146 km |
nk_vincenty_f32_serial | 4.54 mp/s, 12.0 m | 3.25 mp/s, 12.5 m | 2.42 mp/s, 22.0 m |
nk_haversine_f32_neon | 247 mp/s, 1.12 km | 235 mp/s, 32.8 km | 252 mp/s, 146 km |
nk_vincenty_f32_neon | 45.7 mp/s, 12.2 m | 37.9 mp/s, 12.8 m | 15.7 mp/s, 22.0 m |
WASM
Measured with Wasmtime v43 (Cranelift backend).
| Kernel | ≤1° | ≤30° | ≤180° |
|---|---|---|---|
| f64 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_haversine_f64_serial | 5.12 mp/s, 1.12 km | 5.12 mp/s, 32.8 km | 5.07 mp/s, 148 km |
nk_vincenty_f64_serial | 1.27 mp/s, 1.86 nm | 1.11 mp/s, 2.33 nm | 1.02 mp/s, 594 nm |
nk_haversine_f64_v128relaxed | 109 mp/s, 1.12 km | 109 mp/s, 32.8 km | 109 mp/s, 148 km |
nk_vincenty_f64_v128relaxed | 12.8 mp/s, 1.89 nm | 10.4 mp/s, 2.33 nm | 8.28 mp/s, 594 nm |
| f32 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_haversine_f32_serial | 20.9 mp/s, 20,000 km | 20.9 mp/s, 32.7 km | 21.0 mp/s, 136 km |
nk_vincenty_f32_serial | 6.11 mp/s, 20,000 km | 4.30 mp/s, 12.0 m | 3.27 mp/s, 22.0 m |
nk_haversine_f32_v128relaxed | 523 mp/s, 20,000 km | 524 mp/s, 32.7 km | 524 mp/s, 153 km |
nk_vincenty_f32_v128relaxed | 76.9 mp/s, 12.0 m | 68.2 mp/s, 16.2 m | 26.8 mp/s, 18.0 m |