Batched Set Distances in NumKong
April 2, 2026 · View on GitHub
NumKong implements batched M×N Hamming and Jaccard distance matrices for binary vectors. The module reuses the dots u1 packing and GEMM infrastructure, converting popcount-of-AND dot products to set distances via precomputed norms.
Hamming distance from batched dot products:
Where dot = popcount(AND), measuring intersection size.
Jaccard distance from batched dot products:
Reformulating as Python pseudocode:
import numpy as np
def hammings_packed(a: np.ndarray, b: np.ndarray) -> np.ndarray:
dots = np.array([[np.unpackbits(np.bitwise_and(ai, bj)).sum()
for bj in b] for ai in a])
a_pop = np.array([np.unpackbits(ai).sum() for ai in a])[:, None]
b_pop = np.array([np.unpackbits(bj).sum() for bj in b])[None, :]
return a_pop + b_pop - 2 * dots
def jaccards_packed(a: np.ndarray, b: np.ndarray) -> np.ndarray:
dots = np.array([[np.unpackbits(np.bitwise_and(ai, bj)).sum()
for bj in b] for ai in a])
a_pop = np.array([np.unpackbits(ai).sum() for ai in a])[:, None]
b_pop = np.array([np.unpackbits(bj).sum() for bj in b])[None, :]
union = a_pop + b_pop - dots
return np.where(union > 0, 1.0 - dots / union, 0.0)
Input & Output Types
| Input Type | Output Type | Description |
|---|---|---|
u1 | u32 | Binary Hamming distance, packed octets |
u1 | f32 | Binary Jaccard distance, packed octets |
Optimizations
Hamming and Jaccard from Intersection Counts
nk_hammings_packed_u1_serial, nk_hammings_packed_u1_haswell, nk_jaccards_packed_u1_serial, nk_jaccards_packed_u1_haswell reuse the dots u1 GEMM output where each dot product counts intersection bits.
The L1 norm of a binary vector is its popcount: .
By inclusion-exclusion, .
Hamming distance counts positions where exactly one bit is set: .
Finalizer nk_hamming_u32x4_from_dot_serial_ computes pop_a + pop_b - 2 * dot in pure UInt32 arithmetic — no division, no float conversion, no sqrt.
Jaccard distance: .
Finalizer nk_jaccard_f32x4_from_dot_serial_ requires UInt32 → Float32 cast plus Float32 division (~11cy latency on Haswell), making it ~3× more expensive per element than Hamming's integer subtraction chain.
Per-column popcount norms (, ) are precomputed during packing and stored in packed buffer metadata, avoiding per-pair recomputation.
SME Binary Outer-Product Accumulation
nk_hammings_packed_u1_smebi32, nk_jaccards_packed_u1_smebi32 use the BMOPA instruction which computes — counting matching bits in a single outer-product operation over 16×16 output tiles with 512-bit depth chunks.
This is fundamentally different from the AND+POPCNT used by scalar/NEON/x86 kernels, which count intersection bits.
Hamming from BMOPA: , since XOR popcount (differing bits) is the Hamming distance directly — no per-vector norm correction needed.
Jaccard from BMOPA: must convert matching-bit counts to intersection via , then apply the Jaccard formula — more arithmetic than the AND-based path.
Streaming mode overhead (~50–100 cycles for SMSTART/SMSTOP) is amortized across the full M×N output.
Performance
The following performance tables are produced by manually re-running nk_test and nk_bench included internal tools to measure both accuracy and throughput at different input shapes.
The input size is controlled by NK_MATRIX_HEIGHT, NK_MATRIX_WIDTH, and NK_MATRIX_DEPTH environment variables, all set to the same value for batched set operations over square matrices.
Columns show throughput for 256³, 1024³, and 4096³ configurations.
The throughput is measured in GSO/s as Giga Scalar Operations per Second.
Accuracy is reported where applicable as exact distance in the result representation; floating Jaccard rows are shown as mean ULP (units in last place).
Each kernel runs for at least 20 seconds per configuration.
Benchmark threads are pinned to specific cores; on machines with heterogeneous core types (e.g., Apple P/E cores), only the fastest cores are used.
Workloads that significantly degrade CPU frequencies (Intel AMX, Apple SME) run in separate passes to avoid affecting throughput measurements of other kernels.
Intel Sapphire Rapids
Native
| Kernel | 256³ | 1024³ | 4096³ |
|---|---|---|---|
| u1 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_hammings_packed_u1_serial | 109 gso/s | 162 gso/s | 284 gso/s |
nk_hammings_symmetric_u1_serial | 39.7 gso/s | 133 gso/s | 325 gso/s |
nk_jaccards_packed_u1_serial | 54.8 gso/s, 0 ulp | 128 gso/s, 0 ulp | 259 gso/s, 0 ulp |
nk_jaccards_symmetric_u1_serial | 29.8 gso/s, 0 ulp | 110 gso/s, 0 ulp | 292 gso/s, 0 ulp |
nk_hammings_packed_u1_haswell | 100 gso/s | 126 gso/s | 168 gso/s |
nk_hammings_symmetric_u1_haswell | 58.5 gso/s | 132 gso/s | 328 gso/s |
nk_jaccards_packed_u1_haswell | 84.2 gso/s, 0.3 ulp | 124 gso/s, 0.3 ulp | 165 gso/s, 0.3 ulp |
nk_jaccards_symmetric_u1_haswell | 57.6 gso/s, 0.3 ulp | 131 gso/s, 0.3 ulp | 324 gso/s, 0.3 ulp |
nk_hammings_packed_u1_icelake | 110 gso/s | 340 gso/s | 604 gso/s |
nk_hammings_symmetric_u1_icelake | 76.2 gso/s | 258 gso/s | 1,040 gso/s |
nk_jaccards_packed_u1_icelake | 89.2 gso/s, 0.3 ulp | 312 gso/s, 0.3 ulp | 601 gso/s, 0.3 ulp |
nk_jaccards_symmetric_u1_icelake | 66.9 gso/s, 0.3 ulp | 260 gso/s, 0.3 ulp | 965 gso/s, 0.3 ulp |
WASM
Measured with Wasmtime v42 (Cranelift backend).
| Kernel | 256³ | 1024³ | 4096³ |
|---|---|---|---|
| u1 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_hammings_packed_u1_serial | 43.7 gso/s | 68.0 gso/s | 74.7 gso/s |
nk_hammings_packed_u1_v128relaxed | 75.3 gso/s | 134 gso/s | 144 gso/s |
nk_hammings_symmetric_u1_serial | 3.72 gso/s | 13.5 gso/s | 41.0 gso/s |
nk_hammings_symmetric_u1_v128relaxed | 3.64 gso/s | 13.9 gso/s | 42.2 gso/s |
nk_jaccards_packed_u1_serial | 33.7 gso/s, 0 ulp | 61.3 gso/s, 0 ulp | 73.2 gso/s, 0 ulp |
nk_jaccards_packed_u1_v128relaxed | 66.4 gso/s, 0 ulp | 129 gso/s, 0 ulp | 143 gso/s, 0 ulp |
nk_jaccards_symmetric_u1_serial | 3.57 gso/s, 0 ulp | 13.3 gso/s, 0 ulp | 40.6 gso/s, 0 ulp |
nk_jaccards_symmetric_u1_v128relaxed | 3.65 gso/s, 0 ulp | 13.9 gso/s, 0 ulp | 42.2 gso/s, 0 ulp |
Apple M5
Native
| Kernel | 256³ | 1024³ | 4096³ |
|---|---|---|---|
| u1 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_hammings_packed_u1_serial | 156 gso/s | 231 gso/s | 262 gso/s |
nk_hammings_symmetric_u1_serial | 106 gso/s | 196 gso/s | 246 gso/s |
nk_jaccards_packed_u1_serial | 136 gso/s, 0 ulp | 221 gso/s, 0 ulp | 262 gso/s, 0 ulp |
nk_jaccards_symmetric_u1_serial | 96.5 gso/s, 0 ulp | 183 gso/s, 0 ulp | 244 gso/s, 0 ulp |
nk_hammings_packed_u1_neon | 321 gso/s | 436 gso/s | 508 gso/s |
nk_hammings_symmetric_u1_neon | 126 gso/s | 239 gso/s | 318 gso/s |
nk_jaccards_packed_u1_neon | 271 gso/s, 0 ulp | 423 gso/s, 0 ulp | 503 gso/s, 0 ulp |
nk_jaccards_symmetric_u1_neon | 120 gso/s, 0 ulp | 233 gso/s, 0 ulp | 316 gso/s, 0 ulp |
nk_hammings_packed_u1_smebi32 | 3,286 gso/s | 7,303 gso/s | 11,269 gso/s |
nk_hammings_symmetric_u1_smebi32 | 1,872 gso/s | 5,332 gso/s | 4,079 gso/s |
nk_jaccards_packed_u1_smebi32 | 371 gso/s, 0 ulp | 1,735 gso/s, 0 ulp | 4,348 gso/s, 0 ulp |
nk_jaccards_symmetric_u1_smebi32 | 83.1 gso/s, 0 ulp | 358 gso/s, 0 ulp | 1,005 gso/s, 0 ulp |
WASM
Measured with Wasmtime v43 (Cranelift backend).
| Kernel | 256³ | 1024³ | 4096³ |
|---|---|---|---|
| u1 | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░░░░░░ |
nk_hammings_packed_u1_serial | 99.3 gso/s | 127 gso/s | 154 gso/s |
nk_hammings_symmetric_u1_serial | 63.7 gso/s | 142 gso/s | 210 gso/s |
nk_jaccards_packed_u1_serial | 92.2 gso/s, 0 ulp | 123 gso/s, 0 ulp | 153 gso/s, 0 ulp |
nk_jaccards_symmetric_u1_serial | 59.3 gso/s, 0 ulp | 142 gso/s, 0 ulp | 207 gso/s, 0 ulp |
nk_hammings_packed_u1_v128relaxed | 266 gso/s | 378 gso/s | 426 gso/s |
nk_hammings_symmetric_u1_v128relaxed | 72.2 gso/s | 185 gso/s | 259 gso/s |
nk_jaccards_packed_u1_v128relaxed | 243 gso/s, 0 ulp | 370 gso/s, 0 ulp | 424 gso/s, 0 ulp |
nk_jaccards_symmetric_u1_v128relaxed | 72.9 gso/s, 0 ulp | 183 gso/s, 0 ulp | 257 gso/s, 0 ulp |