Batched Set Distances in NumKong

April 2, 2026 · View on GitHub

NumKong implements batched M×N Hamming and Jaccard distance matrices for binary vectors. The module reuses the dots u1 packing and GEMM infrastructure, converting popcount-of-AND dot products to set distances via precomputed norms.

Hamming distance from batched dot products:

Dij=Ai1+Bj12dot(Ai,Bj)D_{ij} = \|A_i\|_1 + \|B_j\|_1 - 2 \cdot \text{dot}(A_i, B_j)

Where dot = popcount(AND), measuring intersection size.

Jaccard distance from batched dot products:

Dij=1dot(Ai,Bj)Ai1+Bj1dot(Ai,Bj)D_{ij} = 1 - \frac{\text{dot}(A_i, B_j)}{\|A_i\|_1 + \|B_j\|_1 - \text{dot}(A_i, B_j)}

Reformulating as Python pseudocode:

import numpy as np

def hammings_packed(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    dots = np.array([[np.unpackbits(np.bitwise_and(ai, bj)).sum()
                      for bj in b] for ai in a])
    a_pop = np.array([np.unpackbits(ai).sum() for ai in a])[:, None]
    b_pop = np.array([np.unpackbits(bj).sum() for bj in b])[None, :]
    return a_pop + b_pop - 2 * dots

def jaccards_packed(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    dots = np.array([[np.unpackbits(np.bitwise_and(ai, bj)).sum()
                      for bj in b] for ai in a])
    a_pop = np.array([np.unpackbits(ai).sum() for ai in a])[:, None]
    b_pop = np.array([np.unpackbits(bj).sum() for bj in b])[None, :]
    union = a_pop + b_pop - dots
    return np.where(union > 0, 1.0 - dots / union, 0.0)

Input & Output Types

Input TypeOutput TypeDescription
u1u32Binary Hamming distance, packed octets
u1f32Binary Jaccard distance, packed octets

Optimizations

Hamming and Jaccard from Intersection Counts

nk_hammings_packed_u1_serial, nk_hammings_packed_u1_haswell, nk_jaccards_packed_u1_serial, nk_jaccards_packed_u1_haswell reuse the dots u1 GEMM output where each dot product dot(a,b)=popcount(a&b)=AB\text{dot}(a, b) = \text{popcount}(a \mathbin{\&} b) = |A \cap B| counts intersection bits. The L1 norm of a binary vector is its popcount: A=popcount(a)=a1|A| = \text{popcount}(a) = \|a\|_1. By inclusion-exclusion, AB=A+BAB|A \cup B| = |A| + |B| - |A \cap B|. Hamming distance counts positions where exactly one bit is set: DH=A+B2AB=popcount(ab)D_H = |A| + |B| - 2|A \cap B| = \text{popcount}(a \oplus b). Finalizer nk_hamming_u32x4_from_dot_serial_ computes pop_a + pop_b - 2 * dot in pure UInt32 arithmetic — no division, no float conversion, no sqrt. Jaccard distance: DJ=1ABAB=1dotpopa+popbdotD_J = 1 - \frac{|A \cap B|}{|A \cup B|} = 1 - \frac{\text{dot}}{\text{pop}_a + \text{pop}_b - \text{dot}}. Finalizer nk_jaccard_f32x4_from_dot_serial_ requires UInt32 → Float32 cast plus Float32 division (~11cy latency on Haswell), making it ~3× more expensive per element than Hamming's integer subtraction chain. Per-column popcount norms (a1\|a\|_1, b1\|b\|_1) are precomputed during packing and stored in packed buffer metadata, avoiding per-pair recomputation.

SME Binary Outer-Product Accumulation

nk_hammings_packed_u1_smebi32, nk_jaccards_packed_u1_smebi32 use the BMOPA instruction which computes popcount(XNOR(a,b))\text{popcount}(\text{XNOR}(a, b)) — counting matching bits in a single outer-product operation over 16×16 output tiles with 512-bit depth chunks. This is fundamentally different from the AND+POPCNT used by scalar/NEON/x86 kernels, which count intersection bits. Hamming from BMOPA: DH=depth_bitspopcount(XNOR)D_H = \text{depth\_bits} - \text{popcount}(\text{XNOR}), since XOR popcount (differing bits) is the Hamming distance directly — no per-vector norm correction needed. Jaccard from BMOPA: must convert matching-bit counts to intersection via AB=(popcount(XNOR)(depth_bitsAB))/2|A \cap B| = (\text{popcount}(\text{XNOR}) - (\text{depth\_bits} - |A| - |B|)) / 2, then apply the Jaccard formula — more arithmetic than the AND-based path. Streaming mode overhead (~50–100 cycles for SMSTART/SMSTOP) is amortized across the full M×N output.

Performance

The following performance tables are produced by manually re-running nk_test and nk_bench included internal tools to measure both accuracy and throughput at different input shapes. The input size is controlled by NK_MATRIX_HEIGHT, NK_MATRIX_WIDTH, and NK_MATRIX_DEPTH environment variables, all set to the same value for batched set operations over square matrices. Columns show throughput for 256³, 1024³, and 4096³ configurations. The throughput is measured in GSO/s as Giga Scalar Operations per Second. Accuracy is reported where applicable as exact distance in the result representation; floating Jaccard rows are shown as mean ULP (units in last place). Each kernel runs for at least 20 seconds per configuration. Benchmark threads are pinned to specific cores; on machines with heterogeneous core types (e.g., Apple P/E cores), only the fastest cores are used. Workloads that significantly degrade CPU frequencies (Intel AMX, Apple SME) run in separate passes to avoid affecting throughput measurements of other kernels.

Intel Sapphire Rapids

Native

Kernel256³1024³4096³
u1░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_hammings_packed_u1_serial109 gso/s162 gso/s284 gso/s
nk_hammings_symmetric_u1_serial39.7 gso/s133 gso/s325 gso/s
nk_jaccards_packed_u1_serial54.8 gso/s, 0 ulp128 gso/s, 0 ulp259 gso/s, 0 ulp
nk_jaccards_symmetric_u1_serial29.8 gso/s, 0 ulp110 gso/s, 0 ulp292 gso/s, 0 ulp
nk_hammings_packed_u1_haswell100 gso/s126 gso/s168 gso/s
nk_hammings_symmetric_u1_haswell58.5 gso/s132 gso/s328 gso/s
nk_jaccards_packed_u1_haswell84.2 gso/s, 0.3 ulp124 gso/s, 0.3 ulp165 gso/s, 0.3 ulp
nk_jaccards_symmetric_u1_haswell57.6 gso/s, 0.3 ulp131 gso/s, 0.3 ulp324 gso/s, 0.3 ulp
nk_hammings_packed_u1_icelake110 gso/s340 gso/s604 gso/s
nk_hammings_symmetric_u1_icelake76.2 gso/s258 gso/s1,040 gso/s
nk_jaccards_packed_u1_icelake89.2 gso/s, 0.3 ulp312 gso/s, 0.3 ulp601 gso/s, 0.3 ulp
nk_jaccards_symmetric_u1_icelake66.9 gso/s, 0.3 ulp260 gso/s, 0.3 ulp965 gso/s, 0.3 ulp

WASM

Measured with Wasmtime v42 (Cranelift backend).

Kernel256³1024³4096³
u1░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_hammings_packed_u1_serial43.7 gso/s68.0 gso/s74.7 gso/s
nk_hammings_packed_u1_v128relaxed75.3 gso/s134 gso/s144 gso/s
nk_hammings_symmetric_u1_serial3.72 gso/s13.5 gso/s41.0 gso/s
nk_hammings_symmetric_u1_v128relaxed3.64 gso/s13.9 gso/s42.2 gso/s
nk_jaccards_packed_u1_serial33.7 gso/s, 0 ulp61.3 gso/s, 0 ulp73.2 gso/s, 0 ulp
nk_jaccards_packed_u1_v128relaxed66.4 gso/s, 0 ulp129 gso/s, 0 ulp143 gso/s, 0 ulp
nk_jaccards_symmetric_u1_serial3.57 gso/s, 0 ulp13.3 gso/s, 0 ulp40.6 gso/s, 0 ulp
nk_jaccards_symmetric_u1_v128relaxed3.65 gso/s, 0 ulp13.9 gso/s, 0 ulp42.2 gso/s, 0 ulp

Apple M5

Native

Kernel256³1024³4096³
u1░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_hammings_packed_u1_serial156 gso/s231 gso/s262 gso/s
nk_hammings_symmetric_u1_serial106 gso/s196 gso/s246 gso/s
nk_jaccards_packed_u1_serial136 gso/s, 0 ulp221 gso/s, 0 ulp262 gso/s, 0 ulp
nk_jaccards_symmetric_u1_serial96.5 gso/s, 0 ulp183 gso/s, 0 ulp244 gso/s, 0 ulp
nk_hammings_packed_u1_neon321 gso/s436 gso/s508 gso/s
nk_hammings_symmetric_u1_neon126 gso/s239 gso/s318 gso/s
nk_jaccards_packed_u1_neon271 gso/s, 0 ulp423 gso/s, 0 ulp503 gso/s, 0 ulp
nk_jaccards_symmetric_u1_neon120 gso/s, 0 ulp233 gso/s, 0 ulp316 gso/s, 0 ulp
nk_hammings_packed_u1_smebi323,286 gso/s7,303 gso/s11,269 gso/s
nk_hammings_symmetric_u1_smebi321,872 gso/s5,332 gso/s4,079 gso/s
nk_jaccards_packed_u1_smebi32371 gso/s, 0 ulp1,735 gso/s, 0 ulp4,348 gso/s, 0 ulp
nk_jaccards_symmetric_u1_smebi3283.1 gso/s, 0 ulp358 gso/s, 0 ulp1,005 gso/s, 0 ulp

WASM

Measured with Wasmtime v43 (Cranelift backend).

Kernel256³1024³4096³
u1░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_hammings_packed_u1_serial99.3 gso/s127 gso/s154 gso/s
nk_hammings_symmetric_u1_serial63.7 gso/s142 gso/s210 gso/s
nk_jaccards_packed_u1_serial92.2 gso/s, 0 ulp123 gso/s, 0 ulp153 gso/s, 0 ulp
nk_jaccards_symmetric_u1_serial59.3 gso/s, 0 ulp142 gso/s, 0 ulp207 gso/s, 0 ulp
nk_hammings_packed_u1_v128relaxed266 gso/s378 gso/s426 gso/s
nk_hammings_symmetric_u1_v128relaxed72.2 gso/s185 gso/s259 gso/s
nk_jaccards_packed_u1_v128relaxed243 gso/s, 0 ulp370 gso/s, 0 ulp424 gso/s, 0 ulp
nk_jaccards_symmetric_u1_v128relaxed72.9 gso/s, 0 ulp183 gso/s, 0 ulp257 gso/s, 0 ulp