Set Similarity Measures in NumKong

April 2, 2026 · View on GitHub

NumKong implements set similarity functions for binary and integer vectors: Hamming distance measures the number of differing elements, while Jaccard distance measures the complement of the intersection-over-union ratio. These are fundamental to locality-sensitive hashing, MinHash sketches, and binary feature matching.

Hamming distance counts the number of positions where elements differ. For binary vectors packed as octets, this is the popcount of the XOR. For byte-level vectors, it counts the number of mismatched bytes:

hamming(a,b)=i=0n1[aibi]\text{hamming}(a, b) = \sum_{i=0}^{n-1} [a_i \neq b_i]

Jaccard distance measures the dissimilarity of two sets. For binary vectors, the intersection and union are computed via bitwise AND and OR with popcount:

jaccard(a,b)=1ABAB=1popcount(a&b)popcount(ab)\text{jaccard}(a, b) = 1 - \frac{|A \cap B|}{|A \cup B|} = 1 - \frac{\text{popcount}(a \mathbin{\&} b)}{\text{popcount}(a \mathbin{|} b)}

For word-level vectors (MinHash signatures), Jaccard similarity is the fraction of matching elements:

jaccard(a,b)=1i=0n1[ai=bi]n\text{jaccard}(a, b) = 1 - \frac{\sum_{i=0}^{n-1} [a_i = b_i]}{n}

Reformulating as Python pseudocode:

import numpy as np

def hamming_bits(a: np.ndarray, b: np.ndarray) -> int:
    return np.unpackbits(np.bitwise_xor(a, b)).sum()

def jaccard_bits(a: np.ndarray, b: np.ndarray) -> float:
    intersection = np.unpackbits(np.bitwise_and(a, b)).sum()
    union = np.unpackbits(np.bitwise_or(a, b)).sum()
    return 1 - intersection / union if union else 0

def jaccard_words(a: np.ndarray, b: np.ndarray) -> float:
    return 1 - np.mean(a == b)

Input & Output Types

Input TypeOutput TypeDescription
u1u32Binary Hamming distance, packed octets
u1f32Binary Jaccard distance, packed octets
u8u32Byte-level Hamming distance
u16f32Word-level Jaccard distance, 16-bit MinHash
u32f32Word-level Jaccard distance, 32-bit MinHash

Optimizations

Harley-Seal Carry-Save Adders for U1

nk_hamming_u1_haswell, nk_jaccard_u1_haswell amortize the cost of popcount by using Harley-Seal carry-save adder trees. Instead of computing popcount on every XOR/AND/OR result independently, three intermediate values are combined through a full-adder circuit:

ones  = a ^ b ^ c
twos  = (a & b) | (c & (a ^ b))

This circuit takes three popcount inputs and produces a ones and twos accumulator, where twos has double the weight of ones. By chaining two levels, a fours accumulator is also produced, so the actual VPSHUFB-based popcount is called only on the final accumulated ones, twos, and fours values. The total number of popcount operations is reduced by roughly a factor of three compared to computing popcount on every vector independently.

Native VPOPCNTQ on Ice Lake

nk_hamming_u1_icelake, nk_jaccard_u1_icelake use VPOPCNTQ on 512-bit vectors, which directly produces per-quadword population counts for 8 quadwords at once. This single instruction replaces the entire nibble-LUT + Harley-Seal pipeline used on Haswell. The kernels batch 16 vectors before horizontal reduction to minimize VPSADBW overhead, accumulating the per-quadword counts into a running total via VPADDQ.

Jaccard via Precomputed Norms

nk_jaccard_u1_haswell, nk_jaccard_u1_icelake exploit the identity AB=A+BAB|A \cup B| = |A| + |B| - |A \cap B| to avoid computing both AND-popcount and OR-popcount in the inner loop. When vector norms (popcount of each vector) are precomputed and passed via the streaming API, only the intersection popcount is needed per pair, halving the work in the critical path.

Byte Hamming via VPSADBW

nk_hamming_u8_haswell, nk_hamming_u8_icelake compute byte-level Hamming distance using XOR to produce per-byte difference indicators, then VPSADBW against zero to horizontally sum the nonzero bytes. XOR produces 0 for equal bytes and nonzero for different ones, and VPSADBW sums the absolute values of byte differences within each 64-bit lane. Since XOR results are either 0 or nonzero (not necessarily 1), the kernel masks XOR output through VPMIN with a vector of ones to clamp each byte to 0 or 1 before feeding VPSADBW.

Performance

The following performance tables are produced by manually re-running nk_test and nk_bench included internal tools to measure both accuracy and throughput at different input shapes. The input size is controlled by the NK_DENSE_DIMENSIONS environment variable and set to 256, 1024, and 4096 elements. The throughput is measured in GB/s as the number of input bytes per second. Accuracy is reported where applicable as exact distance in the result representation; floating Jaccard rows are shown as mean ULP (units in last place). Each kernel runs for at least 20 seconds per configuration. Benchmark threads are pinned to specific cores; on machines with heterogeneous core types (e.g., Apple P/E cores), only the fastest cores are used. Workloads that significantly degrade CPU frequencies (Intel AMX, Apple SME) run in separate passes to avoid affecting throughput measurements of other kernels.

Intel Sapphire Rapids

Native

Kernel25610244096
u1░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_hamming_u1_serial2.30 gb/s2.62 gb/s2.54 gb/s
nk_jaccard_u1_serial1.35 gb/s, 0 ulp1.46 gb/s, 0 ulp1.50 gb/s, 0 ulp
nk_hamming_u1_haswell9.63 gb/s25.2 gb/s56.2 gb/s
nk_jaccard_u1_haswell5.24 gb/s, 0 ulp15.5 gb/s, 0 ulp27.0 gb/s, 0 ulp
nk_hamming_u1_icelake11.2 gb/s38.2 gb/s56.1 gb/s
nk_jaccard_u1_icelake6.46 gb/s, 0 ulp22.4 gb/s, 0 ulp33.3 gb/s, 0 ulp
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_hamming_u8_serial15.0 gb/s14.9 gb/s14.8 gb/s
nk_hamming_u8_haswell22.4 gb/s21.6 gb/s17.9 gb/s
nk_hamming_u8_icelake55.2 gb/s37.7 gb/s24.3 gb/s
u16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_jaccard_u16_serial27.8 gb/s, 0 ulp23.0 gb/s, 0 ulp19.2 gb/s, 0 ulp
nk_jaccard_u16_haswell22.2 gb/s, 0 ulp18.4 gb/s, 0 ulp13.7 gb/s, 0 ulp
nk_jaccard_u16_icelake54.2 gb/s, 0 ulp24.3 gb/s, 0 ulp20.9 gb/s, 0 ulp
u32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_jaccard_u32_serial33.1 gb/s, 0 ulp23.5 gb/s, 0 ulp18.3 gb/s, 0 ulp
nk_jaccard_u32_haswell19.0 gb/s, 0 ulp16.9 gb/s, 0 ulp11.0 gb/s, 0 ulp
nk_jaccard_u32_icelake33.0 gb/s, 0 ulp24.6 gb/s, 0 ulp16.3 gb/s, 0 ulp

WASM

Measured with Wasmtime v42 (Cranelift backend).

Kernel25610244096
u1░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_hamming_u1_v128relaxed0.138 gb/s0.149 gb/s0.979 gb/s
nk_jaccard_u1_v128relaxed0.153 gb/s, 0 ulp0.352 gb/s, 0 ulp2.50 gb/s, 0 ulp
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_hamming_u8_v128relaxed0.370 gb/s0.400 gb/s2.19 gb/s
u16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_jaccard_u16_v128relaxed2.30 gb/s, 0 ulp2.34 gb/s, 0 ulp0.381 gb/s, 0 ulp
u32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_jaccard_u32_v128relaxed0.430 gb/s, 0 ulp2.46 gb/s, 0 ulp1.08 gb/s, 0 ulp

Apple M5

Native

Kernel25610244096
u1░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_hamming_u1_serial6.79 gb/s7.48 gb/s6.92 gb/s
nk_jaccard_u1_serial4.36 gb/s, 0 ulp5.38 gb/s, 0 ulp5.45 gb/s, 0 ulp
nk_hamming_u1_neon31.6 gb/s65.6 gb/s90.9 gb/s
nk_jaccard_u1_neon28.4 gb/s, 0 ulp48.1 gb/s, 0 ulp51.0 gb/s, 0 ulp
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_hamming_u8_serial27.8 gb/s30.1 gb/s31.2 gb/s
nk_hamming_u8_neon96.9 gb/s79.5 gb/s56.3 gb/s
u16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_jaccard_u16_serial59.3 gb/s, 0 ulp69.4 gb/s, 0 ulp66.8 gb/s, 0 ulp
nk_jaccard_u16_neon67.8 gb/s, 0 ulp61.6 gb/s, 0 ulp50.8 gb/s, 0 ulp
u32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_jaccard_u32_serial105 gb/s, 0 ulp101 gb/s, 0 ulp89.1 gb/s, 0 ulp
nk_jaccard_u32_neon89.3 gb/s, 0 ulp72.8 gb/s, 0 ulp68.2 gb/s, 0 ulp

WASM

Measured with Wasmtime v43 (Cranelift backend).

Kernel25610244096
u1░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_hamming_u1_serial5.18 gb/s5.66 gb/s6.52 gb/s
nk_jaccard_u1_serial1.74 gb/s, 0 ulp3.32 gb/s, 0 ulp3.61 gb/s, 0 ulp
nk_hamming_u1_v128relaxed22.6 gb/s46.5 gb/s67.9 gb/s
nk_jaccard_u1_v128relaxed16.1 gb/s, 0 ulp34.5 gb/s, 0 ulp50.8 gb/s, 0 ulp
u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_hamming_u8_serial8.32 gb/s6.09 gb/s5.84 gb/s
nk_hamming_u8_v128relaxed47.7 gb/s68.5 gb/s72.1 gb/s
u16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_jaccard_u16_serial19.2 gb/s, 0 ulp12.4 gb/s, 0 ulp11.9 gb/s, 0 ulp
nk_jaccard_u16_v128relaxed89.8 gb/s, 0 ulp74.0 gb/s, 0 ulp71.3 gb/s, 0 ulp
u32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_jaccard_u32_serial91.6 gb/s, 0 ulp69.4 gb/s, 0 ulp68.4 gb/s, 0 ulp
nk_jaccard_u32_v128relaxed94.8 gb/s, 0 ulp76.2 gb/s, 0 ulp68.8 gb/s, 0 ulp