MaxSim Late-Interaction Scoring in NumKong

April 2, 2026 · View on GitHub

NumKong implements ColBERT-style late-interaction scoring: the MaxSim score sums, over each query token, the minimum angular distance to any document token. A two-stage coarse-to-fine strategy uses i8-quantized screening to find the best document per query, then full-precision refinement computes the final angular distance.

MaxSim score:

MaxSim(Q,D)=i=0m1minj=0n1angular(qi,dj)\text{MaxSim}(Q, D) = \sum_{i=0}^{m-1} \min_{j=0}^{n-1} \text{angular}(q_i, d_j)

Coarse screening finds the best document via i8 dot products as a proxy for argmin angular:

j=argmaxjdoti8(qi,dj)j^* = \arg\max_j \text{dot}_{\text{i8}}(q_i, d_j)

Full-precision refinement:

angular(qi,dj)=1dot(qi,dj)qidj\text{angular}(q_i, d_{j^*}) = 1 - \frac{\text{dot}(q_i, d_{j^*})}{\|q_i\| \cdot \|d_{j^*}\|}

Reformulating as Python pseudocode:

import numpy as np

def maxsim(queries: np.ndarray, documents: np.ndarray) -> float:
    score = 0.0
    for q in queries:
        dots = documents @ q
        best = np.argmax(dots)
        d = documents[best]
        angular = 1 - np.dot(q, d) / (np.linalg.norm(q) * np.linalg.norm(d))
        score += angular
    return score

Input & Output Types

Input TypeOutput TypeDescription
bf16f3216-bit brain float, widened output
f32f3232-bit IEEE 754 single precision
f16f3216-bit IEEE 754 half precision

Optimizations

Dual Pre-Packing

nk_maxsim_packed_bf16_sme, nk_maxsim_packed_f32_sme benefit from having both query and document matrices pre-packed into identical contiguous formats, unlike the nk_dots_packed_* family where only B is pre-packed and A is accessed with arbitrary stride. In the dots GEMM, one ZA tile must be reserved for A-side staging (loading unpacked A rows into the tile array), leaving 3 ZA tiles for accumulation. With both sides pre-packed, all 4 ZA tiles (ZA0–ZA3) serve as accumulators — a +33% increase in MOPA throughput. No output matrix materialization: dots_packed writes a full M×N f32 result matrix, while maxsim reduces each query row to a single argmax index in-flight, eliminating the M×N memory round-trip. Benchmark data (Apple M4, SVL=512):

Dimensionsdots_packed GEMMmaxsim fusedGEMM speedupEnd-to-end
32×128×128 (ColBERT)840 GFLOPS1516 GFLOPS1.81×5.10×
32×256×1281037 GFLOPS1591 GFLOPS1.53×5.17×
64×512×1281016 GFLOPS1651 GFLOPS1.62×5.42×
32×128×256859 GFLOPS1725 GFLOPS2.01×4.06×
32×1024×768 (BERT)1124 GFLOPS1932 GFLOPS1.72×2.61×

End-to-end speedup (5×) exceeds GEMM-only speedup (1.5–2×) because maxsim eliminates output materialization and fuses argmax+angular refinement into the tile extraction loop.

Two-Stage Coarse-to-Fine Scoring

All backends use i8-quantized coarse screening at O(m·n·k) with 1 byte/element instead of 2–4, followed by full-precision refinement at O(m·k) for only the winning pairs. Break-even at ~4 documents per query — beyond that, coarse screening dominates and the i8 bandwidth advantage compounds.

ISA-Specific Quantization Ranges

Haswell uses [-79, 79] — VPMADDUBSW produces i16 intermediates, must avoid saturation (2×depth×79 < 32767). Alder Lake and Ice Lake use [-127, 127] — VPDPBUSD accumulates directly to i32, no i16 bottleneck. WASM v128relaxed uses [-63, 63] — i32x4_relaxed_dot_i8x16_i7x16_add requires 7-bit operands. Serial uses [-127, 127].

XOR-0x80 Bias Correction

nk_maxsim_packed_bf16_haswell, nk_maxsim_packed_f32_alder, nk_maxsim_packed_f32_icelake work around the unsigned×signed operand requirement of VPMADDUBSW and VPDPBUSD. Both query and document are signed after quantization, so queries are XOR'd with 0x80 to shift to unsigned range. Post-multiply correction subtracts $128 \cdot \text{sum_i8}(d_j)$ per document, where sums are precomputed in packed metadata.

Vertical Column Extraction on SME

nk_maxsim_packed_bf16_sme, nk_maxsim_packed_f32_sme accumulate Q×D dot products into ZA tiles (4 tiles ZA0–ZA3, each SVL×SVL). The argmax operation needs to find the best document for each query. The naive approach reads rows horizontally (svread_hor_za32) and reduces each row with svmaxv — but svmaxv is a horizontal reduction costing ~8 cycles on typical SVE implementations. Vertical column extraction flips the access pattern: svread_ver_za32_f32_m reads one column of ZA, returning one dot-product score per query for a single document. Element-wise svcmpgt_f32 + svsel_f32 (~1 cycle each) update the running maximum across all queries simultaneously. For 32 queries × 256 documents: horizontal approach = 32 × 256 × svmaxv = 8,192 horizontal reductions; vertical approach = 256 column reads × 1 element-wise svmax = 256 vertical reads + 256 comparisons (~270 cycles vs ~2,048 cycles for the argmax phase alone). The argmax index is tracked in-flight using svsel to conditionally update an index vector alongside the maximum values — no separate argmax pass needed. After finding the best document index per query, full-precision angular refinement uses the originals stored in the packed buffer's third region.

Three-Region Packed Buffer

All backends use a three-region packed buffer layout: [Header 64B] [i8 vectors, 64B-aligned] [metadata, 64B-aligned] [originals, 64B-aligned]. Per-vector metadata (12 bytes) stores quantization scale, i8 sum (for bias correction), and inverse norm (for angular finalization). The originals region stores full-precision vectors for refinement via existing nk_dot_* primitives.

Performance

The following performance tables are produced by manually re-running nk_test and nk_bench included internal tools to measure both accuracy and throughput at different input shapes. The input size is controlled by NK_MATRIX_HEIGHT, NK_MATRIX_WIDTH, and NK_MATRIX_DEPTH environment variables, all set to the same value for late-interaction scoring over square matrices. Columns show throughput for 256³, 1024³, and 4096³ configurations. The throughput is measured in GSO/s as Giga Scalar Operations per Second, with ops=2MNK\text{ops} = 2 \cdot M \cdot N \cdot K complexity for scoring MM query tokens against NN document tokens of dimension KK. Accuracy is reported as mean ULP (units in last place) unless noted otherwise — the average number of representable floating-point values between the result and the exact answer. Each kernel runs for at least 20 seconds per configuration. Benchmark threads are pinned to specific cores; on machines with heterogeneous core types (e.g., Apple P/E cores), only the fastest cores are used. Workloads that significantly degrade CPU frequencies (Intel AMX, Apple SME) run in separate passes to avoid affecting throughput measurements of other kernels.

Intel Sapphire Rapids

Native

Kernel256³1024³4096³
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_maxsim_packed_f32_serial15.7 gso/s, 48.9K ulp15.2 gso/s, 48.9K ulp16.3 gso/s, 48.9K ulp
nk_maxsim_packed_f32_haswell77.2 gso/s, 49.3K ulp70.7 gso/s, 49.3K ulp74.5 gso/s, 49.3K ulp
nk_maxsim_packed_f32_alder99.7 gso/s, 48.9K ulp97.7 gso/s, 48.9K ulp94.5 gso/s, 48.9K ulp
nk_maxsim_packed_f32_icelake131 gso/s, 48.9K ulp124 gso/s, 48.9K ulp136 gso/s, 48.9K ulp
nk_maxsim_packed_f32_sapphireamx273 gso/s, 48.9K ulp293 gso/s, 48.9K ulp285 gso/s, 48.9K ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_maxsim_packed_bf16_serial15.9 gso/s, 49.0K ulp17.0 gso/s, 49.0K ulp15.3 gso/s, 49.0K ulp
nk_maxsim_packed_bf16_haswell79.2 gso/s, 49.3K ulp85.0 gso/s, 49.3K ulp81.0 gso/s, 49.3K ulp
nk_maxsim_packed_bf16_alder114 gso/s, 49.0K ulp110 gso/s, 49.0K ulp115 gso/s, 49.0K ulp
nk_maxsim_packed_bf16_genoa163 gso/s, 49.0K ulp165 gso/s, 49.0K ulp174 gso/s, 49.0K ulp
nk_maxsim_packed_bf16_sapphireamx418 gso/s, 994 ulp418 gso/s, 994 ulp445 gso/s, 994 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_maxsim_packed_f16_serial15.5 gso/s, 49.4K ulp15.6 gso/s, 49.4K ulp16.9 gso/s, 49.4K ulp
nk_maxsim_packed_f16_haswell79.1 gso/s, 49.8K ulp78.1 gso/s, 49.8K ulp79.1 gso/s, 49.8K ulp
nk_maxsim_packed_f16_alder113 gso/s, 49.4K ulp112 gso/s, 49.4K ulp107 gso/s, 49.4K ulp
nk_maxsim_packed_f16_icelake154 gso/s, 49.4K ulp164 gso/s, 49.4K ulp163 gso/s, 49.4K ulp
nk_maxsim_packed_f16_sapphireamx339 gso/s, 49.5K ulp395 gso/s, 49.5K ulp381 gso/s, 49.5K ulp

WASM

Measured with Wasmtime v42 (Cranelift backend).

Kernel256³1024³4096³
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_maxsim_packed_f32_serial? gso/s, 46.8K ulp? gso/s, 46.8K ulp? gso/s, 46.8K ulp
nk_maxsim_packed_f32_v128relaxed? gso/s, 1.58M ulp? gso/s, 1.58M ulp? gso/s, 1.58M ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_maxsim_packed_bf16_serial? gso/s, 47.0K ulp? gso/s, 47.0K ulp? gso/s, 47.0K ulp
nk_maxsim_packed_bf16_v128relaxed? gso/s, 1.58M ulp? gso/s, 1.58M ulp? gso/s, 1.58M ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_maxsim_packed_f16_serial? gso/s, 46.4K ulp? gso/s, 46.4K ulp? gso/s, 46.4K ulp
nk_maxsim_packed_f16_v128relaxed? gso/s, 1.58M ulp? gso/s, 1.58M ulp? gso/s, 1.58M ulp

Apple M4

Native

Kernel256³1024³4096³
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_maxsim_packed_f32_serial124 gso/s, 166K ulp136 gso/s, 104K ulp130 gso/s, 55.1K ulp
nk_maxsim_packed_f32_neonsdot170 gso/s, 167K ulp240 gso/s, 104K ulp167 gso/s, 55.1K ulp
nk_maxsim_packed_f32_sme291 gso/s, 200K ulp1,800 gso/s, 64.6K ulp? gso/s, ? ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_maxsim_packed_bf16_serial135 gso/s, 167K ulp139 gso/s, 105K ulp132 gso/s, 54.8K ulp
nk_maxsim_packed_bf16_neonsdot192 gso/s, 167K ulp257 gso/s, 105K ulp161 gso/s, 54.8K ulp
nk_maxsim_packed_bf16_sme580 gso/s, 16.1K ulp1,620 gso/s, 735 ulp? gso/s, ? ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_maxsim_packed_f16_serial136 gso/s, 169K ulp140 gso/s, 104K ulp134 gso/s, 55.1K ulp
nk_maxsim_packed_f16_neonsdot193 gso/s, 166K ulp255 gso/s, 104K ulp172 gso/s, 55.1K ulp
nk_maxsim_packed_f16_sme573 gso/s, 16.0K ulp1,620 gso/s, 725 ulp? gso/s, ? ulp

WASM

Measured with Wasmtime v43 (Cranelift backend).

Kernel256³1024³4096³
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_maxsim_packed_f32_serial33.7 gso/s, 46.8K ulp35.0 gso/s, 46.8K ulp35.8 gso/s, 46.8K ulp
nk_maxsim_packed_f32_v128relaxed88.5 gso/s, 46.0K ulp98.1 gso/s, 46.0K ulp82.7 gso/s, 46.0K ulp
bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_maxsim_packed_bf16_serial34.4 gso/s, 49.2K ulp35.1 gso/s, 49.2K ulp35.7 gso/s, 49.2K ulp
nk_maxsim_packed_bf16_v128relaxed92.3 gso/s, 49.4K ulp100 gso/s, 49.4K ulp83.2 gso/s, 49.4K ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_maxsim_packed_f16_serial33.8 gso/s, 49.5K ulp35.0 gso/s, 49.5K ulp35.7 gso/s, 49.5K ulp
nk_maxsim_packed_f16_v128relaxed87.0 gso/s, 49.3K ulp95.8 gso/s, 49.3K ulp82.3 gso/s, 49.3K ulp