NumKong Built-in Benchmarks

April 2, 2026 · View on GitHub

Internal profiling suite comparing NumKong's SIMD backends against each other and optionally against BLAS libraries. For broader comparisons — Rust, Python, etc. — see NumWars.

On x86 it compares serial code to manually-vectorized Haswell, Skylake, Ice Lake, Genoa, Sapphire Rapids and newer-generation SIMD kernels.
On Arm it compares serial code to manually-vectorized NEON, SVE, SVE2, SME, SME2 with various extensions for BF16 and mixed-precision dot-products.
On RISC-V it compares serial code to manually-vectorized RVV 1.0 kernels with and without BB, BF16, and F16 extensions.
In WASM environments it compares serial code to manually-vectorized V128 kernels with Relaxed SIMD extensions.

C++

Building

cmake -B build_release -D CMAKE_BUILD_TYPE=Release -D NK_BUILD_BENCH=1
cmake --build build_release --config Release --parallel

With BLAS or MKL cross-validation:

cmake -B build_release -D CMAKE_BUILD_TYPE=Release \
      -D NK_BUILD_BENCH=1 \
      -D NK_COMPARE_TO_BLAS=1 \
      -D NK_COMPARE_TO_MKL=1

On macOS with Homebrew Clang and OpenBLAS — see CONTRIBUTING.md for the full recipe, adding -DNK_BUILD_BENCH=1 to the cmake flags.

Compiler requirements vary by ISA target — see CONTRIBUTING.md for the full table.

Running

build_release/nk_bench                                    # run all benchmarks
build_release/nk_bench --benchmark_filter=dot             # filter by name
build_release/nk_bench --benchmark_min_time=10s           # longer runs for stable results
build_release/nk_bench --filter=dot                       # shorthand for --benchmark_filter

Environment Variables

Variable	Default	Description
`NK_FILTER`	`.*`	Regex to filter benchmarks by name
`NK_SEED`	`42`	RNG seed for reproducible inputs
`NK_BUDGET_SECS`	`10`	Minimum time per benchmark in seconds
`NK_BUDGET_MB`	`1024`	Memory budget for pre-allocated inputs
`NK_DENSE_DIMENSIONS`	`1536`	Vector dimension for dot/spatial benchmarks
`NK_CURVED_DIMENSIONS`	`64`	Vector dimension for curved / bilinear form benchmarks
`NK_MESH_POINTS`	`1000`	Point count for mesh / RMSD / Kabsch benchmarks
`NK_MATRIX_HEIGHT`	`1024`	GEMM M dimension, dataset size in kNN
`NK_MATRIX_WIDTH`	`128`	GEMM N dimension, query count in kNN
`NK_MATRIX_DEPTH`	`1536`	GEMM K dimension, vector dimension in kNN
`NK_SPARSE_FIRST_LENGTH`	`1024`	First set size for sparse benchmarks
`NK_SPARSE_SECOND_LENGTH`	`8192`	Second set size for sparse benchmarks
`NK_SPARSE_INTERSECTION`	`0.5`	Intersection share [0.0, 1.0] for sparse benchmarks
`NK_MAX_COORD_ANGLE`	`180`	Maximum angle in degrees for geospatial benchmarks

Disable multi-threading in BLAS libraries to avoid interference:

export OPENBLAS_NUM_THREADS=1    # for OpenBLAS
export MKL_NUM_THREADS=1         # for Intel MKL
export VECLIB_MAXIMUM_THREADS=1  # for Apple Accelerate
export BLIS_NUM_THREADS=1        # for BLIS

Reported Units

Benchmark Type	Counter	Meaning
Vector kernels — dot, spatial, set, ...	`bytes/s`	Bytes of input consumed per second, both input vectors combined
GEMM, symmetric, batch	`scalar-ops/s`	Scalar multiply-accumulate operations per second / FLOPS
Reductions, casts, trigonometry	`bytes/s`	Bytes of input consumed per second, single input vector

bytes: total bytes across all input vectors read per call. For a pair of 1536-dimensional f32 vectors: 2 * 1536 * 4 = 12288 bytes per call.

scalar-ops: number of scalar arithmetic operations. For dense GEMM: 2 * M * N * K per call. For symmetric GEMM: N * (N + 1) * K per call.

JavaScript

Running

npm run bench:native                            # Node.js native addon
npm run bench:emscripten                        # Emscripten WASM with SIMD
npm run bench:wasi                              # WASI portable execution
npm run bench:browser                           # Chromium via Playwright
npm run bench:all                               # all runtimes

NK_DIMENSIONS=768 NK_FILTER="dot" npm run bench:native    # custom config

Variable	Default	Description
`NK_DIMENSIONS`	`1536`	Vector dimensionality
`NK_ITERATIONS`	`1000`	Number of benchmark iterations
`NK_FILTER`	`.*`	Regex to filter benchmarks
`NK_RUNTIME`	`native`	Runtime: `native`, `emscripten`, `wasi`
`NK_SEED`	`42`	Random seed for reproducible data

Output

JSON results are written to bench/results/. Generate a Markdown comparison report:

npm run bench:report
cat bench/results/report.md

WASM

Emscripten — wasm32 and wasm64

source ~/emsdk/emsdk_env.sh
cmake -B build-wasm -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-wasm.cmake -DNK_BUILD_BENCH=1
cmake --build build-wasm --parallel

For wasm64:

cmake -B build-wasm64 -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-wasm64.cmake -DNK_BUILD_BENCH=1
cmake --build build-wasm64 --parallel

The toolchain files enable -msimd128 and -mrelaxed-simd automatically.

WASI

export WASI_SDK_PATH=~/wasi-sdk-24.0-x86_64-linux
cmake -B build-wasi -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-wasi.cmake -DNK_BUILD_BENCH=1
cmake --build build-wasi --parallel

Running

wasmtime run -W simd=y,relaxed-simd=y,threads=y,shared-memory=y -S threads=y,inherit-env=y ./build-wasi/nk_bench.wasm
wasmer run --enable-simd --enable-relaxed-simd ./build-wasi/nk_bench.wasm
node ./build-wasm/nk_bench.js

Browser benchmarks via Playwright:

npm run bench:browser

Interpreting WASM Results

WASM benchmarks run slower than native due to JIT compilation overhead and memory indirection. Expected performance relative to native:

Runtime	Typical Throughput vs Native
Emscripten / Node.js	60–80%
WASI / Wasmtime	50–70%
Browser / Chromium	40–60%

wasm64 / Memory64 adds ~5–10% overhead vs wasm32 due to 64-bit pointer arithmetic. Relaxed SIMD provides measurable gains for fused multiply-add patterns — compare with and without --wasm-features relaxed-simd to quantify.

Frequency Scaling on AMX and SME

Intel AMX tiles on Sapphire Rapids and later cause P-state throttling: the CPU reduces its frequency when AMX instructions execute, similar to heavy AVX-512 workloads. Arm SME streaming mode on Graviton4 and Apple M4 has analogous frequency effects when entering and exiting streaming SVE mode.

This means AMX/SME benchmarks that interleave with non-AMX/SME work will show misleading throughput numbers as the CPU oscillates between frequency states.

Mitigations:

Use --benchmark_min_time=10s or higher to amortize warm-up over a longer measurement window.
Disable turbo boost with echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo on Linux.
Run AMX/SME benchmarks in isolation — do not mix them with non-AMX/SME benchmarks in the same invocation.
Filter with --benchmark_filter=amx or --benchmark_filter=sme for dedicated runs.

Pinning to Performance Cores

Linux

taskset -c 0-3 ./build_release/nk_bench
numactl --physcpubind=0-3 ./build_release/nk_bench

For dedicated benchmarking machines, add isolcpus=4-7 to the kernel command line and pin benchmarks to isolated cores.

macOS

No direct core-pinning API exists on macOS. Use QoS to avoid efficiency cores:

taskpolicy -b ./build_release/nk_bench

On Apple Silicon there is no public API for P/E core pinning. Run with minimal background load for reproducible results.

Windows

start /affinity 0xF nk_bench.exe

The hex mask 0xF pins to cores 0-3.