NumKong Built-in Benchmarks

April 2, 2026 · View on GitHub

Internal profiling suite comparing NumKong's SIMD backends against each other and optionally against BLAS libraries. For broader comparisons — Rust, Python, etc. — see NumWars.

  • On x86 it compares serial code to manually-vectorized Haswell, Skylake, Ice Lake, Genoa, Sapphire Rapids and newer-generation SIMD kernels.
  • On Arm it compares serial code to manually-vectorized NEON, SVE, SVE2, SME, SME2 with various extensions for BF16 and mixed-precision dot-products.
  • On RISC-V it compares serial code to manually-vectorized RVV 1.0 kernels with and without BB, BF16, and F16 extensions.
  • In WASM environments it compares serial code to manually-vectorized V128 kernels with Relaxed SIMD extensions.

C++

Building

cmake -B build_release -D CMAKE_BUILD_TYPE=Release -D NK_BUILD_BENCH=1
cmake --build build_release --config Release --parallel

With BLAS or MKL cross-validation:

cmake -B build_release -D CMAKE_BUILD_TYPE=Release \
      -D NK_BUILD_BENCH=1 \
      -D NK_COMPARE_TO_BLAS=1 \
      -D NK_COMPARE_TO_MKL=1

On macOS with Homebrew Clang and OpenBLAS — see CONTRIBUTING.md for the full recipe, adding -DNK_BUILD_BENCH=1 to the cmake flags.

Compiler requirements vary by ISA target — see CONTRIBUTING.md for the full table.

Running

build_release/nk_bench                                    # run all benchmarks
build_release/nk_bench --benchmark_filter=dot             # filter by name
build_release/nk_bench --benchmark_min_time=10s           # longer runs for stable results
build_release/nk_bench --filter=dot                       # shorthand for --benchmark_filter

Environment Variables

VariableDefaultDescription
NK_FILTER.*Regex to filter benchmarks by name
NK_SEED42RNG seed for reproducible inputs
NK_BUDGET_SECS10Minimum time per benchmark in seconds
NK_BUDGET_MB1024Memory budget for pre-allocated inputs
NK_DENSE_DIMENSIONS1536Vector dimension for dot/spatial benchmarks
NK_CURVED_DIMENSIONS64Vector dimension for curved / bilinear form benchmarks
NK_MESH_POINTS1000Point count for mesh / RMSD / Kabsch benchmarks
NK_MATRIX_HEIGHT1024GEMM M dimension, dataset size in kNN
NK_MATRIX_WIDTH128GEMM N dimension, query count in kNN
NK_MATRIX_DEPTH1536GEMM K dimension, vector dimension in kNN
NK_SPARSE_FIRST_LENGTH1024First set size for sparse benchmarks
NK_SPARSE_SECOND_LENGTH8192Second set size for sparse benchmarks
NK_SPARSE_INTERSECTION0.5Intersection share [0.0, 1.0] for sparse benchmarks
NK_MAX_COORD_ANGLE180Maximum angle in degrees for geospatial benchmarks

Disable multi-threading in BLAS libraries to avoid interference:

export OPENBLAS_NUM_THREADS=1    # for OpenBLAS
export MKL_NUM_THREADS=1         # for Intel MKL
export VECLIB_MAXIMUM_THREADS=1  # for Apple Accelerate
export BLIS_NUM_THREADS=1        # for BLIS

Reported Units

Benchmark TypeCounterMeaning
Vector kernels — dot, spatial, set, ...bytes/sBytes of input consumed per second, both input vectors combined
GEMM, symmetric, batchscalar-ops/sScalar multiply-accumulate operations per second / FLOPS
Reductions, casts, trigonometrybytes/sBytes of input consumed per second, single input vector

bytes: total bytes across all input vectors read per call. For a pair of 1536-dimensional f32 vectors: 2 * 1536 * 4 = 12288 bytes per call.

scalar-ops: number of scalar arithmetic operations. For dense GEMM: 2 * M * N * K per call. For symmetric GEMM: N * (N + 1) * K per call.

JavaScript

Running

npm run bench:native                            # Node.js native addon
npm run bench:emscripten                        # Emscripten WASM with SIMD
npm run bench:wasi                              # WASI portable execution
npm run bench:browser                           # Chromium via Playwright
npm run bench:all                               # all runtimes
NK_DIMENSIONS=768 NK_FILTER="dot" npm run bench:native    # custom config
VariableDefaultDescription
NK_DIMENSIONS1536Vector dimensionality
NK_ITERATIONS1000Number of benchmark iterations
NK_FILTER.*Regex to filter benchmarks
NK_RUNTIMEnativeRuntime: native, emscripten, wasi
NK_SEED42Random seed for reproducible data

Output

JSON results are written to bench/results/. Generate a Markdown comparison report:

npm run bench:report
cat bench/results/report.md

WASM

Emscripten — wasm32 and wasm64

source ~/emsdk/emsdk_env.sh
cmake -B build-wasm -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-wasm.cmake -DNK_BUILD_BENCH=1
cmake --build build-wasm --parallel

For wasm64:

cmake -B build-wasm64 -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-wasm64.cmake -DNK_BUILD_BENCH=1
cmake --build build-wasm64 --parallel

The toolchain files enable -msimd128 and -mrelaxed-simd automatically.

WASI

export WASI_SDK_PATH=~/wasi-sdk-24.0-x86_64-linux
cmake -B build-wasi -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-wasi.cmake -DNK_BUILD_BENCH=1
cmake --build build-wasi --parallel

Running

wasmtime run -W simd=y,relaxed-simd=y,threads=y,shared-memory=y -S threads=y,inherit-env=y ./build-wasi/nk_bench.wasm
wasmer run --enable-simd --enable-relaxed-simd ./build-wasi/nk_bench.wasm
node ./build-wasm/nk_bench.js

Browser benchmarks via Playwright:

npm run bench:browser

Interpreting WASM Results

WASM benchmarks run slower than native due to JIT compilation overhead and memory indirection. Expected performance relative to native:

RuntimeTypical Throughput vs Native
Emscripten / Node.js60–80%
WASI / Wasmtime50–70%
Browser / Chromium40–60%

wasm64 / Memory64 adds ~5–10% overhead vs wasm32 due to 64-bit pointer arithmetic. Relaxed SIMD provides measurable gains for fused multiply-add patterns — compare with and without --wasm-features relaxed-simd to quantify.

Frequency Scaling on AMX and SME

Intel AMX tiles on Sapphire Rapids and later cause P-state throttling: the CPU reduces its frequency when AMX instructions execute, similar to heavy AVX-512 workloads. Arm SME streaming mode on Graviton4 and Apple M4 has analogous frequency effects when entering and exiting streaming SVE mode.

This means AMX/SME benchmarks that interleave with non-AMX/SME work will show misleading throughput numbers as the CPU oscillates between frequency states.

Mitigations:

  • Use --benchmark_min_time=10s or higher to amortize warm-up over a longer measurement window.
  • Disable turbo boost with echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo on Linux.
  • Run AMX/SME benchmarks in isolation — do not mix them with non-AMX/SME benchmarks in the same invocation.
  • Filter with --benchmark_filter=amx or --benchmark_filter=sme for dedicated runs.

Pinning to Performance Cores

Linux

taskset -c 0-3 ./build_release/nk_bench
numactl --physcpubind=0-3 ./build_release/nk_bench

For dedicated benchmarking machines, add isolcpus=4-7 to the kernel command line and pin benchmarks to isolated cores.

macOS

No direct core-pinning API exists on macOS. Use QoS to avoid efficiency cores:

taskpolicy -b ./build_release/nk_bench

On Apple Silicon there is no public API for P/E core pinning. Run with minimal background load for reproducible results.

Windows

start /affinity 0xF nk_bench.exe

The hex mask 0xF pins to cores 0-3.