Contributing
April 13, 2026 · View on GitHub
To keep the quality of the code high, we follow the coding style and conventions shared across multiple projects — covering Git history, C++ and Python formatting, dependency management, and documentation.
Directory Tree
include/numkong/ C and C++ headers — one .h per kernel family, one .hpp per C++ API
include/numkong/*/ Per-ISA kernel implementations — serial, haswell, neon, rvv, sme, etc.
c/ Runtime dispatch layer — one dispatch_*.c per dtype
test/ C++ precision tests — see test/README.md
bench/ C++ Google Benchmark suite and JS bench runner — see bench/README.md
python/ CPython extension, no SWIG or PyBind11
javascript/ Node.js native addon + Emscripten WASM + TypeScript API
rust/ Rust FFI bindings
swift/ Swift Package Manager bindings
golang/ Go cgo bindings
cmake/ Toolchain files for cross-compilation — WASM, WASI, RISC-V, AArch64
C and C++
Building
cmake -B build_release -D CMAKE_BUILD_TYPE=Release \
-D NK_BUILD_TEST=1 \
-D NK_BUILD_BENCH=1 \
-D NK_COMPARE_TO_BLAS=1
cmake --build build_release --config Release --parallel
build_release/nk_bench
build_release/nk_test
| CMake Flag | Default | Description |
|---|---|---|
NK_BUILD_TEST | OFF | Compile precision tests with ULP error analysis |
NK_BUILD_BENCH | OFF | Compile micro-benchmarks |
NK_BUILD_SHARED | ON, if top-level | Compile dynamic library |
NK_BUILD_SHARED_TEST | OFF | Compile tests against the shared library |
NK_COMPARE_TO_BLAS | AUTO | Include OpenBLAS into test/bench comparisons |
NK_COMPARE_TO_ACCELERATE | AUTO | Include Apple's Accelerate into test/bench comparisons |
NK_COMPARE_TO_MKL | AUTO | Include Intel' MKL into test/bench comparisons |
NK_MARCH_NATIVE | OFF | Tune for host CPU with -march=native |
Target Baseline Policy
CMakeLists.txt, build.rs, setup.py, and binding.gyp pin the TU-level baseline to each architecture's ABI floor so distributable artifacts run on any CPU matching the ABI, not just the build host.
SIMD kernels live inside #pragma GCC target(...) regions and are only called after runtime probing — see the README's Compile-Time and Run-Time Dispatch section.
| Target arch | GCC/Clang baseline | MSVC baseline | Notes |
|---|---|---|---|
x86_64 | -march=x86-64 | /arch:SSE2 | System V psABI / Microsoft x64 ABI floor; SSE2 is mandatory |
aarch64 | -march=armv8-a | /arch:armv8.0 | ARMv8-A ABI floor; NEON is mandatory |
riscv64 | -march=rv64gc | n/a | V extension is runtime-probed and dispatched |
powerpc64le | -mcpu=power8 | n/a | ELFv2 ABI floor (VSX is mandatory) |
loongarch64 | -march=loongarch64 -mlasx | n/a | LASX baked into the baseline — see LoongArch note below |
GCC/Clang builds also pass -fno-tree-vectorize -fno-tree-slp-vectorize so the auto-vectorizer cannot promote serial fallbacks to baseline SIMD (NEON, SSE2, VSX, …).
That keeps the tiered dispatch design intact: "serial" kernels stay actually serial, and the per-pragma SIMD kernels — which use explicit intrinsics, not vectorized scalar code — are the sole source of SIMD emission.
MSVC has no per-function target pragma and no command-line vectorizer toggle, so the explicit /arch: flags above match defaults and document intent only; NumKong's MSVC strategy is compile-time gating via _MSC_VER version checks (see include/numkong/types.h).
LoongArch is the one arch that can't honor the per-function-pragma model: __attribute__((target("lasx"))) and #pragma GCC target("lasx") only landed in GCC 15.1 (Feb 2025) and Clang 22.1 (May 2025), and the bundled lasxintrin.h gates every wrapper on the __loongarch_asx macro that those older toolchains only set via TU-level -mlasx.
Until NumKong's minimum supported toolchain catches up, LoongArch artifacts require LASX-capable hardware (LA464+, c. 2021).
Package.swift and golang/numkong.go do not pin baselines: SPM forbids .unsafeFlags() on remotely consumed targets, and the cgo bindings rely on the surrounding compiler default.
For host-tuned local builds, set NK_MARCH_NATIVE=1 (env var honored by build.rs and setup.py; CMake option -DNK_MARCH_NATIVE=ON).
The resulting artifact bakes host-specific instructions into scaffolding code and is not portable.
Compiler Requirements
| ISA Family | GCC | Clang | AppleClang | MSVC |
|---|---|---|---|---|
| Base — serial, NEON, AVX2 | 9+ | 10+ | Any | 2019+ |
| Float16 — NEONHalf, Sapphire FP16, Zvfh | 12+ | 16+ | Any | 2022 17.14+ |
| AVX-512 — Skylake, Ice Lake | 9+ | 10+ | N/A | 2019+ |
| AVX-512BF16 — Genoa | 12+ | 16+ | N/A | 2022 17.14+ |
| Intel AMX — Sapphire, Granite | 14+ | 18+ | N/A | 2022 17.14+ |
| Arm SME/SME2 | 14+ | 18+ | 16+ / Xcode 16 | N/A |
| RISC-V Vector — RVV 1.0 | 13+ | 17+ | N/A | N/A |
| RVV + Zvfh/Zvfbfwma/Zvbb | 14+ | 18+ | N/A | N/A |
To install on Ubuntu 22.04:
sudo apt install gcc-12 g++-12
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 100
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-12 100
Cross-Compilation
NumKong ships 8 toolchain files in cmake/ for cross-compiling to non-native targets.
Tests and benchmarks run transparently under QEMU via CMAKE_CROSSCOMPILING_EMULATOR.
| Target | Toolchain File | Emulator | Prerequisites |
|---|---|---|---|
| ARM64 Linux | toolchain-aarch64-gnu.cmake | qemu-aarch64 -cpu max | gcc-aarch64-linux-gnu, qemu-user |
| RISC-V 64 GCC | toolchain-riscv64-gnu.cmake | qemu-riscv64 -cpu max | riscv-gnu-toolchain, qemu-user |
| RISC-V 64 LLVM | toolchain-riscv64-llvm.cmake | qemu-riscv64 -cpu max | LLVM 17+, RISCV_SYSROOT |
| Android ARM64 | toolchain-android-arm64.cmake | — | ANDROID_NDK_ROOT |
| x86_64 from Apple Silicon | toolchain-x86_64-llvm.cmake | arch -x86_64 | Homebrew LLVM |
| WASM Emscripten | toolchain-wasm.cmake | Node.js | Emscripten 3.1.27+ |
| WASM64 Memory64 | toolchain-wasm64.cmake | Node.js | Emscripten 3.1.35+ |
| WASI | toolchain-wasi.cmake | Wasmtime / Wasmer | WASI SDK 24+ |
Set NK_IN_QEMU=1 to relax half-precision accuracy thresholds under emulation.
ARM64 Linux
cmake -B build_arm64 -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-aarch64-gnu.cmake
cmake --build build_arm64 --parallel
To build and run tests under emulation, see test/README.md.
Default arch: armv9-a+sve2+fp16+bf16+i8mm+dotprod+fp16fml.
RISC-V 64 with GCC
cmake -B build_riscv -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-riscv64-gnu.cmake
cmake --build build_riscv --parallel
To build and run tests under emulation, see test/README.md.
Default arch: rv64gcv_zvfh_zvfbfwma_zvbb.
RISC-V 64 with LLVM
export RISCV_SYSROOT=/path/to/riscv-sysroot
cmake -B build_riscv_llvm -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-riscv64-llvm.cmake
cmake --build build_riscv_llvm --parallel
To build and run tests under emulation, see test/README.md.
Android ARM64
cmake -B build_android -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-android-arm64.cmake
cmake --build build_android --parallel
To build and run tests under emulation, see test/README.md.
WASM via Emscripten
source ~/emsdk/emsdk_env.sh
cmake -B build-wasm -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-wasm.cmake
cmake --build build-wasm --parallel
For wasm64 — Memory64:
cmake -B build-wasm64 -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-wasm64.cmake
cmake --build build-wasm64 --parallel
WASI
export WASI_SDK_PATH=~/wasi-sdk-24.0-x86_64-linux
cmake -B build-wasi -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-wasi.cmake
cmake --build build-wasi --parallel
iOS Simulator via Xcode
xcodebuild test -scheme NumKong -destination 'platform=iOS Simulator,name=iPhone 16'
x86_64 from Apple Silicon
cmake -B build_x86 -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-x86_64-llvm.cmake
cmake --build build_x86 --parallel
macOS
With Apple Clang and Homebrew OpenBLAS:
brew install openblas
cmake -B build_release -D CMAKE_BUILD_TYPE=Release \
-D NK_BUILD_TEST=1 \
-D NK_BUILD_BENCH=1 \
-D NK_COMPARE_TO_BLAS=1 \
-D CMAKE_PREFIX_PATH="$(brew --prefix openblas)" \
-D CMAKE_CXX_STANDARD_INCLUDE_DIRECTORIES="$(brew --prefix openblas)/include"
cmake --build build_release --config Release --parallel
With Homebrew Clang — recommended for full ISA support:
brew install llvm openblas
unset DEVELOPER_DIR
cmake -B build_release -D CMAKE_BUILD_TYPE=Release \
-D NK_BUILD_TEST=1 \
-D NK_BUILD_BENCH=1 \
-D NK_COMPARE_TO_BLAS=1 \
-D CMAKE_CXX_STANDARD_INCLUDE_DIRECTORIES="$(brew --prefix openblas)/include" \
-D CMAKE_C_LINK_FLAGS="-L$(xcrun --sdk macosx --show-sdk-path)/usr/lib" \
-D CMAKE_EXE_LINKER_FLAGS="-L$(xcrun --sdk macosx --show-sdk-path)/usr/lib" \
-D CMAKE_C_COMPILER="$(brew --prefix llvm)/bin/clang" \
-D CMAKE_CXX_COMPILER="$(brew --prefix llvm)/bin/clang++" \
-D CMAKE_OSX_SYSROOT="$(xcrun --sdk macosx --show-sdk-path)" \
-D CMAKE_OSX_DEPLOYMENT_TARGET=$(sw_vers -productVersion)
cmake --build build_release --config Release --parallel
BLAS Threading
When benchmarking with BLAS cross-validation, disable multi-threading in BLAS libraries to avoid interference — see bench/README.md for the *_NUM_THREADS variables.
Debugger Breakpoints
Useful breakpoints for debugging:
__asan::ReportGenericError— illegal memory accesses.__GI_exit— exit points at end of any executable.__builtin_unreachable— unexpected code paths._sz_assert_failure— StringZilla logic assertions.
See test/README.md for test framework details and bench/README.md for benchmark configuration.
Static Analysis & Formatting
Once done editing the code, please run analyzers and formatters:
git ls-files '*.h' '*.c' '*.hpp' '*.cpp' | xargs clang-format -i # Use Clang Format 21 or newer
Python
Python bindings are implemented using pure CPython, so you wouldn't need to install SWIG, PyBind11, or any other third-party library. Still, you need a virtual environment. If you already have one:
pip install -e . # build locally from source
pip install pytest pytest-repeat pytest-randomly # testing dependencies
pip install numpy scipy ml_dtypes tabulate # optional reference libraries
pytest test/ -s -x -Wd # to run tests
# to check supported SIMD instructions:
python -c "import numkong; print(numkong.get_capabilities())"
Alternatively, use uv to create the virtual environment.
uv venv --python 3.13t # or your preferred version
source .venv/bin/activate # activate the environment
uv pip install -e . # build locally from source
# to run GIL-related tests in a free-threaded environment:
uv pip install pytest pytest-repeat pytest-randomly numpy scipy ml_dtypes tabulate
PYTHON_GIL=0 python -m pytest test/ -s -x -Wd -k gil
Here, -s will output the logs.
The -x will stop on the first failure.
The -Wd will silence overflows and runtime warnings.
When building on macOS, same as with C/C++, use non-Apple Clang version:
brew install llvm libomp
CC=$(brew --prefix llvm)/bin/clang CXX=$(brew --prefix llvm)/bin/clang++ pip install -e .
Wheels pin a portable per-arch baseline by default — see Target Baseline Policy.
For host-tuned local installs, set NK_MARCH_NATIVE=1 pip install -e . (the resulting build is not redistributable).
Before merging your changes you may want to test your changes against the entire matrix of Python versions NumKong supports.
For that you need the cibuildwheel, which is tricky to use on macOS and Windows, as it would target just the local environment.
Still, if you have Docker running on any desktop OS, you can use it to build and test the Python bindings for all Python versions for Linux:
pip install cibuildwheel
cibuildwheel
cibuildwheel --platform linux # works on any OS and builds all Linux backends
cibuildwheel --platform linux --archs x86_64 # 64-bit x86, the most common on desktop and servers
cibuildwheel --platform linux --archs aarch64 # 64-bit Arm for mobile devices, Apple M-series, and AWS Graviton
cibuildwheel --platform linux --archs i686 # 32-bit Linux
cibuildwheel --platform macos # works only on macOS
cibuildwheel --platform windows # works only on Windows
You may need root privileges for multi-architecture builds:
sudo $(which cibuildwheel) --platform linux
On Windows and macOS, to avoid frequent path resolution issues, you may want to use:
python -m cibuildwheel --platform windows
Static Analysis & Formatting
Once done editing the code, please run analyzers and formatters:
ruff check test/ # linting
black . # format with default settings
Rust
cargo test -p numkong
cargo test -p numkong -- --nocapture # to see the output
NK_MARCH_NATIVE=1 cargo build --release # for host-tuned local builds
The crate pins a portable per-arch baseline by default — see Target Baseline Policy.
To automatically detect the Minimum Supported Rust Version — MSRV:
cargo +stable install cargo-msrv
cargo msrv find --ignore-lockfile
Please avoid the temptation of using macros in this Rust code.
JavaScript
See javascript/README.md for JavaScript/TypeScript development, WASM support, and API documentation.
Quick reference:
npm run build-js # Build TypeScript
npm test # Run tests
npm run bench # Run benchmarks
Swift
swift build && swift test -v
Running Swift on Linux requires a couple of extra steps, as the Swift compiler is not available in the default repositories. Please get the most recent Swift tarball from the official website. At the time of writing, for 64-bit Arm CPU running Ubuntu 22.04, the following commands would work:
wget https://download.swift.org/swift-5.9.2-release/ubuntu2204-aarch64/swift-5.9.2-RELEASE/swift-5.9.2-RELEASE-ubuntu22.04-aarch64.tar.gz
tar xzf swift-5.9.2-RELEASE-ubuntu22.04-aarch64.tar.gz
sudo mv swift-5.9.2-RELEASE-ubuntu22.04-aarch64 /usr/share/swift
echo "export PATH=/usr/share/swift/usr/bin:$PATH" >> ~/.bashrc
source ~/.bashrc
You can check the available images on swift.org/download page.
For x86 CPUs, the following commands would work:
wget https://download.swift.org/swift-5.9.2-release/ubuntu2204/swift-5.9.2-RELEASE/swift-5.9.2-RELEASE-ubuntu22.04.tar.gz
tar xzf swift-5.9.2-RELEASE-ubuntu22.04.tar.gz
sudo mv swift-5.9.2-RELEASE-ubuntu22.04 /usr/share/swift
echo "export PATH=/usr/share/swift/usr/bin:$PATH" >> ~/.bashrc
source ~/.bashrc
Alternatively, on Linux, the official Swift Docker image can be used for builds and tests:
sudo docker run --rm -v "$PWD:/workspace" -w /workspace swift:5.9 /bin/bash -cl "swift build -c release --static-swift-stdlib && swift test -c release --enable-test-discovery"
GoLang
cd golang
go test # To test
go test -run=^$ -bench=. -benchmem # To benchmark
Adding a New Kernel Family
To add a new operation family, for example foo:
- C header: create
include/numkong/foo.hwith serial implementation and dispatch function signatures. - ISA implementations: add
include/numkong/foo/serial.h,foo/neon.h,foo/haswell.h, etc. - Dispatch layer: add entries to the appropriate
c/dispatch_*.cfiles for each dtype the kernel supports. - C++ wrapper: create
include/numkong/foo.hppwith the typed C++ API. - Test: create
test/test_foo.cppwith precision validation againstf118_treferences. - Benchmark: create
bench/bench_foo.cppwith Google Benchmark harness. - Cross-platform tests: add entries to
test/test_cross.hppand the relevanttest_cross_*.cppfiles. - CMakeLists.txt: wire the new source files into the
nk_testandnk_benchtargets. - Language bindings: update
python/numkong.c,javascript/numkong.c,rust/numkong.rs, etc. as needed.
Adding a Backend Kernel to an Existing Family
For primary kernels, every backend implementation should be wired in five places beyond the backend header itself:
- Forward declaration: add the
NK_PUBLICdeclaration with the matching@copydocin the first half ofinclude/numkong/<family>.h. - Compile-time dispatch: add the
#if !NK_DYNAMIC_DISPATCHbranch in the second half ofinclude/numkong/<family>.h. - Run-time dispatch: add the dtype-specific entry to the relevant
c/dispatch_*.ctable. - Precision tests: register the kernel in
nk_test, usually in the existingtest/test_<family>.cppsuite. - Benchmarks: register the kernel in
nk_bench, usually in the existingbench/bench_<family>.cppsuite.
Use the existing family suite unless the kernel introduces a genuinely new test shape. The rule is about coverage and reachability, not about creating a brand new source file for every symbol.
There are two intentional exceptions:
cast: the family-levelnk_cast_*kernels follow the same header/dispatch/test/bench rule, but scalar conversion helpers are wired throughc/dispatch_other.cand are covered throughtest/test_cast.cppandbench/bench_cast.cpp.scalar: scalar helpers are centrally declared ininclude/numkong/scalar.h, wired throughc/dispatch_other.c, and currently do not follow the per-helpernk_testandnk_benchregistration pattern.
Wording & Styling
A lot of effort goes into keeping the wording and styling of the code consistent. Variable names must reflect the semantic operation, not just the intrinsic name.
Variable Names & Type Suffixes
Reading mixed-precision kernels can be very confusing when different wide registers encode numbers differently. So most of the kernel code encodes the inner register representation into the symbol name:
- Fixed-width ISAs (NEON, x86, WASM) use
<name>_<dtype>x<count>variable naming convention — e.g.sum_f32x4,a_f64x2,query_f64x8. - SVE uses
<name>_<dtype>xwith no count, since VL is runtime — e.g.a_f32x,accumulator_f64x. - RVV uses
<name>_<dtype>m<lmul>for the LMUL register-group multiplier — e.g.a_f32m1,sum_f64m2.
For the <name> part, prefer full words over abbreviations: accumulator instead of acc, sum instead of s, low & high instead of lo & hi.
Regardless of the intrinsic name used to produce a value, the variable name should reflect its relation to surrounding code.
A good example is naming upcasted register halves. With
svunpklo&svunpkhiin SVE,vget_low&vget_highin NEON, or_mm256_extractf128in x86, the values are contiguous halves of the register — so we call themlowandhigh:svfloat32_t values_low_f32x = svreinterpret_f32_u32(svlsl_n_u32_x(p, svunpklo_u32(raw), 16)); svfloat32_t values_high_f32x = svreinterpret_f32_u32(svlsl_n_u32_x(p, svunpkhi_u32(raw), 16));But
svcvt&svcvtltin SVE select interleaved even/odd elements, not contiguous halves. Usinglow&highhere would mislead reviewers into assuming a different control flow. So we compensate the non-expressive intrinsic name with a more accurate variable name:svfloat32_t values_even_f32x = svcvt_f32_f16_x(pred_even_b32x, values_f16x); // elements 0,2,4,... svfloat32_t values_odd_f32x = svcvtlt_f32_f16_x(pred_odd_b32x, values_f16x); // elements 1,3,5,...Similarly, in AMX tile-based GEMMs, the A matrix is split into a top half and a bottom half, while B tiles cover left and right halfs. Using
high&lowwould suggest register halves;top&bottomreflects the spatial role in the matrix multiplication:_tile_loadd(0, a_tile_top, a_stride_bytes); // A top rows _tile_loadd(1, a_tile_bottom, a_stride_bytes); // A bottom rows _tile_loadd(2, b_tile_left, 64); // B left columns _tile_loadd(3, b_tile_right, 64); // B right columns
For the <dtype> part, values like u8, bf16, f64c, i4, and e3m2 are used — except where the type doesn't matter, such as predicate masks, loads, and stores.
Those use b32 or b8, reflecting the number of bits in each mask element.
For scalar variables, similar preferences for cleaner and longer variable names apply:
- Loop variables use
ifor simple loops;row_tile_index,column_tile_index,depth_stepfor nested tile loops. - Matrix / GEMM dimensions use
rows,columns,depth— never single-letterm,n,k. - Tile terminology is descriptive:
tile_dimension,row_in_tile,column_within_tile. - Element counts are explicit about what's counted:
count_scalars,count_pairs. - Strides explicitly mention the units:
a_stride_in_bytes,a_stride_elements = a_stride_in_bytes / sizeof(nk_f16_t).
Intrinsic Style
Prefer explicit named intrinsics over implicit syntax or manual bit manipulation.
Power VSX uses vec_xl(), vec_xst() — never implicit Altivec vector operators.
x86 AVX-512 uses _mm512_mask_* K-mask intrinsics — never manual bitwise ops on __mmask16.
When hardware has no intrinsic, wrap raw assembly in an NK_INTERNAL helper and document the instruction mnemonic:
NK_INTERNAL void nk_sme_start_streaming_(void) {
__asm__ __volatile__("smstart sm" ::: "memory");
}
Function Naming
Public API: nk_<operation>_<dtype>_<isa> — e.g. nk_dot_f32_sve, nk_angular_f16_sme.
Internal helpers use a trailing underscore: nk_reduce_add_f32x16_skylake_.
Conversions: nk_<src>x<count>_to_<dst>x<count>_<isa>_ — e.g. nk_e4m3x8_to_f32x8_haswell_.