NumKong for Go
March 23, 2026 · View on GitHub
NumKong's Go binding gives you native SIMD-accelerated kernels without building a custom cGo shim. It covers dot products, dense distances, geospatial helpers, packed matrix kernels, symmetric self-similarity, binary set metrics, probability metrics, and MaxSim late interaction through a slice-based API that fits naturally into Go's numeric idioms.
Quickstart
package main
import (
"fmt"
nk "github.com/ashvardanian/NumKong/golang"
)
func main() {
a := []float32{1, 2, 3}
b := []float32{4, 5, 6}
fmt.Println(nk.DotF32(a, b)) // 32 (returned as float64)
fmt.Println(nk.AngularF32(a, b)) // cosine distance (returned as float64)
}
Highlights
Slice-based API.
Plain Go slices are the input model.
Widened outputs.
int8 and float32 storage widens into safer return types.
Packed matrix kernels.
GEMM-like batch workloads with pack-once-reuse-many semantics via PackedMatrix.
Symmetric self-similarity.
SYRK-like kernels that skip duplicate (i, j) and (j, i) work.
MaxSim late interaction.
ColBERT-style scoring with pre-packed queries and documents via MaxSimPacked.
Binary set metrics.
Hamming and Jaccard on packed bit vectors and set-hash vectors.
Probability metrics.
Kullback-Leibler divergence and Jensen-Shannon distance.
Geospatial included.
Haversine* and Vincenty* are part of the same package.
Capability bits exposed.
You can inspect the runtime SIMD surface from Go.
Ecosystem Comparison
| Feature | NumKong | GoNum |
|---|---|---|
| Operation families | dots, distances, binary, probability, geospatial, MaxSim | dots, distances, some statistics |
| Precision | BFloat16 through sub-byte; automatic widening; Kahan summation; 0 ULP in Float32/Float64 | Float64 only; standard accuracy |
| Runtime SIMD dispatch | auto-selects best ISA per-thread at runtime across x86, ARM, RISC-V | no runtime dispatch; some hand-written assembly routines |
| Packed matrix, GEMM-like | pack once, reuse across query batches via PackedMatrix | mat.Dense.Mul — no persistent packing |
| Symmetric kernels, SYRK-like | skips duplicate pairs, up to 2x speedup for self-distance | no duplicate-pair skipping |
| Memory model | slice-based, caller-owned; cGo zero-copy pointer passing | allocates internally in many functions |
| Host-side parallelism | reusable WorkerPool for packed and symmetric batch ops | partial — gonum/optimize has some parallel support |
Installation
The Go binding compiles the C library from headers at go build time via cGo.
No pre-compiled shared library is required — just a C compiler.
Import the subpackage from the root module:
import nk "github.com/ashvardanian/NumKong/golang"
The module path is github.com/ashvardanian/NumKong.
The Go binding lives under github.com/ashvardanian/NumKong/golang.
CGO must be enabled (the default). Any C11-capable compiler works: GCC, Clang, or MSVC.
Dot Products
Dot products cover float64, float32, int8, and uint8.
package main
import (
"fmt"
nk "github.com/ashvardanian/NumKong/golang"
)
func main() {
a64 := []float64{1, 2, 3}
b64 := []float64{4, 5, 6}
fmt.Println(nk.DotF64(a64, b64))
a32 := []float32{1, 2, 3}
b32 := []float32{4, 5, 6}
fmt.Println(nk.DotF32(a32, b32)) // widened to float64
a8 := []int8{1, 2, 3, 4}
b8 := []int8{4, 3, 2, 1}
fmt.Println(nk.DotI8(a8, b8)) // widened to int32
au := []uint8{1, 2, 3, 4}
bu := []uint8{4, 3, 2, 1}
fmt.Println(nk.DotU8(au, bu)) // widened to uint32
}
DotF32 returns float64.
DotI8 returns int32.
DotU8 returns uint32.
Those widened outputs are deliberate.
Dense Distances
The dense distance family includes squared Euclidean, Euclidean, and angular distance.
Each metric supports float64, float32, int8, and uint8.
package main
import (
"fmt"
nk "github.com/ashvardanian/NumKong/golang"
)
func main() {
a := []float64{1, 2, 3, 4}
b := []float64{4, 3, 2, 1}
fmt.Println(nk.SqEuclideanF64(a, b))
fmt.Println(nk.EuclideanF64(a, b))
fmt.Println(nk.AngularF64(a, b))
}
The int8 and uint8 families widen their outputs.
AngularF32 and EuclideanF32 return float64.
AngularI8, AngularU8, EuclideanI8, EuclideanU8 return float32.
SqEuclideanI8 and SqEuclideanU8 return uint32.
Binary Metrics
Hamming and Jaccard distances for binary and set-hash vectors.
// Byte-level Hamming distance
a := []uint8{1, 2, 3, 4}
b := []uint8{1, 0, 3, 5}
dist := nk.HammingU8(a, b) // 2
// Bit-level Hamming distance (packed binary vectors)
x := []byte{0xFF}
y := []byte{0x0F}
bits := nk.HammingU1(x, y, 8) // 4
// Bit-level Jaccard distance
jd := nk.JaccardU1(x, y, 8) // 0.5
// Set-hash Jaccard (MinHash-style)
h16a := []uint16{1, 2, 3, 4}
h16b := []uint16{3, 4, 5, 6}
nk.JaccardU16(h16a, h16b)
h32a := []uint32{10, 20, 30}
h32b := []uint32{30, 40, 50}
nk.JaccardU32(h32a, h32b)
Probability Metrics
Kullback-Leibler divergence and Jensen-Shannon distance for probability distributions.
package main
import (
"fmt"
nk "github.com/ashvardanian/NumKong/golang"
)
func main() {
p := []float64{0.25, 0.25, 0.25, 0.25}
q := []float64{0.1, 0.2, 0.3, 0.4}
fmt.Println(nk.KullbackLeiblerF64(p, q)) // KL divergence
fmt.Println(nk.JensenShannonF64(p, q)) // JS distance (symmetric)
}
KullbackLeiblerF64 and JensenShannonF64 take []float64 and return float64.
KullbackLeiblerF32 and JensenShannonF32 take []float32 and return float64 (widened).
Geospatial Metrics
The Go package also exposes Haversine and Vincenty helpers. Inputs are in radians. Outputs are written into caller-owned slices.
package main
import (
"fmt"
nk "github.com/ashvardanian/NumKong/golang"
)
func main() {
// Statue of Liberty (40.6892°N, 74.0445°W) → Big Ben (51.5007°N, 0.1246°W)
libertyLat := []float64{0.7101605100}
libertyLon := []float64{-1.2923203180}
bigBenLat := []float64{0.8988567821}
bigBenLon := []float64{-0.0021746802}
distance := make([]float64, 1)
nk.VincentyF64(libertyLat, libertyLon, bigBenLat, bigBenLon, distance) // ≈ 5,589,857 m (ellipsoidal, baseline)
nk.HaversineF64(libertyLat, libertyLon, bigBenLat, bigBenLon, distance) // ≈ 5,543,723 m (spherical, ~46 km less)
fmt.Println(distance[0])
}
The output slice is caller-owned. That keeps allocation behavior explicit and predictable.
Packed Matrix Kernels for GEMM-Like Workloads
Packed kernels are the main batch-throughput path.
The PackedMatrix struct wraps the packed buffer with its dimensions and dtype, providing type safety.
package main
import (
"fmt"
nk "github.com/ashvardanian/NumKong/golang"
)
func main() {
height, width, depth := 4, 8, 16
a := make([]float32, height*depth) // 4 query vectors of dimension 16
b := make([]float32, width*depth) // 8 database vectors of dimension 16
// Fill with sample data
for i := range a { a[i] = float32(i % 7) }
for i := range b { b[i] = float32(i % 5) }
// Pack the right-hand side (once, reuse across batches)
bPacked := nk.NewPackedMatrixF32(b, width, depth)
// Compute A × Bᵀ
c := make([]float64, height*width)
nk.DotsPackedF32(a, bPacked, c, height)
// Angular and Euclidean distances use the same PackedMatrix
angDist := make([]float64, height*width)
nk.AngularsPackedF32(a, bPacked, angDist, height)
eucDist := make([]float64, height*width)
nk.EuclideansPackedF32(a, bPacked, eucDist, height)
fmt.Println(c[:width]) // first row of the result matrix
}
DotsPackedF64 takes []float64 and produces []float64.
DotsPackedF32 takes []float32 and produces []float64 (widened).
DotsPackedI8 takes []int8 and produces []int32 (widened).
DotsPackedU8 takes []uint8 and produces []uint32 (widened).
The same widening pattern applies to AngularsPacked* and EuclideansPacked* variants.
Symmetric Kernels for SYRK-Like Workloads
Symmetric kernels compute self-similarity or self-distance matrices.
They skip duplicate (i, j) and (j, i) pairs, filling only the upper triangle and mirroring the result.
package main
import (
"fmt"
nk "github.com/ashvardanian/NumKong/golang"
)
func main() {
n, depth := 4, 8
vectors := make([]float32, n*depth)
for i := range vectors { vectors[i] = float32(i % 5) }
// Gram matrix: all-pairs dot products
gram := make([]float64, n*n)
nk.DotsSymmetricF32(vectors, n, depth, gram)
// Angular distance matrix
angDist := make([]float64, n*n)
nk.AngularsSymmetricF32(vectors, n, depth, angDist)
// Euclidean distance matrix
eucDist := make([]float64, n*n)
nk.EuclideansSymmetricF32(vectors, n, depth, eucDist)
fmt.Println("gram[0]:", gram[0])
fmt.Println("angular[0,1]:", angDist[1])
fmt.Println("euclidean[0,1]:", eucDist[1])
}
Available symmetric variants: DotsSymmetric{F64,F32,I8,U8}, AngularsSymmetric{F64,F32,I8,U8}, EuclideansSymmetric{F64,F32,I8,U8}.
Binary Packed and Symmetric Kernels
Binary vectors use []byte storage where depth is the number of bits.
Packing uses NewPackedMatrixU1.
package main
import (
"fmt"
nk "github.com/ashvardanian/NumKong/golang"
)
func main() {
n, depth := 4, 64 // 4 vectors of 64 bits each
bytesPerVec := (depth + 7) / 8
vectors := make([]byte, n*bytesPerVec)
for i := range vectors { vectors[i] = byte(i * 37) }
// Pack for batch queries
cols := 2
queryVectors := vectors[:cols*bytesPerVec]
queryPacked := nk.NewPackedMatrixU1(queryVectors, cols, depth)
// Hamming distances: n database vectors × cols query vectors
hammingResult := make([]uint32, n*cols)
nk.HammingsPackedU1(vectors, queryPacked, hammingResult, n)
// Jaccard distances
jaccardResult := make([]float32, n*cols)
nk.JaccardsPackedU1(vectors, queryPacked, jaccardResult, n)
// Symmetric Hamming distance matrix
hammingSym := make([]uint32, n*n)
nk.HammingsSymmetricU1(vectors, n, depth, hammingSym)
// Symmetric Jaccard distance matrix
jaccardSym := make([]float32, n*n)
nk.JaccardsSymmetricU1(vectors, n, depth, jaccardSym)
fmt.Println("hamming packed:", hammingResult[:cols])
fmt.Println("jaccard symmetric[0,1]:", jaccardSym[1])
}
MaxSim and ColBERT-Style Late Interaction
MaxSim is the late-interaction primitive used by systems such as ColBERT.
It computes the sum of per-query-token maximum cosine similarities across document tokens.
The result is an angular distance: sum(1 - max_cosine).
The MaxSimPacked struct wraps packed vectors with their metadata.
package main
import (
"fmt"
nk "github.com/ashvardanian/NumKong/golang"
)
func main() {
queryTokens, docTokens, depth := 4, 8, 16
queries := make([]float32, queryTokens*depth)
docs := make([]float32, docTokens*depth)
for i := range queries { queries[i] = float32(i%5) + 1 }
for i := range docs { docs[i] = float32(i%3) + 1 }
// Pack both sides
qPacked := nk.NewMaxSimPackedF32(queries, queryTokens, depth)
dPacked := nk.NewMaxSimPackedF32(docs, docTokens, depth)
// Compute MaxSim score
score := nk.MaxSimF32(qPacked, dPacked)
fmt.Println("MaxSim score:", score)
}
Parallel Batch Processing
WorkerPool provides pre-pinned goroutines with pre-configured SIMD state.
Create once, reuse across batch calls, close when done.
The pool amortizes ConfigureThread + LockOSThread cost across all batch operations.
package main
import (
"fmt"
nk "github.com/ashvardanian/NumKong/golang"
)
func main() {
width, depth, totalQueries := 1024, 128, 10000
db := make([]float32, width*depth)
queries := make([]float32, totalQueries*depth)
for i := range db { db[i] = float32(i%7) * 0.1 }
for i := range queries { queries[i] = float32(i%11) * 0.1 }
dbPacked := nk.NewPackedMatrixF32(db, width, depth)
// Create a reusable pool (defaults to GOMAXPROCS workers)
pool := nk.NewWorkerPool(8)
defer pool.Close()
// Packed batch operations dispatch to the pool
results := make([]float64, totalQueries*width)
dbPacked.DotsF32WithPool(queries, results, totalQueries, pool)
// Angular and Euclidean distances use the same pool
angResults := make([]float64, totalQueries*width)
dbPacked.AngularsF32WithPool(queries, angResults, totalQueries, pool)
fmt.Println("first result row:", results[:width])
}
Symmetric operations also support pool-based parallelism:
n, depth := 1000, 128
vectors := make([]float32, n*depth)
gram := make([]float64, n*n)
pool := nk.NewWorkerPool(0) // 0 = GOMAXPROCS
defer pool.Close()
nk.DotsSymmetricF32WithPool(vectors, n, depth, gram, pool)
For one-off parallel work without a pool, you can still use ConfigureThread directly:
go func() {
defer nk.ConfigureThread()() // lock thread + configure SIMD; defer unlocks
nk.DotsPackedF32(queries, dbPacked, results, height)
}()
Thread Configuration and Capabilities
ConfigureThread pins the current goroutine to an OS thread via runtime.LockOSThread, enables CPU-specific acceleration features such as Intel AMX, then returns an unlock function.
Goroutines can migrate between OS threads, so thread-local state (AMX tiles) would be lost without pinning.
defer nk.ConfigureThread()() // auto-detect, lock thread, defer unlock
defer nk.ConfigureThreadWith(caps)() // explicit capability mask variant
caps := nk.Capabilities() // inspect SIMD surface
Idiomatic usage is defer nk.ConfigureThread()() — the first () calls ConfigureThread (which locks the thread and returns the unlock function), the second () is deferred and calls the unlock function when the surrounding function returns.
ConfigureThreadWith lets you narrow the enabled feature set.
The package also exposes capability bit constants, like CapSerial, CapNeon, CapHaswell, CapSkylake, CapSapphire, CapSapphireAmx, and CapSme.
These are useful for logging the active platform or gating optional benchmark paths.
cGo Integration Notes
This package is a cGo wrapper over the C library. That means a few rules matter:
- Input slices must have matching lengths where the API expects paired vectors.
- Length mismatches and insufficient slice capacity panic uniformly across all functions.
- Empty slices return zero for scalar outputs rather than crashing.
- The slice backing arrays remain owned by Go.
PackedMatrixandMaxSimPackedstructs own their packed buffers and carry dimensions and dtype metadata.- Constructors validate that input slices are large enough for the given dimensions.
- Batch functions validate both input and output slice sizes.
- Symmetric output matrices must be
n × nin size.
Memory Safety
Go automatically pins slice backing arrays for the duration of each cGo call.
No runtime.Pinner or manual pinning is needed from the caller.
PackedMatrix and MaxSimPacked hold strong Go references to their []byte buffers.
This keeps the packed data alive for garbage collection as long as the struct is reachable.
Memory footprint in Go is easiest to think about in two layers. The slice header is the ordinary Go slice header. The payload is the backing array you already own. NumKong does not wrap those slices in extra heap-owning tensor objects in this binding.