Trigonometric Functions in NumKong

April 2, 2026 · View on GitHub

NumKong implements element-wise trigonometric functions — sine, cosine, and arc tangent — with ~3 ulp error bounds for f32 and faithful rounding for f64. Each function operates on dense vectors, reading input angles (radians) and writing output values of the same length. The implementations derive from SLEEF (SIMD Library for Evaluating Elementary Functions), adapted for NumKong's ISA dispatch and type system.

Sine:

sin:R[1,1]\text{sin}: \mathbb{R} \to [-1, 1]

Cosine:

cos:R[1,1]\text{cos}: \mathbb{R} \to [-1, 1]

Arc tangent:

atan:R(π2,π2)\text{atan}: \mathbb{R} \to \left(-\frac{\pi}{2}, \frac{\pi}{2}\right)

Reformulating as Python pseudocode:

import numpy as np

def sin(a: np.ndarray) -> np.ndarray:
    return np.sin(a)

def cos(a: np.ndarray) -> np.ndarray:
    return np.cos(a)

def atan(a: np.ndarray) -> np.ndarray:
    return np.arctan(a)

Input & Output Types

Input TypeOutput TypeDescription
f64f6464-bit IEEE 754 double precision
f32f3232-bit IEEE 754 single precision
f16f1616-bit half precision, widened to f32 internally

Optimizations

Cody-Waite Range Reduction

All trigonometric kernels reduce the input angle to [π/4,π/4][-\pi/4, \pi/4] before polynomial evaluation using Cody-Waite argument reduction. The constant π\pi is split into high and low parts (πhi+πlo\pi_{\text{hi}} + \pi_{\text{lo}}) to maintain precision during the subtraction xnπx - n\pi: reduced = (x - n * pi_hi) - n * pi_lo. Single-part subtraction would lose ~3 bits of precision for large multiples of π\pi; the two-part split preserves the full mantissa. The quadrant index n=round(x/π)n = \text{round}(x / \pi) selects which trigonometric identity to apply (sin-cos swap, sign flip) via a 2-bit branch.

Minimax Polynomial Approximation

nk_each_sin_f32_serial, nk_each_cos_f32_serial evaluate degree-9 minimax polynomials via Horner's method after range reduction. The polynomial coefficients are precomputed to minimize maximum error over [π/4,π/4][-\pi/4, \pi/4] — Chebyshev-optimal, not Taylor truncation. Horner evaluation: p = c9*x^2 + c7; p = p*x^2 + c5; p = p*x^2 + c3; p = p*x^2 + c1; p = p*x — 4 FMA operations plus 1 multiply for the final odd-power term. nk_each_sin_f64_serial uses degree-19 polynomials for 52-bit mantissa coverage.

Vectorized Polynomial Evaluation

nk_each_sin_f32_haswell, nk_each_cos_f32_skylake evaluate the same polynomial on 8 (AVX2) or 16 (AVX-512) elements simultaneously. Range reduction, quadrant selection, and polynomial evaluation all operate on packed vectors — the only scalar operation is the final sign correction via VBLENDVPS with the quadrant mask. nk_each_sin_f32_neon processes 4 elements per iteration using vfmaq_f32 for the Horner chain. WASM v128relaxed (nk_each_sin_f32_v128relaxed) uses f32x4.relaxed_madd for the FMA steps, achieving ~2x throughput over strict f32x4.mul + f32x4.add sequences.

Performance

The following performance tables are produced by manually re-running nk_test and nk_bench included internal tools to measure both accuracy and throughput at different input shapes. The input size is controlled by the NK_DENSE_DIMENSIONS environment variable and set to 256, 1024, and 4096 elements. The throughput is measured in GB/s as the number of input bytes per second. Accuracy is reported as mean ULP (units in last place) unless noted otherwise — the average number of representable floating-point values between the result and the exact answer. Each kernel runs for at least 20 seconds per configuration. Benchmark threads are pinned to specific cores; on machines with heterogeneous core types (e.g., Apple P/E cores), only the fastest cores are used. Workloads that significantly degrade CPU frequencies (Intel AMX, Apple SME) run in separate passes to avoid affecting throughput measurements of other kernels.

Intel Sapphire Rapids

Native

Kernel25610244096
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sin_f64_serial0.994 gb/s, 0 ulp0.783 gb/s, 0 ulp0.827 gb/s, 0 ulp
nk_each_cos_f64_serial0.906 gb/s, 0 ulp0.784 gb/s, 0 ulp0.824 gb/s, 0 ulp
nk_each_atan_f64_serial0.307 gb/s, 0 ulp0.291 gb/s, 0 ulp0.291 gb/s, 0 ulp
nk_each_sin_f64_haswell4.59 gb/s, 0 ulp4.19 gb/s, 0 ulp4.04 gb/s, 0 ulp
nk_each_cos_f64_haswell4.25 gb/s, 0 ulp4.14 gb/s, 0 ulp3.92 gb/s, 0 ulp
nk_each_atan_f64_haswell3.83 gb/s, 0 ulp3.21 gb/s, 0 ulp3.49 gb/s, 0 ulp
nk_each_sin_f64_skylake7.65 gb/s, 0 ulp6.55 gb/s, 0 ulp4.70 gb/s, 0 ulp
nk_each_cos_f64_skylake7.88 gb/s, 0 ulp5.76 gb/s, 0 ulp5.01 gb/s, 0 ulp
nk_each_atan_f64_skylake5.08 gb/s, 0 ulp4.72 gb/s, 0 ulp4.58 gb/s, 0 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sin_f32_serial6.29 gb/s, 5 ulp6.07 gb/s, 5 ulp5.41 gb/s, 5 ulp
nk_each_cos_f32_serial7.03 gb/s, 15 ulp6.24 gb/s, 15 ulp5.16 gb/s, 15 ulp
nk_each_atan_f32_serial0.642 gb/s, 0.4 ulp0.541 gb/s, 0.4 ulp0.567 gb/s, 0.4 ulp
nk_each_sin_f32_haswell10.0 gb/s, 5 ulp7.36 gb/s, 5 ulp5.63 gb/s, 5 ulp
nk_each_cos_f32_haswell7.82 gb/s, 15 ulp7.11 gb/s, 15 ulp5.09 gb/s, 15 ulp
nk_each_atan_f32_haswell7.63 gb/s, 0.4 ulp5.94 gb/s, 0.4 ulp5.38 gb/s, 0.4 ulp
nk_each_sin_f32_skylake11.9 gb/s, 5 ulp9.14 gb/s, 5 ulp5.43 gb/s, 5 ulp
nk_each_cos_f32_skylake10.4 gb/s, 15 ulp8.26 gb/s, 15 ulp5.40 gb/s, 15 ulp
nk_each_atan_f32_skylake9.07 gb/s, 0.4 ulp7.80 gb/s, 0.4 ulp5.75 gb/s, 0.4 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sin_f16_serial0.112 gb/s, 0.9 ulp0.102 gb/s, 1.1 ulp0.110 gb/s, 0.9 ulp
nk_each_cos_f16_serial0.105 gb/s, 12 ulp0.0962 gb/s, 12 ulp0.0976 gb/s, 12 ulp
nk_each_atan_f16_serial0.0208 gb/s, 6.4 ulp0.0201 gb/s, 6.7 ulp0.0204 gb/s, 6.6 ulp
nk_each_sin_f16_skylake6.05 gb/s, 8.41K ulp5.81 gb/s, 8.43K ulp5.24 gb/s, 8.41K ulp
nk_each_cos_f16_skylake6.05 gb/s, 8.34K ulp5.20 gb/s, 8.34K ulp5.09 gb/s, 8.35K ulp
nk_each_atan_f16_skylake4.86 gb/s, 16.5K ulp5.25 gb/s, 16.6K ulp4.76 gb/s, 16.5K ulp

WASM

Measured with Wasmtime v42 (Cranelift backend).

Kernel25610244096
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sin_f64_serial0.34 gb/s, 0.2 ulp0.38 gb/s, 0.2 ulp0.08 gb/s, 0.2 ulp
nk_each_cos_f64_serial0.36 gb/s, 0.3 ulp0.39 gb/s, 0.3 ulp0.08 gb/s, 0.3 ulp
nk_each_atan_f64_serial0.11 gb/s, 0.3 ulp0.12 gb/s, 0.3 ulp0.11 gb/s, 0.3 ulp
nk_each_sin_f64_v128relaxed0.59 gb/s, 0.2 ulp0.26 gb/s, 0.2 ulp0.05 gb/s, 0.2 ulp
nk_each_cos_f64_v128relaxed0.29 gb/s, 0.3 ulp0.50 gb/s, 0.3 ulp0.03 gb/s, 0.3 ulp
nk_each_atan_f64_v128relaxed0.11 gb/s, 0.3 ulp0.48 gb/s, 0.3 ulp0.21 gb/s, 0.3 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sin_f32_serial0.17 gb/s, 4.9 ulp0.51 gb/s, 4.9 ulp0.07 gb/s, 4.9 ulp
nk_each_cos_f32_serial0.05 gb/s, 14.4 ulp0.41 gb/s, 14.4 ulp0.10 gb/s, 14.4 ulp
nk_each_atan_f32_serial0.08 gb/s, 0.4 ulp0.08 gb/s, 0.4 ulp0.09 gb/s, 0.4 ulp
nk_each_sin_f32_v128relaxed0.13 gb/s, 20.7 ulp0.01 gb/s, 20.7 ulp0.10 gb/s, 20.7 ulp
nk_each_cos_f32_v128relaxed0.15 gb/s, 21.9 ulp0.32 gb/s, 21.9 ulp0.05 gb/s, 21.9 ulp
nk_each_atan_f32_v128relaxed0.45 gb/s, 0.4 ulp0.39 gb/s, 0.4 ulp0.15 gb/s, 0.4 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sin_f16_serial0.07 gb/s, 1.1 ulp0.07 gb/s, 1.1 ulp0.07 gb/s, 1.1 ulp
nk_each_cos_f16_serial0.07 gb/s, 11.8 ulp0.07 gb/s, 11.8 ulp0.07 gb/s, 11.8 ulp
nk_each_atan_f16_serial0.03 gb/s, 6.5 ulp0.03 gb/s, 6.5 ulp0.03 gb/s, 6.5 ulp

Apple M4

Native

Kernel25610244096
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sin_f64_serial0.627 gb/s, 0.2 ulp0.634 gb/s, 0.2 ulp0.639 gb/s, 0.2 ulp
nk_each_cos_f64_serial0.621 gb/s, 0.3 ulp0.632 gb/s, 0.3 ulp0.619 gb/s, 0.3 ulp
nk_each_atan_f64_serial0.153 gb/s, 0.3 ulp0.154 gb/s, 0.3 ulp0.153 gb/s, 0.3 ulp
nk_each_sin_f64_neon5.94 gb/s, 0.2 ulp5.75 gb/s, 0.2 ulp5.82 gb/s, 0.2 ulp
nk_each_cos_f64_neon5.15 gb/s, 0.3 ulp5.36 gb/s, 0.3 ulp5.37 gb/s, 0.3 ulp
nk_each_atan_f64_neon3.53 gb/s, 0.3 ulp3.50 gb/s, 0.3 ulp3.50 gb/s, 0.3 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sin_f32_serial7.94 gb/s, 4.9 ulp7.94 gb/s, 4.9 ulp7.23 gb/s, 4.9 ulp
nk_each_cos_f32_serial7.26 gb/s, 14 ulp6.41 gb/s, 14 ulp6.52 gb/s, 14 ulp
nk_each_atan_f32_serial0.128 gb/s, 0.4 ulp0.129 gb/s, 0.4 ulp0.126 gb/s, 0.4 ulp
nk_each_sin_f32_neon9.75 gb/s, 4.9 ulp9.44 gb/s, 4.9 ulp8.13 gb/s, 4.9 ulp
nk_each_cos_f32_neon8.68 gb/s, 18 ulp7.77 gb/s, 18 ulp7.84 gb/s, 18 ulp
nk_each_atan_f32_neon5.57 gb/s, 0.4 ulp5.00 gb/s, 0.4 ulp5.10 gb/s, 0.4 ulp
f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sin_f16_serial3.66 gb/s, 1.3 ulp3.71 gb/s, 1.3 ulp3.38 gb/s, 1.3 ulp
nk_each_cos_f16_serial3.28 gb/s, 12 ulp3.29 gb/s, 12 ulp3.15 gb/s, 12 ulp
nk_each_atan_f16_serial0.0639 gb/s, 6.5 ulp0.0626 gb/s, 6.5 ulp0.0627 gb/s, 6.5 ulp

WASM

Measured with Wasmtime v43 (Cranelift backend).

Kernel25610244096
f64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sin_f64_serial0.771 gb/s, 0.2 ulp0.733 gb/s, 0.2 ulp0.798 gb/s, 0.2 ulp
nk_each_sin_f64_v128relaxed8.65 gb/s, 0.2 ulp8.40 gb/s, 0.2 ulp9.45 gb/s, 0.2 ulp
nk_each_cos_f64_serial0.746 gb/s, 0.3 ulp0.703 gb/s, 0.3 ulp0.735 gb/s, 0.3 ulp
nk_each_cos_f64_v128relaxed9.07 gb/s, 0.3 ulp8.80 gb/s, 0.3 ulp9.58 gb/s, 0.3 ulp
nk_each_atan_f64_serial0.247 gb/s, 0.3 ulp0.234 gb/s, 0.3 ulp0.232 gb/s, 0.3 ulp
nk_each_atan_f64_v128relaxed5.51 gb/s, 0.3 ulp5.21 gb/s, 0.3 ulp5.96 gb/s, 0.3 ulp
f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_each_sin_f32_serial9.94 gb/s, 4.9 ulp9.44 gb/s, 4.9 ulp10.2 gb/s, 4.9 ulp
nk_each_sin_f32_v128relaxed17.8 gb/s, 20 ulp17.5 gb/s, 20 ulp17.9 gb/s, 20 ulp
nk_each_cos_f32_serial9.13 gb/s, 14 ulp8.96 gb/s, 14 ulp9.55 gb/s, 14 ulp
nk_each_cos_f32_v128relaxed16.7 gb/s, 21 ulp14.7 gb/s, 21 ulp16.9 gb/s, 21 ulp
nk_each_atan_f32_serial0.191 gb/s, 0.4 ulp0.185 gb/s, 0.4 ulp0.199 gb/s, 0.4 ulp
nk_each_atan_f32_v128relaxed11.2 gb/s, 0.4 ulp10.7 gb/s, 0.4 ulp11.6 gb/s, 0.4 ulp