Type Conversions in NumKong

April 20, 2026 · View on GitHub

NumKong implements bidirectional type conversions between all supported numeric formats through Float32 as a hub type. Conversions cover IEEE 754 floats (Float16, Float32, Float64), brain float (BFloat16), Float8 formats (e4m3, e5m2, e2m3, e3m2), and integers (Int8–Int64, UInt8–UInt64, packed i4x2/u4x2). All conversions use round-to-nearest-even (RNE) for narrowing and exact widening where the target format has sufficient range and precision.

BFloat16 relates to Float32 by truncation with rounding:

bf16f3216\text{bf16} \approx \text{f32} \gg 16

With RNE tie-breaking to preserve the least significant bit of the truncated result.

Float16 range and precision:

f16[65504,65504],min positive normal=214\text{f16} \in [-65504, 65504], \quad \text{min positive normal} = 2^{-14}

Reformulating as Python pseudocode:

import numpy as np

def cast(a: np.ndarray, target_dtype: np.dtype) -> np.ndarray:
    return a.astype(target_dtype)

Input & Output Types

Float-to-float conversions:

Input TypeOutput TypeDescription
f64f3264-bit to 32-bit, narrowing with RNE
f32f6432-bit to 64-bit, exact widening
f32f1632-bit to 16-bit half precision
f16f3216-bit half to 32-bit, exact widening
f32bf1632-bit to brain float, truncation with RNE
bf16f32Brain float to 32-bit, exact widening

Float-to-Float8 conversions:

Input TypeOutput TypeDescription
f32e4m332-bit to Float8: 4 exponent, 3 mantissa bits
e4m3f32Float8 to 32-bit, exact via lookup table
f32e5m232-bit to Float8: 5 exponent, 2 mantissa bits
e5m2f32Float8 to 32-bit, exact via lookup table
f32e2m332-bit to MX: 2 exponent, 3 mantissa bits
e2m3f32MX to 32-bit, exact via lookup table
f32e3m232-bit to MX: 3 exponent, 2 mantissa bits
e3m2f32MX to 32-bit, exact via lookup table

Float-to-integer conversions:

Input TypeOutput TypeDescription
f32i8Clamped to [-128, 127], rounded
f32u8Clamped to [0, 255], rounded
f32i16Clamped to [-32768, 32767], rounded
f32u16Clamped to [0, 65535], rounded
f64i32Clamped to Int32 range, rounded
f64u32Clamped to UInt32 range, rounded
f64i64Clamped to Int64 range, rounded
f64u64Clamped to UInt64 range, rounded

Packed sub-byte conversions:

Input TypeOutput TypeDescription
i4x2i8Signed 4-bit pair to two signed 8-bit values
u4x2u8Unsigned 4-bit pair to two unsigned 8-bit values

Optimizations

Lookup Tables for Mini-Floats

nk_e4m3_to_f32_serial, nk_e5m2_to_f32_serial, nk_e2m3_to_f32_serial, nk_e3m2_to_f32_serial use 256-entry precomputed lookup tables — each 8-bit input indexes directly into a Float32 result array. The reverse direction (nk_f32_to_e4m3_serial) uses clamping + rounding: clamp to format range, multiply by scale, round-to-nearest, cast to UInt8. SIMD backends (nk_cast_haswell, nk_cast_skylake) use VPGATHERDD to perform 8 or 16 simultaneous table lookups from the same 256-entry table. AVX-512 gathers on Skylake achieve ~3cy throughput per 16-element lookup vs ~8cy on Haswell for 8-element gathers.

BFloat16 as Truncated Float32

nk_bf16_to_f32_serial zero-extends by left-shifting 16 bits — exact, no rounding error, single-cycle on all platforms. nk_f32_to_bf16_serial right-shifts with round-to-nearest-even: adds a rounding bias of 0x7FFF + ((bits >> 16) & 1) before truncating, matching the IEEE 754 RNE tie-breaking rule. NEON backend uses vreinterpretq_u16_u8 + vzip for zero-extension; Haswell uses VPSLLD / VPSRLD shifts.

F16C Hardware Conversion

nk_f16_to_f32_haswell, nk_f32_to_f16_haswell use the F16C extension instructions VCVTPH2PS / VCVTPS2PH — single-instruction conversion of 8 elements with correct denormal handling, NaN propagation, and RNE rounding. The serial fallback (nk_f16_to_f32_serial) must handle denormals via explicit exponent/mantissa extraction and conditional re-normalization — ~15 integer ops per element vs 1 instruction with F16C. AVX-512 (nk_cast_skylake) doubles throughput to 16 elements per instruction. F16C also unlocks a cheaper FP8 → F32 path that bypasses i32-lane bit math: nk_e5m2x16_to_f32x16_skylake_ and nk_e5m2x8_to_f32x8_haswell_ widen u8 → u16 and left-shift by 8 (E5M2 shares F16's bias 15, so the result is a bit-exact F16 encoding of every input including subnormals and NaN), then feed VCVTPH2PS — three ops total. E4M3 can't use a plain shift (bias 7 vs 15), but the Giesen-style fake-F16 ((byte & 0x7F) << 7) | ((byte & 0x80) << 8) gives an F16 whose value differs from the E4M3 magnitude by exactly 2⁸; nk_e4m3x16_to_f32x16_skylake_ and nk_e4m3x8_to_f32x8_haswell_ widen through VCVTPH2PS, multiply by 256 in F32 to correct, and blend in F32 NaN for the lone |byte|==0x7F encoding. For E4M3 GEMM specifically, nk_e4m3x16_to_f16x16_skylake_ produces TRUE F16 (bias-corrected, with a small subnormal LUT and NaN blend) so the packed buffer stores 2 bytes/element instead of 4 — the inner loop reads F16 and widens to F32 once per B-load, trading ~10% compute for 50% pack memory.

Performance

The following performance tables are produced by manually running nk_bench included internal tools to measure the throughput at different input shapes. The input size is controlled by the NK_DENSE_DIMENSIONS environment variable and set to 256, 1024, and 4096 elements. The throughput is measured in GB/s as the number of bytes read and written per second, with ↓ for downcasts and ↑ for upcasts. Each kernel runs for at least 5 seconds per configuration. Benchmark threads are pinned to specific cores; on machines with heterogeneous core types (e.g., Apple P/E cores), only the fastest cores are used. Workloads that significantly degrade CPU frequencies (Intel AMX, Apple SME) run in separate passes to avoid affecting throughput measurements of other kernels.

Intel Sapphire Rapids

Native

Kernel↓ 256↓ 1K↓ 4K↑ 256↑ 1K↑ 4K
f32 ↔ bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial0.542 gb/s0.521 gb/s0.553 gb/s1.10 gb/s1.12 gb/s1.17 gb/s
nk_cast_haswell40.8 gb/s52.4 gb/s55.1 gb/s27.7 gb/s43.2 gb/s46.3 gb/s
nk_cast_skylake23.6 gb/s44.8 gb/s46.8 gb/s37.6 gb/s60.1 gb/s61.3 gb/s
nk_cast_icelake21.4 gb/s26.0 gb/s27.2 gb/s32.6 gb/s39.4 gb/s44.3 gb/s
nk_cast_sapphire21.5 gb/s21.1 gb/s49.5 gb/s39.2 gb/s38.3 gb/s56.3 gb/s
f32 ↔ f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial6.28 gb/s6.36 gb/s6.20 gb/s2.93 gb/s2.95 gb/s2.89 gb/s
nk_cast_haswell50.2 gb/s106 gb/s105 gb/s31.7 gb/s60.2 gb/s66.1 gb/s
nk_cast_skylake38.0 gb/s56.6 gb/s39.4 gb/s39.7 gb/s58.3 gb/s43.7 gb/s
nk_cast_icelake51.8 gb/s60.2 gb/s54.3 gb/s52.2 gb/s57.7 gb/s60.6 gb/s
nk_cast_sapphire31.8 gb/s33.8 gb/s38.8 gb/s35.0 gb/s33.6 gb/s51.5 gb/s
f32 ↔ e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial0.785 gb/s0.725 gb/s0.569 gb/s2.62 gb/s2.57 gb/s2.69 gb/s
nk_cast_haswell7.93 gb/s8.39 gb/s5.44 gb/s12.6 gb/s17.9 gb/s10.6 gb/s
nk_cast_skylake10.3 gb/s10.8 gb/s10.0 gb/s27.2 gb/s28.6 gb/s28.0 gb/s
nk_cast_icelake5.07 gb/s4.96 gb/s6.08 gb/s14.9 gb/s13.7 gb/s14.5 gb/s
nk_cast_sapphire7.81 gb/s5.25 gb/s10.7 gb/s24.7 gb/s15.2 gb/s25.0 gb/s
f32 ↔ e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial0.653 gb/s0.623 gb/s0.445 gb/s1.51 gb/s1.43 gb/s1.44 gb/s
nk_cast_haswell6.74 gb/s7.35 gb/s6.68 gb/s10.4 gb/s12.1 gb/s7.47 gb/s
nk_cast_skylake7.70 gb/s9.83 gb/s9.79 gb/s17.3 gb/s23.2 gb/s22.2 gb/s
nk_cast_icelake8.51 gb/s9.01 gb/s9.43 gb/s17.8 gb/s20.5 gb/s21.4 gb/s
nk_cast_sapphire4.98 gb/s4.90 gb/s8.56 gb/s15.7 gb/s11.0 gb/s17.1 gb/s
f32 ↔ e3m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial0.863 gb/s1.44 gb/s1.21 gb/s2.46 gb/s4.20 gb/s4.14 gb/s
nk_cast_haswell4.70 gb/s5.04 gb/s5.00 gb/s7.47 gb/s7.82 gb/s8.03 gb/s
nk_cast_skylake6.34 gb/s6.37 gb/s6.46 gb/s14.7 gb/s17.6 gb/s17.1 gb/s
nk_cast_icelake5.34 gb/s5.10 gb/s6.36 gb/s13.3 gb/s14.2 gb/s21.3 gb/s
nk_cast_sapphire8.78 gb/s9.93 gb/s7.02 gb/s23.0 gb/s18.5 gb/s20.8 gb/s
f32 ↔ e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial0.941 gb/s1.39 gb/s0.688 gb/s2.68 gb/s4.79 gb/s2.70 gb/s
nk_cast_haswell4.76 gb/s4.51 gb/s5.00 gb/s8.26 gb/s8.92 gb/s9.02 gb/s
nk_cast_skylake6.55 gb/s6.54 gb/s6.42 gb/s13.4 gb/s15.9 gb/s16.1 gb/s
nk_cast_icelake5.03 gb/s6.41 gb/s6.44 gb/s12.4 gb/s14.8 gb/s16.2 gb/s
nk_cast_sapphire9.95 gb/s8.90 gb/s9.17 gb/s19.7 gb/s24.1 gb/s16.8 gb/s
f32 ↔ i16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial1.99 gb/s2.02 gb/s2.04 gb/s4.59 gb/s4.63 gb/s4.68 gb/s
nk_cast_haswell46.4 gb/s51.8 gb/s53.0 gb/s19.8 gb/s21.0 gb/s21.9 gb/s
nk_cast_skylake31.0 gb/s34.2 gb/s36.7 gb/s48.7 gb/s58.5 gb/s61.1 gb/s
f32 ↔ u16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial3.19 gb/s3.13 gb/s3.14 gb/s4.60 gb/s4.82 gb/s4.75 gb/s
nk_cast_haswell36.4 gb/s43.6 gb/s48.4 gb/s19.1 gb/s20.6 gb/s21.2 gb/s
nk_cast_skylake32.0 gb/s36.1 gb/s37.3 gb/s48.4 gb/s55.0 gb/s59.5 gb/s
f32 ↔ i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial3.22 gb/s3.62 gb/s3.40 gb/s5.41 gb/s5.65 gb/s5.73 gb/s
nk_cast_haswell21.6 gb/s25.5 gb/s27.5 gb/s12.8 gb/s13.6 gb/s14.0 gb/s
nk_cast_skylake13.0 gb/s13.2 gb/s13.9 gb/s22.1 gb/s23.4 gb/s22.9 gb/s
nk_cast_icelake14.2 gb/s16.4 gb/s21.5 gb/s25.4 gb/s29.4 gb/s34.8 gb/s
nk_cast_sapphire26.0 gb/s27.3 gb/s19.5 gb/s33.1 gb/s48.9 gb/s49.4 gb/s
f32 ↔ u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial4.44 gb/s4.58 gb/s5.84 gb/s7.45 gb/s7.20 gb/s4.24 gb/s
nk_cast_haswell41.2 gb/s42.2 gb/s41.4 gb/s17.9 gb/s19.2 gb/s20.8 gb/s
nk_cast_skylake27.8 gb/s31.1 gb/s33.4 gb/s39.8 gb/s48.7 gb/s51.5 gb/s
f64 ↔ f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial11.6 gb/s12.2 gb/s12.3 gb/s12.1 gb/s12.9 gb/s13.2 gb/s
nk_cast_skylake52.1 gb/s59.4 gb/s53.8 gb/s54.4 gb/s65.9 gb/s60.6 gb/s
f64 ↔ i64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial5.30 gb/s5.21 gb/s5.21 gb/s15.4 gb/s16.1 gb/s14.0 gb/s
nk_cast_skylake8.73 gb/s9.81 gb/s9.03 gb/s25.3 gb/s26.8 gb/s20.3 gb/s
f64 ↔ u64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial9.17 gb/s8.55 gb/s8.57 gb/s16.3 gb/s15.1 gb/s15.0 gb/s
nk_cast_skylake13.8 gb/s14.5 gb/s15.4 gb/s25.5 gb/s28.1 gb/s19.6 gb/s
f64 ↔ i32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial3.71 gb/s3.97 gb/s3.71 gb/s11.6 gb/s12.3 gb/s12.6 gb/s
nk_cast_skylake38.7 gb/s48.1 gb/s45.9 gb/s54.1 gb/s64.2 gb/s60.8 gb/s
f64 ↔ u32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial6.37 gb/s6.16 gb/s6.08 gb/s10.9 gb/s11.9 gb/s10.3 gb/s
nk_cast_skylake46.6 gb/s48.9 gb/s49.5 gb/s50.2 gb/s60.5 gb/s62.3 gb/s

WASM

Measured with Wasmtime v42 (Cranelift backend).

Kernel↓ 256↓ 1K↓ 4K↑ 256↑ 1K↑ 4K
f32 ↔ bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial? gb/s? gb/s1.63 gb/s? gb/s? gb/s2.21 gb/s
f32 ↔ f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial? gb/s? gb/s0.436 gb/s? gb/s? gb/s1.19 gb/s
f32 ↔ e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial? gb/s? gb/s0.294 gb/s? gb/s? gb/s1.45 gb/s
f32 ↔ e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial? gb/s? gb/s0.239 gb/s? gb/s? gb/s0.746 gb/s

Apple M5

Native

Kernel↓ 256↓ 1K↓ 4K↑ 256↑ 1K↑ 4K
f32 ↔ bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial1.37 gb/s1.35 gb/s1.41 gb/s1.37 gb/s1.34 gb/s1.38 gb/s
nk_cast_neon19.3 gb/s23.7 gb/s23.2 gb/s59.4 gb/s58.9 gb/s57.3 gb/s
f32 ↔ f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial1.37 gb/s1.31 gb/s1.32 gb/s1.37 gb/s1.31 gb/s1.40 gb/s
nk_cast_neon20.1 gb/s21.9 gb/s25.0 gb/s52.1 gb/s60.2 gb/s70.2 gb/s
f32 ↔ e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial0.681 gb/s0.621 gb/s0.600 gb/s1.17 gb/s1.17 gb/s1.23 gb/s
nk_cast_neon8.50 gb/s8.45 gb/s8.35 gb/s40.6 gb/s46.5 gb/s46.5 gb/s
f32 ↔ e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial0.683 gb/s0.618 gb/s0.586 gb/s1.02 gb/s1.01 gb/s1.02 gb/s
nk_cast_neon7.85 gb/s7.91 gb/s7.66 gb/s18.9 gb/s19.2 gb/s18.3 gb/s
f32 ↔ e3m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial0.702 gb/s0.632 gb/s0.596 gb/s1.17 gb/s1.13 gb/s1.15 gb/s
nk_cast_neon8.94 gb/s9.02 gb/s8.91 gb/s24.9 gb/s25.0 gb/s24.4 gb/s
f32 ↔ e2m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial0.921 gb/s0.843 gb/s0.715 gb/s1.21 gb/s1.21 gb/s1.26 gb/s
nk_cast_neon8.89 gb/s9.03 gb/s8.82 gb/s24.9 gb/s25.1 gb/s24.6 gb/s
f32 ↔ i16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial0.785 gb/s0.679 gb/s0.678 gb/s1.44 gb/s1.39 gb/s1.49 gb/s
nk_cast_neon19.4 gb/s22.6 gb/s23.9 gb/s19.9 gb/s23.2 gb/s25.9 gb/s
f32 ↔ u16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial0.916 gb/s0.822 gb/s0.726 gb/s1.37 gb/s1.36 gb/s1.48 gb/s
nk_cast_neon20.3 gb/s20.6 gb/s22.1 gb/s15.6 gb/s18.5 gb/s17.4 gb/s
f32 ↔ i8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial0.725 gb/s0.616 gb/s0.578 gb/s1.21 gb/s1.21 gb/s1.28 gb/s
nk_cast_neon18.2 gb/s24.5 gb/s21.7 gb/s16.3 gb/s18.9 gb/s19.8 gb/s
f32 ↔ u8░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial0.967 gb/s0.795 gb/s0.723 gb/s1.29 gb/s1.25 gb/s1.40 gb/s
nk_cast_neon17.5 gb/s19.8 gb/s19.4 gb/s13.8 gb/s17.8 gb/s15.1 gb/s
f64 ↔ f32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial2.65 gb/s2.60 gb/s2.70 gb/s2.59 gb/s2.55 gb/s2.65 gb/s
nk_cast_neon2.87 gb/s2.60 gb/s2.73 gb/s2.64 gb/s2.63 gb/s2.57 gb/s
f64 ↔ i64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial2.42 gb/s2.00 gb/s1.86 gb/s3.79 gb/s3.61 gb/s4.03 gb/s
nk_cast_neon2.51 gb/s1.94 gb/s1.78 gb/s3.83 gb/s3.68 gb/s3.79 gb/s
f64 ↔ u64░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial2.56 gb/s2.19 gb/s2.06 gb/s3.71 gb/s3.50 gb/s3.87 gb/s
nk_cast_neon2.68 gb/s2.10 gb/s1.97 gb/s3.68 gb/s3.61 gb/s3.58 gb/s
f64 ↔ i32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial1.58 gb/s1.32 gb/s1.29 gb/s2.65 gb/s2.58 gb/s2.84 gb/s
nk_cast_neon1.61 gb/s1.33 gb/s1.24 gb/s2.73 gb/s2.63 gb/s2.66 gb/s
f64 ↔ u32░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial1.83 gb/s1.53 gb/s1.47 gb/s2.55 gb/s2.48 gb/s2.69 gb/s
nk_cast_neon1.89 gb/s1.53 gb/s1.38 gb/s2.56 gb/s2.54 gb/s2.59 gb/s

WASM

Measured with Wasmtime v43 (Cranelift backend).

Kernel↓ 256↓ 1K↓ 4K↑ 256↑ 1K↑ 4K
f32 ↔ bf16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial0.514 gb/s0.522 gb/s0.538 gb/s0.511 gb/s0.526 gb/s0.519 gb/s
f32 ↔ f16░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial0.368 gb/s0.363 gb/s0.360 gb/s0.490 gb/s0.480 gb/s0.489 gb/s
f32 ↔ e5m2░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial0.323 gb/s0.312 gb/s0.304 gb/s0.423 gb/s0.425 gb/s0.425 gb/s
f32 ↔ e4m3░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
nk_cast_serial0.315 gb/s0.304 gb/s0.295 gb/s0.396 gb/s0.396 gb/s0.397 gb/s