GPU Compute Backend (wgpu)

April 7, 2026 · View on GitHub

Module: sc_neurocore_engine.GpuDenseLayer Rust path: sc_neurocore_engine::gpu::GpuDenseLayer Feature gate: --features gpu Backends: Vulkan (Linux/Windows), Metal (macOS), DX12 (Windows) Tested on: NVIDIA GTX 1060 6GB, AMD RX 6600 XT 8GB

1. Mathematical Formalism

Stochastic computing dense layer

The GPU backend accelerates the SC-NeuroCore DenseLayer, which performs stochastic computing matrix-vector multiplication using Bernoulli-encoded bitstreams and AND+popcount accumulation.

Bernoulli encoding:

$B_i[k] \sim \text{Bernoulli}(p_i), \quad k = 1, \ldots, L$

Each input probability $p_i$ is encoded as a bitstream of length $L$ where each bit is independently 1 with probability $p_i$ .

Stochastic AND multiplication:

$Y_{ij}[k] = W_{ij}[k] \wedge B_i[k]$

where $W_{ij}$ is the pre-encoded weight bitstream and $B_i$ is the input bitstream. The AND operation implements stochastic multiplication: $\mathbb{E}[Y] = w_{ij} \cdot p_i$ .

Popcount accumulation:

$\text{out}_j = \frac{1}{L} \sum_{i=1}^{N_{in}} \text{popcount}(Y_{ij})$

The output for neuron $j$ is the normalised sum of bit-counts across all inputs.

GPU kernel architecture

Two compute shader kernels mirror the CPU hot path:

Kernel 1 — Encode (encode.wgsl):

Each GPU thread generates one packed vec2<u32> word (64 bits) of a Bernoulli bitstream. The PRNG is Philox 4×32-10, a counter-based RNG that enables massively parallel independent random streams without state sharing.

Dispatch: $(\text{ceil}(\text{n\_inputs} \times \text{words\_per\_input} / 256), \text{n\_samples}, 1)$

Kernel 2 — Accumulate (accumulate.wgsl):

Each workgroup processes one (neuron, sample) pair. 256 threads cooperatively AND+popcount all input words for one neuron, then perform workgroup-local tree reduction via shared memory.

Dispatch: (n_neurons, n_samples, 1) — one workgroup per (neuron, sample)

Philox 4×32-10 PRNG

Counter-based RNG producing 4 random u32 values per round:

Input: (counter[4], key[2])
10 rounds of: {
    x0 = mulhi(R0, counter[0]) ^ key[0] ^ counter[3]
    x1 = mullo(R0, counter[0])
    x2 = mulhi(R1, counter[2]) ^ key[1] ^ counter[1]
    x3 = mullo(R1, counter[2])
    counter = (x0, x1, x2, x3)
    key += (PHILOX_W0, PHILOX_W1)
}

Constants: $R_0 = \text{0xD2511F53}$ , $R_1 = \text{0xCD9E8D57}$ , $W_0 = \text{0x9E3779B9}$ (golden ratio), $W_1 = \text{0xBB67AE85}$ .

The PRNG is implemented in WGSL with 16-bit decomposition for mulhilo since WGSL has no native 64-bit integer multiplication.

u64 emulation via vec2

WGSL has no native u64 type. Bitstreams are packed as vec2<u32> (2 × 32 bits = 64 bits). AND and popcount operate on both halves independently:

let result = vec2<u32>(a.x & b.x, a.y & b.y);
let count = countOneBits(result.x) + countOneBits(result.y);

This incurs ~15-25% overhead vs native u64 popcount but is memory-bound anyway, so the penalty is negligible at large layer sizes.

2. Theoretical Context

Problem statement

SC-NeuroCore's single-neuron Rust engine is best-in-class (3.8 ns/step LIF), but network-level stochastic computing is O(n²) in the number of neurons × inputs, dominated by Bernoulli encoding and AND+popcount. At 1K+ neurons, CPU performance is insufficient for real-time simulation:

Framework	1K neurons	10K neurons
SC-NeuroCore (CPU)	11 ms	~1.1 s
Brian2 (CPU)	2.5 ms	250 ms
snnTorch (GPU)	0.5 ms	5 ms
GeNN (GPU)	0.3 ms	3 ms

The GPU backend closes this gap by offloading the encoding and accumulation to massively parallel GPU compute shaders.

Why wgpu?

Alternative	Pros	Cons
CUDA	Fastest on NVIDIA	NVIDIA-only, no AMD support
ROCm	AMD support	Linux-only, poor API stability
OpenCL	Cross-platform	Stagnant, poor tooling
Vulkan Compute	Cross-platform, low-level	Verbose API
wgpu	Cross-platform, Rust-native, safe	Slight overhead vs raw Vulkan
Metal	Best on macOS	Apple-only

wgpu was chosen because:

Cross-platform: Vulkan (Linux/Windows), Metal (macOS), DX12 (Windows)
Rust-native: First-class Rust API, no FFI overhead
Production-grade: Powers Firefox WebGPU implementation
AMD support: User's primary GPUs are AMD RX 6600 XT (RDNA2)
No CUDA dependency: Avoids vendor lock-in

GPU-native vs CPU-upload design

Two design alternatives were considered:

Option A: Upload CPU random bytes to GPU

Simple: generate bitstreams on CPU, upload packed words via PCIe
Bottleneck: PCIe bandwidth (8-16 GB/s) limits throughput at batch scale
For 1K inputs × 1024-bit bitstreams × 256 batch = 32 MB/batch → 2 ms upload alone

Option B: GPU-native RNG (chosen)

Philox PRNG generates random bits directly on GPU
Only probabilities (f32, small) and output (f32, small) cross PCIe
For 1K inputs: upload 4 KB probabilities, download 4 KB output
Eliminates PCIe bottleneck entirely

SC_GPU_ADAPTER environment variable

Multi-GPU systems can select a specific adapter via substring match:

export SC_GPU_ADAPTER="AMD"    # selects AMD GPU
export SC_GPU_ADAPTER="1060"   # selects GTX 1060
export SC_GPU_ADAPTER="RDNA"   # selects RDNA-based GPU

If unset, wgpu's HighPerformance heuristic selects the most powerful adapter.

3. Pipeline Position

Python / Rust application
    │
    ▼
┌─────────────────────────────────────────────────────┐
│  GpuDenseLayer                                       │
│                                                     │
│  ┌──────────┐     ┌──────────────────────────────┐  │
│  │ CPU      │     │ GPU (wgpu)                   │  │
│  │ DenseLayer│     │                              │  │
│  │ (fallback)│     │  Encode kernel (Philox RNG)  │  │
│  │          │     │       │                      │  │
│  │          │     │  Accumulate kernel (AND+pop)  │  │
│  │          │     │       │                      │  │
│  │          │     │  Output buffer → readback     │  │
│  └──────────┘     └──────────────────────────────┘  │
│                                                     │
│  Weights: uploaded once, persistent on GPU           │
│  Inputs: uploaded per forward() call (f32 probs)     │
│  Output: downloaded per forward() call (f32 values)  │
└─────────────────────────────────────────────────────┘

Inputs

Input	Type	Range	Description
`input_values`	`Vec<f64>` or `list[float]`	$[0, 1]$	Input probabilities per neuron
`seed`	`u64`	Any	Philox PRNG seed (different seed = different bitstreams)
`n_samples`	`usize` (batch only)	$\geq 1$	Batch size for batch forward

Outputs

Output	Type	Range	Description
`Vec<f64>` or `list[float]`	Per-neuron	$[0, n_{inputs}]$	Stochastic dot product output

4. Features

Feature	Description
Cross-platform GPU	Vulkan, Metal, DX12 via wgpu
Feature-gated	`--features gpu$, \text{no} \text{compile}-\text{time} \text{cost} \text{without}
\text{Philox} 4 \times 32-10 \text{PRNG}	\text{GPU}-\text{native} \text{counter}-\text{based} \text{RNG}
\text{Two}-\text{kernel} \text{architecture}	\text{Encode} + \text{Accumulate}, \text{matching} \text{CPU} \text{hot} \text{path}
\text{Persistent} \text{weights}	\text{Uploaded} \text{once}, \text{reused} \text{across} \text{forward} \text{calls}
\text{Batch} \text{mode}	$forward_batch()` processes multiple samples in one dispatch
Single-sample mode	`forward_fast()` for interactive use
SC_GPU_ADAPTER	Environment variable for GPU selection
CPU fallback	GpuDenseLayer wraps a CPU DenseLayer internally
Pre-allocated buffers	No per-call allocation up to max_batch
PyO3 bindings	`from sc_neurocore_engine import GpuDenseLayer`
Workgroup reduction	Shared memory tree reduction for popcount
Adapter detection	Reports GPU name and availability

5. Usage Examples

Python: basic GPU forward

from sc_neurocore_engine import GpuDenseLayer

# Check GPU availability.
if GpuDenseLayer.is_gpu_available():
    layer = GpuDenseLayer(n_inputs=256, n_neurons=128, length=1024, seed=42)
    print(f"GPU: {layer.gpu_name()}")

    # Single-sample forward.
    inputs = [0.5] * 256
    output = layer.forward_fast(inputs, seed=42)
    print(f"Output shape: {len(output)}, first 5: {output[:5]}")
else:
    print("No GPU available — use CPU DenseLayer instead")

Python: batch forward

import numpy as np

layer = GpuDenseLayer(256, 128, length=1024, seed=42, max_batch=512)

# Batch of 64 samples.
n_samples = 64
inputs_flat = np.random.uniform(0.0, 1.0, size=n_samples * 256).tolist()
outputs = layer.forward_batch(inputs_flat, n_samples=n_samples, seed=42)

# Reshape to (n_samples, n_neurons).
import numpy as np
out_array = np.array(outputs).reshape(n_samples, 128)
print(f"Batch output shape: {out_array.shape}")

Rust: GPU forward

use sc_neurocore_engine::gpu::GpuDenseLayer;

if let Some(layer) = GpuDenseLayer::try_new(256, 128, 1024, 42, 256) {
    let inputs = vec![0.5f64; 256];
    let output = layer.forward_gpu(&inputs, 42);
    println!("GPU output: {:?}", &output[..5]);
    println!("GPU: {}", layer.gpu_name());
}

GPU adapter selection

# List available adapters.
python3 -c "
from sc_neurocore_engine import GpuDenseLayer
layer = GpuDenseLayer(64, 32)
print(f'Selected: {layer.gpu_name()}')
"

# Force AMD GPU.
SC_GPU_ADAPTER="AMD" python3 -c "
from sc_neurocore_engine import GpuDenseLayer
layer = GpuDenseLayer(64, 32)
print(f'Selected: {layer.gpu_name()}')
"

Build with GPU support

# Compile Rust engine with GPU feature.
cd engine
cargo build --release --features gpu

# Run GPU tests.
cargo test --features gpu

# Run GPU benchmarks.
cargo bench --features gpu --bench gpu_bench

GPU vs CPU comparison

import time
from sc_neurocore_engine import GpuDenseLayer

sizes = [(64, 32), (256, 128), (512, 256), (1000, 500)]
for n_in, n_out in sizes:
    layer = GpuDenseLayer(n_in, n_out, length=1024, seed=42)
    inputs = [0.5] * n_in

    # Warm up.
    for _ in range(10):
        layer.forward_fast(inputs, seed=42)

    # Measure.
    start = time.perf_counter()
    for _ in range(100):
        layer.forward_fast(inputs, seed=42)
    elapsed = (time.perf_counter() - start) / 100
    print(f"{n_in}x{n_out}: {elapsed*1e6:.0f} us/forward")

6. Technical Reference

Class: `GpuDenseLayer` (Python)

Exposed via PyO3. Module: sc_neurocore_engine.

Constructor

GpuDenseLayer(n_inputs, n_neurons, length=1024, seed=24301, max_batch=256)

Parameter	Type	Default	Description
`n_inputs`	`int`	Required	Number of input features
`n_neurons`	`int`	Required	Number of output neurons
`length`	`int`	`1024`	Bitstream length (higher = more accurate, slower)
`seed`	`int`	`24301`	Weight initialisation seed
`max_batch`	`int`	`256`	Maximum batch size (pre-allocates buffers)

Raises RuntimeError if no GPU is available.

Methods

Method	Signature	Description
`forward_fast`	`(inputs: list[float], seed: int) -> list[float]`	Single-sample forward
`forward_batch`	`(inputs_flat: list[float], n_samples: int, seed: int) -> list[float]`	Batch forward (flat row-major)
`gpu_name`	`() -> str`	Name of the GPU adapter
`is_gpu_available`	`() -> bool` (static)	Check GPU availability

Properties

Property	Type	Description
`n_inputs`	`int`	Number of input features
`n_neurons`	`int`	Number of output neurons
`length`	`int`	Bitstream length

Rust API

pub struct GpuDenseLayer {
    pub cpu: DenseLayer,
    ctx: Arc<GpuContext>,
    weight_buf: wgpu::Buffer,
    // ... pre-allocated buffers
}

impl GpuDenseLayer {
    pub fn try_new(n_inputs, n_neurons, length, seed, max_batch) -> Option<Self>
    pub fn forward_gpu(&self, inputs: &[f64], seed: u64) -> Vec<f64>
    pub fn forward_batch_gpu(&self, inputs_flat: &[f64], n_samples: usize, seed: u64) -> Vec<f64>
    pub fn gpu_name(&self) -> &str
}

GpuContext singleton

pub struct GpuContext {
    pub device: wgpu::Device,
    pub queue: wgpu::Queue,
    pub encode_pipeline: wgpu::ComputePipeline,
    pub accumulate_pipeline: wgpu::ComputePipeline,
    pub adapter_name: String,
}

Initialised once via OnceLock. Shared across all GpuDenseLayer instances.

WGSL shader files

File	Workgroup size	Description
`philox.wgsl$	—	\text{Philox} 4 \times 32-10 \text{RNG} \text{functions} (\text{included} \text{by} \text{encode})
$encode.wgsl`	256	Bernoulli encoding: probability → packed bitstream
`accumulate.wgsl`	256	AND + popcount + workgroup tree reduction

Buffer layout

Buffer	Size	Persistence	Description
`weight_buf`	$N_{in} \times N_{out} \times W$ × 8 bytes	Persistent	Packed weight bitstreams
`input_prob_buf`	$N_{in} \times B$ × 4 bytes	Per-call	Input probabilities (f32)
`packed_input_buf`	$N_{in} \times W \times B$ × 8 bytes	Per-call	Encoded input bitstreams
`output_buf`	$N_{out} \times B$ × 4 bytes	Per-call	Output values (f32)
`output_staging`	Same as output	Per-call	MAP_READ staging buffer

where $W = \lceil L/64 \rceil$ (words per bitstream) and $B$ = batch size.

Memory limits

The maxStorageBufferBindingSize on most GPUs is 128-256 MB. For 1000×1000 layers with length=1024 (16 words): weight buffer = $1000 \times 1000 \times 16 \times 8$ = 128 MB. Larger layers may require chunking or reduced bitstream length.

7. Performance Benchmarks

Local GPU benchmarks (i5-11600K host)

Measured with Criterion micro-benchmarks (commit 27299c8b):

Layer size	CPU (i5-11600K)	GTX 1060 (x16 PCIe)	RX 6600 XT (x1 PCIe)
64×32	58 µs	137 µs (0.4×)	1.0 ms (0.06×)
512×256	5.6 ms	263 µs (21.3×)	1.65 ms (3.4×)
1000×500	11.2 ms	539 µs (20.9×)	2.8 ms (4.0×)

Key findings

GPU crossover at ~128 neurons: Below this, dispatch overhead (~50 µs) dominates
21× speedup at 512+ neurons on GTX 1060 (x16 PCIe 3.0)
RX 6600 XT limited by PCIe x1 riser: Only 985 MB/s bandwidth vs 15.75 GB/s for x16
Batch mode amortises dispatch: 256 samples in one dispatch costs ~2× single-sample

GTX 1060 vs RX 6600 XT

The RX 6600 XT has more compute units (32 vs 10) and higher bandwidth (256 GB/s vs 192 GB/s) but is connected via a PCIe x1 mining riser, limiting host-GPU transfer to ~985 MB/s. With a proper x16 connection, the RX 6600 XT would likely outperform the GTX 1060 by 2-3×.

Scaling behaviour

Dimension	Complexity	GPU advantage
$N < 128$	O(N²) small	CPU faster (dispatch overhead)
$128 \leq N < 512$	O(N²) medium	GPU ~5-10× faster
$N \geq 512$	O(N²) large	GPU ~20× faster (memory-bound)
Batch $B$	O(B × N²)	GPU ~40× (amortised dispatch)

Memory bandwidth analysis

For 512×256 layer with length=1024:

Weight reads: $512 \times 256 \times 16 \times 8$ = 16.7 MB
Input reads: $512 \times 16 \times 8$ = 65 KB (negligible)
Output writes: $256 \times 4$ = 1 KB (negligible)

GTX 1060 memory bandwidth: 192 GB/s → theoretical minimum: 16.7 MB / 192 GB/s = 87 µs. Measured: 263 µs → ~33% memory bandwidth utilisation (reasonable for compute shaders with workgroup synchronisation overhead).

8. Citations

wgpu. "wgpu: Safe and portable GPU abstraction in Rust." https://wgpu.rs/ — Cross-platform GPU compute framework.
Philox PRNG. Salmon, J. K. et al. "Parallel Random Numbers: As Easy as 1, 2, 3." SC '11: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2011. — Counter-based RNG algorithm used for GPU Bernoulli encoding.
Stochastic computing. Alaghi, A. & Hayes, J. P. "Survey of Stochastic Computing." ACM Transactions on Embedded Computing Systems 12(2s):92:1-92:19, 2013. — Theoretical foundation for bitstream-based computation.
WGSL specification. W3C. "WebGPU Shading Language." https://www.w3.org/TR/WGSL/ — Shader language used for compute kernels.
GPU-accelerated SNNs. Knight, J. C. & Nowotny, T. "GPUs outperform current HPC and neuromorphic solutions in terms of speed and energy when simulating a highly-connected cortical model." Frontiers in Neuroscience 12:941, 2018. — Motivation for GPU acceleration of spiking network simulation.
GeNN. Yavuz, E. et al. "GeNN: a code generation framework for accelerated brain simulations." Scientific Reports 6:18854, 2016. — GPU-accelerated SNN framework, comparison target.

Validation

Test suite results (11 GPU integration tests)

All tests passing with cargo test --features gpu:

Test	What it verifies	Status
`gpu_layer_creation`	GpuDenseLayer::try_new succeeds	PASS
`gpu_output_shape`	Output length = n_neurons	PASS
`gpu_output_non_negative`	All outputs ≥ 0	PASS
`gpu_output_range`	All outputs ≤ n_inputs	PASS
`gpu_deterministic`	Same seed → same output	PASS
`gpu_cpu_agreement`	GPU ≈ CPU within 5% relative	PASS
`gpu_batch`	Batch output shape = n_samples × n_neurons	PASS
`gpu_zero_inputs`	All-zero inputs → ~zero output	PASS
`gpu_all_ones`	All-one inputs → output ≈ sum(weights)	PASS
`gpu_different_seeds`	Different seeds → different output	PASS
`gpu_adapter_name`	Non-empty adapter name string	PASS

GPU-CPU parity

GPU and CPU outputs agree within 5% relative tolerance (at the scale of n_inputs). The remaining difference arises from:

Different PRNG (Philox on GPU vs rand on CPU)
f32 accumulation on GPU vs f64 on CPU
Different operation ordering (parallel vs sequential)

These differences are within the inherent stochastic noise of the bitstream computation.

Design Decisions

Why two kernels instead of one fused kernel?

Separating encode and accumulate allows:

Memory efficiency: Encoded bitstreams can be reused across neurons
Debugging: Each stage can be tested independently
Flexibility: Different encode strategies (Bernoulli, LFSR, Sobol) can be swapped without changing the accumulate kernel

A fused kernel would reduce global memory traffic but increase register pressure and limit workgroup occupancy.

Why f32 on GPU, f64 on CPU?

RDNA2 (RX 6600 XT) has 1:1 f32:f16 ratio but only 1:16 f64 rate. Pascal (GTX 1060) has 1:32 f64 rate. Since the Bernoulli encoding quantises to 8-bit effective precision anyway (p mapped to [0, 255] threshold), f32 accumulation is more than sufficient.

Why persistent weights?

Weights are uploaded once at construction time and stored in a GPU buffer. They are reused across all forward calls. This amortises the upload cost (which can be significant for large layers) and keeps the per-call data transfer minimal (only input probabilities and output values).

No auto-chunking: Layers exceeding the maxStorageBufferBindingSize (typically 128-256 MB) will fail at creation time. Manual chunking is needed.
No mixed precision: All GPU computation is f32. No f16 or int8 modes.
No sparse support: Only dense layers. Sparse connectivity patterns still use the full weight matrix.
No gradient computation: Forward pass only. Training requires CPU fallback with surrogate gradients.
No multi-GPU: Only one adapter is used (selected by SC_GPU_ADAPTER or wgpu heuristic). Multi-GPU data parallelism is not supported.
Dispatch overhead: ~50 µs per kernel dispatch makes GPU slower than CPU for small layers (<128 neurons).
PCIe bandwidth dependency: Performance depends heavily on PCIe link speed. x1 risers (mining rigs) significantly reduce throughput.
No timeline profiling: No built-in GPU profiling. Use external tools (Nsight, RenderDoc) for kernel-level analysis.

SC-NeuroCore v3.14.0 — Stochastic Computing Spiking Neural Network Framework