GPU Compute Backend (wgpu)

April 7, 2026 · View on GitHub

Module: sc_neurocore_engine.GpuDenseLayer Rust path: sc_neurocore_engine::gpu::GpuDenseLayer Feature gate: --features gpu Backends: Vulkan (Linux/Windows), Metal (macOS), DX12 (Windows) Tested on: NVIDIA GTX 1060 6GB, AMD RX 6600 XT 8GB


1. Mathematical Formalism

Stochastic computing dense layer

The GPU backend accelerates the SC-NeuroCore DenseLayer, which performs stochastic computing matrix-vector multiplication using Bernoulli-encoded bitstreams and AND+popcount accumulation.

Bernoulli encoding:

Bi[k]Bernoulli(pi),k=1,,LB_i[k] \sim \text{Bernoulli}(p_i), \quad k = 1, \ldots, L

Each input probability pip_i is encoded as a bitstream of length LL where each bit is independently 1 with probability pip_i.

Stochastic AND multiplication:

Yij[k]=Wij[k]Bi[k]Y_{ij}[k] = W_{ij}[k] \wedge B_i[k]

where WijW_{ij} is the pre-encoded weight bitstream and BiB_i is the input bitstream. The AND operation implements stochastic multiplication: E[Y]=wijpi\mathbb{E}[Y] = w_{ij} \cdot p_i.

Popcount accumulation:

outj=1Li=1Ninpopcount(Yij)\text{out}_j = \frac{1}{L} \sum_{i=1}^{N_{in}} \text{popcount}(Y_{ij})

The output for neuron jj is the normalised sum of bit-counts across all inputs.

GPU kernel architecture

Two compute shader kernels mirror the CPU hot path:

Kernel 1 — Encode (encode.wgsl):

Each GPU thread generates one packed vec2<u32> word (64 bits) of a Bernoulli bitstream. The PRNG is Philox 4×32-10, a counter-based RNG that enables massively parallel independent random streams without state sharing.

Dispatch: (ceil(n_inputs×words_per_input/256),n_samples,1)(\text{ceil}(\text{n\_inputs} \times \text{words\_per\_input} / 256), \text{n\_samples}, 1)

Kernel 2 — Accumulate (accumulate.wgsl):

Each workgroup processes one (neuron, sample) pair. 256 threads cooperatively AND+popcount all input words for one neuron, then perform workgroup-local tree reduction via shared memory.

Dispatch: (n_neurons, n_samples, 1) — one workgroup per (neuron, sample)

Philox 4×32-10 PRNG

Counter-based RNG producing 4 random u32 values per round:

Input: (counter[4], key[2])
10 rounds of: {
    x0 = mulhi(R0, counter[0]) ^ key[0] ^ counter[3]
    x1 = mullo(R0, counter[0])
    x2 = mulhi(R1, counter[2]) ^ key[1] ^ counter[1]
    x3 = mullo(R1, counter[2])
    counter = (x0, x1, x2, x3)
    key += (PHILOX_W0, PHILOX_W1)
}

Constants: R0=0xD2511F53R_0 = \text{0xD2511F53}, R1=0xCD9E8D57R_1 = \text{0xCD9E8D57}, W0=0x9E3779B9W_0 = \text{0x9E3779B9} (golden ratio), W1=0xBB67AE85W_1 = \text{0xBB67AE85}.

The PRNG is implemented in WGSL with 16-bit decomposition for mulhilo since WGSL has no native 64-bit integer multiplication.

u64 emulation via vec2

WGSL has no native u64 type. Bitstreams are packed as vec2<u32> (2 × 32 bits = 64 bits). AND and popcount operate on both halves independently:

let result = vec2<u32>(a.x & b.x, a.y & b.y);
let count = countOneBits(result.x) + countOneBits(result.y);

This incurs ~15-25% overhead vs native u64 popcount but is memory-bound anyway, so the penalty is negligible at large layer sizes.


2. Theoretical Context

Problem statement

SC-NeuroCore's single-neuron Rust engine is best-in-class (3.8 ns/step LIF), but network-level stochastic computing is O(n²) in the number of neurons × inputs, dominated by Bernoulli encoding and AND+popcount. At 1K+ neurons, CPU performance is insufficient for real-time simulation:

Framework1K neurons10K neurons
SC-NeuroCore (CPU)11 ms~1.1 s
Brian2 (CPU)2.5 ms250 ms
snnTorch (GPU)0.5 ms5 ms
GeNN (GPU)0.3 ms3 ms

The GPU backend closes this gap by offloading the encoding and accumulation to massively parallel GPU compute shaders.

Why wgpu?

AlternativeProsCons
CUDAFastest on NVIDIANVIDIA-only, no AMD support
ROCmAMD supportLinux-only, poor API stability
OpenCLCross-platformStagnant, poor tooling
Vulkan ComputeCross-platform, low-levelVerbose API
wgpuCross-platform, Rust-native, safeSlight overhead vs raw Vulkan
MetalBest on macOSApple-only

wgpu was chosen because:

  1. Cross-platform: Vulkan (Linux/Windows), Metal (macOS), DX12 (Windows)
  2. Rust-native: First-class Rust API, no FFI overhead
  3. Production-grade: Powers Firefox WebGPU implementation
  4. AMD support: User's primary GPUs are AMD RX 6600 XT (RDNA2)
  5. No CUDA dependency: Avoids vendor lock-in

GPU-native vs CPU-upload design

Two design alternatives were considered:

Option A: Upload CPU random bytes to GPU

  • Simple: generate bitstreams on CPU, upload packed words via PCIe
  • Bottleneck: PCIe bandwidth (8-16 GB/s) limits throughput at batch scale
  • For 1K inputs × 1024-bit bitstreams × 256 batch = 32 MB/batch → 2 ms upload alone

Option B: GPU-native RNG (chosen)

  • Philox PRNG generates random bits directly on GPU
  • Only probabilities (f32, small) and output (f32, small) cross PCIe
  • For 1K inputs: upload 4 KB probabilities, download 4 KB output
  • Eliminates PCIe bottleneck entirely

SC_GPU_ADAPTER environment variable

Multi-GPU systems can select a specific adapter via substring match:

export SC_GPU_ADAPTER="AMD"    # selects AMD GPU
export SC_GPU_ADAPTER="1060"   # selects GTX 1060
export SC_GPU_ADAPTER="RDNA"   # selects RDNA-based GPU

If unset, wgpu's HighPerformance heuristic selects the most powerful adapter.


3. Pipeline Position

Python / Rust application


┌─────────────────────────────────────────────────────┐
│  GpuDenseLayer                                       │
│                                                     │
│  ┌──────────┐     ┌──────────────────────────────┐  │
│  │ CPU      │     │ GPU (wgpu)                   │  │
│  │ DenseLayer│     │                              │  │
│  │ (fallback)│     │  Encode kernel (Philox RNG)  │  │
│  │          │     │       │                      │  │
│  │          │     │  Accumulate kernel (AND+pop)  │  │
│  │          │     │       │                      │  │
│  │          │     │  Output buffer → readback     │  │
│  └──────────┘     └──────────────────────────────┘  │
│                                                     │
│  Weights: uploaded once, persistent on GPU           │
│  Inputs: uploaded per forward() call (f32 probs)     │
│  Output: downloaded per forward() call (f32 values)  │
└─────────────────────────────────────────────────────┘

Inputs

InputTypeRangeDescription
input_valuesVec<f64> or list[float][0,1][0, 1]Input probabilities per neuron
seedu64AnyPhilox PRNG seed (different seed = different bitstreams)
n_samplesusize (batch only)1\geq 1Batch size for batch forward

Outputs

OutputTypeRangeDescription
Vec<f64> or list[float]Per-neuron[0,ninputs][0, n_{inputs}]Stochastic dot product output

4. Features

FeatureDescription
Cross-platform GPUVulkan, Metal, DX12 via wgpu
Feature-gated`--features gpu$, \text{no} \text{compile}-\text{time} \text{cost} \text{without}
\text{Philox} 4 \times 32-10 \text{PRNG}\text{GPU}-\text{native} \text{counter}-\text{based} \text{RNG}
\text{Two}-\text{kernel} \text{architecture}\text{Encode} + \text{Accumulate}, \text{matching} \text{CPU} \text{hot} \text{path}
\text{Persistent} \text{weights}\text{Uploaded} \text{once}, \text{reused} \text{across} \text{forward} \text{calls}
\text{Batch} \text{mode}$forward_batch()` processes multiple samples in one dispatch
Single-sample modeforward_fast() for interactive use
SC_GPU_ADAPTEREnvironment variable for GPU selection
CPU fallbackGpuDenseLayer wraps a CPU DenseLayer internally
Pre-allocated buffersNo per-call allocation up to max_batch
PyO3 bindingsfrom sc_neurocore_engine import GpuDenseLayer
Workgroup reductionShared memory tree reduction for popcount
Adapter detectionReports GPU name and availability

5. Usage Examples

Python: basic GPU forward

from sc_neurocore_engine import GpuDenseLayer

# Check GPU availability.
if GpuDenseLayer.is_gpu_available():
    layer = GpuDenseLayer(n_inputs=256, n_neurons=128, length=1024, seed=42)
    print(f"GPU: {layer.gpu_name()}")

    # Single-sample forward.
    inputs = [0.5] * 256
    output = layer.forward_fast(inputs, seed=42)
    print(f"Output shape: {len(output)}, first 5: {output[:5]}")
else:
    print("No GPU available — use CPU DenseLayer instead")

Python: batch forward

import numpy as np

layer = GpuDenseLayer(256, 128, length=1024, seed=42, max_batch=512)

# Batch of 64 samples.
n_samples = 64
inputs_flat = np.random.uniform(0.0, 1.0, size=n_samples * 256).tolist()
outputs = layer.forward_batch(inputs_flat, n_samples=n_samples, seed=42)

# Reshape to (n_samples, n_neurons).
import numpy as np
out_array = np.array(outputs).reshape(n_samples, 128)
print(f"Batch output shape: {out_array.shape}")

Rust: GPU forward

use sc_neurocore_engine::gpu::GpuDenseLayer;

if let Some(layer) = GpuDenseLayer::try_new(256, 128, 1024, 42, 256) {
    let inputs = vec![0.5f64; 256];
    let output = layer.forward_gpu(&inputs, 42);
    println!("GPU output: {:?}", &output[..5]);
    println!("GPU: {}", layer.gpu_name());
}

GPU adapter selection

# List available adapters.
python3 -c "
from sc_neurocore_engine import GpuDenseLayer
layer = GpuDenseLayer(64, 32)
print(f'Selected: {layer.gpu_name()}')
"

# Force AMD GPU.
SC_GPU_ADAPTER="AMD" python3 -c "
from sc_neurocore_engine import GpuDenseLayer
layer = GpuDenseLayer(64, 32)
print(f'Selected: {layer.gpu_name()}')
"

Build with GPU support

# Compile Rust engine with GPU feature.
cd engine
cargo build --release --features gpu

# Run GPU tests.
cargo test --features gpu

# Run GPU benchmarks.
cargo bench --features gpu --bench gpu_bench

GPU vs CPU comparison

import time
from sc_neurocore_engine import GpuDenseLayer

sizes = [(64, 32), (256, 128), (512, 256), (1000, 500)]
for n_in, n_out in sizes:
    layer = GpuDenseLayer(n_in, n_out, length=1024, seed=42)
    inputs = [0.5] * n_in

    # Warm up.
    for _ in range(10):
        layer.forward_fast(inputs, seed=42)

    # Measure.
    start = time.perf_counter()
    for _ in range(100):
        layer.forward_fast(inputs, seed=42)
    elapsed = (time.perf_counter() - start) / 100
    print(f"{n_in}x{n_out}: {elapsed*1e6:.0f} us/forward")

6. Technical Reference

Class: GpuDenseLayer (Python)

Exposed via PyO3. Module: sc_neurocore_engine.

Constructor

GpuDenseLayer(n_inputs, n_neurons, length=1024, seed=24301, max_batch=256)
ParameterTypeDefaultDescription
n_inputsintRequiredNumber of input features
n_neuronsintRequiredNumber of output neurons
lengthint1024Bitstream length (higher = more accurate, slower)
seedint24301Weight initialisation seed
max_batchint256Maximum batch size (pre-allocates buffers)

Raises RuntimeError if no GPU is available.

Methods

MethodSignatureDescription
forward_fast(inputs: list[float], seed: int) -> list[float]Single-sample forward
forward_batch(inputs_flat: list[float], n_samples: int, seed: int) -> list[float]Batch forward (flat row-major)
gpu_name() -> strName of the GPU adapter
is_gpu_available() -> bool (static)Check GPU availability

Properties

PropertyTypeDescription
n_inputsintNumber of input features
n_neuronsintNumber of output neurons
lengthintBitstream length

Rust API

pub struct GpuDenseLayer {
    pub cpu: DenseLayer,
    ctx: Arc<GpuContext>,
    weight_buf: wgpu::Buffer,
    // ... pre-allocated buffers
}

impl GpuDenseLayer {
    pub fn try_new(n_inputs, n_neurons, length, seed, max_batch) -> Option<Self>
    pub fn forward_gpu(&self, inputs: &[f64], seed: u64) -> Vec<f64>
    pub fn forward_batch_gpu(&self, inputs_flat: &[f64], n_samples: usize, seed: u64) -> Vec<f64>
    pub fn gpu_name(&self) -> &str
}

GpuContext singleton

pub struct GpuContext {
    pub device: wgpu::Device,
    pub queue: wgpu::Queue,
    pub encode_pipeline: wgpu::ComputePipeline,
    pub accumulate_pipeline: wgpu::ComputePipeline,
    pub adapter_name: String,
}

Initialised once via OnceLock. Shared across all GpuDenseLayer instances.

WGSL shader files

FileWorkgroup sizeDescription
`philox.wgsl$\text{Philox} 4 \times 32-10 \text{RNG} \text{functions} (\text{included} \text{by} \text{encode})
$encode.wgsl`256Bernoulli encoding: probability → packed bitstream
accumulate.wgsl256AND + popcount + workgroup tree reduction

Buffer layout

BufferSizePersistenceDescription
weight_bufNin×Nout×WN_{in} \times N_{out} \times W × 8 bytesPersistentPacked weight bitstreams
input_prob_bufNin×BN_{in} \times B × 4 bytesPer-callInput probabilities (f32)
packed_input_bufNin×W×BN_{in} \times W \times B × 8 bytesPer-callEncoded input bitstreams
output_bufNout×BN_{out} \times B × 4 bytesPer-callOutput values (f32)
output_stagingSame as outputPer-callMAP_READ staging buffer

where W=L/64W = \lceil L/64 \rceil (words per bitstream) and BB = batch size.

Memory limits

The maxStorageBufferBindingSize on most GPUs is 128-256 MB. For 1000×1000 layers with length=1024 (16 words): weight buffer = $1000 \times 1000 \times 16 \times 8$ = 128 MB. Larger layers may require chunking or reduced bitstream length.


7. Performance Benchmarks

Local GPU benchmarks (i5-11600K host)

Measured with Criterion micro-benchmarks (commit 27299c8b):

Layer sizeCPU (i5-11600K)GTX 1060 (x16 PCIe)RX 6600 XT (x1 PCIe)
64×3258 µs137 µs (0.4×)1.0 ms (0.06×)
512×2565.6 ms263 µs (21.3×)1.65 ms (3.4×)
1000×50011.2 ms539 µs (20.9×)2.8 ms (4.0×)

Key findings

  1. GPU crossover at ~128 neurons: Below this, dispatch overhead (~50 µs) dominates
  2. 21× speedup at 512+ neurons on GTX 1060 (x16 PCIe 3.0)
  3. RX 6600 XT limited by PCIe x1 riser: Only 985 MB/s bandwidth vs 15.75 GB/s for x16
  4. Batch mode amortises dispatch: 256 samples in one dispatch costs ~2× single-sample

GTX 1060 vs RX 6600 XT

The RX 6600 XT has more compute units (32 vs 10) and higher bandwidth (256 GB/s vs 192 GB/s) but is connected via a PCIe x1 mining riser, limiting host-GPU transfer to ~985 MB/s. With a proper x16 connection, the RX 6600 XT would likely outperform the GTX 1060 by 2-3×.

Scaling behaviour

DimensionComplexityGPU advantage
N<128N < 128O(N²) smallCPU faster (dispatch overhead)
$128 \leq N < 512$O(N²) mediumGPU ~5-10× faster
N512N \geq 512O(N²) largeGPU ~20× faster (memory-bound)
Batch BBO(B × N²)GPU ~40× (amortised dispatch)

Memory bandwidth analysis

For 512×256 layer with length=1024:

  • Weight reads: $512 \times 256 \times 16 \times 8$ = 16.7 MB
  • Input reads: $512 \times 16 \times 8$ = 65 KB (negligible)
  • Output writes: $256 \times 4$ = 1 KB (negligible)

GTX 1060 memory bandwidth: 192 GB/s → theoretical minimum: 16.7 MB / 192 GB/s = 87 µs. Measured: 263 µs → ~33% memory bandwidth utilisation (reasonable for compute shaders with workgroup synchronisation overhead).


8. Citations

  1. wgpu. "wgpu: Safe and portable GPU abstraction in Rust." https://wgpu.rs/ — Cross-platform GPU compute framework.

  2. Philox PRNG. Salmon, J. K. et al. "Parallel Random Numbers: As Easy as 1, 2, 3." SC '11: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2011. — Counter-based RNG algorithm used for GPU Bernoulli encoding.

  3. Stochastic computing. Alaghi, A. & Hayes, J. P. "Survey of Stochastic Computing." ACM Transactions on Embedded Computing Systems 12(2s):92:1-92:19, 2013. — Theoretical foundation for bitstream-based computation.

  4. WGSL specification. W3C. "WebGPU Shading Language." https://www.w3.org/TR/WGSL/ — Shader language used for compute kernels.

  5. GPU-accelerated SNNs. Knight, J. C. & Nowotny, T. "GPUs outperform current HPC and neuromorphic solutions in terms of speed and energy when simulating a highly-connected cortical model." Frontiers in Neuroscience 12:941, 2018. — Motivation for GPU acceleration of spiking network simulation.

  6. GeNN. Yavuz, E. et al. "GeNN: a code generation framework for accelerated brain simulations." Scientific Reports 6:18854, 2016. — GPU-accelerated SNN framework, comparison target.


Validation

Test suite results (11 GPU integration tests)

All tests passing with cargo test --features gpu:

TestWhat it verifiesStatus
gpu_layer_creationGpuDenseLayer::try_new succeedsPASS
gpu_output_shapeOutput length = n_neuronsPASS
gpu_output_non_negativeAll outputs ≥ 0PASS
gpu_output_rangeAll outputs ≤ n_inputsPASS
gpu_deterministicSame seed → same outputPASS
gpu_cpu_agreementGPU ≈ CPU within 5% relativePASS
gpu_batchBatch output shape = n_samples × n_neuronsPASS
gpu_zero_inputsAll-zero inputs → ~zero outputPASS
gpu_all_onesAll-one inputs → output ≈ sum(weights)PASS
gpu_different_seedsDifferent seeds → different outputPASS
gpu_adapter_nameNon-empty adapter name stringPASS

GPU-CPU parity

GPU and CPU outputs agree within 5% relative tolerance (at the scale of n_inputs). The remaining difference arises from:

  • Different PRNG (Philox on GPU vs rand on CPU)
  • f32 accumulation on GPU vs f64 on CPU
  • Different operation ordering (parallel vs sequential)

These differences are within the inherent stochastic noise of the bitstream computation.


Design Decisions

Why two kernels instead of one fused kernel?

Separating encode and accumulate allows:

  1. Memory efficiency: Encoded bitstreams can be reused across neurons
  2. Debugging: Each stage can be tested independently
  3. Flexibility: Different encode strategies (Bernoulli, LFSR, Sobol) can be swapped without changing the accumulate kernel

A fused kernel would reduce global memory traffic but increase register pressure and limit workgroup occupancy.

Why f32 on GPU, f64 on CPU?

RDNA2 (RX 6600 XT) has 1:1 f32:f16 ratio but only 1:16 f64 rate. Pascal (GTX 1060) has 1:32 f64 rate. Since the Bernoulli encoding quantises to 8-bit effective precision anyway (p mapped to [0, 255] threshold), f32 accumulation is more than sufficient.

Why persistent weights?

Weights are uploaded once at construction time and stored in a GPU buffer. They are reused across all forward calls. This amortises the upload cost (which can be significant for large layers) and keeps the per-call data transfer minimal (only input probabilities and output values).

Why OnceLock singleton for GpuContext?

The wgpu Device and Queue are heavyweight objects. Creating them for each GpuDenseLayer would waste GPU memory and initialisation time. The singleton ensures all layers share a single device/queue, and the compiled compute pipelines are also shared.

Why max_batch pre-allocation?

GPU buffer allocation is expensive (~100 µs). Pre-allocating for the maximum expected batch size eliminates per-call allocation overhead. If the actual batch size is smaller, only the used portion is dispatched.


Known Limitations

  1. No auto-chunking: Layers exceeding the maxStorageBufferBindingSize (typically 128-256 MB) will fail at creation time. Manual chunking is needed.

  2. No mixed precision: All GPU computation is f32. No f16 or int8 modes.

  3. No sparse support: Only dense layers. Sparse connectivity patterns still use the full weight matrix.

  4. No gradient computation: Forward pass only. Training requires CPU fallback with surrogate gradients.

  5. No multi-GPU: Only one adapter is used (selected by SC_GPU_ADAPTER or wgpu heuristic). Multi-GPU data parallelism is not supported.

  6. Dispatch overhead: ~50 µs per kernel dispatch makes GPU slower than CPU for small layers (<128 neurons).

  7. PCIe bandwidth dependency: Performance depends heavily on PCIe link speed. x1 risers (mining rigs) significantly reduce throughput.

  8. No timeline profiling: No built-in GPU profiling. Use external tools (Nsight, RenderDoc) for kernel-level analysis.


SC-NeuroCore v3.14.0 — Stochastic Computing Spiking Neural Network Framework

© 2020–2026 Miroslav Šotek / ANULUM. AGPL-3.0-or-later.