SIMD Performance Guide

June 10, 2026 · View on GitHub

VelesDB uses native SIMD dispatch for ultra-fast vector operations, automatically selecting the optimal implementation based on CPU features and vector size.

Native SIMD Architecture (EPIC-052/077)

The simd_native module provides hand-tuned SIMD implementations using core::arch intrinsics:

┌─────────────────────────────────────────────────────────────────┐
│              simd_native::cosine_similarity_native()             │
│                                                                  │
│  Runtime: feature detection → tiered dispatch → native SIMD     │
│  - AVX-512: 4/2/1 accumulators based on size                    │
│  - AVX2: 4-acc (>1024), 2-acc (64-1023), 1-acc (<64)            │
│  - ARM NEON: 128-bit SIMD                                       │
│  - Scalar: fallback for small vectors                           │
└─────────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        ▼                     ▼                     ▼
  ┌───────────┐        ┌───────────┐        ┌───────────┐
  │ AVX-512   │        │ AVX2/FMA  │        │  Scalar   │
  │ (512-bit) │        │ (256-bit) │        │ (native)  │
  └───────────┘        └───────────┘        └───────────┘

Architecture Support

Platform	Implementation	Instructions	Performance (768D)
x86_64 AVX-512	simd_native	512-bit 2/4-acc	~38-42ns
x86_64 AVX2	simd_native	256-bit 2/4-acc	~40-82ns
aarch64	simd_native	NEON 128-bit	~60-100ns
WASM	scalar fallback (SIMD128 planned)	Native Rust	see Fallback
Fallback	Scalar	Native Rust	~150-200ns

Tiered Dispatch Strategy (EPIC-077)

Implementations adapt based on vector size and ISA to minimize register pressure and maximize throughput:

AVX-512 (cosine):

Size Range	Accumulators	Stride	Use Case
>= 512 elements	4-acc (12 zmm regs)	64	Large vectors (ada-002, text-embedding-3-large)
16-511 elements	2-acc (6 zmm regs)	32	Medium vectors (BERT, MiniLM)
< 16 elements	Scalar	1	Tiny vectors

AVX2 (cosine):

Size Range	Accumulators	Stride	Use Case
>= 512 elements	4-acc (12 ymm regs)	32	Large vectors
8-511 elements	2-acc (6 ymm regs)	16	Medium/small vectors
< 8 elements	Scalar	1	Tiny vectors

AVX2 (dot product, squared L2):

Size Range	Accumulators	Stride	Use Case
>= 256 elements	4-acc	32	Large vectors
64-255 elements	2-acc	16	Medium vectors
8-63 elements	1-acc	8	Small vectors
< 8 elements	Scalar	1	Tiny vectors

All AVX2 cosine kernels use vectorized 8-wide remainder handling, reducing the scalar tail from up to 31 elements to at most 7. AVX-512 kernels use masked loads for zero-cost remainder.

Performance Benchmarks (March 27, 2026)

Distance Functions (768D vectors)

Function	Latency	Throughput	vs Previous
`dot_product_native`	19.8ns	38.8 Gelem/s	Baseline
`euclidean_native`	22.5ns	34.1 Gelem/s	Improved
`cosine_similarity_native`	33.1ns	23.2 Gelem/s	Optimized (4-acc, single-sqrt finish)
`cosine_normalized_native`	19.8ns	38.8 Gelem/s	Same as dot
`hamming_distance_native`	35.8ns	21.5M ops/s	FP-domain 4-acc (no cross-domain penalty) + NEON + batch
`jaccard_similarity_native`	35.1ns	21.9 Gelem/s	Optimized (4-acc + NEON + batch)

Measured March 27, 2026 on i9-14900KF (24C/32T, AVX2+FMA), 64GB DDR5, Rust 1.92.0, Windows 11 Pro, sequential run on idle machine.

Scaling by Dimension (simd_native)

Dimension	Cosine	Dot Product	Model
128	8.1ns	5.4ns	MiniLM
384	20.1ns	12.0ns	all-MiniLM-L6-v2
768	33.1ns	19.8ns	BERT, ada-002
1536	69.0ns	43.8ns	text-embedding-3-small
3072	112.2ns	91.2ns	text-embedding-3-large

Optimization Techniques

1. 32-Wide Unrolling (4x f32x8)

// 4 parallel accumulators for maximum ILP
let mut sum0 = f32x8::ZERO;
let mut sum1 = f32x8::ZERO;
let mut sum2 = f32x8::ZERO;
let mut sum3 = f32x8::ZERO;

for i in 0..simd_len {
    let offset = i * 32;
    sum0 = va0.mul_add(vb0, sum0);
    sum1 = va1.mul_add(vb1, sum1);
    sum2 = va2.mul_add(vb2, sum2);
    sum3 = va3.mul_add(vb3, sum3);
}

Why it works:

Modern CPUs have 4+ FMA units (Zen 3+, Alder Lake+)
Out-of-order execution can run all 4 accumulators in parallel
~15-20% faster than single-accumulator SIMD

2. Pre-Normalized Vectors

For cosine similarity with pre-normalized vectors:

// Standard cosine: 3 passes (dot, norm_a, norm_b)
pub fn cosine_similarity_fast(a: &[f32], b: &[f32]) -> f32;

// Normalized: 1 pass (dot only) - 40% faster!
pub fn cosine_similarity_normalized(a: &[f32], b: &[f32]) -> f32;

Use when:

Vectors are normalized at insertion time
Same vector is compared multiple times
Building custom distance functions

3. CPU Prefetch Hints

// Prefetch next vectors into L1 cache
#[cfg(target_arch = "x86_64")]
unsafe {
    use std::arch::x86_64::{_mm_prefetch, _MM_HINT_T0};
    _mm_prefetch(next_vector.as_ptr().cast::<i8>(), _MM_HINT_T0);
}

Benefits:

Hides memory latency during HNSW traversal
~10-20% improvement on large datasets
Critical for cold cache scenarios

4. Contiguous Memory Layout

pub struct ContiguousVectors {
    data: *mut f32,  // Single contiguous buffer
    dimension: usize,
    count: usize,
}

Why it matters:

Cache line alignment (64 bytes)
Sequential access pattern
Enables hardware prefetching

AVX-512 Transition Cost (Intel Skylake+)

On Intel Skylake-X and later CPUs, AVX-512 instructions incur a significant warmup cost:

Phase	Cycles	Time @ 4GHz
License transition	~20,000	~5μs
Register file power-up	~36,000	~9μs
Total warmup	~56,000	~14μs

Why This Matters

First AVX-512 instruction triggers CPU frequency throttling (P-state transition)
Subsequent instructions run at reduced frequency until warmup completes
Short bursts of AVX-512 may be slower than AVX2 due to transition overhead

VelesDB Mitigation

The adaptive dispatch system handles this automatically:

// 500 iterations per benchmark captures warmup cost
const BENCHMARK_ITERATIONS: usize = 500;

// Eager initialization at Database::open() avoids first-call latency
let info = simd_ops::init_dispatch();

Result: The dispatch table reflects real-world performance after warmup, ensuring AVX-512 is only selected when it provides a genuine advantage over AVX2.

Recommendations

Workload	Recommendation
Sustained vector ops (batch search)	AVX-512 beneficial
Sporadic single queries	AVX2 may be faster
Mixed workloads	Let adaptive dispatch decide

To check which backend was selected:

velesdb simd info

Best Practices

1. Pre-normalize at Insertion

// Normalize once at insertion
let norm = vector.iter().map(|x| x * x).sum::<f32>().sqrt();
let normalized: Vec<f32> = vector.iter().map(|x| x / norm).collect();

// Fast cosine at search time
let similarity = cosine_similarity_normalized(&stored, &query);

2. Batch Operations

// Single query, multiple candidates
let results = batch_cosine_normalized(&candidates, &query);

3. Use Appropriate Metric

Use Case	Recommended Metric
Semantic search	Cosine (normalized)
Image embeddings	Euclidean
Recommendations	Dot Product
Binary features	Hamming
Set similarity	Jaccard

Running Benchmarks

# All SIMD benchmarks
cargo bench --bench simd_benchmark

# Specific dimension
cargo bench --bench simd_benchmark -- "768"

# Compare implementations
cargo bench --bench simd_benchmark -- "explicit_simd|auto_vec"

Native SIMD API

use velesdb_core::simd_native;

// Direct native SIMD calls (no dispatch overhead)
let sim = simd_native::cosine_similarity_native(&a, &b);
let dist = simd_native::euclidean_native(&a, &b);
let dot = simd_native::dot_product_native(&a, &b);
let n = simd_native::norm_native(&v);
simd_native::normalize_inplace_native(&mut v);

// Batch operations with prefetching
let results = simd_native::batch_dot_product_native(&candidates, &query);

// Fast approximate (Newton-Raphson rsqrt)
let fast_sim = simd_native::cosine_similarity_fast(&a, &b);

Module Structure

Module	Purpose	Use When
`simd_native`	Hand-tuned intrinsics (AVX2/AVX-512/NEON)	Maximum performance, native CPU
`wide_simd`	Portable SIMD (f32x8)	WASM, cross-platform
`simd`	Auto-vectorized fallback	Generic builds

Future Optimizations

ARM SVE - Scalable vectors for ARM servers
WASM SIMD relaxed - Additional browser performance
GPU offload - Optional CUDA/Metal for batch operations

License

VelesDB Core is licensed under VelesDB Core License 1.0.