SIMD Performance Guide

June 10, 2026 · View on GitHub

VelesDB uses native SIMD dispatch for ultra-fast vector operations, automatically selecting the optimal implementation based on CPU features and vector size.

Native SIMD Architecture (EPIC-052/077)

The simd_native module provides hand-tuned SIMD implementations using core::arch intrinsics:

┌─────────────────────────────────────────────────────────────────┐
│              simd_native::cosine_similarity_native()             │
│                                                                  │
│  Runtime: feature detection → tiered dispatch → native SIMD     │
│  - AVX-512: 4/2/1 accumulators based on size                    │
│  - AVX2: 4-acc (>1024), 2-acc (64-1023), 1-acc (<64)            │
│  - ARM NEON: 128-bit SIMD                                       │
│  - Scalar: fallback for small vectors                           │
└─────────────────────────────────────────────────────────────────┘

        ┌─────────────────────┼─────────────────────┐
        ▼                     ▼                     ▼
  ┌───────────┐        ┌───────────┐        ┌───────────┐
  │ AVX-512   │        │ AVX2/FMA  │        │  Scalar   │
  │ (512-bit) │        │ (256-bit) │        │ (native)  │
  └───────────┘        └───────────┘        └───────────┘

Architecture Support

PlatformImplementationInstructionsPerformance (768D)
x86_64 AVX-512simd_native512-bit 2/4-acc~38-42ns
x86_64 AVX2simd_native256-bit 2/4-acc~40-82ns
aarch64simd_nativeNEON 128-bit~60-100ns
WASMscalar fallback (SIMD128 planned)Native Rustsee Fallback
FallbackScalarNative Rust~150-200ns

Tiered Dispatch Strategy (EPIC-077)

Implementations adapt based on vector size and ISA to minimize register pressure and maximize throughput:

AVX-512 (cosine):

Size RangeAccumulatorsStrideUse Case
>= 512 elements4-acc (12 zmm regs)64Large vectors (ada-002, text-embedding-3-large)
16-511 elements2-acc (6 zmm regs)32Medium vectors (BERT, MiniLM)
< 16 elementsScalar1Tiny vectors

AVX2 (cosine):

Size RangeAccumulatorsStrideUse Case
>= 512 elements4-acc (12 ymm regs)32Large vectors
8-511 elements2-acc (6 ymm regs)16Medium/small vectors
< 8 elementsScalar1Tiny vectors

AVX2 (dot product, squared L2):

Size RangeAccumulatorsStrideUse Case
>= 256 elements4-acc32Large vectors
64-255 elements2-acc16Medium vectors
8-63 elements1-acc8Small vectors
< 8 elementsScalar1Tiny vectors

All AVX2 cosine kernels use vectorized 8-wide remainder handling, reducing the scalar tail from up to 31 elements to at most 7. AVX-512 kernels use masked loads for zero-cost remainder.

Performance Benchmarks (March 27, 2026)

Distance Functions (768D vectors)

FunctionLatencyThroughputvs Previous
dot_product_native19.8ns38.8 Gelem/sBaseline
euclidean_native22.5ns34.1 Gelem/sImproved
cosine_similarity_native33.1ns23.2 Gelem/sOptimized (4-acc, single-sqrt finish)
cosine_normalized_native19.8ns38.8 Gelem/sSame as dot
hamming_distance_native35.8ns21.5M ops/sFP-domain 4-acc (no cross-domain penalty) + NEON + batch
jaccard_similarity_native35.1ns21.9 Gelem/sOptimized (4-acc + NEON + batch)

Measured March 27, 2026 on i9-14900KF (24C/32T, AVX2+FMA), 64GB DDR5, Rust 1.92.0, Windows 11 Pro, sequential run on idle machine.

Scaling by Dimension (simd_native)

DimensionCosineDot ProductModel
1288.1ns5.4nsMiniLM
38420.1ns12.0nsall-MiniLM-L6-v2
76833.1ns19.8nsBERT, ada-002
153669.0ns43.8nstext-embedding-3-small
3072112.2ns91.2nstext-embedding-3-large

Optimization Techniques

1. 32-Wide Unrolling (4x f32x8)

// 4 parallel accumulators for maximum ILP
let mut sum0 = f32x8::ZERO;
let mut sum1 = f32x8::ZERO;
let mut sum2 = f32x8::ZERO;
let mut sum3 = f32x8::ZERO;

for i in 0..simd_len {
    let offset = i * 32;
    sum0 = va0.mul_add(vb0, sum0);
    sum1 = va1.mul_add(vb1, sum1);
    sum2 = va2.mul_add(vb2, sum2);
    sum3 = va3.mul_add(vb3, sum3);
}

Why it works:

  • Modern CPUs have 4+ FMA units (Zen 3+, Alder Lake+)
  • Out-of-order execution can run all 4 accumulators in parallel
  • ~15-20% faster than single-accumulator SIMD

2. Pre-Normalized Vectors

For cosine similarity with pre-normalized vectors:

// Standard cosine: 3 passes (dot, norm_a, norm_b)
pub fn cosine_similarity_fast(a: &[f32], b: &[f32]) -> f32;

// Normalized: 1 pass (dot only) - 40% faster!
pub fn cosine_similarity_normalized(a: &[f32], b: &[f32]) -> f32;

Use when:

  • Vectors are normalized at insertion time
  • Same vector is compared multiple times
  • Building custom distance functions

3. CPU Prefetch Hints

// Prefetch next vectors into L1 cache
#[cfg(target_arch = "x86_64")]
unsafe {
    use std::arch::x86_64::{_mm_prefetch, _MM_HINT_T0};
    _mm_prefetch(next_vector.as_ptr().cast::<i8>(), _MM_HINT_T0);
}

Benefits:

  • Hides memory latency during HNSW traversal
  • ~10-20% improvement on large datasets
  • Critical for cold cache scenarios

4. Contiguous Memory Layout

pub struct ContiguousVectors {
    data: *mut f32,  // Single contiguous buffer
    dimension: usize,
    count: usize,
}

Why it matters:

  • Cache line alignment (64 bytes)
  • Sequential access pattern
  • Enables hardware prefetching

AVX-512 Transition Cost (Intel Skylake+)

On Intel Skylake-X and later CPUs, AVX-512 instructions incur a significant warmup cost:

PhaseCyclesTime @ 4GHz
License transition~20,000~5μs
Register file power-up~36,000~9μs
Total warmup~56,000~14μs

Why This Matters

  1. First AVX-512 instruction triggers CPU frequency throttling (P-state transition)
  2. Subsequent instructions run at reduced frequency until warmup completes
  3. Short bursts of AVX-512 may be slower than AVX2 due to transition overhead

VelesDB Mitigation

The adaptive dispatch system handles this automatically:

// 500 iterations per benchmark captures warmup cost
const BENCHMARK_ITERATIONS: usize = 500;

// Eager initialization at Database::open() avoids first-call latency
let info = simd_ops::init_dispatch();

Result: The dispatch table reflects real-world performance after warmup, ensuring AVX-512 is only selected when it provides a genuine advantage over AVX2.

Recommendations

WorkloadRecommendation
Sustained vector ops (batch search)AVX-512 beneficial
Sporadic single queriesAVX2 may be faster
Mixed workloadsLet adaptive dispatch decide

To check which backend was selected:

velesdb simd info

Best Practices

1. Pre-normalize at Insertion

// Normalize once at insertion
let norm = vector.iter().map(|x| x * x).sum::<f32>().sqrt();
let normalized: Vec<f32> = vector.iter().map(|x| x / norm).collect();

// Fast cosine at search time
let similarity = cosine_similarity_normalized(&stored, &query);

2. Batch Operations

// Single query, multiple candidates
let results = batch_cosine_normalized(&candidates, &query);

3. Use Appropriate Metric

Use CaseRecommended Metric
Semantic searchCosine (normalized)
Image embeddingsEuclidean
RecommendationsDot Product
Binary featuresHamming
Set similarityJaccard

Running Benchmarks

# All SIMD benchmarks
cargo bench --bench simd_benchmark

# Specific dimension
cargo bench --bench simd_benchmark -- "768"

# Compare implementations
cargo bench --bench simd_benchmark -- "explicit_simd|auto_vec"

Native SIMD API

use velesdb_core::simd_native;

// Direct native SIMD calls (no dispatch overhead)
let sim = simd_native::cosine_similarity_native(&a, &b);
let dist = simd_native::euclidean_native(&a, &b);
let dot = simd_native::dot_product_native(&a, &b);
let n = simd_native::norm_native(&v);
simd_native::normalize_inplace_native(&mut v);

// Batch operations with prefetching
let results = simd_native::batch_dot_product_native(&candidates, &query);

// Fast approximate (Newton-Raphson rsqrt)
let fast_sim = simd_native::cosine_similarity_fast(&a, &b);

Module Structure

ModulePurposeUse When
simd_nativeHand-tuned intrinsics (AVX2/AVX-512/NEON)Maximum performance, native CPU
wide_simdPortable SIMD (f32x8)WASM, cross-platform
simdAuto-vectorized fallbackGeneric builds

Future Optimizations

  1. ARM SVE - Scalable vectors for ARM servers
  2. WASM SIMD relaxed - Additional browser performance
  3. GPU offload - Optional CUDA/Metal for batch operations

License

VelesDB Core is licensed under VelesDB Core License 1.0.