SIMD Performance Guide
June 10, 2026 · View on GitHub
VelesDB uses native SIMD dispatch for ultra-fast vector operations, automatically selecting the optimal implementation based on CPU features and vector size.
Native SIMD Architecture (EPIC-052/077)
The simd_native module provides hand-tuned SIMD implementations using core::arch intrinsics:
┌─────────────────────────────────────────────────────────────────┐
│ simd_native::cosine_similarity_native() │
│ │
│ Runtime: feature detection → tiered dispatch → native SIMD │
│ - AVX-512: 4/2/1 accumulators based on size │
│ - AVX2: 4-acc (>1024), 2-acc (64-1023), 1-acc (<64) │
│ - ARM NEON: 128-bit SIMD │
│ - Scalar: fallback for small vectors │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ AVX-512 │ │ AVX2/FMA │ │ Scalar │
│ (512-bit) │ │ (256-bit) │ │ (native) │
└───────────┘ └───────────┘ └───────────┘
Architecture Support
| Platform | Implementation | Instructions | Performance (768D) |
|---|---|---|---|
| x86_64 AVX-512 | simd_native | 512-bit 2/4-acc | ~38-42ns |
| x86_64 AVX2 | simd_native | 256-bit 2/4-acc | ~40-82ns |
| aarch64 | simd_native | NEON 128-bit | ~60-100ns |
| WASM | scalar fallback (SIMD128 planned) | Native Rust | see Fallback |
| Fallback | Scalar | Native Rust | ~150-200ns |
Tiered Dispatch Strategy (EPIC-077)
Implementations adapt based on vector size and ISA to minimize register pressure and maximize throughput:
AVX-512 (cosine):
| Size Range | Accumulators | Stride | Use Case |
|---|---|---|---|
| >= 512 elements | 4-acc (12 zmm regs) | 64 | Large vectors (ada-002, text-embedding-3-large) |
| 16-511 elements | 2-acc (6 zmm regs) | 32 | Medium vectors (BERT, MiniLM) |
| < 16 elements | Scalar | 1 | Tiny vectors |
AVX2 (cosine):
| Size Range | Accumulators | Stride | Use Case |
|---|---|---|---|
| >= 512 elements | 4-acc (12 ymm regs) | 32 | Large vectors |
| 8-511 elements | 2-acc (6 ymm regs) | 16 | Medium/small vectors |
| < 8 elements | Scalar | 1 | Tiny vectors |
AVX2 (dot product, squared L2):
| Size Range | Accumulators | Stride | Use Case |
|---|---|---|---|
| >= 256 elements | 4-acc | 32 | Large vectors |
| 64-255 elements | 2-acc | 16 | Medium vectors |
| 8-63 elements | 1-acc | 8 | Small vectors |
| < 8 elements | Scalar | 1 | Tiny vectors |
All AVX2 cosine kernels use vectorized 8-wide remainder handling, reducing the scalar tail from up to 31 elements to at most 7. AVX-512 kernels use masked loads for zero-cost remainder.
Performance Benchmarks (March 27, 2026)
Distance Functions (768D vectors)
| Function | Latency | Throughput | vs Previous |
|---|---|---|---|
dot_product_native | 19.8ns | 38.8 Gelem/s | Baseline |
euclidean_native | 22.5ns | 34.1 Gelem/s | Improved |
cosine_similarity_native | 33.1ns | 23.2 Gelem/s | Optimized (4-acc, single-sqrt finish) |
cosine_normalized_native | 19.8ns | 38.8 Gelem/s | Same as dot |
hamming_distance_native | 35.8ns | 21.5M ops/s | FP-domain 4-acc (no cross-domain penalty) + NEON + batch |
jaccard_similarity_native | 35.1ns | 21.9 Gelem/s | Optimized (4-acc + NEON + batch) |
Measured March 27, 2026 on i9-14900KF (24C/32T, AVX2+FMA), 64GB DDR5, Rust 1.92.0, Windows 11 Pro, sequential run on idle machine.
Scaling by Dimension (simd_native)
| Dimension | Cosine | Dot Product | Model |
|---|---|---|---|
| 128 | 8.1ns | 5.4ns | MiniLM |
| 384 | 20.1ns | 12.0ns | all-MiniLM-L6-v2 |
| 768 | 33.1ns | 19.8ns | BERT, ada-002 |
| 1536 | 69.0ns | 43.8ns | text-embedding-3-small |
| 3072 | 112.2ns | 91.2ns | text-embedding-3-large |
Optimization Techniques
1. 32-Wide Unrolling (4x f32x8)
// 4 parallel accumulators for maximum ILP
let mut sum0 = f32x8::ZERO;
let mut sum1 = f32x8::ZERO;
let mut sum2 = f32x8::ZERO;
let mut sum3 = f32x8::ZERO;
for i in 0..simd_len {
let offset = i * 32;
sum0 = va0.mul_add(vb0, sum0);
sum1 = va1.mul_add(vb1, sum1);
sum2 = va2.mul_add(vb2, sum2);
sum3 = va3.mul_add(vb3, sum3);
}
Why it works:
- Modern CPUs have 4+ FMA units (Zen 3+, Alder Lake+)
- Out-of-order execution can run all 4 accumulators in parallel
- ~15-20% faster than single-accumulator SIMD
2. Pre-Normalized Vectors
For cosine similarity with pre-normalized vectors:
// Standard cosine: 3 passes (dot, norm_a, norm_b)
pub fn cosine_similarity_fast(a: &[f32], b: &[f32]) -> f32;
// Normalized: 1 pass (dot only) - 40% faster!
pub fn cosine_similarity_normalized(a: &[f32], b: &[f32]) -> f32;
Use when:
- Vectors are normalized at insertion time
- Same vector is compared multiple times
- Building custom distance functions
3. CPU Prefetch Hints
// Prefetch next vectors into L1 cache
#[cfg(target_arch = "x86_64")]
unsafe {
use std::arch::x86_64::{_mm_prefetch, _MM_HINT_T0};
_mm_prefetch(next_vector.as_ptr().cast::<i8>(), _MM_HINT_T0);
}
Benefits:
- Hides memory latency during HNSW traversal
- ~10-20% improvement on large datasets
- Critical for cold cache scenarios
4. Contiguous Memory Layout
pub struct ContiguousVectors {
data: *mut f32, // Single contiguous buffer
dimension: usize,
count: usize,
}
Why it matters:
- Cache line alignment (64 bytes)
- Sequential access pattern
- Enables hardware prefetching
AVX-512 Transition Cost (Intel Skylake+)
On Intel Skylake-X and later CPUs, AVX-512 instructions incur a significant warmup cost:
| Phase | Cycles | Time @ 4GHz |
|---|---|---|
| License transition | ~20,000 | ~5μs |
| Register file power-up | ~36,000 | ~9μs |
| Total warmup | ~56,000 | ~14μs |
Why This Matters
- First AVX-512 instruction triggers CPU frequency throttling (P-state transition)
- Subsequent instructions run at reduced frequency until warmup completes
- Short bursts of AVX-512 may be slower than AVX2 due to transition overhead
VelesDB Mitigation
The adaptive dispatch system handles this automatically:
// 500 iterations per benchmark captures warmup cost
const BENCHMARK_ITERATIONS: usize = 500;
// Eager initialization at Database::open() avoids first-call latency
let info = simd_ops::init_dispatch();
Result: The dispatch table reflects real-world performance after warmup, ensuring AVX-512 is only selected when it provides a genuine advantage over AVX2.
Recommendations
| Workload | Recommendation |
|---|---|
| Sustained vector ops (batch search) | AVX-512 beneficial |
| Sporadic single queries | AVX2 may be faster |
| Mixed workloads | Let adaptive dispatch decide |
To check which backend was selected:
velesdb simd info
Best Practices
1. Pre-normalize at Insertion
// Normalize once at insertion
let norm = vector.iter().map(|x| x * x).sum::<f32>().sqrt();
let normalized: Vec<f32> = vector.iter().map(|x| x / norm).collect();
// Fast cosine at search time
let similarity = cosine_similarity_normalized(&stored, &query);
2. Batch Operations
// Single query, multiple candidates
let results = batch_cosine_normalized(&candidates, &query);
3. Use Appropriate Metric
| Use Case | Recommended Metric |
|---|---|
| Semantic search | Cosine (normalized) |
| Image embeddings | Euclidean |
| Recommendations | Dot Product |
| Binary features | Hamming |
| Set similarity | Jaccard |
Running Benchmarks
# All SIMD benchmarks
cargo bench --bench simd_benchmark
# Specific dimension
cargo bench --bench simd_benchmark -- "768"
# Compare implementations
cargo bench --bench simd_benchmark -- "explicit_simd|auto_vec"
Native SIMD API
use velesdb_core::simd_native;
// Direct native SIMD calls (no dispatch overhead)
let sim = simd_native::cosine_similarity_native(&a, &b);
let dist = simd_native::euclidean_native(&a, &b);
let dot = simd_native::dot_product_native(&a, &b);
let n = simd_native::norm_native(&v);
simd_native::normalize_inplace_native(&mut v);
// Batch operations with prefetching
let results = simd_native::batch_dot_product_native(&candidates, &query);
// Fast approximate (Newton-Raphson rsqrt)
let fast_sim = simd_native::cosine_similarity_fast(&a, &b);
Module Structure
| Module | Purpose | Use When |
|---|---|---|
simd_native | Hand-tuned intrinsics (AVX2/AVX-512/NEON) | Maximum performance, native CPU |
wide_simd | Portable SIMD (f32x8) | WASM, cross-platform |
simd | Auto-vectorized fallback | Generic builds |
Future Optimizations
- ARM SVE - Scalable vectors for ARM servers
- WASM SIMD relaxed - Additional browser performance
- GPU offload - Optional CUDA/Metal for batch operations
License
VelesDB Core is licensed under VelesDB Core License 1.0.