ReasonKit Performance Tuning Guide

January 3, 2026 ยท View on GitHub

Target: All core loops < 5ms (excluding LLM latency) Version: 1.0.0 Last Updated: 2026-01-01

This guide provides comprehensive performance optimization recommendations for ReasonKit deployments, from development laptops to production clusters.


Table of Contents

  1. Baseline Performance Expectations
  2. Configuration Options
  3. Memory Optimization
  4. Embedding Model Selection
  5. Batch Processing
  6. Profiling and Benchmarking
  7. Production Deployment
  8. Troubleshooting

1. Baseline Performance Expectations

1.1 Core Operation Targets

All non-LLM operations must complete within the specified targets:

OperationTargetExpected BaselineNotes
Retrieval Operations
Sparse search (BM25) @ 100 docs< 5ms~2.5msPure text matching
Hybrid search @ 100 docs< 10ms~8.0msSparse + dense fusion
Concurrent 8 queries< 15ms~12msParallel execution
Fusion Operations
RRF @ 100 results (2 methods)< 5ms~1.2msReciprocal Rank Fusion
Weighted sum @ 100 results< 5ms~1.5msLinear combination
Multi-method (3 methods)< 10ms~3.5msComplex fusion
Embedding Operations
Mock embedding (cache hit)< 0.1ms~0.02msIn-memory lookup
Batch embedding @ 100< 5ms~4.5msParallel processing
Cosine similarity @ 384-dim< 1ms~0.08msVector math
RAPTOR Operations
Tree creation @ 100 leaves< 50ms~8msHierarchical clustering
Tree traversal @ 1000 nodes< 5ms~2.5msDFS/BFS
Node lookup< 0.1ms~0.03msHashMap access
Ingestion Operations
Fixed chunking @ 10KB< 5ms~1.8msText splitting
Sentence chunking @ 10KB< 10ms~3.2msNLP-based splitting
Parallel @ 100 docs< 50ms~35msRayon parallelism
ThinkTools Operations
Protocol orchestration< 5ms~1-2msExcluding LLM time
Template rendering< 1ms~100-500usString substitution
Confidence extraction< 0.5ms~50-200usRegex parsing
Profile lookup< 0.1ms~0.03msRegistry access

1.2 LLM Latency Considerations

LLM calls are the dominant factor in end-to-end latency:

ProviderModelTypical LatencyNotes
OpenAIgpt-4o1-5sVaries by prompt length
Anthropicclaude-sonnet-41-4sConsistent performance
Anthropicclaude-opus-42-8sHigher quality, slower
LocalDeepSeek-V30.5-2sSelf-hosted, GPU required
LocalLlama 3.3 70B1-3sSelf-hosted, GPU required

Key Insight: ReasonKit's overhead is typically <1% of total execution time. Focus optimization efforts on reducing LLM calls, not on ReasonKit internals.

1.3 Variance Reduction Metrics

Structured prompting via ThinkTools reduces output variance:

MetricRaw PromptsStructured PromptsImprovement
Mean Agreement Rate (TARa)96.0%98.0%+2.0 pp
Inconsistency Rate4.0%2.0%-2.0 pp
Variance Reduction--5.0%

Results from Claude Sonnet 4 at temperature=0, N=5 runs per question.


2. Configuration Options

2.1 Core Configuration (config/default.toml)

[general]
data_dir = "./data"
log_level = "info"  # trace, debug, info, warn, error

[storage]
backend = "qdrant"  # "qdrant" or "local"

[storage.qdrant]
host = "localhost"
port = 6334
grpc_port = 6333
embedded = true           # Use embedded mode (no external server)
collection = "reasonkit_docs"
vector_size = 1024        # Must match embedding model dimensions
distance = "Cosine"       # Cosine, Euclid, or Dot
quantization = true       # 4x memory reduction, ~2% accuracy loss

[storage.local]
documents_path = "./data/documents"
index_path = "./data/indexes"

[embedding]
backend = "api"  # "api" or "local"

[embedding.api]
provider = "openai"
model = "text-embedding-3-small"
dimensions = 1536
batch_size = 100

[embedding.local]
model_path = "./models/bge-m3-onnx"
device = "cpu"  # or "cuda"

[indexing]
bm25_enabled = true
hnsw_m = 16              # Higher = more accuracy, more memory
hnsw_ef_construction = 200  # Higher = slower indexing, better quality

[processing]
chunk_size = 512         # tokens
chunk_overlap = 50       # tokens
min_chunk_size = 100     # Avoid tiny fragments

[retrieval]
top_k = 10
alpha = 0.7              # 0 = BM25 only, 1 = vector only
min_score = 0.0
rerank = false           # Enable for better quality, higher latency

[raptor]
enabled = false
max_depth = 3
cluster_size = 10
summarizer = "claude-sonnet-4-5"

[server]
host = "127.0.0.1"
port = 8080
cors = true
timeout = 30

2.2 Performance-Critical Settings

Storage Backend Selection

SettingPerformance ImpactWhen to Use
embedded = trueFastest startup, no network overheadDevelopment, single-node
embedded = falseSlightly higher latency, scalableProduction, multi-node
quantization = true4x memory reduction, ~2% accuracy lossMemory-constrained
quantization = falseFull precisionMaximum accuracy required

HNSW Parameters

[indexing]
hnsw_m = 16              # Connections per node (8-64)
hnsw_ef_construction = 200  # Build-time beam width (100-500)
hnsw_mMemory/VectorSearch SpeedAccuracy
8~256 bytesVery fast95%
16~512 bytesFast98%
32~1KBMedium99%
64~2KBSlow99.5%

Recommendation: Start with hnsw_m = 16 for most use cases.

Retrieval Alpha Tuning

The alpha parameter controls the balance between sparse (BM25) and dense (vector) search:

alpha = 0.0  -> 100% BM25 (keyword matching)
alpha = 0.5  -> 50/50 hybrid
alpha = 0.7  -> 70% vector, 30% BM25 (recommended default)
alpha = 1.0  -> 100% vector (semantic only)
Use CaseRecommended AlphaRationale
Technical documentation0.5Keywords matter
General knowledge0.7Semantic similarity important
Code search0.3Exact matches critical
Conversational search0.8Semantic understanding key

2.3 Environment Variables

# Core settings
export REASONKIT_DATA_DIR="/path/to/data"
export REASONKIT_LOG_LEVEL="info"

# API keys (never commit these!)
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export COHERE_API_KEY="..."
export VOYAGE_API_KEY="..."

# Qdrant connection (if not embedded)
export QDRANT_HOST="localhost"
export QDRANT_PORT="6334"
export QDRANT_API_KEY="..."

# Performance tuning
export RAYON_NUM_THREADS="8"      # Parallel processing threads
export TOKIO_WORKER_THREADS="4"   # Async runtime threads

3. Memory Optimization

3.1 Memory Usage by Component

ComponentMemory Per UnitScaling Factor
Raw document~1x document sizeLinear
Chunks~1.1x raw (overlap)Linear
BM25 index~0.2x corpus sizeLinear
Dense embeddings (f32)4 bytes _ dim _ chunksLinear
Dense embeddings (quantized)1 byte _ dim _ chunksLinear
HNSW graph~512 bytes * chunks (m=16)Linear
RAPTOR tree~2x leaf nodesLogarithmic

3.2 Memory Reduction Strategies

1. Enable Quantization (4x reduction)

[storage.qdrant]
quantization = true  # Scalar quantization to uint8

Impact: ~2% accuracy reduction for 4x memory savings.

2. Reduce Embedding Dimensions

[embedding.api]
model = "text-embedding-3-small"
dimensions = 512  # Down from 1536
DimensionsMemory/ChunkAccuracy (relative)
15366KB100%
10244KB98%
5122KB95%
2561KB90%

3. Increase Chunk Size

[processing]
chunk_size = 1024  # Up from 512
chunk_overlap = 100

Doubles content per chunk, halves total chunks.

4. Disable RAPTOR (if not needed)

[raptor]
enabled = false

RAPTOR trees add ~2x overhead for hierarchical summaries.

3.3 Memory Monitoring

# Check process memory
ps aux | grep rk

# Detailed memory breakdown
cat /proc/$(pgrep rk)/status | grep -E "VmRSS|VmSize"

# Qdrant collection stats
curl localhost:6333/collections/reasonkit_docs | jq '.result.points_count, .result.vectors_count'
Corpus SizeChunks (512 tok)Memory (quantized)Memory (full)
1,000 docs~10,000~100MB~400MB
10,000 docs~100,000~1GB~4GB
100,000 docs~1,000,000~10GB~40GB
1,000,000 docs~10,000,000~100GB~400GB

4. Embedding Model Selection

4.1 Supported Models

Cloud Providers (via API)

ProviderModelDimensionsCost/1M tokensQuality
OpenAItext-embedding-3-large3072$0.13Excellent
OpenAItext-embedding-3-small1536$0.02Good
Cohereembed-english-v3.01024$0.10Excellent
Voyagevoyage-21024$0.10Excellent
Anthropic(via Voyage)1024$0.10Excellent

Local Models (via ONNX)

ModelDimensionsMemorySpeed (CPU)Speed (GPU)
BGE-M31024~2GB50ms/text5ms/text
BGE-Large-EN1024~1.5GB40ms/text4ms/text
all-MiniLM-L6384~100MB10ms/text1ms/text
E5-Large-v21024~1.5GB40ms/text4ms/text

4.2 Model Selection Guide

High accuracy, cost acceptable:
  -> OpenAI text-embedding-3-large (3072-dim)
  -> Cohere embed-english-v3.0 (1024-dim)

Good accuracy, cost-sensitive:
  -> OpenAI text-embedding-3-small (1536-dim)
  -> Local BGE-M3 with GPU

Low latency, self-hosted:
  -> Local all-MiniLM-L6 (384-dim, CPU-friendly)
  -> Local E5-Large-v2 with GPU

Multilingual:
  -> BGE-M3 (best multilingual support)
  -> Cohere embed-multilingual-v3.0

4.3 Local Embedding Setup

# Install with local-embeddings feature
cargo build --release --features local-embeddings

# Download BGE-M3 ONNX model
mkdir -p models
curl -L https://huggingface.co/BAAI/bge-m3/resolve/main/onnx/model.onnx \
  -o models/bge-m3-onnx/model.onnx
curl -L https://huggingface.co/BAAI/bge-m3/resolve/main/tokenizer.json \
  -o models/bge-m3-onnx/tokenizer.json

Configure in config/default.toml:

[embedding]
backend = "local"

[embedding.local]
model_path = "./models/bge-m3-onnx"
device = "cuda"  # or "cpu"

4.4 Embedding Cache

ReasonKit automatically caches embeddings to avoid recomputation:

Cache location: $REASONKIT_DATA_DIR/embeddings_cache/
Cache key: SHA-256(model_id + text)
Cache format: Binary (4 bytes per dimension)

Cache performance:

OperationLatency
Cache hit< 0.1ms
Cache miss (API)100-500ms
Cache miss (local)5-50ms

5. Batch Processing

5.1 Ingestion Optimization

Single Document Ingestion

rk ingest /path/to/document.md
# Process directory recursively
rk ingest /path/to/documents/ --recursive

# With parallel processing
rk ingest /path/to/documents/ --parallel 8

# Stream from stdin (for pipelines)
find /path -name "*.md" | rk ingest --stdin

5.2 Parallel Processing Configuration

# Set thread count via environment
export RAYON_NUM_THREADS=8

# Or via CLI
rk ingest /path --threads 8

Optimal thread count:

CPU CoresRecommended ThreadsNotes
22Limited parallelism
44Good for development
86-8Leave 2 for system
16+12-14Diminishing returns

5.3 Batch Embedding Strategies

Sequential (Default)

[embedding.api]
batch_size = 1  # One at a time
[embedding.api]
batch_size = 100  # Batch requests

Pipeline (Advanced)

// In custom code
let embedder = BatchEmbedder::new(config)
    .with_batch_size(100)
    .with_concurrent_batches(4);

embedder.embed_streaming(chunks_iter).await?;

5.4 Ingestion Performance Expectations

Corpus SizeDocumentsChunksIngestion Time
Small1001,000~30s
Medium1,00010,000~5min
Large10,000100,000~1hr
Enterprise100,0001,000,000~10hr

Times assume API embeddings with batch_size=100. Local embeddings with GPU are ~5x faster.


6. Profiling and Benchmarking

6.1 Running Benchmarks

Quick Benchmark

# Run all benchmarks
cargo bench

# Run specific benchmark suite
cargo bench --bench retrieval_bench
cargo bench --bench fusion_bench
cargo bench --bench embedding_bench
cargo bench --bench thinktool_bench

With Baseline Comparison

# Save current performance as baseline
cargo bench -- --save-baseline main

# Compare against baseline after changes
cargo bench -- --baseline main

Using the Helper Script

./scripts/run_benchmarks.sh            # Run all
./scripts/run_benchmarks.sh fusion     # Run specific
./scripts/run_benchmarks.sh --baseline # Save baseline
./scripts/run_benchmarks.sh --compare  # Check regressions

6.2 Benchmark Suites

SuiteFileWhat It Tests
Retrievalbenches/retrieval_bench.rsBM25, hybrid search, scaling
Fusionbenches/fusion_bench.rsRRF, weighted sum, multi-method
Embeddingbenches/embedding_bench.rsDense/sparse, batch, cache
RAPTORbenches/raptor_bench.rsTree operations, traversal
Ingestionbenches/ingestion_bench.rsChunking, cleaning, parallel
ThinkToolsbenches/thinktool_bench.rsProtocol execution, profiles

6.3 Reading Benchmark Results

retrieval_bench/sparse_search/query_0
                        time:   [2.3821 ms 2.4234 ms 2.4697 ms]
                        change: [-1.8234% +0.2543% +2.3109%] (p = 0.78 > 0.05)
                        No change in performance detected.
  • time: [lower bound, estimate, upper bound] at 95% confidence
  • change: Comparison to previous run
  • p-value: < 0.05 indicates statistically significant change

6.4 HTML Reports

After running benchmarks, view detailed reports:

open target/criterion/report/index.html

Reports include:

  • Performance trends over time
  • Violin plots showing distribution
  • Regression analysis
  • Historical comparisons

6.5 Profiling Tools

# Install flamegraph
cargo install flamegraph

# Generate flamegraph
cargo flamegraph --bin rk -- query "test query"

Perf (Linux)

# Record performance data
perf record -g cargo run --release -- query "test"

# Generate report
perf report

Instruments (macOS)

# Time Profiler
xcrun xctrace record --template 'Time Profiler' --launch rk query "test"

6.6 Variance Benchmarking

Test LLM output consistency with structured prompting:

cd benchmarks
python variance_benchmark.py --runs 20 --model claude-sonnet-4-20250514

# Quick test
python variance_benchmark.py --quick

# Output to specific directory
python variance_benchmark.py -o results/

7. Production Deployment

Minimum (Development/Testing)

  • CPU: 4 cores
  • RAM: 8GB
  • Storage: 50GB SSD
  • Corpus: Up to 10,000 documents

Standard (Production)

  • CPU: 8-16 cores
  • RAM: 32GB
  • Storage: 200GB NVMe
  • Corpus: Up to 100,000 documents

Enterprise (Large Scale)

  • CPU: 32+ cores
  • RAM: 128GB+
  • Storage: 1TB+ NVMe
  • GPU: NVIDIA A100 (for local embeddings)
  • Corpus: 1M+ documents

7.2 Qdrant Cluster Setup

For high availability and scale:

# docker-compose.yml
version: "3.8"
services:
  qdrant-1:
    image: qdrant/qdrant:latest
    volumes:
      - qdrant_data_1:/qdrant/storage
    environment:
      - QDRANT__CLUSTER__ENABLED=true
    ports:
      - "6333:6333"
      - "6334:6334"

  qdrant-2:
    image: qdrant/qdrant:latest
    volumes:
      - qdrant_data_2:/qdrant/storage
    environment:
      - QDRANT__CLUSTER__ENABLED=true
      - QDRANT__CLUSTER__P2P__PORT=6335

volumes:
  qdrant_data_1:
  qdrant_data_2:

7.3 Connection Pooling

ReasonKit uses HTTP connection pooling for optimal performance:

// Already configured in llm.rs
reqwest::Client::builder()
    .timeout(Duration::from_secs(120))
    .pool_max_idle_per_host(10)
    .pool_idle_timeout(Duration::from_secs(90))
    .tcp_keepalive(Duration::from_secs(60))
    .build()

Impact: Eliminates TLS handshake overhead (100-500ms per call).

7.4 Monitoring

Metrics to Track

MetricTargetAlert Threshold
Query latency p50< 100ms> 500ms
Query latency p99< 500ms> 2s
Ingestion rate> 10 docs/sec< 1 doc/sec
Memory usage< 80%> 90%
Error rate< 0.1%> 1%
Cache hit rate> 80%< 50%

Prometheus Metrics

# Enable metrics endpoint
rk serve --metrics-port 9090

Grafana Dashboard

Import the provided dashboard:

grafana/dashboards/reasonkit-overview.json

8. Troubleshooting

8.1 Common Performance Issues

Issue: Slow Query Performance

Symptoms: Queries take > 1s (excluding LLM time)

Diagnosis:

# Check collection stats
curl localhost:6333/collections/reasonkit_docs | jq

# Check index status
rk stats --verbose

Solutions:

  1. Ensure HNSW index is built: rk index rebuild
  2. Enable quantization for memory pressure
  3. Reduce top_k if returning too many results
  4. Check BM25 index: rk index check-bm25

Issue: High Memory Usage

Symptoms: OOM errors or swap thrashing

Diagnosis:

ps aux | grep rk
cat /proc/$(pgrep rk)/status | grep VmRSS

Solutions:

  1. Enable quantization (quantization = true)
  2. Reduce embedding dimensions
  3. Increase chunk size
  4. Use external Qdrant instead of embedded

Issue: Slow Ingestion

Symptoms: Documents take > 10s each to ingest

Diagnosis:

# Check embedding latency
time rk embed "test text"

# Check chunking speed
time rk chunk /path/to/doc.md

Solutions:

  1. Increase batch_size for API embeddings
  2. Use local embeddings with GPU
  3. Increase parallel threads
  4. Check network latency to embedding API

Issue: Inconsistent LLM Outputs

Symptoms: Same query produces different results

Diagnosis:

# Run variance benchmark
python benchmarks/variance_benchmark.py --quick

Solutions:

  1. Use structured prompting (ThinkTools profiles)
  2. Set temperature = 0 for deterministic output
  3. Enable self-consistency voting
  4. Use the --paranoid profile for critical queries

8.2 Performance Regression Checklist

Before every release:

  • All benchmarks compile without errors
  • No regressions > 5% vs baseline
  • Core loops remain < 5ms
  • Scalability confirmed (linear growth)
  • No memory leaks in batch operations
  • Cache hit rates stable
# Full quality check
just qa

# Benchmark comparison
cargo bench -- --baseline main
./scripts/run_benchmarks.sh --compare

8.3 Debug Logging

Enable detailed performance logging:

RUST_LOG=reasonkit=debug rk query "test"

# Even more detail
RUST_LOG=reasonkit=trace rk query "test"

8.4 Getting Help

  1. Check the GitHub Issues
  2. Review TROUBLESHOOTING.md
  3. Ask in GitHub Discussions
  4. Email: support@reasonkit.sh

Appendix A: Benchmark File Reference

FilePurposeKey Tests
benches/retrieval_bench.rsSearch performanceBM25, hybrid, scaling, concurrent
benches/fusion_bench.rsResult fusionRRF, weighted sum, multi-method
benches/embedding_bench.rsEmbedding opsDense, sparse, batch, cache
benches/raptor_bench.rsRAPTOR treesCreate, traverse, lookup
benches/ingestion_bench.rsDocument processingChunk, clean, parallel
benches/thinktool_bench.rsThinkTools protocolsExecute, profile, concurrent
benchmarks/variance_benchmark.pyLLM consistencyTARa, structured vs raw

Appendix B: Performance Optimization Checklist

Before Deployment

  • Configured appropriate embedding model
  • Set optimal chunk size for content type
  • Enabled quantization if memory-constrained
  • Configured connection pooling
  • Set appropriate HNSW parameters
  • Tested with representative workload

Ongoing Monitoring

  • Query latency within targets
  • Memory usage stable
  • Cache hit rates acceptable
  • Error rates minimal
  • Ingestion throughput sufficient

Periodic Maintenance

  • Run full benchmark suite weekly
  • Compare against baseline monthly
  • Review and compact indexes quarterly
  • Update embedding models as needed

References


"All core loops < 5ms" - This is our contract with users.

ReasonKit Performance Engineering Version 1.0.0 | https://reasonkit.sh