Performance tuning

June 7, 2026 · View on GitHub

Fitting Memtrace to your machine. The defaults auto-tune from detected RAM + CPU + accelerator signals, so most users never touch these knobs. This doc is for when the defaults aren't right for you.

Glossary (read this first if the rest looks like jargon)

A few terms come up repeatedly in this doc. Each is plain enough once you've seen it once.

  • ONNX runtime — the local inference engine that runs the embedding and rerank models. No network calls, no GPU required.
  • Intra-op threads — how many CPU threads the ONNX runtime uses for one operation (a matrix multiply, etc.). More threads = faster but more memory per op. The single biggest lever for memory on tight machines.
  • Batch size — how many symbols Memtrace embeds in one go. Memory scales linearly with this.
  • RSS — Resident Set Size, the actual amount of physical RAM the process is holding. Shown by Activity Monitor on macOS, htop on Linux. The "RSS guardrail" is a soft ceiling that triggers back-pressure during indexing.
  • CoreML / ANE — Apple's on-chip accelerator for ML inference. Memtrace uses it on Apple Silicon by default; turning it off (MEMTRACE_DISABLE_COREML=1) forces CPU-only inference, which is slower but avoids first-run graph-compile delays.

How auto-tuning works

On startup, Memtrace classifies your host into one of three tiers based on a 0–11 score combining:

  • RAM — 0 pts (<12 GB) → 5 pts (≥48 GB)
  • CPU — 0 pts (≤3 cores) → 4 pts. Apple Silicon variants get a fixed table: M-base = 1, M-Pro = 2, M-Max = 3, M-Ultra = 4.
  • Accelerator — discrete GPU = 2 pts, Apple Neural Engine = 1 pt
   score   tier         picks            embed quant
   ─────   ──────────   ──────────────   ──────────────
   0–2     light        small batches    int8
   3–6     standard     medium batches   int8
   7+      heavy        large batches    fp32

The actual tuning happens via three RuntimeProfile accessors:

SettingLightStandardHeavy
embed_intra_op_threads124
embed_batch_size81664
embed_rss_limit_gb3–4610–20

Override anything with MEMTRACE_TIER=light|standard|heavy (forces the tier) or by setting the individual env vars in environment-variables.md.

Common scenarios

v0.4.60 memory-footprint baseline

The defaults moved in v0.4.60. Per-workload median peak RSS on a 1000-query mempalace bench (M3 Max, 14 cores), median of 3 runs:

Workloadv0.4.50v0.4.60Δ
Cold reindex peak RSS538 MB457 MB−15.2%
Cold reindex variance (spread of 3)145 MB4 MB−97%
Concurrent rerank+embed peak RSS1514 MB1289 MB−14.9%
Concurrent throughput11.04 qps11.00 qpsflat
1k find_symbol p50 / acc@10.24 ms / 96.6%identicalbit-identical
Binary size (release)144 MB85 MB−41%

The under-reported headline is the variance collapse. Pre-v0.4.60 the same workload on the same host swung 29% of median between runs; now it's within 1% of median. Container memory limits are sizeable from a single number — pick ~600 MB for single-repo / ~2 GB for concurrent-rerank, no safety margin guessing required.

What changed (most operators set nothing — defaults move):

  • mimalloc 3 is now the default allocator on every target except musl and Windows MSVC (which keep the system allocator unchanged). --no-default-features falls back to jemalloc if you need the rebuild path.
  • Single in-RAM hot cache (MEMTRACE_UNIFIED_CACHE_MB, default 256 MB) replaces several per-subsystem caches that previously grew independently. W-TinyLFU eviction. Raise to 512 / 1024 on RAM-rich hosts; 0 disables it.
  • ORT memory-arena toggle (MEMTRACE_ORT_LOW_RSS=1) on every ONNX session site (rerank, sparse encoder, embedder). Empirical: −2.7% RSS for −19% throughput on our shipping model sizes — off by default; on if you're running much smaller custom models where the arena dwarfs the model.
  • Other internals. Shared tree-sitter parser pool, string interning on dup-heavy node fields, inline-small-string for short identifiers, bitmap-backed adjacency primitive. None of this changes the user surface; the on-disk format and the MCP JSON wire format are byte-identical to v0.4.50.

Full numbers + cross-platform compile gates: release notes.

"16 GB M1/M2/M3 Pro — Memtrace is eating my RAM"

This was the v0.3.30-and-earlier failure mode: 27+ GB resident during indexing. The fix shipped in v0.3.31:

  • ORT intra-op threads capped to 2 (was: num_cpus = 10)
  • Embed batch capped to 16 (was: 128)
  • RSS guardrail at 6 GB triggers back-pressure

If you're on v0.3.31+ and still seeing high RSS:

# Force the tier even tighter
export MEMTRACE_TIER=light

# Lower batch further (extreme)
export MEMTRACE_EMBED_BATCH_SIZE=4
export MEMTRACE_EMBED_INTRA_OP_THREADS=1

# Tighter RSS ceiling
export MEMTRACE_EMBED_RSS_LIMIT_GB=4

# Drop the rerank model to save ~250 MB resident
export MEMTRACE_RERANK=off

memtrace stop && memtrace start

Watch for the log line:

embed: RSS sample batch_idx=32 rss_mb=<N> limit_mb=<L>

If rss_mb stays under your limit_mb, you're in good shape.

"8 GB laptop / Raspberry Pi — even bge-small is too big"

Drop to the smallest sensible config:

export MEMTRACE_TIER=light
export MEMTRACE_EMBED_MODEL=bge-small         # 384d, ~140 MB resident
export MEMTRACE_VECTOR_DIMS=384               # MUST match the model
export MEMTRACE_EMBED_QUANT=int8
export MEMTRACE_EMBED_BATCH_SIZE=4
export MEMTRACE_EMBED_INTRA_OP_THREADS=1
export MEMTRACE_RERANK=off
export MEMTRACE_DISABLE_COREML=1              # if on Apple Silicon Pi-equivalent
memtrace start

You'll lose ~6 pts of acc@1 vs the default jina model on agent-style queries. Tradeoff is intentional — fits in 4 GB hosts.

"Workstation with 64 GB + a discrete GPU — go fast"

The defaults already pick Heavy tier on this host. If you want even more throughput:

export MEMTRACE_TIER=heavy
export MEMTRACE_EMBED_BATCH_SIZE=128          # bump from 64 default
export MEMTRACE_EMBED_INTRA_OP_THREADS=8      # bump from 4 default
export MEMTRACE_EMBED_QUANT=fp32              # already default on Heavy
memtrace start

For multi-tenant (orchestrator) deployments, bind streamable-HTTP and let many agents share one workspace runtime — see mcp-and-transports.md.

"Indexing Django takes 14s — I want it faster"

Most of indexing time on big repos is the embedding pass, not the parser. To skip embedding entirely (you lose find_code semantic search but keep BM25, structural search, and time travel):

export MEMTRACE_EMBED_MIN_LINES=1000          # de facto skip everything
memtrace index <path>

To use a smaller, faster embedding model:

export MEMTRACE_EMBED_MODEL=bge-small
export MEMTRACE_VECTOR_DIMS=384
memtrace reset && memtrace index <path>

bge-small is ~3× faster to embed than jina-code at the cost of ~6 pts retrieval accuracy.

"Rerank takes 400ms per query — I want faster queries"

export MEMTRACE_RERANK=off
memtrace stop && memtrace start

You'll get ~50–150 ms p50 instead of ~450–870 ms, with ~3–4 pp lower acc@1 on agent-style queries. Worth it for auto-completion-style use cases; not worth it when correctness matters.

"Re-indexing the same repo over and over (in CI)"

Take advantage of the embed cache — ~/.memtrace/embed-cache/ is keyed by symbol AST hash, so unchanged symbols are cache hits even on a fresh .memdb/.

If you nuke .memdb/ in CI but keep ~/.memtrace/embed-cache/ mounted as a volume, the embed pass becomes nearly free. The first CI run is slow; everything after is fast.

What "auto-tuned for your host" actually decides

When you run memtrace start, the banner shows:

  ◆  Host profile: Apple M3 Pro · 12 (6P+6E) · 18 GB · score=5 · tier=standard · embed=int8

Decoded:

  • M3 Pro CPU
  • 12 cores total, 6 performance + 6 efficiency
  • 18 GB RAM
  • score 5 → Standard tier
  • int8 embedding quantisation

If the auto-pick is wrong for your situation, override with MEMTRACE_TIER=....

Specific knobs and their cost / benefit

MEMTRACE_EMBED_BATCH_SIZE

ValueRAM impactThroughputBest for
4~30% lower than default~30% slowerRPi / very tight RAM
8tier Light defaultbaselineM1/M2 8 GB
16tier Standard defaultbaselineM1/M2/M3 16 GB
32~30% higher than default~10–15% faster32 GB workstation
64tier Heavy defaultbest32+ GB workstation
128extrememarginal gain64+ GB GPU box

Memory is O(batch × seq_len × hidden × per_thread_scratch). The sweet spot is "as large as fits without swapping".

MEMTRACE_EMBED_INTRA_OP_THREADS

ORT spawns intra-op threads for parallel MatMul. Each thread holds its own scratch buffers, so doubling threads ~doubles the per-op RAM. Default is 2 on Standard hosts. Going above 4 on a non-GPU host rarely helps (memory-bandwidth-bound, not CPU-bound).

MEMTRACE_RERANK

Settingacc@1 (Django, agent queries)p50 latency
off~70%~50 ms
on (default)~74%~450 ms

The reranker holds an extra ~75 MB resident (int8 model). For most agent workflows the +4 pp accuracy is worth it; for sub-100ms auto-completion paths, turn it off.

MEMTRACE_EMBED_RSS_LIMIT_GB

A soft ceiling. When the daemon's RSS crosses it during indexing, the embed loop yields for 50 ms and logs a warning. This lets the MemDB writer drain in-flight batches and gives the allocator a chance to return pages.

Don't set this so low that it fires constantly — that just throttles indexing. The default scales with host RAM and is generally right.

Diagnosing without changing anything

memtrace status                    # high-level: data dir, counts, models

In the daemon's log (run memtrace start in foreground or tail -f ~/.memtrace/logs/...), look for:

ort: global thread pool capped — intra_op=2, inter_op=1
embed: RSS sample batch_idx=32 rss_mb=4892 limit_mb=6144

Those tell you the auto-tuner kicked in and the embed loop is respecting the RSS budget.

For per-query latency, the local UI at localhost:3030 has a Value Ledger panel that breaks down where time goes.

When something is genuinely too slow

Open a GitHub issue with:

  1. memtrace status output
  2. Your memtrace --version
  3. The host profile line from the daemon banner
  4. A short repro — what command, what repo size, what observed time

We tune based on real workloads, not synthetic ones. Field reports move the defaults faster than anything else.