Benchmarks
April 7, 2026 · View on GitHub
unch ships with a versioned benchmark suite and an end-to-end CLI runner.
The runner answers three practical questions:
- How long does indexing take?
- How long does search take?
- How often does the tool return the exact expected
path:line?
It measures real unch index and unch search subprocesses, not internal Go APIs.
Today the checked-in adapters, suites, and release workflows benchmark unch only.
The report format is generic enough to support future adapters, but the current benchmark story should be read as unch regression and quality tracking, not as a published cross-tool shootout.
Benchmark Suites
The checked-in suites now have explicit identity and version fields:
suite_idsuite_version
This is important because benchmark results are only directly comparable when they come from the same suite definition.
Smoke Suite
File: benchmarks/suites/smoke.json
suite_id:smokesuite_version:1- size:
12queries - purpose: the smallest local sanity-check suite and harness smoke run
CI Suite
File: benchmarks/suites/ci.json
suite_id:cisuite_version:1- size:
32queries across8pinned repositories - purpose: cross-platform GitHub Actions benchmarking with balanced
semantic,auto, andlexicalcoverage
Default Suite
File: benchmarks/suites/default.json
suite_id:defaultsuite_version:3- size:
161queries across8pinned repositories - purpose: broader
unchregression and quality tracking across pinned repositories
Quick Start
Run the full default suite:
go run ./cmd/bench
Run the smaller smoke suite:
go run ./cmd/bench -suite ./benchmarks/suites/smoke.json
Run the checked-in CI suite locally:
go run ./cmd/bench -suite ./benchmarks/suites/ci.json
The shell wrapper forwards directly to the Go runner:
scripts/benchmark_repos.sh
Write the JSON report to a specific path:
go run ./cmd/bench \
-suite ./benchmarks/suites/default.json \
-output ./benchmarks/results/unch-local.json
Pass a custom model or yzma installation:
go run ./cmd/bench \
-tool-option model=/path/to/model.gguf \
-tool-option lib=/path/to/yzma
Use a known model id instead of a full path:
go run ./cmd/bench \
-tool-option model=qwen3
What The Runner Measures
The runner records:
cold index meanwarm index meanwarm search mean- per-query
warm search mean - per-query top hit and observed rank
top1top3mrrquality score- suite coverage and per-repository mode mix
Timing Definitions
cold index
- one index run on a repo with its local
.semsearch/removed - model/runtime caches are already present
- embeddings are recomputed because the local index database starts empty
- network download time is not included
warm index
- repeated index runs on the same pinned checkout
- shared model/runtime caches stay warm
- the existing local
.semsearch/directory is kept between repeats - stored embedding hashes can be reused, so unchanged symbols should skip model inference on cache hit
warm search
- repeated searches against an already-built local index
- averaged per query and then per repository / per benchmark run
Quality Scoring
Each query defines one or more acceptable exact hits:
{
"id": "new-router-semantic",
"text": "create a new router",
"mode": "auto",
"expected_hits": ["mux.go:32"]
}
The runner scores ranked output against those exact targets:
top1: expected hit is ranked firsttop3: expected hit appears in the first three resultsmrr: reciprocal rank of the first expected hit in the top 10
Composite score:
score = round(100 * (0.5 * top1 + 0.2 * top3 + 0.3 * mrr))
This score is intentionally strict. It measures exact symbol localization, not vague semantic similarity.
Query Matrix
The source of truth for benchmark cases is the suite JSON itself.
Today the CI suite covers:
gorilla/mux:4queriesdevelopit/mitt:4queriesexpressjs/morgan:4queriespallets-eco/blinker:4queriessindresorhus/p-limit:4queriessindresorhus/p-queue:4queriesexpressjs/cors:4queriestheskumar/python-dotenv:4queries
That smaller matrix is intentionally balanced for CI:
- every repository contributes
4queries - every repository includes at least one exact lexical query
- the suite mixes
9explicitsemanticqueries,15autoqueries, and8lexical queries - the overall footprint is small enough to run cross-platform without turning ordinary release verification into a multi-hour job
Today the default suite covers:
gorilla/mux:39queriesdevelopit/mitt:28queriesexpressjs/morgan:31queriespallets-eco/blinker:31queriesgo-chi/chi:8queriessindresorhus/p-queue:8queriesexpressjs/cors:8queriestheskumar/python-dotenv:8queries
The cases are a mix of:
- explicit
semanticqueries autonatural-language queries- lexical symbol-name queries
- paraphrases that hit the same expected
path:line
That larger matrix is deliberate. It makes the suite more resistant to “one lucky query phrasing” and gives better signal when ranking changes.
How To Read The Output
Typical summary:
Tool: unch (v0.3.0)
Suite: /.../benchmarks/suites/smoke.json [smoke v1]
Suite revision: sha256:...
Environment: darwin/arm64 • Apple M4 • 10 cores
Suite coverage: 4 repos • 12 queries • auto=8 lexical=4
Run profile: 1 cold / 3 warm / 5 search repeats • top 10 hits
Cold index mean: 1.95s
Warm index mean: 316.68ms
Warm search mean: 305.57ms
Quality: 95/100 (top1=0.917 top3=1.000 mrr=0.958)
Interpretation:
cold index meantells you rebuild cost from empty local index statewarm index meantells you rebuild cost once caches are readywarm search meanis the user-facing query latencytop1shows how often the first answer is exactly righttop3shows whether the right answer still stays near the topmrrpunishes rank driftquality scoreis the compact summary number for comparingunchruns, but it should always be read together with the raw metricssuite coverageandrun profilemake it explicit how broad the run was and how many repeats were usedlatest index snapshotper repo tells you roughly how much code was indexed in the most recent successful runtop1 missesand the GitHub summary'sslowest queriessection make it easier to see whether regressions came from ranking drift, latency spikes, or both
The GitHub Actions benchmark matrix runs on manual workflow_dispatch and release-tag pushes. It uses the ci suite with a lighter 1 cold / 1 warm / 1 search repeat profile across Linux, Linux arm64, macOS, Windows x86_64, and Windows arm64. Ordinary push CI skips that matrix to keep feedback fast.
When benchmark artifacts are available from multiple platforms, the workflow also renders an aggregated summary with:
- a platform overview table
- one repository table per platform
- the same machine-readable per-platform JSON reports uploaded as workflow artifacts
Result Files
Each run writes machine-readable JSON under benchmarks/results/.
The report contains:
- benchmark environment
- suite path and suite revision hash
- suite metadata including
suite_idandsuite_version - suite coverage including repo count, query count, and mode mix
- per-repository timing and quality breakdown
- per-repository query count, mode mix, and latest indexed symbol/file counts
- per-query hits, top hit, observed rank, and mean search timing
That JSON is the source of truth for comparing unch runs produced from the same suite definition.
Governance
Changing a suite is meaningful.
At minimum:
- changing queries
- changing expected
path:line - changing pinned commits
- adding or removing repositories
should be treated as a suite change, not as “the same benchmark”.
Rule of thumb:
- small text cleanup with identical semantics: keep the version
- meaningful benchmark behavior change: bump
suite_version
Results from different suite versions should not be compared as if they were the same benchmark.