Embedding Models
May 9, 2026 · View on GitHub
pdf-mcp uses fastembed for local, offline embedding. The four models below are validated end-to-end against the project's arxiv benchmark corpus (see Live Benchmark Results). Any other model in the fastembed TextEmbedding catalogue is accepted by the BYOM config, but is unvalidated — see the Unvalidated section below for the gotchas we've already hit.
Configuration
Add to ~/.config/pdf-mcp/config.toml:
[embedding]
model = "snowflake/snowflake-arctic-embed-s"
Missing key → default BAAI/bge-small-en-v1.5. The model downloads once on first use. Switching models clears the embedding cache for that PDF; re-embedding happens automatically on the next search.
Validated Models
These four models have been live-tested against the 7-scenario arxiv ground-truth corpus (Attention paper + GPT-3 paper). MTEB Retrieval = nDCG@10 averaged over 15 retrieval tasks (English MTEB benchmark). Higher is better. Sources: model cards on HuggingFace; see Notes below.
Fast English (384 dimensions)
| Model | Size | MTEB Retrieval | License | Notes |
|---|---|---|---|---|
BAAI/bge-small-en-v1.5 (default) | 67 MB | 51.68 | MIT | Best retrieval-per-MB at this size; proven default |
snowflake/snowflake-arctic-embed-s | 130 MB | 51.98 | Apache 2.0 | Slightly better retrieval than default; good Apache 2.0 alternative |
Mid-Size English (768 dimensions)
| Model | Size | MTEB Retrieval | License | Notes |
|---|---|---|---|---|
BAAI/bge-base-en-v1.5 | 210 MB | 53.25 | MIT | Solid mid-size step-up |
snowflake/snowflake-arctic-embed-m | 430 MB | 54.90 | Apache 2.0 | Best MTEB under 500 MB |
Selection Guide
| Goal | Model |
|---|---|
| Keep it simple | BAAI/bge-small-en-v1.5 (default) |
| Apache 2.0 drop-in for default | snowflake/snowflake-arctic-embed-s |
| Mid-size step-up (MIT) | BAAI/bge-base-en-v1.5 |
| Best validated retrieval | snowflake/snowflake-arctic-embed-m |
Unvalidated Models
The fastembed catalogue includes additional models (long-context, very-large, multilingual) that the BYOM config will accept but that we have not validated end-to-end. Use at your own risk; if you successfully run the benchmark on one, send numbers and we'll promote it.
Known gotchas we've already hit:
nomic-ai/nomic-embed-text-v1.5(520 MB, 768-dim, 8192-token context) — fastembed's defaultbatch_size=256makes the model OOM/hang when embedding PDFs with ~75+ pages of long text on commodity hardware. Loweringbatch_sizehelps but didn't make it reliable in our tests.mixedbread-ai/mxbai-embed-large-v1(640 MB, 1024-dim) — not run against the live corpus.BAAI/bge-large-en-v1.5(1.2 GB, 1024-dim) — not run against the live corpus.intfloat/multilingual-e5-small(384-dim, 100+ languages) — not run against the live corpus.intfloat/multilingual-e5-large(2.2 GB, 1024-dim, 100+ languages) — not run against the live corpus.
If you need any of these (long contexts, multilingual, larger English models), pin via BYOM and validate the retrieval yourself before depending on it.
Notes
- MTEB scores for BGE v1.5 models from their HuggingFace model cards.
- Snowflake Arctic Embed scores from snowflake-arctic-embed-m and -l model cards.
- All validated models run fully locally via fastembed — no external API calls.
Live Benchmark Results
Measured on the existing arxiv ground-truth corpus (Attention paper + GPT-3 paper, 7 hand-annotated scenarios). MRR aggregated across all 7 scenarios at each scenario's native k. Latency = p50 query time on a warm embedding cache. Run via scripts/benchmark_embedding_models.py.
| Model | MRR | p50 latency | Size | MTEB |
|---|---|---|---|---|
BAAI/bge-small-en-v1.5 (baseline) | 0.806 | 6.1 ms | 67 MB | 51.68 |
snowflake/snowflake-arctic-embed-s | 0.690 | 4.1 ms | 130 MB | 51.98 |
BAAI/bge-base-en-v1.5 | 0.667 | 5.7 ms | 210 MB | 53.25 |
snowflake/snowflake-arctic-embed-m | 0.029 | 5.8 ms | 430 MB | 54.90 |
Default decision (2026-05-09): kept — no challenger passed the gate (MRR lift ≥ 0.05 AND p50 ≤ 1.5x baseline). bge-small wins MRR by 0.116 over the best challenger on this corpus. The arctic-embed-m collapse (0.029) likely reflects a missing query/passage prefix protocol that fastembed does not apply automatically; users running BYOM with that family should validate their results before relying on them.