Embedding Models

May 9, 2026 · View on GitHub

pdf-mcp uses fastembed for local, offline embedding. The four models below are validated end-to-end against the project's arxiv benchmark corpus (see Live Benchmark Results). Any other model in the fastembed TextEmbedding catalogue is accepted by the BYOM config, but is unvalidated — see the Unvalidated section below for the gotchas we've already hit.

Configuration

Add to ~/.config/pdf-mcp/config.toml:

[embedding]
model = "snowflake/snowflake-arctic-embed-s"

Missing key → default BAAI/bge-small-en-v1.5. The model downloads once on first use. Switching models clears the embedding cache for that PDF; re-embedding happens automatically on the next search.


Validated Models

These four models have been live-tested against the 7-scenario arxiv ground-truth corpus (Attention paper + GPT-3 paper). MTEB Retrieval = nDCG@10 averaged over 15 retrieval tasks (English MTEB benchmark). Higher is better. Sources: model cards on HuggingFace; see Notes below.

Fast English (384 dimensions)

ModelSizeMTEB RetrievalLicenseNotes
BAAI/bge-small-en-v1.5 (default)67 MB51.68MITBest retrieval-per-MB at this size; proven default
snowflake/snowflake-arctic-embed-s130 MB51.98Apache 2.0Slightly better retrieval than default; good Apache 2.0 alternative

Mid-Size English (768 dimensions)

ModelSizeMTEB RetrievalLicenseNotes
BAAI/bge-base-en-v1.5210 MB53.25MITSolid mid-size step-up
snowflake/snowflake-arctic-embed-m430 MB54.90Apache 2.0Best MTEB under 500 MB

Selection Guide

GoalModel
Keep it simpleBAAI/bge-small-en-v1.5 (default)
Apache 2.0 drop-in for defaultsnowflake/snowflake-arctic-embed-s
Mid-size step-up (MIT)BAAI/bge-base-en-v1.5
Best validated retrievalsnowflake/snowflake-arctic-embed-m

Unvalidated Models

The fastembed catalogue includes additional models (long-context, very-large, multilingual) that the BYOM config will accept but that we have not validated end-to-end. Use at your own risk; if you successfully run the benchmark on one, send numbers and we'll promote it.

Known gotchas we've already hit:

  • nomic-ai/nomic-embed-text-v1.5 (520 MB, 768-dim, 8192-token context) — fastembed's default batch_size=256 makes the model OOM/hang when embedding PDFs with ~75+ pages of long text on commodity hardware. Lowering batch_size helps but didn't make it reliable in our tests.
  • mixedbread-ai/mxbai-embed-large-v1 (640 MB, 1024-dim) — not run against the live corpus.
  • BAAI/bge-large-en-v1.5 (1.2 GB, 1024-dim) — not run against the live corpus.
  • intfloat/multilingual-e5-small (384-dim, 100+ languages) — not run against the live corpus.
  • intfloat/multilingual-e5-large (2.2 GB, 1024-dim, 100+ languages) — not run against the live corpus.

If you need any of these (long contexts, multilingual, larger English models), pin via BYOM and validate the retrieval yourself before depending on it.


Notes


Live Benchmark Results

Measured on the existing arxiv ground-truth corpus (Attention paper + GPT-3 paper, 7 hand-annotated scenarios). MRR aggregated across all 7 scenarios at each scenario's native k. Latency = p50 query time on a warm embedding cache. Run via scripts/benchmark_embedding_models.py.

ModelMRRp50 latencySizeMTEB
BAAI/bge-small-en-v1.5 (baseline)0.8066.1 ms67 MB51.68
snowflake/snowflake-arctic-embed-s0.6904.1 ms130 MB51.98
BAAI/bge-base-en-v1.50.6675.7 ms210 MB53.25
snowflake/snowflake-arctic-embed-m0.0295.8 ms430 MB54.90

Default decision (2026-05-09): kept — no challenger passed the gate (MRR lift ≥ 0.05 AND p50 ≤ 1.5x baseline). bge-small wins MRR by 0.116 over the best challenger on this corpus. The arctic-embed-m collapse (0.029) likely reflects a missing query/passage prefix protocol that fastembed does not apply automatically; users running BYOM with that family should validate their results before relying on them.