BM25 Benchmarks
April 29, 2026 ยท View on GitHub
CLI
Installation
From PyPI with pip:
pip install bm25-benchmarks
From GitHub with pip:
pip install "bm25-benchmarks @ git+https://github.com/xhluca/bm25-benchmarks.git"
With uv as a globally available tool:
uv tool install bm25-benchmarks
With uv into the current virtual environment:
uv pip install "bm25-benchmarks @ git+https://github.com/xhluca/bm25-benchmarks.git"
For local development:
pip install -e "."
# or
uv pip install -e "."
The default install includes bm25s. The bm25s backend uses the dataset
helpers shipped by bm25s, so the default CLI path does not install beir.
Install another backend with
bm25-benchmark install rank, bm25-benchmark install bm25-pt,
bm25-benchmark install pyserini, bm25-benchmark install elastic,
bm25-benchmark install pisa, or bm25-benchmark install all.
For rank-bm25, use the CLI installer so the pinned Git dependency is installed
for you. PyPI rejects direct Git dependencies in package metadata, so the pin is
kept in requirements-rank-bm25.txt and the CLI installer:
bm25-benchmark install rank
bm25-benchmark install all
bm25-benchmark install rank --installer uv
Usage
The benchmark package exposes one CLI for running evals:
bm25-benchmark --help
bm25-benchmark models
bm25-benchmark datasets
Run an eval by choosing a backend and dataset:
bm25-benchmark eval bm25s -d fiqa
bm25-benchmark eval rank-bm25 -d fiqa --samples 1000
bm25-benchmark eval pyserini -d fiqa --threads 4
bm25-benchmark eval elastic -d fiqa --hostname localhost
bm25-benchmark eval pisa -d fiqa
bm25-benchmark eval bm25-pt -d fiqa --batch-size 32
Common eval options include:
bm25-benchmark eval bm25s -d fiqa -d scifact --result-dir results --save-dir datasets
bm25-benchmark eval bm25s -d fiqa,scifact --num-runs 3
bm25-benchmark eval bm25s -d fiqa --dry-run
Use bm25-benchmark eval <backend> --help to see backend-specific options.
The module form is also available:
python -m benchmark eval bm25s -d fiqa
Running Benchmarks
BM25S options
For bm25s, you can specify which scoring methods and retrieval backends to benchmark:
# Default: runs the jit scorer and numba backend
bm25-benchmark eval bm25s -d fiqa
# Specify scorers (uncompiled, legacy, jit)
bm25-benchmark eval bm25s -d fiqa --scorers legacy jit
bm25-benchmark eval bm25s -d fiqa --scorers jit
bm25-benchmark eval bm25s -d fiqa --scorers uncompiled legacy jit
# Specify backends (jax, numba, numpy)
bm25-benchmark eval bm25s -d fiqa --backends numba
bm25-benchmark eval bm25s -d fiqa --backends jax numba numpy
# Combine both
bm25-benchmark eval bm25s -d fiqa --scorers jit --backends numba
Scorer options:
uncompiled: Default NumPy implementation (optimized withnp.add.at)legacy: Legacy implementation (similar to uncompiled, kept for comparison)jit: Numba JIT-compiled version (fastest after warmup)
Backend options:
jax: JAX-based retrievalnumba: Numba JIT-compiled retrieval (default)numpy: Pure NumPy retrieval
Available datasets
The available datasets are public BEIR datasets: trec-covid, nfcorpus, fiqa, arguana, webis-touche2020, quora, scidocs, scifact, cqadupstack, nq, msmarco, hotpotqa, dbpedia-entity, fever, climate-fever,
Sampling during benchmarking
For rank-bm25, due to the long runtime, we can sample queries
bm25-benchmark eval rank-bm25 -d "<dataset>" --samples <num_samples>
Rank-bm25 variants
For rank-bm25, we can also specify the method with --method to be used:
rank(default)bm25lbm25+
Results will be saved in results/ directory.
Elasticsearch server
If you want to use elastic search, you need to start the server first.
First, download the elastic search from here. You will get a file, e.g. elasticsearch-8.14.0-linux-x86_64.tar.gz. Extract the file and ensure it is in the same directory as the bm25-benchmarks directory.
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.14.0-linux-x86_64.tar.gz
tar -xzf elasticsearch-8.14.0-linux-x86_64.tar.gz
# remove the tar file
rm elasticsearch-8.14.0-linux-x86_64.tar.gz
Then, start the server with the following command:
./elasticsearch-8.14.0/bin/elasticsearch -E xpack.security.enabled=false -E thread_pool.search.size=1 -E thread_pool.write.size=1
Results
The results are benchmarked using Kaggle notebooks to ensure reproducibility. Each one is run on single-core, Intel Xeon CPU @ 2.20GHz, using 30GB RAM.
The shorthands used are:
BM25PTforbm25_ptPSRNforpyseriniR-BM25forrank-bm25BM25Sforbm25, andBM25S+Jfor Numba JIT version ofbm25s(v0.2.0+)ESforelasticsearchPISAfor the Pisa Engine (via thepyterrier_pisaPython bindings)OOMfor out-of-memory errorDNTfor did not terminate (i.e. went over 12 hours)
Queries per second
| dataset | PISA | BM25S+J | BM25S | ES | PSRN | PT | R-BM25 |
|---|---|---|---|---|---|---|---|
| arguana | 270.53 | 869.95 | 573.91 | 13.67 | 11.95 | 110.51 | 2 |
| climate-fever | 35.95 | 38.49 | 13.09 | 4.02 | 8.06 | OOM | 0.03 |
| cqadupstack | 362.39 | 396.5 | 170.91 | 13.38 | DNT | OOM | 0.77 |
| dbpedia-entity | 197.45 | 71.8 | 13.44 | 10.68 | 12.69 | OOM | 0.11 |
| fever | 81.42 | 53.84 | 20.19 | 7.45 | 10.52 | OOM | 0.06 |
| fiqa | 714.35 | 1237.39 | 717.78 | 16.96 | 12.51 | 20.52 | 4.46 |
| hotpotqa | 54.98 | 47.16 | 20.88 | 7.11 | 10.41 | OOM | 0.04 |
| msmarco | 178.65 | 39.18 | 12.2 | 11.88 | 11.01 | OOM | 0.07 |
| nfcorpus | 5111.72 | 5696.21 | 1196.16 | 45.84 | 32.94 | 256.67 | 224.66 |
| nq | 168.12 | 109.47 | 41.85 | 12.16 | 11.04 | OOM | 0.1 |
| quora | 735.20 | 479.71 | 272.04 | 21.8 | 15.58 | 6.49 | 1.18 |
| scidocs | 818.97 | 1448.32 | 767.05 | 17.93 | 14.1 | 41.34 | 9.01 |
| scifact | 1463.73 | 2787.84 | 1317.12 | 20.81 | 15.02 | 184.3 | 47.6 |
| trec-covid | 282.94 | 483.84 | 85.64 | 7.34 | 8.53 | 3.73 | 1.48 |
| webis-touche2020 | 431.12 | 390.03 | 60.59 | 13.53 | 12.36 | OOM | 1.1 |
Notes:
- For Rank-BM25, larger datasets are ran with 1000 samples rather than the full dataset to ensure it finishes within 12h (limit for Kaggle notebooks).
- For ES and BM25S, we can set a number of threads to use. However, you might not see an improvement, in fact you might even see a decrease in throughput in the case of BM25S due to how multi-threading is implemented. Click below to see the results.
Show BM25S & ES multi-threaded (4T) performance (Q/s)
| dataset | PISA | BM25S | ES |
|---|---|---|---|
| arguana | 590.93 | 211 | 33.37 |
| climate-fever | 91.68 | 22.06 | 8.13 |
| cqadupstack | 945.66 | 248.87 | 27.76 |
| dbpedia-entity | 478.26 | 26.18 | 15.49 |
| fever | 222.08 | 47.03 | 14.07 |
| fiqa | 1382.32 | 449.82 | 36.33 |
| hotpotqa | 134.60 | 45.02 | 10.35 |
| msmarco | 393.16 | 21.64 | 18.19 |
| nfcorpus | 6706.53 | 784.24 | 81.07 |
| nq | 423.54 | 77.49 | 19.18 |
| quora | 1892.98 | 308.58 | 43.02 |
| scidocs | 1757.44 | 614.23 | 46.36 |
| scifact | 2480.86 | 645.88 | 50.93 |
| trec-covid | 676.40 | 100.88 | 13.5 |
| webis-touche2020 | 938.57 | 202.39 | 26.55 |
Show normalized table wrt Rank-BM25
| dataset | PISA | BM25S | ES | PSRN | PT | Rank |
|---|---|---|---|---|---|---|
| arguana | 135.27 | 286.96 | 6.84 | 5.98 | 55.26 | 1 |
| climate-fever | 1198.33 | 436.33 | 134 | 268.67 | nan | 1 |
| cqadupstack | 470.64 | 221.96 | 17.38 | nan | nan | 1 |
| dbpedia-entity | 1795.00 | 122.18 | 97.09 | 115.36 | nan | 1 |
| fever | 1357.00 | 336.5 | 124.17 | 175.33 | nan | 1 |
| fiqa | 160.17 | 160.94 | 3.8 | 2.8 | 4.6 | 1 |
| hotpotqa | 1374.50 | 522 | 177.75 | 260.25 | nan | 1 |
| msmarco | 2552.14 | 174.29 | 169.71 | 157.29 | nan | 1 |
| nfcorpus | 22.75 | 5.32 | 0.2 | 0.15 | 1.14 | 1 |
| nq | 1681.20 | 418.5 | 121.6 | 110.4 | nan | 1 |
| quora | 623.05 | 230.54 | 18.47 | 13.2 | 5.5 | 1 |
| scidocs | 90.90 | 85.13 | 1.99 | 1.56 | 4.59 | 1 |
| scifact | 30.75 | 27.67 | 0.44 | 0.32 | 3.87 | 1 |
| trec-covid | 191.18 | 57.86 | 4.96 | 5.76 | 2.52 | 1 |
| webis-touche2020 | 391.93 | 55.08 | 12.3 | 11.24 | nan | 1 |
Stats
| # Docs | # Queries | # Tokens | |
|---|---|---|---|
| msmarco | 8,841,823 | 6,980 | 340,859,891 |
| hotpotqa | 5,233,329 | 7,405 | 169,530,287 |
| trec-covid | 171,332 | 50 | 20,231,412 |
| webis-touche2020 | 382,545 | 49 | 74,180,340 |
| arguana | 8,674 | 1,406 | 947,470 |
| fiqa | 57,638 | 648 | 5,189,035 |
| nfcorpus | 3,633 | 323 | 614,081 |
| climate-fever | 5,416,593 | 1,535 | 318,190,120 |
| nq | 2,681,468 | 3,452 | 148,249,808 |
| scidocs | 25,657 | 1,000 | 3,211,248 |
| quora | 522,931 | 10,000 | 4,202,123 |
| dbpedia-entity | 4,635,922 | 400 | 162,336,256 |
| cqadupstack | 457,199 | 13,145 | 44,857,487 |
| fever | 5,416,568 | 6,666 | 318,184,321 |
| scifact | 5,183 | 300 | 812,074 |
Indexing time (docs/s)
The following results follow the same setup as the queries/s benchmarks described above (single-core).
| dataset | PISA | BM25S | ES | PSRN | PT | Rank |
|---|---|---|---|---|---|---|
| arguana | 3432.50 | 4314.79 | 3591.63 | 1225.18 | 638.1 | 5021.3 |
| climate-fever | 5462.73 | 4364.43 | 3825.89 | 6880.42 | nan | 7085.51 |
| cqadupstack | 3963.76 | 4800.89 | 3725.43 | nan | nan | 5370.32 |
| dbpedia-entity | 9019.62 | 7576.28 | 6333.82 | 8501.7 | nan | 9110.36 |
| fever | 4903.06 | 4921.88 | 3879.63 | 7007.5 | nan | 5482.64 |
| fiqa | 4426.92 | 5959.25 | 4035.11 | 3735.38 | 421.51 | 6455.53 |
| hotpotqa | 9883.85 | 7420.39 | 5455.6 | 10342.5 | nan | 9407.9 |
| msmarco | 10205.53 | 7480.71 | 5391.29 | 9686.07 | nan | 12455.9 |
| nfcorpus | 2381.11 | 3169.4 | 1688.15 | 692.05 | 442.2 | 3579.47 |
| nq | 7122.05 | 6083.86 | 5742.13 | 6652.33 | nan | 6048.85 |
| quora | 38512.02 | 28002.4 | 8189.75 | 22818.5 | 6251.26 | 47609.2 |
| scidocs | 3085.13 | 4107.46 | 3008.45 | 2137.64 | 312.72 | 4232.15 |
| scifact | 2449.91 | 3253.63 | 2649.57 | 880.53 | 442.61 | 3792.84 |
| trec-covid | 4642.59 | 4600.14 | 2966.98 | 3768.1 | 406.37 | 4672.62 |
| webis-touche2020 | 2228.10 | 2971.96 | 2484.87 | 2718.41 | nan | 3115.96 |
NDCG@10
We use abbreviations for datasets of BEIR benchmarks.
Click to show dataset abbreviations
AGfor arguanaCDfor cqadupstackCFfor climate-feverDBfor dbpedia-entityFQfor fiqaFVfor feverHPfor hotpotqaMSfor msmarcoNFfor nfcorpusNQfor nqQRfor quoraSDfor scidocsSFfor scifactTCfor trec-covidWTfor webis-touche2020
| k1 | b | method | Avg. | AG | CD | CF | DB | FQ | FV | HP | MS | NF | NQ | QR | SD | SF | TC | WT |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.9 | 0.4 | Lucene | 41.1 | 40.8 | 28.2 | 16.2 | 31.9 | 23.8 | 63.8 | 62.9 | 22.8 | 31.8 | 30.5 | 78.7 | 15.0 | 67.6 | 58.9 | 44.2 |
| 1.2 | 0.75 | ATIRE | 39.9 | 48.7 | 30.1 | 13.7 | 30.3 | 25.3 | 50.3 | 58.5 | 22.6 | 31.8 | 29.1 | 80.5 | 15.6 | 68.1 | 61.0 | 33.2 |
| 1.2 | 0.75 | BM25+ | 39.9 | 48.7 | 30.1 | 13.7 | 30.3 | 25.3 | 50.3 | 58.5 | 22.6 | 31.8 | 29.1 | 80.5 | 15.6 | 68.1 | 61.0 | 33.2 |
| 1.2 | 0.75 | BM25L | 39.5 | 49.6 | 29.8 | 13.5 | 29.4 | 25.0 | 46.6 | 55.9 | 21.4 | 32.2 | 28.1 | 80.3 | 15.8 | 68.7 | 62.9 | 33.0 |
| 1.2 | 0.75 | Lucene | 39.9 | 48.7 | 30.1 | 13.7 | 30.3 | 25.3 | 50.3 | 58.5 | 22.6 | 31.8 | 29.1 | 80.5 | 15.6 | 68.0 | 61.0 | 33.2 |
| 1.2 | 0.75 | Robertson | 39.9 | 49.2 | 29.9 | 13.7 | 30.3 | 25.4 | 50.3 | 58.5 | 22.6 | 31.9 | 29.2 | 80.4 | 15.5 | 68.3 | 59.0 | 33.8 |
| 1.5 | 0.75 | ES | 42.0 | 47.7 | 29.8 | 17.8 | 31.1 | 25.3 | 62.0 | 58.6 | 22.1 | 34.4 | 31.6 | 80.6 | 16.3 | 69.0 | 68.0 | 35.4 |
| 1.5 | 0.75 | Lucene | 39.7 | 49.3 | 29.9 | 13.6 | 29.9 | 25.1 | 48.1 | 56.9 | 21.9 | 32.1 | 28.5 | 80.4 | 15.8 | 68.7 | 62.3 | 33.1 |
| 1.5 | 0.75 | PSRN | 40.0 | 48.4 | 29.8 | 14.2 | 30.0 | 25.3 | 50.0 | 57.6 | 22.1 | 32.6 | 28.6 | 80.6 | 15.6 | 68.8 | 63.4 | 33.5 |
| 1.5 | 0.75 | PT | 45.0 | 44.9 | -- | -- | -- | 22.5 | -- | -- | -- | 31.9 | -- | 75.1 | 14.7 | 67.8 | 58.0 | -- |
| 1.5 | 0.75 | Rank | 39.6 | 49.5 | 29.6 | 13.6 | 29.9 | 25.3 | 49.3 | 58.1 | 21.1 | 32.1 | 28.5 | 80.3 | 15.8 | 68.5 | 60.1 | 32.9 |
| 1.2 | 0.75 | PISA | 38.8 | 41.1 | 27.8 | 13.9 | 30.5 | 24.5 | 49.2 | 58.2 | 22.8 | 34.3 | 28.2 | 72.0 | 15.7 | 68.9 | 64.2 | 30.9 |
Recall@1000
| k1 | b | method | Avg. | AG | CD | CF | DB | FQ | FV | HP | MS | NF | NQ | QR | SD | SF | TC | WT |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.9 | 0.4 | Lucene | 77.3 | 98.8 | 71.1 | 63.3 | 67.5 | 74.3 | 95.7 | 88.0 | 85.3 | 47.7 | 89.6 | 99.5 | 56.5 | 97.0 | 39.2 | 86.0 |
| 1.2 | 0.75 | ATIRE | 77.4 | 99.3 | 73.0 | 59.0 | 67.0 | 76.5 | 94.2 | 86.8 | 85.7 | 47.8 | 89.8 | 99.5 | 57.3 | 97.0 | 40.3 | 87.2 |
| 1.2 | 0.75 | BM25+ | 77.4 | 99.3 | 73.0 | 59.0 | 67.0 | 76.5 | 94.2 | 86.8 | 85.7 | 47.8 | 89.8 | 99.5 | 57.3 | 97.0 | 40.3 | 87.2 |
| 1.2 | 0.75 | BM25L | 77.2 | 99.4 | 73.4 | 57.3 | 66.1 | 77.3 | 93.7 | 85.7 | 85.0 | 47.7 | 89.3 | 99.5 | 57.7 | 97.0 | 40.8 | 87.5 |
| 1.2 | 0.75 | Lucene | 77.4 | 99.3 | 73.0 | 59.0 | 67.0 | 76.5 | 94.2 | 86.8 | 85.6 | 47.8 | 89.8 | 99.5 | 57.3 | 97.0 | 40.3 | 87.2 |
| 1.2 | 0.75 | Robertson | 77.4 | 99.3 | 73.2 | 59.1 | 66.7 | 76.8 | 94.2 | 86.8 | 85.9 | 47.5 | 89.8 | 99.5 | 57.3 | 96.7 | 40.2 | 87.4 |
| 1.5 | 0.75 | ES | 76.9 | 99.2 | 74.2 | 58.8 | 63.6 | 76.7 | 95.9 | 85.2 | 85.1 | 39.0 | 90.8 | 99.6 | 57.9 | 98.0 | 41.3 | 88.0 |
| 1.5 | 0.75 | Lucene | 77.2 | 99.3 | 73.3 | 57.8 | 66.3 | 77.2 | 93.8 | 86.1 | 85.2 | 47.7 | 89.5 | 99.6 | 57.5 | 97.0 | 40.6 | 87.4 |
| 1.5 | 0.75 | PSRN | 76.7 | 99.2 | 74.2 | 58.7 | 66.2 | 76.7 | 94.2 | 86.4 | 85.1 | 37.1 | 89.4 | 99.6 | 57.4 | 97.7 | 41.1 | 87.2 |
| 1.5 | 0.75 | PT | 73.0 | 98.3 | -- | -- | -- | 72.5 | -- | -- | -- | 51.0 | -- | 98.9 | 56.0 | 97.8 | 36.3 | -- |
| 1.5 | 0.75 | Rank | 77.1 | 99.4 | 73.4 | 57.5 | 66.4 | 77.4 | 93.6 | 87.7 | 82.6 | 47.6 | 89.5 | 99.5 | 57.4 | 96.7 | 40.5 | 87.5 |
| 1.2 | 0.75 | PISA | 77.1 | 98.7 | 72.2 | 60.2 | 67.7 | 76.5 | 93.7 | 86.8 | 86.9 | 38.4 | 89.1 | 98.9 | 56.9 | 97.0 | 45.9 | 87.4 |
Links
- BM25+
- BM25L
- ATIRE
- Robertson
- Lucene (k1=1.2, b=0.75)
- Lucene (k1=0.9, b=0.4)
- Lucene (k1=1.5, b=0.75): DB, MS, FV, CF, NQ, HP, Remaining
- ES: FV, CF, NQ, MS, HP, DB, Remaining
- PT: MS, CF, FV, DB, HP, NQ, WT, CD, Remaining
- Rank: DB, HP, CF, FV, MS, NQ, CD, Remaining
- PSRN: CD, FV, HP, MS, DB, NQ, Remaining
- PISA: NQ, DB, CF, HP, FV, MS, CD, Remaining
- BM25+J: Sub-1m, remaining
Legacy Module Entry Points
Prefer the bm25-benchmark eval ... CLI for new runs. The older module entry
points are still available for existing scripts:
# For bm25_pt
python -m benchmark.on_bm25_pt -d "<dataset>"
# For rank-bm25
python -m benchmark.on_rank_bm25 -d "<dataset>"
# For Pyserini
python -m benchmark.on_pyserini -d "<dataset>"
# For elastic, after starting the server, run:
python -m benchmark.on_elastic -d "<dataset>"
# For PISA
python -m benchmark.on_pisa -d "<dataset>"
# For bm25s
python -m benchmark.on_bm25s -d "<dataset>"
where <dataset> is the name of the dataset to be used.