BM25 Benchmarks

April 29, 2026 ยท View on GitHub

CLI

Installation

From PyPI with pip:

pip install bm25-benchmarks

From GitHub with pip:

pip install "bm25-benchmarks @ git+https://github.com/xhluca/bm25-benchmarks.git"

With uv as a globally available tool:

uv tool install bm25-benchmarks

With uv into the current virtual environment:

uv pip install "bm25-benchmarks @ git+https://github.com/xhluca/bm25-benchmarks.git"

For local development:

pip install -e "."
# or
uv pip install -e "."

The default install includes bm25s. The bm25s backend uses the dataset helpers shipped by bm25s, so the default CLI path does not install beir. Install another backend with bm25-benchmark install rank, bm25-benchmark install bm25-pt, bm25-benchmark install pyserini, bm25-benchmark install elastic, bm25-benchmark install pisa, or bm25-benchmark install all.

For rank-bm25, use the CLI installer so the pinned Git dependency is installed for you. PyPI rejects direct Git dependencies in package metadata, so the pin is kept in requirements-rank-bm25.txt and the CLI installer:

bm25-benchmark install rank
bm25-benchmark install all
bm25-benchmark install rank --installer uv

Usage

The benchmark package exposes one CLI for running evals:

bm25-benchmark --help
bm25-benchmark models
bm25-benchmark datasets

Run an eval by choosing a backend and dataset:

bm25-benchmark eval bm25s -d fiqa
bm25-benchmark eval rank-bm25 -d fiqa --samples 1000
bm25-benchmark eval pyserini -d fiqa --threads 4
bm25-benchmark eval elastic -d fiqa --hostname localhost
bm25-benchmark eval pisa -d fiqa
bm25-benchmark eval bm25-pt -d fiqa --batch-size 32

Common eval options include:

bm25-benchmark eval bm25s -d fiqa -d scifact --result-dir results --save-dir datasets
bm25-benchmark eval bm25s -d fiqa,scifact --num-runs 3
bm25-benchmark eval bm25s -d fiqa --dry-run

Use bm25-benchmark eval <backend> --help to see backend-specific options.

The module form is also available:

python -m benchmark eval bm25s -d fiqa

Running Benchmarks

BM25S options

For bm25s, you can specify which scoring methods and retrieval backends to benchmark:

# Default: runs the jit scorer and numba backend
bm25-benchmark eval bm25s -d fiqa

# Specify scorers (uncompiled, legacy, jit)
bm25-benchmark eval bm25s -d fiqa --scorers legacy jit
bm25-benchmark eval bm25s -d fiqa --scorers jit
bm25-benchmark eval bm25s -d fiqa --scorers uncompiled legacy jit

# Specify backends (jax, numba, numpy)
bm25-benchmark eval bm25s -d fiqa --backends numba
bm25-benchmark eval bm25s -d fiqa --backends jax numba numpy

# Combine both
bm25-benchmark eval bm25s -d fiqa --scorers jit --backends numba

Scorer options:

  • uncompiled: Default NumPy implementation (optimized with np.add.at)
  • legacy: Legacy implementation (similar to uncompiled, kept for comparison)
  • jit: Numba JIT-compiled version (fastest after warmup)

Backend options:

  • jax: JAX-based retrieval
  • numba: Numba JIT-compiled retrieval (default)
  • numpy: Pure NumPy retrieval

Available datasets

The available datasets are public BEIR datasets: trec-covid, nfcorpus, fiqa, arguana, webis-touche2020, quora, scidocs, scifact, cqadupstack, nq, msmarco, hotpotqa, dbpedia-entity, fever, climate-fever,

Sampling during benchmarking

For rank-bm25, due to the long runtime, we can sample queries

bm25-benchmark eval rank-bm25 -d "<dataset>" --samples <num_samples>

Rank-bm25 variants

For rank-bm25, we can also specify the method with --method to be used:

  • rank (default)
  • bm25l
  • bm25+

Results will be saved in results/ directory.

Elasticsearch server

If you want to use elastic search, you need to start the server first.

First, download the elastic search from here. You will get a file, e.g. elasticsearch-8.14.0-linux-x86_64.tar.gz. Extract the file and ensure it is in the same directory as the bm25-benchmarks directory.

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.14.0-linux-x86_64.tar.gz
tar -xzf elasticsearch-8.14.0-linux-x86_64.tar.gz
# remove the tar file
rm elasticsearch-8.14.0-linux-x86_64.tar.gz

Then, start the server with the following command:

./elasticsearch-8.14.0/bin/elasticsearch -E xpack.security.enabled=false -E thread_pool.search.size=1 -E thread_pool.write.size=1

Results

The results are benchmarked using Kaggle notebooks to ensure reproducibility. Each one is run on single-core, Intel Xeon CPU @ 2.20GHz, using 30GB RAM.

The shorthands used are:

  • BM25PT for bm25_pt
  • PSRN for pyserini
  • R-BM25 for rank-bm25
  • BM25S for bm25, and BM25S+J for Numba JIT version of bm25s (v0.2.0+)
  • ES for elasticsearch
  • PISA for the Pisa Engine (via the pyterrier_pisa Python bindings)
  • OOM for out-of-memory error
  • DNT for did not terminate (i.e. went over 12 hours)

Queries per second

datasetPISABM25S+JBM25SESPSRNPTR-BM25
arguana270.53869.95573.9113.6711.95110.512
climate-fever35.9538.4913.094.028.06OOM0.03
cqadupstack362.39396.5170.9113.38DNTOOM0.77
dbpedia-entity197.4571.813.4410.6812.69OOM0.11
fever81.4253.8420.197.4510.52OOM0.06
fiqa714.351237.39717.7816.9612.5120.524.46
hotpotqa54.9847.1620.887.1110.41OOM0.04
msmarco178.6539.1812.211.8811.01OOM0.07
nfcorpus5111.725696.211196.1645.8432.94256.67224.66
nq168.12109.4741.8512.1611.04OOM0.1
quora735.20479.71272.0421.815.586.491.18
scidocs818.971448.32767.0517.9314.141.349.01
scifact1463.732787.841317.1220.8115.02184.347.6
trec-covid282.94483.8485.647.348.533.731.48
webis-touche2020431.12390.0360.5913.5312.36OOM1.1

Notes:

  • For Rank-BM25, larger datasets are ran with 1000 samples rather than the full dataset to ensure it finishes within 12h (limit for Kaggle notebooks).
  • For ES and BM25S, we can set a number of threads to use. However, you might not see an improvement, in fact you might even see a decrease in throughput in the case of BM25S due to how multi-threading is implemented. Click below to see the results.
Show BM25S & ES multi-threaded (4T) performance (Q/s)
datasetPISABM25SES
arguana590.9321133.37
climate-fever91.6822.068.13
cqadupstack945.66248.8727.76
dbpedia-entity478.2626.1815.49
fever222.0847.0314.07
fiqa1382.32449.8236.33
hotpotqa134.6045.0210.35
msmarco393.1621.6418.19
nfcorpus6706.53784.2481.07
nq423.5477.4919.18
quora1892.98308.5843.02
scidocs1757.44614.2346.36
scifact2480.86645.8850.93
trec-covid676.40100.8813.5
webis-touche2020938.57202.3926.55
Show normalized table wrt Rank-BM25
datasetPISABM25SESPSRNPTRank
arguana135.27286.966.845.9855.261
climate-fever1198.33436.33134268.67nan1
cqadupstack470.64221.9617.38nannan1
dbpedia-entity1795.00122.1897.09115.36nan1
fever1357.00336.5124.17175.33nan1
fiqa160.17160.943.82.84.61
hotpotqa1374.50522177.75260.25nan1
msmarco2552.14174.29169.71157.29nan1
nfcorpus22.755.320.20.151.141
nq1681.20418.5121.6110.4nan1
quora623.05230.5418.4713.25.51
scidocs90.9085.131.991.564.591
scifact30.7527.670.440.323.871
trec-covid191.1857.864.965.762.521
webis-touche2020391.9355.0812.311.24nan1

Stats

# Docs# Queries# Tokens
msmarco8,841,8236,980340,859,891
hotpotqa5,233,3297,405169,530,287
trec-covid171,3325020,231,412
webis-touche2020382,5454974,180,340
arguana8,6741,406947,470
fiqa57,6386485,189,035
nfcorpus3,633323614,081
climate-fever5,416,5931,535318,190,120
nq2,681,4683,452148,249,808
scidocs25,6571,0003,211,248
quora522,93110,0004,202,123
dbpedia-entity4,635,922400162,336,256
cqadupstack457,19913,14544,857,487
fever5,416,5686,666318,184,321
scifact5,183300812,074

Indexing time (docs/s)

The following results follow the same setup as the queries/s benchmarks described above (single-core).

datasetPISABM25SESPSRNPTRank
arguana3432.504314.793591.631225.18638.15021.3
climate-fever5462.734364.433825.896880.42nan7085.51
cqadupstack3963.764800.893725.43nannan5370.32
dbpedia-entity9019.627576.286333.828501.7nan9110.36
fever4903.064921.883879.637007.5nan5482.64
fiqa4426.925959.254035.113735.38421.516455.53
hotpotqa9883.857420.395455.610342.5nan9407.9
msmarco10205.537480.715391.299686.07nan12455.9
nfcorpus2381.113169.41688.15692.05442.23579.47
nq7122.056083.865742.136652.33nan6048.85
quora38512.0228002.48189.7522818.56251.2647609.2
scidocs3085.134107.463008.452137.64312.724232.15
scifact2449.913253.632649.57880.53442.613792.84
trec-covid4642.594600.142966.983768.1406.374672.62
webis-touche20202228.102971.962484.872718.41nan3115.96

NDCG@10

We use abbreviations for datasets of BEIR benchmarks.

Click to show dataset abbreviations
  • AG for arguana
  • CD for cqadupstack
  • CF for climate-fever
  • DB for dbpedia-entity
  • FQ for fiqa
  • FV for fever
  • HP for hotpotqa
  • MS for msmarco
  • NF for nfcorpus
  • NQ for nq
  • QR for quora
  • SD for scidocs
  • SF for scifact
  • TC for trec-covid
  • WT for webis-touche2020
k1bmethodAvg.AGCDCFDBFQFVHPMSNFNQQRSDSFTCWT
0.90.4Lucene41.140.828.216.231.923.863.862.922.831.830.578.715.067.658.944.2
1.20.75ATIRE39.948.730.113.730.325.350.358.522.631.829.180.515.668.161.033.2
1.20.75BM25+39.948.730.113.730.325.350.358.522.631.829.180.515.668.161.033.2
1.20.75BM25L39.549.629.813.529.425.046.655.921.432.228.180.315.868.762.933.0
1.20.75Lucene39.948.730.113.730.325.350.358.522.631.829.180.515.668.061.033.2
1.20.75Robertson39.949.229.913.730.325.450.358.522.631.929.280.415.568.359.033.8
1.50.75ES42.047.729.817.831.125.362.058.622.134.431.680.616.369.068.035.4
1.50.75Lucene39.749.329.913.629.925.148.156.921.932.128.580.415.868.762.333.1
1.50.75PSRN40.048.429.814.230.025.350.057.622.132.628.680.615.668.863.433.5
1.50.75PT45.044.9------22.5------31.9--75.114.767.858.0--
1.50.75Rank39.649.529.613.629.925.349.358.121.132.128.580.315.868.560.132.9
1.20.75PISA38.841.127.813.930.524.549.258.222.834.328.272.015.768.964.230.9

Recall@1000

k1bmethodAvg.AGCDCFDBFQFVHPMSNFNQQRSDSFTCWT
0.90.4Lucene77.398.871.163.367.574.395.788.085.347.789.699.556.597.039.286.0
1.20.75ATIRE77.499.373.059.067.076.594.286.885.747.889.899.557.397.040.387.2
1.20.75BM25+77.499.373.059.067.076.594.286.885.747.889.899.557.397.040.387.2
1.20.75BM25L77.299.473.457.366.177.393.785.785.047.789.399.557.797.040.887.5
1.20.75Lucene77.499.373.059.067.076.594.286.885.647.889.899.557.397.040.387.2
1.20.75Robertson77.499.373.259.166.776.894.286.885.947.589.899.557.396.740.287.4
1.50.75ES76.999.274.258.863.676.795.985.285.139.090.899.657.998.041.388.0
1.50.75Lucene77.299.373.357.866.377.293.886.185.247.789.599.657.597.040.687.4
1.50.75PSRN76.799.274.258.766.276.794.286.485.137.189.499.657.497.741.187.2
1.50.75PT73.098.3------72.5------51.0--98.956.097.836.3--
1.50.75Rank77.199.473.457.566.477.493.687.782.647.689.599.557.496.740.587.5
1.20.75PISA77.198.772.260.267.776.593.786.886.938.489.198.956.997.045.987.4

Legacy Module Entry Points

Prefer the bm25-benchmark eval ... CLI for new runs. The older module entry points are still available for existing scripts:

# For bm25_pt
python -m benchmark.on_bm25_pt -d "<dataset>"

# For rank-bm25
python -m benchmark.on_rank_bm25 -d "<dataset>"

# For Pyserini
python -m benchmark.on_pyserini -d "<dataset>"

# For elastic, after starting the server, run:
python -m benchmark.on_elastic -d "<dataset>"

# For PISA
python -m benchmark.on_pisa -d "<dataset>"

# For bm25s
python -m benchmark.on_bm25s -d "<dataset>"

where <dataset> is the name of the dataset to be used.