Anserini: Dense Retrieval for MS MARCO Passage Ranking

May 31, 2026 · View on GitHub

If you're a Waterloo student traversing the onboarding path, start here. In general, don't try to rush through this guide by just blindly copying and pasting commands into a shell; that's what I call cargo culting. Instead, really try to understand what's going on.

Learning outcomes for this guide, building on previous lessons in the onboarding path:

Be able to use Anserini prebuilt indexes to skip indexing, for both BM25 and dense retrieval.
Be able to use Anserini to perform a batch retrieval run using a dense retrieval model.

Repeating Retrieval with Prebuilt Indexes

In the previous lesson, you learned that indexing and retrieval are two distinct phases. Indexing only needs to be done once, and once it's done we can perform retrieval on as many queries as we'd like. Of course, if the document collection changes, we'll need to modify the index, but nearly all collections used for research are static.

This is where prebuilt indexes come in: Instead of making everyone build their own indexes, we can directly share indexes that have already been built. Take the time to read this guide about prebuilt documents.

Here's the same retrieval run that you've done before, on the MS MARCO passage collection with the dev queries, but now using a prebuilt index:

bin/run.sh io.anserini.search.SearchCollection \
  -index msmarco-v1-passage \
  -topics collections/msmarco-passage/queries.dev.small.tsv \
  -topicReader TsvInt \
  -output runs/run.msmarco-passage.dev.bm25.txt \
  -parallelism 4 \
  -bm25 -bm25.k1 0.82 -bm25.b 0.68 -hits 1000

Instead of passing the path to an index in the -index parameter, we specify the name of an index that Anserini already "knows about". Anserini downloads the index from a known location on UWaterloo servers, and stores a copy in ~/.cache/pyserini/indexes. Go ahead and confirm it's there.

The complete list of prebuilt indexes (and where to find them) is in the class IndexInfo.

We can then evaluate the run with the trec_eval tool. Let's compute the MRR@10 score, which is the official metric:

bin/trec_eval -c -M 10 -m recip_rank \
  collections/msmarco-passage/qrels.dev.small.trec \
  runs/run.msmarco-passage.dev.bm25.txt

The MRR@10 should be 0.1875.

There's a tiny bit of difference between this result and the one from the previous lesson. Previously, we used -format msmarco to generate the output in the MS MARCO format, which we then converted into the TREC format before evaluating. This conversion is lossy and causes slight score differences due to tie-breaking effects (i.e., what happens when two documents are tied in terms of score).

Retrieval with Dense Indexes

Next, we're going to look at retrieval using dense vector representations (or just dense vectors). This is also called dense retrieval or vector search. We'll learn more about how they work later in the onboarding path in Pyserini, but for now, let's perform a retrieval run, using the same queries on the same collection.

Here, we are using the BGE-base model. Retrieval using dense vectors requires different indexes. In this case, we're using what is known as an HNSW index (in contrast to the inverted index that you used for BM25).

Putting everything together, the complete retrieval command is:

bin/run.sh io.anserini.search.SearchHnswDenseVectors \
  -index msmarco-v1-passage.bge-base-en-v1.5.hnsw \
  -topics collections/msmarco-passage/queries.dev.small.tsv \
  -topicReader TsvInt \
  -output runs/run.msmarco-passage.dev.bge.txt \
  -encoder BgeBaseEn15 -hits 1000 -threads 4

Note: If you encounter encoder loading errors, clear the cache at ~/.cache/pyserini/encoders.

Instead of SearchCollection, we use SearchHnswDenseVectors since it's a different type of index. We are using a prebuilt index, specified as -index msmarco-v1-passage.bge-base-en-v1.5.hnsw. The above retrieval command automatically downloads the HNSW index for the MS MARCO passage collection. Beware, it's 26 GB.

For reference, on a circa 2022 MacBook Air with an Apple M2 processor and 24 GB RAM, the retrieval run takes around 4 minutes.

Let's compute the MRR@10 score:

bin/trec_eval -c -M 10 -m recip_rank \
  collections/msmarco-passage/qrels.dev.small.trec \
  runs/run.msmarco-passage.dev.bge.txt

You should get a score of 0.3521, which is much higher than the 0.1874 score from BM25. Yes, dense retrieval is better.

Wrapping Up

As a next step in the onboarding path, you basically do the same thing again in Python with Pyserini (as opposed to Java with Anserini here).

Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use yyyy-mm-dd, make sure you're using a commit id that's on the main trunk of Anserini, and use its 7-hexadecimal prefix for the link anchor text. In the description of your pull request, please provide some details on your setup (e.g., operating system, environment and configuration, etc.). In addition, also provide some indication of success (e.g., everything worked) or document issues you encountered. If you think this guide can be improved in any way (e.g., you caught a typo or think a clarification is warranted), feel free to include it in the pull request.

Reproduction Log *

Results reproduced by @b8zhong on 2025-02-23 (commit daceb40)
Results reproduced by @lilyjge on 2025-02-23 (commit 9b13fe4)
Results reproduced by @JJGreen0 on 2025-04-19 (commit 2d8674c)
Results reproduced by @ricky42613 on 2025-04-25 (commit adce4e3)
Results reproduced by @lzguan on 2025-04-30 (commit 4d94f55)
Results reproduced by @Yaohui2019 on 2025-05-02 (commit 4d94f55)
Results reproduced by @karush17 on 2025-05-09 (commit b21b7da)
Results reproduced by @YousefNafea on 2025-05-10 (commit b21b7da)
Results reproduced by @AnthonyZ0425 on 2025-05-13 (commit 83b7b45)
Results reproduced by @luisvenezian on 2025-05-15 (commit bd4c3c7)
Results reproduced by @MINGYISU on 2025-05-14 (commit bd4c3c7)
Results reproduced by @Armd04 on 2025-05-16 (commit bd4c3c7)
Results reproduced by @Cassidy-Li on 2025-05-20 (commit a6fe05c)
Results reproduced by @James-Begin on 2025-05-21 (commit a6fe05c)
Results reproduced by @Roselynzzz on 2025-05-26 (commit ef25129)
Results reproduced by @AnnieZhang2 on 2025-05-28 (commit bd4c3c7)
Results reproduced by @Vik7am10 on 2025-06-03 (commit b216a5f)
Results reproduced by @kevin-zkc on 2025-06-05 (commit 173312d)
Results reproduced by @YuvaanshKapila on 2025-06-08 (commit 17bd9ac)
Results reproduced by @erfan-yazdanparast on 2025-06-08 (commit 2201471)
Results reproduced by @nahalhz on 2025-06-09 (commit c07454b)
Results reproduced by @sadlulu on 2025-06-17 (commit 2e6a58b)
Results reproduced by @goodzcyabc on 2025-06-20 (commit c07454b)
Results reproduced by @adefioye on 2025-06-26 (commit 0299e2a)
Results reproduced by @hari495 on 2025-06-31 (commit 0299e2a)
Results reproduced by @suraj-subrahmanyan on 2025-07-02 (commit 017841f)
Results reproduced by @ed-ward-huang on 2025-07-06 (commit 4039c30)
Results reproduced by @OmarKhaled0K on 2025-07-08 (commit dafa81e)
Results reproduced by @mindlesstruffle on 2025-07-09 (commit b3ab936)
Results reproduced by @niruhan on 2025-07-17 (commit d6a8b36)
Results reproduced by @br0mabs on 2025-07-19 (commit b3ab936)
Results reproduced by @bikram993298 on 2025-08-19 (commit c6ea078)
Results reproduced by @JoshElkind on 2025-08-24 (commit 7c3010f)
Results reproduced by @Dinesh7K on 2025-09-03 (commit e7cb101)
Results reproduced by @CereNova on 2025-09-05 (commit b01c121)
Results reproduced by @NathanNCN on 2025-09-06 (commit 9e39f94)
Results reproduced by @ShivamSingal on 2025-09-07 (commit f59c8ee)
Results reproduced by @FarmersWrap on 2025-09-10 (commit 676af03)
Results reproduced by @shreyaadritabanik on 2025-09-10 (commit 3112afd)
Results reproduced by @k464wang on 2025-09-18 (commit df3b06c)
Results reproduced by @mahdi-behnam on 2025-09-19 (commit 6799a7d)
Results reproduced by @InanSyed on 2025-09-23 (commit 52265a6)
Results reproduced by @rashadjn on 2025-09-19 (commit a92e25c)
Results reproduced by @samin-mehdizadeh on 2025-09-27 (commit a92e25c)
Results reproduced by @AniruddhThakur on 2025-09-27 (commit eeb7756)
Results reproduced by @prav0761 on 2025-10-13 (commit 4a2f9a0)
Results reproduced by @henry4516 on 2025-10-14 (commit 338ac0e)
Results reproduced by @yazdanzv on 2025-10-15 (commit e011b38)
Results reproduced by @royary on 2025-10-23 (commit 7302ce9)
Results reproduced by @Raptors65 on 2025-10-23 (commit 7302ce9)
Results reproduced by @LiHuua258 on 2025-10-24 (commit 7302ce9)
Results reproduced by @RichHene on 2025-10-24 (commit 7302ce9)
Results reproduced by @ivan-0862 on 2025-10-25 (commit 7fc1b57)
Results reproduced by @brandonzhou2002 on 2025-10-26 (commit 5c7d986)
Results reproduced by @MahdiNoori2003 on 2025-10-28 (commit 5c7d986)
Results reproduced by @minj22 on 2025-11-04 (commit c274dd8)
Results reproduced by @ipouyall on 2025-11-05 (commit 82dfba0)
Results reproduced by @AdrianGri on 2025-11-12 (commit c19077b)
Results reproduced by @jianxyou on 2025-11-17 (commit 9406dd8)
Results reproduced by @xincanfeng on 2025-11-18 (commit 9406dd8)
Results reproduced by @Blank9999 on 2025-11-18 (commit 9406dd8)
Results reproduced by @ball2004244 on 2025-11-23 (commit 9aea5f3)
Results reproduced by @Hasebul21 on 2025-11-27 (commit 9aea5f3)
Results reproduced by @RudraMantri123 on 2025-11-28 (commit 9aea5f3)
Results reproduced by @imishrr on 2025-12-01 (commit 79e7777)
Results reproduced by @MehdiJmlkh on 2025-12-08 (commit 259d483)
Results reproduced by @Kushion32 on 2025-12-09 (commit 3e65fbd)
Results reproduced by @anjanpa on 2025-12-17 (commit 1d062ef)
Results reproduced by @MuhammadAli13562 on 2025-12-18 (commit 68311a1)
Results reproduced by @Hossein-Molaeian on 2025-12-19 (commit b64fba2)
Results reproduced by @FayizMohideen on 2025-12-21 (commit 1c5cd32)
Results reproduced by @nli33 on 2025-12-22 (commit 1c5cd32)
Results reproduced by @VarnitOS on 2025-12-26 (commit ⁠ 1c5cd32 ⁠)
Results reproduced by @zizimind on 2026-01-06 (commit d276b57)
Results reproduced by @izzat5233 on 2026-01-17 (commit 5bda670)
Results reproduced by @HusamIsied on 2026-01-25 (commit 952ac5e4)
Results reproduced by @aaryanshroff on 2026-01-26 (commit 952ac5e)
Results reproduced by @maherapp on 2026-02-01 (commit f0ecf565)
Results reproduced by @Karrrthik on 2026-02-20 (commit 792d9cb)
Results reproduced by @namatvir on 2026-02-28 (commit 2133d40)
Results reproduced by @aaliyan1230 on 2026-03-01 (commit 6f7d205)
Results reproduced by @VCY019 on 2026-03-01 (commit 6f7d205)
Results reproduced by @raghav-ai on 2026-03-03 (commit f496153)
Results reproduced by @mahdijaf on 2026-03-13 (commit 61c8beb)
Results reproduced by @justin13888 on 2026-03-18 (commit f2ae641e)
Results reproduced by @KhanShaheb34 on 2026-03-20 (commit a64b6e4)
Results reproduced by @ShanaxWorld on 2026-03-27 (commit aaba107)
Results reproduced by @egekaya1 on 2026-03-29 (commit 9434de7)
Results reproduced by @h79yan on 2026-04-10 (commit e441bc83)
Results reproduced by @kwamearhinPORTFL on 2026-04-15 (commit 5c51ee5)
Results reproduced by @zatchbell1311-wq on 2026-04-17 (commit c6eed686)
Results reproduced by @Zixi-Sam-Tang on 2026-04-21 (commit 1e7b470)
Results reproduced by @Seun-Ajayi on 2026-04-25 (commit 2896644)
Results reproduced by @alex-wang101 on 2026-04-25 (commit 112cab4)
Results reproduced by @TahseenSust on 2026-04-28 (commit 912d7c3)
Results reproduced by @xiandadu on 2026-05-01 (commit 67e9fc4)
Results reproduced by @blissuche90 on 2026-05-01 (commit fb2fd258)
Results reproduced by @mohamedshakir3 on 2026-05-02 (commit ec9cf56)
Results reproduced by @mazleon on 2026-05-03 (commit a74898d)
Results reproduced by @nasazzam on 2026-05-13 (commit 878b97d)
Results reproduced by @VanshJain4 on 2026-05-14 (commit deb4c7b)
Results reproduced by @Adebara123 on 2026-05-15 (commit 4d16490)
Results reproduced by @david23131 on 2026-05-17 (commit ced4918)
Results reproduced by @masud70 on 2026-05-23 (commit 878b97d)
Results reproduced by @ibot1 on 2026-05-23 (commit 878b97d)
Results reproduced by @ReyhanehAhani on 2026-05-25 (commit 11703ed)
Results reproduced by @grf932 on 2026-05-29 (commit 6663a15)
Results reproduced by @amulyabenarji777 on 2026-05-30 (commit 6663a15)

Repeating Retrieval with Prebuilt Indexes

Retrieval with Dense Indexes

Wrapping Up

Reproduction Log*

Reproduction Log *