Anserini: Dense Retrieval for MS MARCO Passage Ranking
May 31, 2026 · View on GitHub
If you're a Waterloo student traversing the onboarding path, start here. In general, don't try to rush through this guide by just blindly copying and pasting commands into a shell; that's what I call cargo culting. Instead, really try to understand what's going on.
Learning outcomes for this guide, building on previous lessons in the onboarding path:
- Be able to use Anserini prebuilt indexes to skip indexing, for both BM25 and dense retrieval.
- Be able to use Anserini to perform a batch retrieval run using a dense retrieval model.
Repeating Retrieval with Prebuilt Indexes
In the previous lesson, you learned that indexing and retrieval are two distinct phases. Indexing only needs to be done once, and once it's done we can perform retrieval on as many queries as we'd like. Of course, if the document collection changes, we'll need to modify the index, but nearly all collections used for research are static.
This is where prebuilt indexes come in: Instead of making everyone build their own indexes, we can directly share indexes that have already been built. Take the time to read this guide about prebuilt documents.
Here's the same retrieval run that you've done before, on the MS MARCO passage collection with the dev queries, but now using a prebuilt index:
bin/run.sh io.anserini.search.SearchCollection \
-index msmarco-v1-passage \
-topics collections/msmarco-passage/queries.dev.small.tsv \
-topicReader TsvInt \
-output runs/run.msmarco-passage.dev.bm25.txt \
-parallelism 4 \
-bm25 -bm25.k1 0.82 -bm25.b 0.68 -hits 1000
Instead of passing the path to an index in the -index parameter, we specify the name of an index that Anserini already "knows about".
Anserini downloads the index from a known location on UWaterloo servers, and stores a copy in ~/.cache/pyserini/indexes.
Go ahead and confirm it's there.
The complete list of prebuilt indexes (and where to find them) is in the class IndexInfo.
We can then evaluate the run with the trec_eval tool.
Let's compute the MRR@10 score, which is the official metric:
bin/trec_eval -c -M 10 -m recip_rank \
collections/msmarco-passage/qrels.dev.small.trec \
runs/run.msmarco-passage.dev.bm25.txt
The MRR@10 should be 0.1875.
There's a tiny bit of difference between this result and the one from the previous lesson.
Previously, we used -format msmarco to generate the output in the MS MARCO format, which we then converted into the TREC format before evaluating.
This conversion is lossy and causes slight score differences due to tie-breaking effects (i.e., what happens when two documents are tied in terms of score).
Retrieval with Dense Indexes
Next, we're going to look at retrieval using dense vector representations (or just dense vectors). This is also called dense retrieval or vector search. We'll learn more about how they work later in the onboarding path in Pyserini, but for now, let's perform a retrieval run, using the same queries on the same collection.
Here, we are using the BGE-base model. Retrieval using dense vectors requires different indexes. In this case, we're using what is known as an HNSW index (in contrast to the inverted index that you used for BM25).
Putting everything together, the complete retrieval command is:
bin/run.sh io.anserini.search.SearchHnswDenseVectors \
-index msmarco-v1-passage.bge-base-en-v1.5.hnsw \
-topics collections/msmarco-passage/queries.dev.small.tsv \
-topicReader TsvInt \
-output runs/run.msmarco-passage.dev.bge.txt \
-encoder BgeBaseEn15 -hits 1000 -threads 4
Note: If you encounter encoder loading errors, clear the cache at ~/.cache/pyserini/encoders.
Instead of SearchCollection, we use SearchHnswDenseVectors since it's a different type of index.
We are using a prebuilt index, specified as -index msmarco-v1-passage.bge-base-en-v1.5.hnsw.
The above retrieval command automatically downloads the HNSW index for the MS MARCO passage collection.
Beware, it's 26 GB.
For reference, on a circa 2022 MacBook Air with an Apple M2 processor and 24 GB RAM, the retrieval run takes around 4 minutes.
Let's compute the MRR@10 score:
bin/trec_eval -c -M 10 -m recip_rank \
collections/msmarco-passage/qrels.dev.small.trec \
runs/run.msmarco-passage.dev.bge.txt
You should get a score of 0.3521, which is much higher than the 0.1874 score from BM25. Yes, dense retrieval is better.
Wrapping Up
As a next step in the onboarding path, you basically do the same thing again in Python with Pyserini (as opposed to Java with Anserini here).
Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use yyyy-mm-dd, make sure you're using a commit id that's on the main trunk of Anserini, and use its 7-hexadecimal prefix for the link anchor text.
In the description of your pull request, please provide some details on your setup (e.g., operating system, environment and configuration, etc.).
In addition, also provide some indication of success (e.g., everything worked) or document issues you encountered.
If you think this guide can be improved in any way (e.g., you caught a typo or think a clarification is warranted), feel free to include it in the pull request.
Reproduction Log*
- Results reproduced by @b8zhong on 2025-02-23 (commit
daceb40) - Results reproduced by @lilyjge on 2025-02-23 (commit
9b13fe4) - Results reproduced by @JJGreen0 on 2025-04-19 (commit
2d8674c) - Results reproduced by @ricky42613 on 2025-04-25 (commit
adce4e3) - Results reproduced by @lzguan on 2025-04-30 (commit
4d94f55) - Results reproduced by @Yaohui2019 on 2025-05-02 (commit
4d94f55) - Results reproduced by @karush17 on 2025-05-09 (commit
b21b7da) - Results reproduced by @YousefNafea on 2025-05-10 (commit
b21b7da) - Results reproduced by @AnthonyZ0425 on 2025-05-13 (commit
83b7b45) - Results reproduced by @luisvenezian on 2025-05-15 (commit
bd4c3c7) - Results reproduced by @MINGYISU on 2025-05-14 (commit
bd4c3c7) - Results reproduced by @Armd04 on 2025-05-16 (commit
bd4c3c7) - Results reproduced by @Cassidy-Li on 2025-05-20 (commit
a6fe05c) - Results reproduced by @James-Begin on 2025-05-21 (commit
a6fe05c) - Results reproduced by @Roselynzzz on 2025-05-26 (commit
ef25129) - Results reproduced by @AnnieZhang2 on 2025-05-28 (commit
bd4c3c7) - Results reproduced by @Vik7am10 on 2025-06-03 (commit
b216a5f) - Results reproduced by @kevin-zkc on 2025-06-05 (commit
173312d) - Results reproduced by @YuvaanshKapila on 2025-06-08 (commit
17bd9ac) - Results reproduced by @erfan-yazdanparast on 2025-06-08 (commit
2201471) - Results reproduced by @nahalhz on 2025-06-09 (commit
c07454b) - Results reproduced by @sadlulu on 2025-06-17 (commit
2e6a58b) - Results reproduced by @goodzcyabc on 2025-06-20 (commit
c07454b) - Results reproduced by @adefioye on 2025-06-26 (commit
0299e2a) - Results reproduced by @hari495 on 2025-06-31 (commit
0299e2a) - Results reproduced by @suraj-subrahmanyan on 2025-07-02 (commit
017841f) - Results reproduced by @ed-ward-huang on 2025-07-06 (commit
4039c30) - Results reproduced by @OmarKhaled0K on 2025-07-08 (commit
dafa81e) - Results reproduced by @mindlesstruffle on 2025-07-09 (commit
b3ab936) - Results reproduced by @niruhan on 2025-07-17 (commit
d6a8b36) - Results reproduced by @br0mabs on 2025-07-19 (commit
b3ab936) - Results reproduced by @bikram993298 on 2025-08-19 (commit
c6ea078) - Results reproduced by @JoshElkind on 2025-08-24 (commit
7c3010f) - Results reproduced by @Dinesh7K on 2025-09-03 (commit
e7cb101) - Results reproduced by @CereNova on 2025-09-05 (commit
b01c121) - Results reproduced by @NathanNCN on 2025-09-06 (commit
9e39f94) - Results reproduced by @ShivamSingal on 2025-09-07 (commit
f59c8ee) - Results reproduced by @FarmersWrap on 2025-09-10 (commit
676af03) - Results reproduced by @shreyaadritabanik on 2025-09-10 (commit
3112afd) - Results reproduced by @k464wang on 2025-09-18 (commit
df3b06c) - Results reproduced by @mahdi-behnam on 2025-09-19 (commit
6799a7d) - Results reproduced by @InanSyed on 2025-09-23 (commit
52265a6) - Results reproduced by @rashadjn on 2025-09-19 (commit
a92e25c) - Results reproduced by @samin-mehdizadeh on 2025-09-27 (commit
a92e25c) - Results reproduced by @AniruddhThakur on 2025-09-27 (commit
eeb7756) - Results reproduced by @prav0761 on 2025-10-13 (commit
4a2f9a0) - Results reproduced by @henry4516 on 2025-10-14 (commit
338ac0e) - Results reproduced by @yazdanzv on 2025-10-15 (commit
e011b38) - Results reproduced by @royary on 2025-10-23 (commit
7302ce9) - Results reproduced by @Raptors65 on 2025-10-23 (commit
7302ce9) - Results reproduced by @LiHuua258 on 2025-10-24 (commit
7302ce9) - Results reproduced by @RichHene on 2025-10-24 (commit
7302ce9) - Results reproduced by @ivan-0862 on 2025-10-25 (commit
7fc1b57) - Results reproduced by @brandonzhou2002 on 2025-10-26 (commit
5c7d986) - Results reproduced by @MahdiNoori2003 on 2025-10-28 (commit
5c7d986) - Results reproduced by @minj22 on 2025-11-04 (commit
c274dd8) - Results reproduced by @ipouyall on 2025-11-05 (commit
82dfba0) - Results reproduced by @AdrianGri on 2025-11-12 (commit
c19077b) - Results reproduced by @jianxyou on 2025-11-17 (commit
9406dd8) - Results reproduced by @xincanfeng on 2025-11-18 (commit
9406dd8) - Results reproduced by @Blank9999 on 2025-11-18 (commit
9406dd8) - Results reproduced by @ball2004244 on 2025-11-23 (commit
9aea5f3) - Results reproduced by @Hasebul21 on 2025-11-27 (commit
9aea5f3) - Results reproduced by @RudraMantri123 on 2025-11-28 (commit
9aea5f3) - Results reproduced by @imishrr on 2025-12-01 (commit
79e7777) - Results reproduced by @MehdiJmlkh on 2025-12-08 (commit
259d483) - Results reproduced by @Kushion32 on 2025-12-09 (commit
3e65fbd) - Results reproduced by @anjanpa on 2025-12-17 (commit
1d062ef) - Results reproduced by @MuhammadAli13562 on 2025-12-18 (commit
68311a1) - Results reproduced by @Hossein-Molaeian on 2025-12-19 (commit
b64fba2) - Results reproduced by @FayizMohideen on 2025-12-21 (commit
1c5cd32) - Results reproduced by @nli33 on 2025-12-22 (commit
1c5cd32) - Results reproduced by @VarnitOS on 2025-12-26 (commit 1c5cd32 )
- Results reproduced by @zizimind on 2026-01-06 (commit
d276b57) - Results reproduced by @izzat5233 on 2026-01-17 (commit
5bda670) - Results reproduced by @HusamIsied on 2026-01-25 (commit
952ac5e4) - Results reproduced by @aaryanshroff on 2026-01-26 (commit
952ac5e) - Results reproduced by @maherapp on 2026-02-01 (commit
f0ecf565) - Results reproduced by @Karrrthik on 2026-02-20 (commit
792d9cb) - Results reproduced by @namatvir on 2026-02-28 (commit
2133d40) - Results reproduced by @aaliyan1230 on 2026-03-01 (commit
6f7d205) - Results reproduced by @VCY019 on 2026-03-01 (commit
6f7d205) - Results reproduced by @raghav-ai on 2026-03-03 (commit
f496153) - Results reproduced by @mahdijaf on 2026-03-13 (commit
61c8beb) - Results reproduced by @justin13888 on 2026-03-18 (commit
f2ae641e) - Results reproduced by @KhanShaheb34 on 2026-03-20 (commit
a64b6e4) - Results reproduced by @ShanaxWorld on 2026-03-27 (commit
aaba107) - Results reproduced by @egekaya1 on 2026-03-29 (commit
9434de7) - Results reproduced by @h79yan on 2026-04-10 (commit
e441bc83) - Results reproduced by @kwamearhinPORTFL on 2026-04-15 (commit
5c51ee5) - Results reproduced by @zatchbell1311-wq on 2026-04-17 (commit
c6eed686) - Results reproduced by @Zixi-Sam-Tang on 2026-04-21 (commit
1e7b470) - Results reproduced by @Seun-Ajayi on 2026-04-25 (commit
2896644) - Results reproduced by @alex-wang101 on 2026-04-25 (commit
112cab4) - Results reproduced by @TahseenSust on 2026-04-28 (commit
912d7c3) - Results reproduced by @xiandadu on 2026-05-01 (commit
67e9fc4) - Results reproduced by @blissuche90 on 2026-05-01 (commit
fb2fd258) - Results reproduced by @mohamedshakir3 on 2026-05-02 (commit
ec9cf56) - Results reproduced by @mazleon on 2026-05-03 (commit
a74898d) - Results reproduced by @nasazzam on 2026-05-13 (commit
878b97d) - Results reproduced by @VanshJain4 on 2026-05-14 (commit
deb4c7b) - Results reproduced by @Adebara123 on 2026-05-15 (commit
4d16490) - Results reproduced by @david23131 on 2026-05-17 (commit
ced4918) - Results reproduced by @masud70 on 2026-05-23 (commit
878b97d) - Results reproduced by @ibot1 on 2026-05-23 (commit
878b97d) - Results reproduced by @ReyhanehAhani on 2026-05-25 (commit
11703ed) - Results reproduced by @grf932 on 2026-05-29 (commit
6663a15) - Results reproduced by @amulyabenarji777 on 2026-05-30 (commit
6663a15)