Pyserini: A Deeper Dive into Learned Sparse Representations
May 31, 2026 · View on GitHub
In a previous guide, we introduced a conceptual framework for a representational approach to information retrieval that integrates dense and sparse representations into the same underlying (bi-encoder) architecture. This guide offers a deeper dive with learned sparse retrieval, where we use SPLADE-v3, a learned sparse model to encode the corpus into sparse vectors, index them into retrieval system with inverted index, and finally perform retrieval and evaluation.
If you're a Waterloo student traversing the onboarding path (which starts here), make sure you've first done the previous step, reproducing a dense retrieval baseline for NFCorpus. In general, don't try to rush through this guide by just blindly copying and pasting commands into a shell; that's what I call cargo culting. Instead, really try to understand what's going on.
Following the onboarding path, this lesson does not introduce any new concepts. Rather, the focus is to solidify previously introduced concepts and to connect the bi-encoder architecture to implementations in Pyserini. Informally, we're "peeling back the covers".
Learning outcomes for this guide, building on previous steps in the onboarding path, are divided into three parts.
- Be able to encode a corpus into its sparse vector representations with SPLADE-v3.
- Be able to index them into a retrieval system using Lucene inverted index.
- Be able to compute query-document scores (i.e., retrieval scores) with pyserini for SPLADE retrieval.
- Be able to perform retrieval with pyserini given a query.
Recap
As a recap from here, this is the "core retrieval" problem that we're trying to solve:
Given an information need expressed as a query q, the text retrieval task is to return a ranked list of k texts {d1, d2 ... dk} from an arbitrarily large but finite collection of texts C = {di} that maximizes a metric of interest, for example, nDCG, AP, etc.
And this is the bi-encoder architecture for tackling the above challenge:
It's all about representations! BM25 generates bag-of-words sparse lexical vectors where the terms are assigned BM25 weights in an unsupervised manner. Contriever and BGE-base, which are examples of dense retrieval models, use transformer-based encoders, trained on large amounts of supervised data, that generate dense vectors.
Learned Sparse Retrieval Models
Now, we're going to basically do the same thing, but with SPLADE-v3 instead of BM25. A learned sparse model, such as SPLADE-v3, extends traditional bag-of-words models like BM25 by incorporating machine learning to optimize term weights and representations. While BM25 relies on fixed, rule-based scoring (e.g., term frequency and inverse document frequency), learned sparse models use neural networks to predict the importance of terms in a query or document, often producing sparse vectors where only the most relevant terms have non-zero weights. This allows learned sparse models to capture semantic relationships and context better than BoW models, which treat terms independently. However, both approaches result in sparse representations, making them efficient for retrieval tasks.
Start by creating the directories where we will store the encoded documents:
mkdir encode
cd encode
mkdir nfcorpus.splade-v3
cd ..
We can then setup to use SPLADE-v3: First, we need to request access to SPLADE-v3 model on Hugging Face since it is gated:
- Create an account for Hugging Face: https://huggingface.co/join
- Go to the model page on Hugging Face: Splade-v3
- Click the "Log In" button.
Next, we need to authenticate with Hugging Face: If you don’t already have the Hugging Face CLI installed, install it using:
pip install huggingface_hub
Run the following command to log in to your Hugging Face account:
hf auth login
You’ll be prompted to enter your Hugging Face API token. You can generate a token from your Hugging Face account settings:
- Go to https://huggingface.co/settings/tokens.
- Click "New token" to generate a token.
- For your token's permissions, give “Read access to contents of all public gated repos you can access”.
- Copy the token and paste it into the terminal when prompted.
We are now all set to use SPLADE-v3 model!
Start by running the following command to encode the corpus into sparse vector representations.
python -m pyserini.encode \
input --corpus collections/nfcorpus/corpus.jsonl \
--fields title text \
output --embeddings encode/nfcorpus.splade-v3 \
encoder --encoder naver/splade-v3 \
--encoder-class splade \
--fields title text \
--max-length 512 \
--device cpu
Use --device cuda for faster encoding if you have a CUDA-enabled GPU.
Next, we will index the encoded corpus using inverted index into a retrieval system.
python -m pyserini.index.lucene \
--collection JsonVectorCollection \
--input encode/nfcorpus.splade-v3 \
--index index/nfcorpus.splade-v3 \
--generator DefaultLuceneDocumentGenerator \
--threads 4 \
--impact \
--pretokenized
Here, we used pretokenized flag as splade already split the text into tokens (words and subwords) in the sparse vector.
Perform retrieval:
python -m pyserini.search.lucene \
--index index/nfcorpus.splade-v3 \
--topics collections/nfcorpus/queries.tsv \
--output runs/run.splade.txt \
--hits 1000 \
--encoder naver/splade-v3 \
--remove-query \
--output-format trec \
--impact \
--threads 4
The runs will be stored in runs/run.splade.txt.
And evaluate the retrieval run:
python -m pyserini.eval.trec_eval \
-c -m ndcg_cut.10 collections/nfcorpus/qrels/test.qrels \
runs/run.splade.txt
The expected results are:
ndcg_cut_10 all 0.3624
We can also perform retrieval interactively:
import torch
from pyserini.search.lucene import LuceneImpactSearcher
from pyserini.encode import SpladeQueryEncoder
encoder = SpladeQueryEncoder(model_name_or_path="naver/splade-v3", device='cuda' if torch.cuda.is_available() else 'cpu')
searcher = LuceneImpactSearcher('index/nfcorpus.splade-v3', query_encoder=encoder)
hits = searcher.search('How to Help Prevent Abdominal Aortic Aneurysms')
for i in range(0, 10):
print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.6f}')
The results should be as follows:
1 MED-4555 51131.000000
2 MED-4423 36854.000000
3 MED-3180 30411.000000
4 MED-1679 30396.000000
5 MED-3253 29326.000000
6 MED-2007 28814.000000
7 MED-1395 28016.000000
8 MED-5300 27989.000000
9 MED-4030 27699.000000
10 MED-1194 27588.000000
To recap, what's the point for this exercise?
- We see that a machine learning model can also be applied to generate sparse vectors.
- You now know how to reconstruct the document vector representations.
- You now know how to encode a query into a query vector.
Okay, that's it for this lesson.
Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use yyyy-mm-dd, make sure you're using a commit id that's on the main trunk of Pyserini, and use its 7-hexadecimal prefix for the link anchor text.
Reproduction Log*
- Results reproduced by @JJGreen0 on 2025-02-16 (commit
f7ed14d) - Results reproduced by @lilyjge on 2025-04-22 (commit
ba896e2) - Results reproduced by @ricky42613 on 2025-04-25 (commit
ea70638) - Results reproduced by @lzguan on 2025-05-02 (commit
252ee06) - Results reproduced by @mindlesstruffle on 2025-07-15 (commit
b5d4838) - Results reproduced by @FarmersWrap on 2025-11-02 (commit
80395dc) - Results reproduced by @minj22 on 2025-11-05 (commit
0fc0b62) - Results reproduced by @ipouyall on 2025-11-05 (commit
7e54c0e7) - Results reproduced by @AdrianGri on 2025-11-12 (commit
f4a8d0e) - Results reproduced by @Amirhosseinpoor on 2025-11-13 (commit
f4a8d0e) - Results reproduced by @jianxyou on 2025-11-19 (commit
7fd6115) - Results reproduced by @xincanfeng on 2025-11-19 (commit
7fd6115) - Results reproduced by @ball2004244 on 2025-11-23 (commit
cadcbd9) - Results reproduced by @RudraMantri123 on 2025-11-28 (commit
566243c) - Results reproduced by @Kushion32 on 2025-12-09 (commit
301db78) - Results reproduced by @Hasebul21 on 2025-12-10 (commit
d26c2fd) - Results reproduced by @MehdiJmlkh on 2025-12-10 (commit
d26c2fd) - Results reproduced by @MuhammadAli13562 on 2025-12-18 (commit
e4bf66e) - Results reproduced by @Hossein-Molaeian on 2025-12-19 (commit
fee9962) - Results reproduced by @anjanpa on 2025-12-22 (commit
d12db6a) - Results reproduced by @nli33 on 2025-12-26 (commit
d12db6a) - Results reproduced by @VarnitOS on 2025-12-29 (commit
4106eed) - Results reproduced by @zizimind on 2026-01-07 (commit
74d7182) - Results reproduced by @izzat5233 on 2026-01-17 (commit
4bfbb9e) - Results reproduced by @aaryanshroff on 2026-01-27 (commit
9e92b42) - Results reproduced by @maherapp on 2026-02-01 (commit
e9b559c) - Results reproduced by @Karrrthik on 2026-02-20 (commit
2cecfb0) - Results reproduced by @VCY019 on 2026-03-02 (commit
bcae91d) - Results reproduced by @raghav-ai on 2026-03-03 (commit
99e2810) - Results reproduced by @justin13888 on 2026-03-18 (commit
76c95a4) - Results reproduced by @namatvir on 2026-03-21 (commit
b3922b8) - Results reproduced by @egekaya1 on 2026-03-29 (commit
59dfd0e) - Results reproduced by @ShanaxWorld on 2026-04-05 (commit
eba5ad0) - Results reproduced by @h79yan on 2026-04-12 (commit
b973a40) - Results reproduced by @kwamearhinPORTFL on 2026-04-23 (commit
a8d2625) - Results reproduced by @Zixi-Sam-Tang on 2026-04-23 (commit
b823fb0) - Results reproduced by @alex-wang101 on 2026-04-25 (commit
e9cd125) - Results reproduced by @Seun-Ajayi on 2026-04-26 (commit
1f12c67) - Results reproduced by @TahseenSust on 2026-04-29 (commit
c8e8273) - Results reproduced by @mohamedshakir3 on 2026-05-02 (commit
7ed1f31) - Results reproduced by @tqmsh on 2026-05-03 (commit
6adee73) - Results reproduced by @mazleon on 2026-05-03 (commit
6adee73) - Results reproduced by @blissuche90 on 2026-05-04 (commit
7ed1f31) - Results reproduced by @nasazzam on 2026-05-13 (commit
f8c4bd6) - Results reproduced by @VanshJain4 on 2026-05-14 (commit
b0bf30c) - Results reproduced by @Adebara123 on 2026-05-15 (commit
e5d3a4a) - Results reproduced by @david23131 on 2026-05-18 (commit
ced4918) - Results reproduced by @masud70 on 2026-05-23 (commit
21a411a) - Results reproduced by @ibot1 on 2026-05-23 (commit
21a411a) - Results reproduced by @grf932 on 2026-05-30 (commit
4536186) - Results reproduced by @amulyabenarji777 on 2026-05-30 (commit
4536186)