Pyserini: Usage of the Query Builder API
March 5, 2022 ยท View on GitHub
The querybuilder provides functionality to construct Lucene queries through Pyserini.
These queries can be directly issued through the LuceneSearcher.
Instead of issuing the query hubble space telescope directly, we can also construct the same exact query manually as follows:
from pyserini.search.lucene import querybuilder
# First, create term queries for each individual query term:
term1 = querybuilder.get_term_query('hubble')
term2 = querybuilder.get_term_query('space')
term3 = querybuilder.get_term_query('telescope')
# Then, assemble into a "bag of words" query:
should = querybuilder.JBooleanClauseOccur['should'].value
boolean_query_builder = querybuilder.get_boolean_query_builder()
boolean_query_builder.add(term1, should)
boolean_query_builder.add(term2, should)
boolean_query_builder.add(term3, should)
query = boolean_query_builder.build()
Then issue the query:
from pyserini.search.lucene import LuceneSearcher
searcher = LuceneSearcher.from_prebuilt_index('robust04')
# Generate your query, per above...
hits = searcher.search(query)
for i in range(0, 10):
print(f'{i+1:2} {hits[i].docid:15} {hits[i].score:.5f}')
The results should be exactly the same as:
hits = searcher.search('hubble space telescope')
By manually constructing queries, it is possible to define the boost for each query term individually. For example:
boost1 = querybuilder.get_boost_query(term1, 2.)
boost2 = querybuilder.get_boost_query(term2, 1.)
boost3 = querybuilder.get_boost_query(term3, 1.)
should = querybuilder.JBooleanClauseOccur['should'].value
boolean_query_builder = querybuilder.get_boolean_query_builder()
boolean_query_builder.add(boost1, should)
boolean_query_builder.add(boost2, should)
boolean_query_builder.add(boost3, should)
query = boolean_query_builder.build()
hits = searcher.search(query)
for i in range(0, 10):
print(f'{i+1:2} {hits[i].docid:15} {hits[i].score:.5f}')
Note that the results are different, because we've placed more weight on the term hubble.