Pyserini Regressions: BM25 Baselines for HC4 on NeuCLIR22

August 6, 2022 ยท View on GitHub

This page documents BM25 regression experiments for HC4 (v1.0) on the NeuCLIR22 corpus. The HC4 qrels have been filtered down to include only those in the intersection of the HC4 and NeuCLIR22 corpora.

Corpus Download

1. Manual Download

The HC4 corpus can be downloaded following the instructions here. After download, verify that all and only specified documents have been downloaded by running the code provided here.

With the corpus downloaded, we need to create 3 separate folders for the 3 languages (Persian, Chinese and Russian) , and unpack the data into the respective folders for each language

mkdir collections/neuclir22-fa collections/neuclir22-zh collections/neuclir22-ru

We can now index these docs as a NeuClirCollection using Anserini bindings from Pyserini


python -m pyserini.index.lucene --collection NeuClirCollection \
  --input collections/neuclir22-zh --index indexes/lucene-index.neuclir22-zh \
  --generator DefaultLuceneDocumentGenerator --threads 8 \
  --storePositions --storeDocvectors --storeRaw -language zh \
  >& logs/log.neuclir22-zh &

python -m pyserini.index.lucene --collection NeuClirCollection \
  --input collections/neuclir22-fa --index indexes/lucene-index.neuclir22-fa \
  --generator DefaultLuceneDocumentGenerator --threads 8 \
  --storePositions --storeDocvectors --storeRaw -language fa \
  >& logs/log.neuclir22-fa &

python -m pyserini.index.lucene --collection NeuClirCollection \
  --input collections/neuclir22-ru --index indexes/lucene-index.neuclir22-ru \
  --generator DefaultLuceneDocumentGenerator --threads 8 \
  --storePositions --storeDocvectors --storeRaw -language ru \
  >& logs/log.neuclir22-ru &

2. Download Pre-Built Sparse Indexes (for BM25)

Retrieval: Test Topics

Condition: Title

python -m pyserini.search.lucene  --index  neuclir22-zh \
    --topics hc4-v1.0-zh-test-title \
    --output runs/run.neuclir22-zh.bm25.topics.hc4-v1.0-zh.test.title.txt \
    --bm25 --language zh

python -m pyserini.search.lucene  --index  neuclir22-fa \
    --topics hc4-v1.0-fa-test-title \
    --output runs/run.neuclir22-fa.bm25.topics.hc4-v1.0-fa.test.title.txt \
    --bm25 --language fa 

python -m pyserini.search.lucene  --index  neuclir22-ru \
    --topics hc4-v1.0-ru-test-title \
    --output runs/run.neuclir22-ru.bm25.topics.hc4-v1.0-ru.test.title.txt \
    --bm25 --language ru

python -m pyserini.search.lucene  --index  neuclir22-zh \
    --topics hc4-v1.0-zh-test-title \
    --output runs/run.neuclir22-zh.bm25-default+rm3.topics.hc4-v1.0-zh.test.title.txt \
    --bm25 --rm3 --language zh

python -m pyserini.search.lucene  --index neuclir22-fa \
    --topics hc4-v1.0-fa-test-title \
    --output runs/run.neuclir22-fa.bm25-default+rm3.topics.hc4-v1.0-fa.test.title.txt \
    --bm25 --rm3 --language fa 

python -m pyserini.search.lucene  --index  neuclir22-ru \
    --topics hc4-v1.0-ru-test-title \
    --output runs/run.neuclir22-ru.bm25-default+rm3.topics.hc4-v1.0-ru.test.title.txt \
    --bm25 --rm3 --language ru

python -m pyserini.search.lucene  --index  neuclir22-zh \
    --topics hc4-v1.0-zh-test-title \
    --output runs/run.neuclir22-zh.bm25-default+rocchio.topics.hc4-v1.0-zh.test.title.txt \
    --bm25 --rocchio --language zh

python -m pyserini.search.lucene  --index  neuclir22-fa \
    --topics hc4-v1.0-fa-test-title \
    --output runs/run.neuclir22-fa.bm25-default+rocchio.topics.hc4-v1.0-fa.test.title.txt \
    --bm25 --rocchio --language fa 

python -m pyserini.search.lucene  --index  neuclir22-ru \
    --topics hc4-v1.0-ru-test-title \
    --output runs/run.neuclir22-ru.bm25-default+rocchio.topics.hc4-v1.0-ru.test.title.txt \
    --bm25 --rocchio --language ru 

Condition: Description

python -m pyserini.search.lucene  --index  neuclir22-zh \
    --topics hc4-v1.0-zh-test-description \
    --output runs/run.neuclir22-zh.bm25.topics.hc4-v1.0-zh.test.description.txt \
    --bm25 --language zh

python -m pyserini.search.lucene  --index  neuclir22-fa \
    --topics hc4-v1.0-fa-test-description \
    --output runs/run.neuclir22-fa.bm25.topics.hc4-v1.0-fa.test.description.txt \
    --bm25 --language fa 

python -m pyserini.search.lucene  --index  neuclir22-ru \
    --topics hc4-v1.0-ru-test-description \
    --output runs/run.neuclir22-ru.bm25.topics.hc4-v1.0-ru.test.description.txt \
    --bm25 --language ru 

python -m pyserini.search.lucene  --index  neuclir22-zh \
    --topics hc4-v1.0-zh-test-description \
    --output runs/run.neuclir22-zh.bm25-default+rm3.topics.hc4-v1.0-zh.test.description.txt \
    --bm25 --rm3 --language zh

python -m pyserini.search.lucene  --index  neuclir22-fa \
    --topics hc4-v1.0-fa-test-description \
    --output runs/run.neuclir22-fa.bm25-default+rm3.topics.hc4-v1.0-fa.test.description.txt \
    --bm25 --rm3 --language fa 

python -m pyserini.search.lucene  --index  neuclir22-ru \
    --topics hc4-v1.0-ru-test-description \
    --output runs/run.neuclir22-ru.bm25-default+rm3.topics.hc4-v1.0-ru.test.description.txt \
    --bm25 --rm3 --language ru

python -m pyserini.search.lucene  --index  neuclir22-zh \
    --topics hc4-v1.0-zh-test-description \
    --output runs/run.neuclir22-zh.bm25-default+rocchio.topics.hc4-v1.0-zh.test.description.txt \
    --bm25 --rocchio --language zh

python -m pyserini.search.lucene  --index  neuclir22-fa \
    --topics hc4-v1.0-fa-test-description \
    --output runs/run.neuclir22-fa.bm25-default+rocchio.topics.hc4-v1.0-fa.test.description.txt \
    --bm25 --rocchio --language fa 

python -m pyserini.search.lucene  --index  neuclir22-ru \
    --topics hc4-v1.0-ru-test-description \
    --output runs/run.neuclir22-ru.bm25-default+rocchio.topics.hc4-v1.0-ru.test.description.txt \
    --bm25 --rocchio --language ru 

Condition: Description + Title

python -m pyserini.search.lucene  --index  neuclir22-zh \
    --topics tools/topics-and-qrels/topics.hc4-v1.0-zh.test.desc.title.tsv \
    --output runs/run.neuclir22-zh.bm25.topics.hc4-v1.0-zh.test.description.title.txt \
    --bm25 --language zh

python -m pyserini.search.lucene  --index  neuclir22-fa \
    --topics tools/topics-and-qrels/topics.hc4-v1.0-fa.test.desc.title.tsv \
    --output runs/run.neuclir22-fa.bm25.topics.hc4-v1.0-fa.test.description.title.txt \
    --bm25 --language fa 

python -m pyserini.search.lucene  --index  neuclir22-ru \
    --topics tools/topics-and-qrels/topics.hc4-v1.0-ru.test.desc.title.tsv \
    --output runs/run.neuclir22-ru.bm25.topics.hc4-v1.0-ru.test.description.title.txt \
    --bm25 --language ru 

python -m pyserini.search.lucene  --index  neuclir22-zh \
    --topics tools/topics-and-qrels/topics.hc4-v1.0-zh.test.desc.title.tsv \
    --output runs/run.neuclir22-zh.bm25-default+rm3.topics.hc4-v1.0-zh.test.description.title.txt \
    --bm25 --rm3 --language zh

python -m pyserini.search.lucene  --index  neuclir22-fa \
    --topics tools/topics-and-qrels/topics.hc4-v1.0-fa.test.desc.title.tsv \
    --output runs/run.neuclir22-fa.bm25-default+rm3.topics.hc4-v1.0-fa.test.description.title.txt \
    --bm25 --rm3 --language fa 

python -m pyserini.search.lucene  --index  neuclir22-ru \
    --topics tools/topics-and-qrels/topics.hc4-v1.0-ru.test.desc.title.tsv \
    --output runs/run.neuclir22-ru.bm25-default+rm3.topics.hc4-v1.0-ru.test.description.title.txt \
    --bm25 --rm3 --language ru

python -m pyserini.search.lucene  --index  neuclir22-zh \
    --topics tools/topics-and-qrels/topics.hc4-v1.0-zh.test.desc.title.tsv \
    --output runs/run.neuclir22-zh.bm25-default+rocchio.topics.hc4-v1.0-zh.test.description.title.txt \
    --bm25 --rocchio --language zh

python -m pyserini.search.lucene  --index  neuclir22-fa \
    --topics tools/topics-and-qrels/topics.hc4-v1.0-fa.test.desc.title.tsv \
    --output runs/run.neuclir22-fa.bm25-default+rocchio.topics.hc4-v1.0-fa.test.description.title.txt \
    --bm25 --rocchio --language fa 

python -m pyserini.search.lucene  --index  neuclir22-ru \
    --topics tools/topics-and-qrels/topics.hc4-v1.0-ru.test.desc.title.tsv \
    --output runs/run.neuclir22-ru.bm25-default+rocchio.topics.hc4-v1.0-ru.test.description.title.txt \
    --bm25 --rocchio --language ru 

Evaluation: Test Topics

Condition: Title

python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-zh.test.txt runs/run.neuclir22-zh.bm25.topics.hc4-v1.0-zh.test.title.txt
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-fa.test.txt runs/run.neuclir22-fa.bm25.topics.hc4-v1.0-fa.test.title.txt
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-ru.test.txt runs/run.neuclir22-ru.bm25.topics.hc4-v1.0-ru.test.title.txt

python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-zh.test.txt runs/run.neuclir22-zh.bm25-default+rm3.topics.hc4-v1.0-zh.test.title.txt
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-fa.test.txt runs/run.neuclir22-fa.bm25-default+rm3.topics.hc4-v1.0-fa.test.title.txt
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-ru.test.txt runs/run.neuclir22-ru.bm25-default+rm3.topics.hc4-v1.0-ru.test.title.txt

python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-zh.test.txt runs/run.neuclir22-zh.bm25-default+rocchio.topics.hc4-v1.0-zh.test.title.txt
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-fa.test.txt runs/run.neuclir22-fa.bm25-default+rocchio.topics.hc4-v1.0-fa.test.title.txt
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-ru.test.txt runs/run.neuclir22-ru.bm25-default+rocchio.topics.hc4-v1.0-ru.test.title.txt

Condition: Description

python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-zh.test.txt runs/run.neuclir22-zh.bm25.topics.hc4-v1.0-zh.test.description.txt
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-fa.test.txt runs/run.neuclir22-fa.bm25.topics.hc4-v1.0-fa.test.description.txt
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-ru.test.txt runs/run.neuclir22-ru.bm25.topics.hc4-v1.0-ru.test.description.txt

python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-zh.test.txt runs/run.neuclir22-zh.bm25-default+rm3.topics.hc4-v1.0-zh.test.description.txt
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-fa.test.txt runs/run.neuclir22-fa.bm25-default+rm3.topics.hc4-v1.0-fa.test.description.txt
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-ru.test.txt runs/run.neuclir22-ru.bm25-default+rm3.topics.hc4-v1.0-ru.test.description.txt

python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-zh.test.txt runs/run.neuclir22-zh.bm25-default+rocchio.topics.hc4-v1.0-zh.test.description.txt
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-fa.test.txt runs/run.neuclir22-fa.bm25-default+rocchio.topics.hc4-v1.0-fa.test.description.txt
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-ru.test.txt runs/run.neuclir22-ru.bm25-default+rocchio.topics.hc4-v1.0-ru.test.description.txt

Condition: Description + Title

python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-zh.test.txt runs/run.neuclir22-zh.bm25.topics.hc4-v1.0-zh.test.description.title.txt
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-fa.test.txt runs/run.neuclir22-fa.bm25.topics.hc4-v1.0-fa.test.description.title.txt
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-ru.test.txt runs/run.neuclir22-ru.bm25.topics.hc4-v1.0-ru.test.description.title.txt

python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-zh.test.txt runs/run.neuclir22-zh.bm25-default+rm3.topics.hc4-v1.0-zh.test.description.title.txt
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-fa.test.txt runs/run.neuclir22-fa.bm25-default+rm3.topics.hc4-v1.0-fa.test.description.title.txt
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-ru.test.txt runs/run.neuclir22-ru.bm25-default+rm3.topics.hc4-v1.0-ru.test.description.title.txt

python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-zh.test.txt runs/run.neuclir22-zh.bm25-default+rocchio.topics.hc4-v1.0-zh.test.description.title.txt
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-fa.test.txt runs/run.neuclir22-fa.bm25-default+rocchio.topics.hc4-v1.0-fa.test.description.title.txt
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.20 -m judged.20 -m recall.1000 tools/topics-and-qrels/qrels.hc4-neuclir22-ru.test.txt runs/run.neuclir22-ru.bm25-default+rocchio.topics.hc4-v1.0-ru.test.description.title.txt

Effectiveness

Chinese

With the above commands, you should be able to reproduce the following results:

MAPBM25 (default)+RM3+Rocchio
HC4 (Chinese): test-topic title0.05610.04490.0488
HC4 (Chinese): test-topic description0.04280.02620.0277
HC4 (Chinese): test-topic description+title0.05970.04350.0462
nDCG@20BM25 (default)+RM3+Rocchio
HC4 (Chinese): test-topic title0.07590.06220.0767
HC4 (Chinese): test-topic description0.06870.03790.0529
HC4 (Chinese): test-topic description+title0.08810.06400.0735
J@20BM25 (default)+RM3+Rocchio
HC4 (Chinese): test-topic title0.06200.04900.0760
HC4 (Chinese): test-topic description0.05900.03600.0610
HC4 (Chinese): test-topic description+title0.07100.04200.0740
Recall@1000BM25 (default)+RM3+Rocchio
HC4 (Chinese): test-topic title0.44010.39090.4128
HC4 (Chinese): test-topic description0.35650.23830.3858
HC4 (Chinese): test-topic description+title0.44420.26730.4259

Russian

MAPBM25 (default)+RM3+Rocchio
HC4 (Russian): test-topic title0.09640.08110.1245
HC4 (Russian): test-topic description0.09260.06050.1064
HC4 (Russian): test-topic description+title0.11130.07710.1341
nDCG@20BM25 (default)+RM3+Rocchio
HC4 (Russian): test-topic title0.13800.12570.1668
HC4 (Russian): test-topic description0.14590.09630.1643
HC4 (Russian): test-topic description+title0.16400.13180.1899
J@20BM25 (default)+RM3+Rocchio
HC4 (Russian): test-topic title0.08600.07300.0940
HC4 (Russian): test-topic description0.07900.06100.0890
HC4 (Russian): test-topic description+title0.09000.07500.0980
Recall@1000BM25 (default)+RM3+Rocchio
HC4 (Russian): test-topic title0.63190.61540.6887
HC4 (Russian): test-topic description0.66400.54080.6407
HC4 (Russian): test-topic description+title0.66670.62210.6743

Persian

MAPBM25 (default)+RM3+Rocchio
HC4 (Persian): test-topic title0.11980.10500.1221
HC4 (Persian): test-topic description0.14350.08450.1254
HC4 (Persian): test-topic description+title0.14380.10790.1351
nDCG@20BM25 (default)+RM3+Rocchio
HC4 (Persian): test-topic title0.18060.15490.1794
HC4 (Persian): test-topic description0.22880.13230.1968
HC4 (Persian): test-topic description+title0.22330.17600.2001
J@20BM25 (default)+RM3+Rocchio
HC4 (Persian): test-topic title0.14300.12200.1520
HC4 (Persian): test-topic description0.14800.11000.1480
HC4 (Persian): test-topic description+title0.15700.12100.1530
Recall@1000BM25 (default)+RM3+Rocchio
HC4 (Persian): test-topic title0.72340.67420.7929
HC4 (Persian): test-topic description0.74310.61070.7768
HC4 (Persian): test-topic description+title0.76520.64360.8058