Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration

May 28, 2024 Β· View on GitHub

Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration

🌟Paper (Findings of ACL 2024)🌟: http://arxiv.org/abs/2405.16546
🌟Datasets and Checkpoints🌟: https://huggingface.co/IR-Cocktail

Introduction

Cocktail, a comprehensive benchmark designed to evaluate Information Retrieval (IR) models amidst the evolving landscape of AI-generated content (AIGC). In an era dominated by Large Language Models (LLMs), the traditional IR corpus, previously solely composed of human-written texts, has expanded to include a significant proportion of LLM-generated content. Cocktail emerges as a valuable resource to response to this transformation, aiming to provide a robust framework for assessing the performance and bias of IR models in handling mixed corpora in this LLM era.

Features

  • Comprehensive Dataset Collection: Cocktail comprises 15 existing IR datasets in a standard format, diversified across a range of text retrieval tasks and domains, each enriched with an LLM-generated corpus using Llama2.

  • Up-to-Date Evaluation Dataset: Introducing Natural Question Up-To-Date (NQ-UTD), a dataset featuring queries derived from the latest events, specifically designed to test the responsiveness of LLM-based IR models to new information not included in their pre-training data.

  • Easy-to-use Evaluation Tool: Cocktail includes a user-friendly evaluation tool, simplifying the process of assessing various IR models on the benchmarked dataset. This tool is designed with adaptability, allowing for seamless integration of new models and datasets, thereby enabling researchers and developers to efficiently evaluate the performance and bias of their IR systems.

File Structure

.
β”œβ”€β”€ dataset  # * dataset path
β”‚   β”œβ”€β”€ climate-fever
β”‚   β”œβ”€β”€ cqadupstack
β”‚   β”œβ”€β”€ ...
β”‚   β”œβ”€β”€ trec-covid
β”‚   └── webis-touche2020 
└── benchmark  # * evaluation benchmark
    β”œβ”€β”€ beir  # * requirements codes from beir
    β”œβ”€β”€ evaluate  # * codes for evaluation
    β”‚   β”œβ”€β”€ rerank # * code for re-rankers
    β”‚   β”œβ”€β”€ retrieval # * code for retreiever
    β”‚   └── utils # * codes for different evaluation setting
    └── shell  # * script for quick evaluation

Quick Start

We provide the detail scripts for all the benchmarked models in the folder benchmark/shell. Using neural retrieval models as an example, you can quickly and easily reproduce our results using the following scripts:

GPU=0
batch_size=128
for dataset in "msmarco" "dl19" "dl20" "trec-covid" "nfcorpus" "nq" "hotpotqa" "fiqa" "webis-touche2020" "cqadupstack" "dbpedia-entity" "scidocs" "fever" "climate-fever" "nq-utd"
do
    for model in "bert" "roberta" "tasb" "contriever" "dragon" "cocondenser" "ance" "retromae"
    do
        mkdir -p ./log/${dataset}/${model}/

        # sole human-written corpus evaluation
        CUDA_VISIBLE_DEVICES=$GPU python evaluate/retrieval/${model}.py \
        --k_values 1 2 3 4 5 6 7 8 9 10 100\
        --corpus_list human \
        --save_results=1 \
        --dataset=${dataset} \
        --batch_size=${batch_size} \
        > ./log/${dataset}/${model}/human.log 2>&1

        # sole llm-generated corpus evaluation
        CUDA_VISIBLE_DEVICES=$GPU python evaluate/retrieval/${model}.py \
        --k_values 1 2 3 4 5 6 7 8 9 10 100\
        --corpus_list llama-2-7b-chat-tmp0.2 \
        --save_results=1 \
        --dataset=${dataset} \
        --batch_size=${batch_size} \
        > ./log/${dataset}/${model}/llama2.log 2>&1

        # mix evaluation
        CUDA_VISIBLE_DEVICES=$GPU python evaluate/retrieval/${model}.py \
        --k_values 1 2 3 4 5 6 7 8 9 10 100\
        --corpus_list human llama-2-7b-chat-tmp0.2 \
        --save_results=1 \
        --dataset=${dataset} \
        --batch_size=${batch_size} \
        > ./log/${dataset}/${model}/human_llama2.log 2>&1

        # mix evaluation
        CUDA_VISIBLE_DEVICES=$GPU python evaluate/retrieval/${model}.py \
        --k_values 1 2 3 4 5 6 7 8 9 10 100\
        --corpus_list human llama-2-7b-chat-tmp0.2 \
        --target_list human llama-2-7b-chat-tmp0.2 \
        --save_results=1 \
        --dataset=${dataset} \
        --batch_size=${batch_size} \
        > ./log/${dataset}/${model}/human+llama2.log 2>&1
    done
done

Our evaluation tool is designed to support a variety of customized assessments, including the integration of corpora from different sources and the computation of metrics for specific target corpora. For personalized customization options, please refer to the code in our evaluate folder.

Available Datasets

All the 16 benchmarked datasets in Cocktail are listed in the following table and are available here at HuggingFace.

DatasetRaw WebsiteCocktail DownloadCocktail-Namemd5 for Processed DataDomainRelevancy# Test Query# Corpus
MS MARCOHomepageHomepagemsmarco985926f3e906fadf0dc6249f23ed850fMisc.Binary6,979542,203
DL19HomepageHomepagedl19d652af47ec0e844af43109c0acf50b74Misc.Binary43542,203
DL20HomepageHomepagedl203afc48141dce3405ede2b6b937c65036Misc.Binary54542,203
TREC-COVIDHomepageHomepagetrec-covid1e1e2264b623d9cb7cb50df8141bd535Bio-Medical3-level50128,585
NFCorpusHomepageHomepagenfcorpus695327760647984c5014d64b2fee8de0Bio-Medical3-level3233,633
NQHomepageHomepagenqa10bfe33efdec54aafcc974ac989c338WikipediaBinary3,446104,194
HotpotQAHomepageHomepagehotpotqa74467760fff8bf8fbdadd5094bf9dd7bWikipediaBinary7,405111,107
FiQA-2018HomepageHomepagefiqa4e1e688539b0622630fb6e65d39d26faFinanceBinary64857,450
TouchΓ©-2020HomepageHomepagewebis-touche2020d58ec465ccd567d8f75edb419b0faaedMisc.3-level49101,922
CQADupStackHomepageHomepagecqadupstackd48d963bc72689c765f381f04fc26f8bStackEx.Binary1,56339,962
DBPediaHomepageHomepagedbpedia-entity43292f4f1a1927e2e323a4a7fa165fc1Wikipedia3-level400145,037
SCIDOCSHomepageHomepagescidocs4058c0915594ab34e9b2b67f885c595fScientificBinary1,00025,259
FEVERHomepageHomepagefever98b631887d8c38772463e9633c477c69WikipediaBinary6,666114,529
Climate-FEVERHomepageHomepageclimate-fever5734d6ac34f24f5da496b27e04ff991aWikipediaBinary1,535101,339
SciFactHomepageHomepagescifactb5b8e24ccad98c9ca959061af14bf833ScientificBinary3005,183
NQ-UTDHomepageHomepagenq-utd2e12e66393829cd4be715718f99d2436Misc.3-level80800

To verify the downloaded files, you can use the command to generate an MD5 hash using Terminal: md5sum filename.zip.

Checkpoints

We also provide some checkpoints trained with train_msmarco_v3.py in BEIR. Please see the following table:

ModelPLMPooling StrategyDownload
bert-base-uncased-mean-v3-msmarcobert-base-uncasedmeanLink
bert-base-uncased-cls-v3-msmarcobert-base-uncasedclsLink
bert-base-uncased-last-v3-msmarcobert-base-uncasedlastLink
bert-base-uncased-max-v3-msmarcobert-base-uncasedmaxLink
bert-base-uncased-weightedmean-v3-msmarcobert-base-uncasedweighted-meanLink
bert-mini-mean-v3-msmarcobert-minimeanLink
bert-small-mean-v3-msmarcobert-smallmeanLink
bert-large-uncased-mean-v3-msmarcobert-large-uncasedmeanLink
roberta-base-mean-v3-msmarcoroberta-basemeanLink
robreta-base-cls-v3-msmarcoroberta-baseclsLink
robreta-base-last-v3-msmarcoroberta-baselastLink
robreta-base-max-v3-msmarcoroberta-basemaxLink
robreta-base-weightedmean-v3-msmarcoroberta-baseweighted-meanLink

Reference

The Cocktail benchmark is built based on the following projects:

Citation

If you find our benchmark or work useful for your research, please cite our work.

@article{dai2024cocktail,
  title={Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration},
  author={Dai, Sunhao and Liu, Weihao and Zhou, Yuqi and Pang, Liang and Ruan, Rongju and Wang, Gang and Dong, Zhenhua and Xu, Jun and Wen, Ji-Rong},
  journal={Findings of the Association for Computational Linguistics: ACL 2024},
  year={2024}
}

@article{dai2024neural,
  title={Neural Retrievers are Biased Towards LLM-Generated Content},
  author={Dai, Sunhao and Zhou, Yuqi and Pang, Liang and Liu, Weihao and Hu, Xiaolin and Liu, Yong and Zhang, Xiao and Wang, Gang and Xu, Jun},
  journal={Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
  year={2024}
}

License

The proposed NQ-UTD dataset use MIT license. All data and code in this project can only be used for academic purposes.