HAGRID: A Human-LLM Collaborative Dataset for Generative Information-seeking with Attribution

August 2, 2023 · View on GitHub

HAGRID (Human-in-the-loop Attributable Generative Retrieval for Information-seeking Dataset) is a dataset for generative information-seeking scenarios. It is constructed on top of MIRACL 🌍🙌🌏, an information retrieval dataset that consists of queries along with a set of manually labelled relevant passages (quotes).

We collect attributed explanations for each question by eliciting prompts from GPT-3.5, based on the given relevant passages. The explanations adhere to an in-context citation style, similar to scientific articles, that reference the supporting quotes. We then ask human annotators to judge the explanations based on two criteria:

Informativeness: whether they provide a direct answer to the question.
Attributability: whether they are attributable to the source passages.

Data

HAGRID is hosted on Hugging Face 🤗: link.

import datasets
hagrid = datasets.load_dataset("miracl/hagrid", split="train")
print(hagrid[0])

Split	#Q	#A	#Informativeness	#Attribuatability
Train	1,922	3,214	3,214	754
Dev	716	1,318	1,157	826

@article{hagrid,
      title={{HAGRID}: A Human-LLM Collaborative Dataset for Generative Information-Seeking with Attribution}, 
      author={Ehsan Kamalloo and Aref Jafari and Xinyu Zhang and Nandan Thakur and Jimmy Lin},
      year={2023},
      journal={arXiv:2307.16883},
}

Quick Links

Data

Baselines (Coming soon!)

Contact

License

Citation