HAGRID: A Human-LLM Collaborative Dataset for Generative Information-seeking with Attribution

August 2, 2023 ยท View on GitHub

HAGRID

Build License arXiv

HAGRID (Human-in-the-loop Attributable Generative Retrieval for Information-seeking Dataset) is a dataset for generative information-seeking scenarios. It is constructed on top of MIRACL ๐ŸŒ๐Ÿ™Œ๐ŸŒ, an information retrieval dataset that consists of queries along with a set of manually labelled relevant passages (quotes).

We collect attributed explanations for each question by eliciting prompts from GPT-3.5, based on the given relevant passages. The explanations adhere to an in-context citation style, similar to scientific articles, that reference the supporting quotes. We then ask human annotators to judge the explanations based on two criteria:

  1. Informativeness: whether they provide a direct answer to the question.
  2. Attributability: whether they are attributable to the source passages.
HAGRID workflow

Data

HAGRID is hosted on Hugging Face ๐Ÿค—: link.

import datasets
hagrid = datasets.load_dataset("miracl/hagrid", split="train")
print(hagrid[0])
Split#Q#A#Informativeness#Attribuatability
Train1,9223,2143,214754
Dev7161,3181,157826

Baselines (Coming soon!)

We are planning to release baseline models soon! Stay tuned!

Contact

If you have any questions, feel free to email us (project.miracl [at] gmail.com) or start a Github issue under this repository.

License

This work is licensed under the Apache 2 license. See LICENSE for details.

Citation

If you find this dataset and repository helpful, please cite HAGRID as follows:

@article{hagrid,
      title={{HAGRID}: A Human-LLM Collaborative Dataset for Generative Information-Seeking with Attribution}, 
      author={Ehsan Kamalloo and Aref Jafari and Xinyu Zhang and Nandan Thakur and Jimmy Lin},
      year={2023},
      journal={arXiv:2307.16883},
}