README.md
February 10, 2022 ยท View on GitHub
We provide the following three pretraining data files extracted from the Wikipedia and OpenWebText Corpus:
openwebtext_questions.txtcontains questions extracted from a subset of the OpenWebText Corpus downloaded here.wiki_long.txtcontains long Wikipedia sequences (between 20 and 70 words) extracted from the 1M Wikipedia sentences downloaded with this script.wiki_short.txtcontains short Wikipedia sequences (between 5 and 30 words) extracted from the 1M Wikipedia sentences downloaded with this script.