Data Preparation for BERT Pretraining
May 14, 2020 ยท View on GitHub
The following steps are to prepare Wikipedia corpus for pretraining. However, these steps can be used with little or no modification to preprocess other datasets as well:
- Download wiki dump file from https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.
This is a zip file and needs to be unzipped. - Clone Wikiextractor, and run it:
Running time can be 5-10 minutes/GB.git clone https://github.com/attardi/wikiextractor python3 wikiextractor/WikiExtractor.py -o out -b 1000M enwiki-latest-pages-articles.xml
output:outdirectory - Run:
This script removes html tags and empty lines and outputs to one file where each line is a paragraph.ln -s out out2 python3 AzureML-BERT/pretrain/PyTorch/dataprep/single_line_doc_file_creation.py
(pip install tqdmif needed.)
output:wikipedia.txt - Run:
This script convertspython3 AzureML-BERT/pretrain/PyTorch/dataprep/sentence_segmentation.py wikipedia.txt wikipedia.segmented.nltk.txtwikipedia.txtto one file where each line is a sentence.
(pip install nltkif needed.)
output:wikipedia.segmented.nltk.txt - Split the above output file into ~100 files by line with:
output:mkdir data_shards python3 AzureML-BERT/pretrain/PyTorch/dataprep/split_data_into_files.pydata_shardsdirectory - Run:
This script will convert each file into pickledpython3 AzureML-BERT/pretrain/PyTorch/dataprep/create_pretraining.py --input_dir=data_shards --output_dir=pickled_pretrain_data --do_lower_case=true.binfile.
output:pickled_pretrain_datadirectory