German Transformer Training

July 9, 2020 · View on GitHub

German Transformer Training

The goal of this repository is to plan the training of German transformer models.

1. Datasets / Data Sources

Germeval 2017: https://sites.google.com/view/germeval2017-absa/data

Dataset	Raw Size /Characters	Quality/Filtered?	URL	Notes/Status	Dupe Factor	Total = 178 GB
German Wikipedia Dump + Comments	5.4 GB / 5.3b	++			10	54 GB = 30 %
Oscar Corpus (Common Crawl 2018-47)	145 GB / 21b Words		Downlaod		-----	------
FB cc_net (Common Crawl 2019-09 )	Head 75 GB	+	Code	More broadly filtered versions middle&tail available too	1	75 GB : 42 %
EU Book Shop	2.3 GB / 2.3b	+			5	11.5 GB: 6.5 %
News 2018	4.3 GB / 4.3b	+			5	20 GB: 11 %
Wortschatz Uni Leipzig	> 20 * 200 mb	Part of News 2018???	Code		----	----
Paracrawl	3.2 GB / 3.2b	--			---	----
Open Subtitles	1.3 GB / 288m Tokens	o			2	2.6 GB : 1.5 %
Open Legal Dump	3.6 GB / 3.5b	+	Announcment	Used by Deepset	5	15 GB: 8.4 %
Corpus of German-Language Fiction (txt)	2735 Prose Works		Download	Old (1510-1940)

https://ofai.github.io/million-post-corpus/

Additional Sources

Originally meant for translations tasks: WMT 19
Mabe Identical to News 2018??? Leipzig Corpus Collection
Huge German Corpus (HGC)

Data Preperation

Clean Files
Split in distinct sentences (2.2 Create Vocab)
Tokenize

2. Training

Training

Pre-training SmallBERTa - A tiny model to train on a tiny dataset: https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b
Pretraining RoBERTa using your own data(Fairseq): https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md
How to train a new language model from scratch using Transformers and Tokenizers: https://huggingface.co/blog/how-to-train
Language model training: https://github.com/huggingface/transformers/tree/master/examples/language-modeling
How to train a new language model from scratch using Transformers and Tokenizers: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb

NLP Libs

Fairseq - GitHub
Hugging Face - GitHub
FARM - GitHub
DeepSpeed: Speeding Up BERT Training @ Microsoft Github

Training Runs from scratch

Name	Steps	Result URL	Training Time	Code	Paper
RoBERTa Base					RoBERTa
BERT Large				Github	BERT

TPU Infos

Overview preemtible TPUs TPU Unicorn

3. Evaluation Metrics

Comparison to other German & Multilingual Models

Name	Steps	Result URL	Training Time	Code	Metrics
Deepset German Bert Base	810k (1024 SL) + 30k (512 SL)	Deepset	9 Days TPU v2-8
Ddmdz German Bert Base	1500k (512 SL)	dbmdz		dbmdz	stefan-it
Europeana BERT		dbmdz		Europeana-bert

4. Contact

Gitter