German Transformer Training

July 9, 2020 ยท View on GitHub

Gitter GitHub license

German Transformer Training

The goal of this repository is to plan the training of German transformer models.

1. Datasets / Data Sources

DatasetRaw Size /CharactersQuality/Filtered?URLNotes/StatusDupe FactorTotal = 178 GB
German Wikipedia Dump + Comments5.4 GB / 5.3b++1054 GB = 30 %
Oscar Corpus (Common Crawl 2018-47)145 GB / 21b WordsDownlaod-----------
FB cc_net (Common Crawl 2019-09 )Head 75 GB+CodeMore broadly filtered versions middle&tail available too175 GB : 42 %
EU Book Shop2.3 GB / 2.3b+511.5 GB: 6.5 %
News 20184.3 GB / 4.3b+520 GB: 11 %
Wortschatz Uni Leipzig> 20 * 200 mbPart of News 2018???Code--------
Paracrawl3.2 GB / 3.2b---------
Open Subtitles1.3 GB / 288m Tokenso22.6 GB : 1.5 %
Open Legal Dump3.6 GB / 3.5b+AnnouncmentUsed by Deepset515 GB: 8.4 %
Corpus of German-Language Fiction (txt)2735 Prose WorksDownloadOld (1510-1940)

https://ofai.github.io/million-post-corpus/

Additional Sources

Data Preperation

  1. Clean Files
  2. Split in distinct sentences (2.2 Create Vocab)
  3. Tokenize

2. Training

Training

NLP Libs

Training Runs from scratch

NameStepsResult URLTraining TimeCodePaper
RoBERTa BaseRoBERTa
BERT LargeGithubBERT

TPU Infos

3. Evaluation Metrics

Comparison to other German & Multilingual Models

NameStepsResult URLTraining TimeCodeMetrics
Deepset German Bert Base810k (1024 SL) + 30k (512 SL)Deepset9 Days TPU v2-8
Ddmdz German Bert Base1500k (512 SL)dbmdzdbmdzstefan-it
Europeana BERTdbmdzEuropeana-bert

4. Contact