NLPdeLUX
June 9, 2023 ยท View on GitHub
This repo collects and documents NLP tools and resources for Luxembourgish. If you know of any that should be featured here, let us know.
NLP tools
LuxemBERT
Description: First Luxembourgish BERT model trained from scratch with optimizations for several downstream tasks such text classification, NEW or intent classification.
Link: huggingface.co/lothritz/LuxemBERT
spaCy
Description: Starting with version 2.2.2, SpaCy has language support for Luxembourgish. This includes tokenization and POS tagging.
Link: github.com/explosion/spaCy
spellux
Description: Automatic text normalization tool for Luxembourgish (spelling correction, lemmatization). Currently in development for training and evaluation.
Link: github.com/questoph/spellux
Syllabifier-for-Luxembourgish
Description: Implementation of the Penn Phonetics Toolkit for Luxembourgish, developed by Peter Gilles. Allows the phonetic syllabification of transcribed words.
Link: github.com/PeterGilles/Syllabifier-for-Luxembourgish
Automatic voice recognition
wav2vec2-large-xls-r-LUXEMBOURGISH2
Description: This is a first experimental build of an automatic voice recogniton system by Peter Gilles trained on a custom data set (~8 hours of Luxembourgish audio+transcript data).
Link: https://huggingface.co/pgilles/wav2vec2-large-xls-r-LUXEMBOURGISH2
OCR
tesseract
Description: Starting with version 4.0, tesseract has language support for Luxembourgish for Opctical Character Recognition.
Link: github.com/tesseract-ocr/tesseract
Resources
Luxembourgish language resources
Description: Phonetic transcriptions of the lemma lists from spellchecker.lu and lod.lu
Link: github.com/PeterGilles/Luxembourgish-language-resources
Luxembourgish dictionaries
Description: HunSpell dictionary and MyThes thesaurus for the Luxembourgish language based on spellchecker.lu
Link: github.com/spellchecker-lu/dictionary-lb-lu
Luxembourgish word embedding
Description: This dataset is a word embedding model trained on Luxembourgish user comments from the media platform RTL.lu. It contains data from roughly 544k Luxembourgish texts published between December 2008 and December 2018.
Link: https://zenodo.org/record/3978066
Universal dependencies
Description: Repository for Luxembourgish as part of the Universal Dependecies project with POS annotated data.
Link: github.com/UniversalDependencies/UD_Luxembourgish-LuxBank