NLP for Urdu
January 5, 2020 ยท View on GitHub
This repository contains State of the Art Language models and Classifier for Urdu, spoken mainly in Pakistan and India, and also in Nepal, Bangladesh and several other countries.
The models trained here have been used in Natural Language Toolkit for Indic Languages (iNLTK)
Dataset
Created as part of this project
Results
Language Model Perplexity
| Architecture/Dataset | Urdu Wikipedia Articles |
|---|---|
| ULMFiT | 13.19 |
| TransformerXL | 12.55 |
Classification Metrics
ULMFiT
| Dataset | Accuracy | Kappa Score |
|---|---|---|
| Urdu News Dataset | 95.28 | 91.58 |
Visualizations
Embedding Space
| Architecture | Visualization |
|---|---|
| ULMFiT | Embeddings projection |
| TransformerXL | Embeddings projection |
Pretrained Language Model
Download pretrained ULMFiT LM from here
Download pretrained TransformerXL LM from here
Classifier
Download classifier from here
Tokenizer
Trained tokenizer using Google's sentencepiece
Download the trained model and vocabulary from here