No Language Left Behind Seed Data
November 20, 2023 · View on GitHub
NLLB Seed is a set of professionally-translated sentences in the Wikipedia domain. Data for NLLB-Seed was sampled from Wikimedia’s List of articles every Wikipedia should have, a collection of topics in different fields of knowledge and human activity. NLLB-Seed consists of around six thousand sentences in 39 languages. NLLB-Seed is meant to be used for training rather than model evaluation. Due to this difference, NLLB-Seed does not go through the human quality assurance process present in FLORES-200.
Download
⚠️ This repository is no longer being updated ⚠️
For newer versions of this dataset, see https://github.com/openlanguagedata/seed and https://www.oldi.org.
The original version of the dataset can still be downloaded here.
Languages in NLLB - Seed
| Language | FLORES-200 code |
|---|---|
| Acehnese (Arabic script) | ace_Arab |
| Acehnese (Latin script) | ace_Latn |
| Moroccan Arabic | ary_Arab |
| Egyptian Arabic | arz_Arab |
| Bambara | bam_Latn |
| Balinese | ban_Latn |
| Bhojpuri | bho_Deva |
| Banjar (Arabic script) | bjn_Arab |
| Banjar (Latin script) | bjn_Latn |
| Buginese | bug_Latn |
| Crimean Tatar | crh_Latn |
| Southwestern Dinka | dik_Latn |
| Dzongkha | dzo_Tibt |
| Friulian | fur_Latn |
| Nigerian Fulfulde | fuv_Latn |
| Guarani | grn_Latn |
| Chhattisgarhi | hne_Deva |
| Kashmiri (Arabic script) | kas_Arab |
| Kashmiri (Devanagari script) | kas_Deva |
| Central Kanuri (Arabic script) | knc_Arab |
| Central Kanuri (Latin script) | knc_Latn |
| Ligurian | lij_Latn |
| Limburgish | lim_Latn |
| Lombard | lmo_Latn |
| Latgalian | ltg_Latn |
| Magahi | mag_Deva |
| Meitei (Bengali script) | mni_Beng |
| Maori | mri_Latn |
| Nuer | nus_Latn |
| Dari | prs_Arab |
| Southern Pashto | pbt_Arab |
| Sicilian | scn_Latn |
| Shan | shn_Mymr |
| Sardinian | srd_Latn |
| Silesian | szl_Latn |
| Tamasheq (Latin script) | taq_Latn |
| Tamasheq (Tifinagh script) | taq_Tfng |
| Central Atlas Tamazight | tzm_Tfng |
| Venetian | vec_Latn |