No Language Left Behind Seed Data

November 20, 2023 · View on GitHub

NLLB Seed is a set of professionally-translated sentences in the Wikipedia domain. Data for NLLB-Seed was sampled from Wikimedia’s List of articles every Wikipedia should have, a collection of topics in different fields of knowledge and human activity. NLLB-Seed consists of around six thousand sentences in 39 languages. NLLB-Seed is meant to be used for training rather than model evaluation. Due to this difference, NLLB-Seed does not go through the human quality assurance process present in FLORES-200.


Download

⚠️ This repository is no longer being updated ⚠️

For newer versions of this dataset, see https://github.com/openlanguagedata/seed and https://www.oldi.org.

The original version of the dataset can still be downloaded here.

Languages in NLLB - Seed

LanguageFLORES-200 code
Acehnese (Arabic script)ace_Arab
Acehnese (Latin script)ace_Latn
Moroccan Arabicary_Arab
Egyptian Arabicarz_Arab
Bambarabam_Latn
Balineseban_Latn
Bhojpuribho_Deva
Banjar (Arabic script)bjn_Arab
Banjar (Latin script)bjn_Latn
Buginesebug_Latn
Crimean Tatarcrh_Latn
Southwestern Dinkadik_Latn
Dzongkhadzo_Tibt
Friulianfur_Latn
Nigerian Fulfuldefuv_Latn
Guaranigrn_Latn
Chhattisgarhihne_Deva
Kashmiri (Arabic script)kas_Arab
Kashmiri (Devanagari script)kas_Deva
Central Kanuri (Arabic script)knc_Arab
Central Kanuri (Latin script)knc_Latn
Ligurianlij_Latn
Limburgishlim_Latn
Lombardlmo_Latn
Latgalianltg_Latn
Magahimag_Deva
Meitei (Bengali script)mni_Beng
Maorimri_Latn
Nuernus_Latn
Dariprs_Arab
Southern Pashtopbt_Arab
Sicilianscn_Latn
Shanshn_Mymr
Sardiniansrd_Latn
Silesianszl_Latn
Tamasheq (Latin script)taq_Latn
Tamasheq (Tifinagh script)taq_Tfng
Central Atlas Tamazighttzm_Tfng
Venetianvec_Latn