No Language Left Behind Seed Data

November 20, 2023 · View on GitHub

NLLB Seed is a set of professionally-translated sentences in the Wikipedia domain. Data for NLLB-Seed was sampled from Wikimedia’s List of articles every Wikipedia should have, a collection of topics in different fields of knowledge and human activity. NLLB-Seed consists of around six thousand sentences in 39 languages. NLLB-Seed is meant to be used for training rather than model evaluation. Due to this difference, NLLB-Seed does not go through the human quality assurance process present in FLORES-200.

Download

⚠️ This repository is no longer being updated ⚠️

For newer versions of this dataset, see https://github.com/openlanguagedata/seed and https://www.oldi.org.

The original version of the dataset can still be downloaded here.

Languages in NLLB - Seed

Language	FLORES-200 code
Acehnese (Arabic script)	ace_Arab
Acehnese (Latin script)	ace_Latn
Moroccan Arabic	ary_Arab
Egyptian Arabic	arz_Arab
Bambara	bam_Latn
Balinese	ban_Latn
Bhojpuri	bho_Deva
Banjar (Arabic script)	bjn_Arab
Banjar (Latin script)	bjn_Latn
Buginese	bug_Latn
Crimean Tatar	crh_Latn
Southwestern Dinka	dik_Latn
Dzongkha	dzo_Tibt
Friulian	fur_Latn
Nigerian Fulfulde	fuv_Latn
Guarani	grn_Latn
Chhattisgarhi	hne_Deva
Kashmiri (Arabic script)	kas_Arab
Kashmiri (Devanagari script)	kas_Deva
Central Kanuri (Arabic script)	knc_Arab
Central Kanuri (Latin script)	knc_Latn
Ligurian	lij_Latn
Limburgish	lim_Latn
Lombard	lmo_Latn
Latgalian	ltg_Latn
Magahi	mag_Deva
Meitei (Bengali script)	mni_Beng
Maori	mri_Latn
Nuer	nus_Latn
Dari	prs_Arab
Southern Pashto	pbt_Arab
Sicilian	scn_Latn
Shan	shn_Mymr
Sardinian	srd_Latn
Silesian	szl_Latn
Tamasheq (Latin script)	taq_Latn
Tamasheq (Tifinagh script)	taq_Tfng
Central Atlas Tamazight	tzm_Tfng
Venetian	vec_Latn