The FLORES-200 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

November 20, 2023 · View on GitHub


The FLORES-200 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

The creation of FLORES-200 doubles the existing language coverage of FLORES-101. Given the nature of the new languages, which have less standardization and require more specialized professional translations, the verification process became more complex. This required modifications to the translation workflow. FLORES-200 has several languages which were not translated from English. Specifically, several languages were translated from Spanish, French, Russian and Modern Standard Arabic. Moreover, FLORES-200 also includes two script alternatives for four languages.


Composition

FLORES-200 consists of translations from 842 distinct web articles, totaling 3001 sentences. These sentences are divided into three splits: dev, devtest, and test (hidden). On average, sentences are approximately 21 words long.

Download

⚠️ This repository is no longer being updated ⚠️

For newer versions of this dataset, see https://github.com/openlanguagedata/flores and https://www.oldi.org.

The original version of the dataset can still be downloaded here and is also available on HuggingFace here.

SPM and Dictionary

  • Dictionary Download here
  • SPM Model Download here

Example SentencePiece Usage

Note: Install SentencePiece from here

flores_dataset=/path/to/flores_dataset
fairseq=/path/to/fairseq
cd $fairseq

python scripts/spm_encode.py \
    --model flores_spm_model_here \
    --output_format=piece \
    --inputs=data_input_path_here \
    --outputs=data_output_path_here

Evaluation

We primarily evaluate with chrf++:

sacrebleu -m chrf --chrf-word-order 2 {ref_file} < {hyp_file}

and also evaluate with spBLEU:

# tokenize with SPM
python scripts/spm_encode.py \
    --model flores_spm_model_here \
    --output_format=piece \
    --inputs={untok_hyp_file} \
    --outputs={hyp_file}

# calculate with sacrebleu
cat {hyp_file} | sacrebleu {ref_file}

Languages in FLORES-200

LanguageFLORES-200 code
Acehnese (Arabic script)ace_Arab
Acehnese (Latin script)ace_Latn
Mesopotamian Arabicacm_Arab
Ta’izzi-Adeni Arabicacq_Arab
Tunisian Arabicaeb_Arab
Afrikaansafr_Latn
South Levantine Arabicajp_Arab
Akanaka_Latn
Amharicamh_Ethi
North Levantine Arabicapc_Arab
Modern Standard Arabicarb_Arab
Modern Standard Arabic (Romanized)arb_Latn
Najdi Arabicars_Arab
Moroccan Arabicary_Arab
Egyptian Arabicarz_Arab
Assameseasm_Beng
Asturianast_Latn
Awadhiawa_Deva
Central Aymaraayr_Latn
South Azerbaijaniazb_Arab
North Azerbaijaniazj_Latn
Bashkirbak_Cyrl
Bambarabam_Latn
Balineseban_Latn
Belarusianbel_Cyrl
Bembabem_Latn
Bengaliben_Beng
Bhojpuribho_Deva
Banjar (Arabic script)bjn_Arab
Banjar (Latin script)bjn_Latn
Standard Tibetanbod_Tibt
Bosnianbos_Latn
Buginesebug_Latn
Bulgarianbul_Cyrl
Catalancat_Latn
Cebuanoceb_Latn
Czechces_Latn
Chokwecjk_Latn
Central Kurdishckb_Arab
Crimean Tatarcrh_Latn
Welshcym_Latn
Danishdan_Latn
Germandeu_Latn
Southwestern Dinkadik_Latn
Dyuladyu_Latn
Dzongkhadzo_Tibt
Greekell_Grek
Englisheng_Latn
Esperantoepo_Latn
Estonianest_Latn
Basqueeus_Latn
Eweewe_Latn
Faroesefao_Latn
Fijianfij_Latn
Finnishfin_Latn
Fonfon_Latn
Frenchfra_Latn
Friulianfur_Latn
Nigerian Fulfuldefuv_Latn
Scottish Gaelicgla_Latn
Irishgle_Latn
Galicianglg_Latn
Guaranigrn_Latn
Gujaratiguj_Gujr
Haitian Creolehat_Latn
Hausahau_Latn
Hebrewheb_Hebr
Hindihin_Deva
Chhattisgarhihne_Deva
Croatianhrv_Latn
Hungarianhun_Latn
Armenianhye_Armn
Igboibo_Latn
Ilocanoilo_Latn
Indonesianind_Latn
Icelandicisl_Latn
Italianita_Latn
Javanesejav_Latn
Japanesejpn_Jpan
Kabylekab_Latn
Jingphokac_Latn
Kambakam_Latn
Kannadakan_Knda
Kashmiri (Arabic script)kas_Arab
Kashmiri (Devanagari script)kas_Deva
Georgiankat_Geor
Central Kanuri (Arabic script)knc_Arab
Central Kanuri (Latin script)knc_Latn
Kazakhkaz_Cyrl
Kabiyèkbp_Latn
Kabuverdianukea_Latn
Khmerkhm_Khmr
Kikuyukik_Latn
Kinyarwandakin_Latn
Kyrgyzkir_Cyrl
Kimbundukmb_Latn
Northern Kurdishkmr_Latn
Kikongokon_Latn
Koreankor_Hang
Laolao_Laoo
Ligurianlij_Latn
Limburgishlim_Latn
Lingalalin_Latn
Lithuanianlit_Latn
Lombardlmo_Latn
Latgalianltg_Latn
Luxembourgishltz_Latn
Luba-Kasailua_Latn
Gandalug_Latn
Luoluo_Latn
Mizolus_Latn
Standard Latvianlvs_Latn
Magahimag_Deva
Maithilimai_Deva
Malayalammal_Mlym
Marathimar_Deva
Minangkabau (Arabic script)min_Arab
Minangkabau (Latin script)min_Latn
Macedonianmkd_Cyrl
Plateau Malagasyplt_Latn
Maltesemlt_Latn
Meitei (Bengali script)mni_Beng
Halh Mongoliankhk_Cyrl
Mossimos_Latn
Maorimri_Latn
Burmesemya_Mymr
Dutchnld_Latn
Norwegian Nynorsknno_Latn
Norwegian Bokmålnob_Latn
Nepalinpi_Deva
Northern Sothonso_Latn
Nuernus_Latn
Nyanjanya_Latn
Occitanoci_Latn
West Central Oromogaz_Latn
Odiaory_Orya
Pangasinanpag_Latn
Eastern Panjabipan_Guru
Papiamentopap_Latn
Western Persianpes_Arab
Polishpol_Latn
Portuguesepor_Latn
Dariprs_Arab
Southern Pashtopbt_Arab
Ayacucho Quechuaquy_Latn
Romanianron_Latn
Rundirun_Latn
Russianrus_Cyrl
Sangosag_Latn
Sanskritsan_Deva
Santalisat_Olck
Sicilianscn_Latn
Shanshn_Mymr
Sinhalasin_Sinh
Slovakslk_Latn
Slovenianslv_Latn
Samoansmo_Latn
Shonasna_Latn
Sindhisnd_Arab
Somalisom_Latn
Southern Sothosot_Latn
Spanishspa_Latn
Tosk Albanianals_Latn
Sardiniansrd_Latn
Serbiansrp_Cyrl
Swatissw_Latn
Sundanesesun_Latn
Swedishswe_Latn
Swahiliswh_Latn
Silesianszl_Latn
Tamiltam_Taml
Tatartat_Cyrl
Telugutel_Telu
Tajiktgk_Cyrl
Tagalogtgl_Latn
Thaitha_Thai
Tigrinyatir_Ethi
Tamasheq (Latin script)taq_Latn
Tamasheq (Tifinagh script)taq_Tfng
Tok Pisintpi_Latn
Tswanatsn_Latn
Tsongatso_Latn
Turkmentuk_Latn
Tumbukatum_Latn
Turkishtur_Latn
Twitwi_Latn
Central Atlas Tamazighttzm_Tfng
Uyghuruig_Arab
Ukrainianukr_Cyrl
Umbunduumb_Latn
Urduurd_Arab
Northern Uzbekuzn_Latn
Venetianvec_Latn
Vietnamesevie_Latn
Waraywar_Latn
Wolofwol_Latn
Xhosaxho_Latn
Eastern Yiddishydd_Hebr
Yorubayor_Latn
Yue Chineseyue_Hant
Chinese (Simplified)zho_Hans
Chinese (Traditional)zho_Hant
Standard Malayzsm_Latn
Zuluzul_Latn

Updates to Previous Languages

Based on feedback and further Q/A, we've improved the quality of several languages:

  • Quechua (quy_Latn)
  • Aymara (ayr_Latn)
  • Cebuano (ceb_Latn)
  • Kimbundu (kmb_Latn)
  • Umbundu (umb_Latn)

As a result, the results between FLORES-101 and FLORES-200 for these languages will differ slightly.

Map between FLORES-101 Language Codes and FLORES-200 Language Codes

FLORES-200 codeFLORES-101 code
afr_Latnafr
amh_Ethiamh
arb_Arabara
asm_Bengasm
ast_Latnast
azj_Latnazj
bel_Cyrlbel
ben_Bengben
bos_Latnbos
bul_Cyrlbul
cat_Latncat
ceb_Latnceb
ces_Latnces
ckb_Arabckb
cym_Latncym
dan_Latndan
deu_Latndeu
ell_Grekell
eng_Latneng
est_Latnest
fin_Latnfin
fra_Latnfra
fuv_Latnful
gle_Latngle
glg_Latnglg
guj_Gujrguj
hau_Latnhau
heb_Hebrheb
hin_Devahin
hrv_Latnhrv
hun_Latnhun
hye_Armnhye
ibo_Latnibo
ind_Latnind
isl_Latnisl
ita_Latnita
jav_Latnjav
jpn_Jpanjpn
kam_Latnkam
kan_Kndakan
kat_Georkat
kaz_Cyrlkaz
khm_Khmrkhm
kir_Cyrlkir
kor_Hangkor
lao_Laoolao
lij_LatnLatvian
lim_Latnkea
lin_Latnlin
lit_Latnlit
ltz_Latnltz
lug_Latnlug
luo_Latnluo
lvs_Latnlav
mal_Mlymmal
mar_Devamar
mkd_Cyrlmkd
mlt_Latnmlt
khk_Cyrlmon
mri_Latnmri
mya_Mymrmya
nld_Latnnld
nob_Latnnob
npi_Devanpi
nso_Latnnso
nya_Latnnya
oci_Latnoci
gaz_Latnorm
ory_Oryaory
pan_Gurupan
pes_Arabfas
pol_Latnpol
por_Latnpor
pbt_Arabpus
ron_Latnron
rus_Cyrlrus
slk_Latnslk
sna_Latnsna
snd_Arabsnd
som_Latnsom
spa_Latnspa
srp_Cyrlsrp
swe_Latnswe
swh_Latnswh
tam_Tamltam
tel_Telutel
tgk_Cyrltgk
tgl_Latntgl
tha_Thaitha
tur_Latntur
ukr_Cyrlukr
umb_Latnumb
urd_Araburd
uzn_Latnuzb
vie_Latnvie
wol_Latnwol
xho_Latnxho
yor_Latnyor
zho_Hanszho_simpl
zho_Hantzho_trad
zsm_Latnmsa
zul_Latnzul

Previous FLORES Releases

FLORES-101

FLORES-101 is a Many-to-Many multilingual translation benchmark dataset for 101 languages.

FLORESv1

FLORESv1 included Nepali, Sinhala, Pashto, and Khmer.

Citation

If you use this data in your work, please cite:

@article{nllb2022,
  author    = {NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi,  Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Jeff Wang},
  title     = {No Language Left Behind: Scaling Human-Centered Machine Translation},
  year      = {2022}
}

@inproceedings{,
  title={The FLORES-101  Evaluation Benchmark for Low-Resource and Multilingual Machine Translation},
  author={Goyal, Naman and Gao, Cynthia and Chaudhary, Vishrav and Chen, Peng-Jen and Wenzek, Guillaume and Ju, Da and Krishnan, Sanjana and Ranzato, Marc'Aurelio and Guzm\'{a}n, Francisco and Fan, Angela},
  year={2021}
}

@inproceedings{,
  title={Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English},
  author={Guzm\'{a}n, Francisco and Chen, Peng-Jen and Ott, Myle and Pino, Juan and Lample, Guillaume and Koehn, Philipp and Chaudhary, Vishrav and Ranzato, Marc'Aurelio},
  journal={arXiv preprint arXiv:1902.01382},
  year={2019}
}