LASER3 - No Language Left Behind

July 6, 2022 · View on GitHub

No Language Left Behind

LASER3 - No Language Left Behind

As part of the project No Language Left Behind (NLLB) we have developed new LASER encoders, referred to here as LASER3. Each LASER3 encoder has a particular focus language which it supports, and the full list of available LASER3 encoders can be found at the bottom of this README.

We have also included an updated version of the original LASER encoder: LASER2. This improved model supports the same languages which LASER was trained on. In order to find more details on how both the LASER2 and LASER3 encoders were trained, please see Heffernan et. al, 2022.

We also provide code to train LASER3 teacher-student models and stopes, a new powerful and flexible mining library.

Downloading encoders

To download the available encoders, please run the download_models.sh script within this directory.

bash ./download_models.sh

LASER2 and all LASER3 encoders are downloaded by default. However, downloading all LASER3 encoders may take up a lot of disk space. Therefore, you may choose to select individual LASER3 encoders to download by supplying a list of available language codes (see full list). For example: bash ./download_models.sh wol_Latn zul_Latn ...

By default, this download script will place all supported models within the calling directory.

Note: LASER3 encoders for each focus language are in the format: laser3-{language_code}.

Embedding texts

Once encoders are downloaded, you can then begin embedding texts by following the instructions here.

For example: ./LASER/tasks/embed/embed.sh [INFILE] [OUTFILE] wol_Latn

List of available LASER3 encoders

CodeLanguage
ace_LatnAcehnese (Latin script)
aka_LatnAkan
als_LatnTosk Albanian
amh_EthiAmharic
asm_BengAssamese
awa_DevaAwadhi
ayr_LatnCentral Aymara
azb_ArabSouth Azerbaijani
azj_LatnNorth Azerbaijani
bak_CyrlBashkir
bam_LatnBambara
ban_LatnBalinese
bel_CyrlBelarusian
bem_LatnBemba
ben_BengBengali
bho_DevaBhojpuri
bjn_LatnBanjar (Latin script)
bod_TibtStandard Tibetan
bug_LatnBuginese
ceb_LatnCebuano
cjk_LatnChokwe
ckb_ArabCentral Kurdish
crh_LatnCrimean Tatar
cym_LatnWelsh
dik_LatnSouthwestern Dinka
diq_LatnSouthern Zaza
dyu_LatnDyula
dzo_TibtDzongkha
ewe_LatnEwe
fao_LatnFaroese
fij_LatnFijian
fon_LatnFon
fur_LatnFriulian
fuv_LatnNigerian Fulfulde
gaz_LatnWest Central Oromo
gla_LatnScottish Gaelic
gle_LatnIrish
grn_LatnGuarani
guj_GujrGujarati
hat_LatnHaitian Creole
hau_LatnHausa
hin_DevaHindi
hne_DevaChhattisgarhi
hye_ArmnArmenian
ibo_LatnIgbo
ilo_LatnIlocano
ind_LatnIndonesian
jav_LatnJavanese
kab_LatnKabyle
kac_LatnJingpho
kam_LatnKamba
kan_KndaKannada
kas_ArabKashmiri (Arabic script)
kas_DevaKashmiri (Devanagari script)
kat_GeorGeorgian
kaz_CyrlKazakh
kbp_LatnKabiyè
kea_LatnKabuverdianu
khk_CyrlHalh Mongolian
khm_KhmrKhmer
kik_LatnKikuyu
kin_LatnKinyarwanda
kir_CyrlKyrgyz
kmb_LatnKimbundu
kmr_LatnNorthern Kurdish
knc_ArabCentral Kanuri (Arabic script)
knc_LatnCentral Kanuri (Latin script)
kon_LatnKikongo
lao_LaooLao
lij_LatnLigurian
lim_LatnLimburgish
lin_LatnLingala
lmo_LatnLombard
ltg_LatnLatgalian
ltz_LatnLuxembourgish
lua_LatnLuba-Kasai
lug_LatnGanda
luo_LatnLuo
lus_LatnMizo
mag_DevaMagahi
mai_DevaMaithili
mal_MlymMalayalam
mar_DevaMarathi
min_LatnMinangkabau (Latin script)
mlt_LatnMaltese
mni_BengMeitei (Bengali script)
mos_LatnMossi
mri_LatnMaori
mya_MymrBurmese
npi_DevaNepali
nso_LatnNorthern Sotho
nus_LatnNuer
nya_LatnNyanja
ory_OryaOdia
pag_LatnPangasinan
pan_GuruEastern Panjabi
pap_LatnPapiamento
pbt_ArabSouthern Pashto
pes_ArabWestern Persian
plt_LatnPlateau Malagasy
prs_ArabDari
quy_LatnAyacucho Quechua
run_LatnRundi
sag_LatnSango
san_DevaSanskrit
sat_BengSantali
scn_LatnSicilian
shn_MymrShan
sin_SinhSinhala
smo_LatnSamoan
sna_LatnShona
snd_ArabSindhi
som_LatnSomali
sot_LatnSouthern Sotho
srd_LatnSardinian
ssw_LatnSwati
sun_LatnSundanese
swh_LatnSwahili
szl_LatnSilesian
tam_TamlTamil
taq_LatnTamasheq (Latin script)
tat_CyrlTatar
tel_TeluTelugu
tgk_CyrlTajik
tgl_LatnTagalog
tha_ThaiThai
tir_EthiTigrinya
tpi_LatnTok Pisin
tsn_LatnTswana
tso_LatnTsonga
tuk_LatnTurkmen
tum_LatnTumbuka
tur_LatnTurkish
twi_LatnTwi
tzm_TfngCentral Atlas Tamazight
uig_ArabUyghur
umb_LatnUmbundu
urd_ArabUrdu
uzn_LatnNorthern Uzbek
vec_LatnVenetian
war_LatnWaray
wol_LatnWolof
xho_LatnXhosa
ydd_HebrEastern Yiddish
yor_LatnYoruba
zsm_LatnStandard Malay
zul_LatnZulu