Toxicity-200

December 14, 2022 · View on GitHub

** Warning: The files included in this contain toxic language. **

This repository contains files that include frequent words and phrases generally considered toxic because they represent:

  • Frequently used profanities
  • Frequently used insults and hate speech terms, or language used to bully, denigrate, or demean
  • Pornographic terms
  • Terms for body parts associated with sexual activity

Download

Toxicity-200 can be downloaded here which you can download with the following command:

wget --trust-server-names https://tinyurl.com/NLLB200TWL

Purpose, Ethical Considerations, and Use of the Lists

The primary purpose of such lists is to help with translation model safety by monitoring for hallucinated toxicity. By hallucinated toxicity, we mean the presence of toxic items in the translated text when no such toxic items can be found in the source text.

The lists were collected via human translation. Any such translation effort inevitably poses risks of bias. The likelihood of getting access to professionals with diverse backgrounds and worldviews is not equal across all supported languages. In addition to the work that has already been done to mitigate biases, which can also introduce its own potential biases, the ultimate mitigation strategy can be to provide the community with free access to the lists, and to welcome feedback and contributions from the community in all supported languages.

The files are in zip format, and unzipping is password protected. To unzip the files after downloading, you may use the following command line: unzip --password tL4nLLb [BCP47_code]_twl.zip The unzipping of the files implies that you consent to viewing their contents.

Language codes for all languages can be found in the below table (see Project Status). The BCP 47 language codes include an ISO 639-3 base tag to identify the language and ISO 15924 supplemental tag to identify the script (e.g., taq_Tfng for Tamasheq in Tifinagh script). The codes mirror those used for the release of the FLORES-200 data sets. However, in cases where FLORES-200 targets a specific lect, the corresponding lists may not be as restrictive in that they may include items from closely related lects.

Languages in Toxicity-200

The following toxicity lists are currently available in these languages:

BCP 47 CodeLanguage
ace_ArabAcehnese (Arabic script)
ace_LatnAcehnese (Latin script)
acm_ArabMesopotamian Arabic
acq_ArabTa’izzi-Adeni Arabic
aeb_ArabTunisian Arabic
afr_LatnAfrikaans
ajp_ArabSouth Levantine Arabic
aka_LatnAkan
als_LatnTosk Albanian
amh_EthiAmharic
apc_ArabNorth Levantine Arabic
arb_ArabModern Standard Arabic
arb_LatnModern Standard Arabic (Romanized)
ars_ArabNajdi Arabic
ary_ArabMoroccan Arabic
arz_ArabEgyptian Arabic
asm_BengAssamese
ast_LatnAsturian
awa_DevaAwadhi
ayr_LatnCentral Aymara
azb_ArabSouth Azerbaijani
azj_LatnNorth Azerbaijani
bak_CyrlBashkir
bam_LatnBambara
ban_LatnBalinese
bel_CyrlBelarusian
bem_LatnBemba
ben_BengBengali
bho_DevaBhojpuri
bjn_ArabBanjar (Arabic script)
bjn_LatnBanjar (Latin script)
bod_TibtStandard Tibetan
bos_LatnBosnian
bug_LatnBuginese
bul_CyrlBulgarian
cat_LatnCatalan
ceb_LatnCebuano
ces_LatnCzech
cjk_LatnChokwe
ckb_ArabCentral Kurdish
crh_LatnCrimean Tatar
cym_LatnWelsh
dan_LatnDanish
deu_LatnGerman
dik_LatnSouthwestern Dinka
dyu_LatnDyula
dzo_TibtDzongkha
ell_GrekGreek
eng_LatnEnglish
epo_LatnEsperanto
est_LatnEstonian
eus_LatnBasque
ewe_LatnEwe
fao_LatnFaroese
fij_LatnFijian
fin_LatnFinnish
fon_LatnFon
fra_LatnFrench
fur_LatnFriulian
fuv_LatnNigerian Fulfulde
gaz_LatnWest Central Oromo
gla_LatnScottish Gaelic
gle_LatnIrish
glg_LatnGalician
grn_LatnGuarani
guj_GujrGujarati
hat_LatnHaitian Creole
hau_LatnHausa
heb_HebrHebrew
hin_DevaHindi
hne_DevaChhattisgarhi
hrv_LatnCroatian
hun_LatnHungarian
hye_ArmnArmenian
ibo_LatnIgbo
ilo_LatnIlocano
ind_LatnIndonesian
isl_LatnIcelandic
ita_LatnItalian
jav_LatnJavanese
jpn_JpanJapanese
kab_LatnKabyle
kac_LatnJingpho
kam_LatnKamba
kan_KndaKannada
kas_ArabKashmiri (Arabic script)
kas_DevaKashmiri (Devanagari script)
kat_GeorGeorgian
kaz_CyrlKazakh
kbp_LatnKabiyè
kea_LatnKabuverdianu
khk_CyrlHalh Mongolian
khm_KhmrKhmer
kik_LatnKikuyu
kin_LatnKinyarwanda
kir_CyrlKyrgyz
kmb_LatnKimbundu
kmr_LatnNorthern Kurdish
knc_ArabCentral Kanuri (Arabic script)
knc_LatnCentral Kanuri (Latin script)
kon_LatnKikongo
kor_HangKorean
lao_LaooLao
lij_LatnLigurian
lim_LatnLimburgish
lin_LatnLingala
lit_LatnLithuanian
lmo_LatnLombard
ltg_LatnLatgalian
ltz_LatnLuxembourgish
lua_LatnLuba-Kasai
lug_LatnGanda
luo_LatnLuo
lus_LatnMizo
lvs_LatnStandard Latvian
mag_DevaMagahi
mai_DevaMaithili
mal_MlymMalayalam
mar_DevaMarathi
min_ArabMinangkabau (Arabic script)
min_LatnMinangkabau (Latin script)
mkd_CyrlMacedonian
mlt_LatnMaltese
mni_BengMeitei (Bengali script)
mos_LatnMossi
mri_LatnMaori
mya_MymrBurmese
nld_LatnDutch
nno_LatnNorwegian Nynorsk
nob_LatnNorwegian Bokmål
npi_DevaNepali
nso_LatnNorthern Sotho
nus_LatnNuer
nya_LatnNyanja
oci_LatnOccitan
ory_OryaOdia
pag_LatnPangasinan
pan_GuruEastern Panjabi
pap_LatnPapiamento
pbt_ArabSouthern Pashto
pes_ArabWestern Persian
plt_LatnPlateau Malagasy
pol_LatnPolish
por_LatnPortuguese
prs_ArabDari
quy_LatnAyacucho Quechua
ron_LatnRomanian
run_LatnRundi
rus_CyrlRussian
sag_LatnSango
san_DevaSanskrit
sat_OlckSantali
scn_LatnSicilian
shn_MymrShan
sin_SinhSinhala
slk_LatnSlovak
slv_LatnSlovenian
smo_LatnSamoan
sna_LatnShona
snd_ArabSindhi
som_LatnSomali
sot_LatnSouthern Sotho
spa_LatnSpanish
srd_LatnSardinian
srp_CyrlSerbian
ssw_LatnSwati
sun_LatnSundanese
swe_LatnSwedish
swh_LatnSwahili
szl_LatnSilesian
tam_TamlTamil
taq_LatnTamasheq (Latin script)
taq_TfngTamasheq (Tifinagh script)
tat_CyrlTatar
tel_TeluTelugu
tgk_CyrlTajik
tgl_LatnTagalog
tha_ThaiThai
tir_EthiTigrinya
tpi_LatnTok Pisin
tsn_LatnTswana
tso_LatnTsonga
tuk_LatnTurkmen
tum_LatnTumbuka
tur_LatnTurkish
twi_LatnTwi
tzm_TfngCentral Atlas Tamazight
uig_ArabUyghur
ukr_CyrlUkrainian
umb_LatnUmbundu
urd_ArabUrdu
uzn_LatnNorthern Uzbek
vec_LatnVenetian
vie_LatnVietnamese
war_LatnWaray
wol_LatnWolof
xho_LatnXhosa
ydd_HebrEastern Yiddish
yor_LatnYoruba
yue_HantYue Chinese
zho_HansChinese (Simplified)
zho_HantChinese (Traditional)
zsm_LatnStandard Malay
zul_LatnZulu

Latest Update

Date: 2022-12-14 Files:

BCP 47 CodeLanguage
est_LatnEstonian
fra_LatnFrench
nld_LatnDutch