KODOLI

May 14, 2023 · View on GitHub

KODOLI is a novel KOrean Dataset for Offensive Language Identification.

Warning: it contains highly offensive expressions.

  • KODOLI comprises more fine-grained offensiveness categories (i.e., not offensive, likely offensive, and offensive)
  • A likely offensive language refers to texts with implicit offensiveness or abusive language without offensive intentions.
  • In addition, we propose two auxiliary tasks to help identify offensive languages: abusive language detection and sentiment analysis.
    • You could utilize toxic detection through the auxiliary task. (Be careful the raw expressions)

Download

You can download benchmark KODOLI in this repository. Please, follow the data's license.

Dataset Description

Source

  • Texts are mainly collected and sampled from online communities and news articles.

source

Statistics

Statistics

Guideline Details

Guideline(ENG.)

[Guideline(KOR.)] Comming Soon

Updates

  • Apr 20, 2023 We release 3.6k examples for offensive language identification task

Citation

@inproceedings{park2023feel,
  title={“Why do I feel offended?”-Korean Dataset for Offensive Language Identification},
  author={Park, San-Hee and Kim, Kang-Min and Lee, O-joun and Kang, Youjin and Lee, Jaewon and Lee, Su-min and Lee, Sangkeun},
  booktitle={Findings of the Association for Computational Linguistics: EACL 2023},
  pages={1112--1123},
  year={2023}
}

Contributors

License

Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.