Overview of dataset

August 9, 2024 · View on GitHub

The English Worldwide Newswire dataset, as introduced in Do "English" Named Entity Recognizers Work Well on Global Englishes? (EMNLP 2023) by Shan et. al. https://arxiv.org/abs/2404.13465

Alex Shan (azshan@cs.stanford.edu) is the correspondence author and maintainer of this repository.

This dataset is composed of ~1100 news articles from around the world, sourced from non-Western newswire. This dataset is specifically designed to exclude Western sourced texts and focuses on uncommon contexts of the English language. We encourage authors to benchmark their English NER models on this dataset to explore the efficacy of modern models on unseen contexts. Below is a detailed breakdown of article origins.

Overview of dataset

  • 1075 hand-annotated English newswire articles from local sources around the world (bucketed into Asia, Africa, Latin America, the Middle East, and Indigenous Commonwealth (Oceania + Canada)).
  • 700,000 tokens
  • Created in collaboration with Datasaur NLP (https://datasaur.ai) and MLTwist (https://mltwist.com)
  • 9 class labels: Date, Person, Location, Facility, Organization, Miscellaneous, Money, NORP, and Product. A more detailed overview of the definition for each class can be found in the appendix of the ArXiv paper.
  • BIOES format: We also tag each token with its class and position, denoting whether the token is the start, intermediate, or end of a named entity.

To process the dataset, check out StanfordNLP's Stanza library which contains the dataset preparation script: https://github.com/stanfordnlp/stanza/blob/main/stanza/utils/datasets/ner/prepare_ner_dataset.py

South America: 94
Argentina20
Bolivia3
Chile12
Colombia10
Ecuador10
Guyana3
Paraguay13
Peru10
Uruguay5
Venezuela8
Central and North America: 178
Costa Rica20
Cuba15
El Salvador20
Honduras14
Mexico29
Nicaragua20
Panama20
Indigenous Canadian40
Africa: 265
General65
Pan-Africa20
Algeria20
Ghana20
Kenya23
Mauritius20
Egypt22
Ethiopia9
Namibia28
South Africa38
Asia: 347
General14
China104
Japan15
India71
Korea37
Taiwan26
Malaysia11
Bangladesh31
Thailand27
Mongolia11
Middle East: 167
Oman12
Jordan21
Israel20
Iran16
UAE17
Saudi Arabia27
Pakistan2
Qatar16
Kuwait36
Oceania: 48
Indigenous Australia28
Indigenous New Zealand20

Repo organization

Inside the original_articles directory, you can find the complete collection of our raw text data before the labeling process. In the procesed_annotated directory, you may find the complete collection of our annotations of the original data. Within the directory, you will find .tsv files containing BIOES-format labeled data. In the other directories, you may find the complete collection of our annotations and annotator metadata computed on the Datasaur platform. To access the labeled data, use the REVIEW subdirectory of each folder. Each line delimits a separate token that is tab-delimited between its text and corresponding label. The file names come in the form of <country>_<newswire_company>_<id>.txt.tsv. To understand where these countries are within the geographic buckets, refer to the /regions.txt file for each prefix conversion.

If you use this dataset, please use the following citation:

@inproceedings{Shan_2023,
   title={Do “English” Named Entity Recognizers Work Well on Global Englishes?},
   url={http://dx.doi.org/10.18653/v1/2023.findings-emnlp.788},
   DOI={10.18653/v1/2023.findings-emnlp.788},
   booktitle={Findings of the Association for Computational Linguistics: EMNLP 2023},
   publisher={Association for Computational Linguistics},
   author={Shan, Alexander and Bauer, John and Carlson, Riley and Manning, Christopher},
   year={2023} }