EverestNER - The Benchmark Data Set for Nepali NER

May 20, 2022 ยท View on GitHub

We have created the largest human annotated Named Entity Recognition (NER) data set for Nepali available to date. Highlights:

  • EverestNER covers five named entities - Person Name, Location, Organization, Event and Date.
  • EverestNER produces high quality annotations through clear annotation guidelines.
  • EverestNER has 24,587 entities, 308,353 tokens corresponding to 15,798 sentences.
  • We split the EverestNER data set into EverestNER-train and EverestNER-test. These standard data sets, therefore, become the first benchmark data sets for evaluating Nepali NER systems.
  • We report a comprehensive evaluation of state-of-the-art Neural and Transformer models using these data sets. This is the first study to apply BERT model for mining Named Entities for Nepali.
  • We also discuss the remaining challenges for discovering NEs for Nepali (see our paper below).

Data Set Stats

DataArticlesSentencesTokensAvg. Sent.LenLOCORGPEREVTDAT
Train84713,848268,74119.405,1484,7567,7073123,394
Test1491,95039,61220.318097151,11559572
Total99615,798308,35319.515,9575,4718,8223713,966

Data Format

The EverestNER data set is divided into train (EverestNER-train) and test (EverestNER-test) sets. Each data set has character level as well as token leven annotations. Please read our paper to get more information on this.

Our Results

Model comparision on EverestNER-test. Models (a) baseline (rule-based), (b) BLSTM CRF, and (c) multilingual BERT:

ModelPre.Rec.F1-micro
Baseline (Rule-based)0.710.550.62
BLSTM-CRF-wc.ft0.890.740.81
BERT-bbmu0.870.840.85

Performance evaluation of the best performing model (BERT-bbmu) per named entities:

ModelPre.Rec.F1Support
PER0.900.850.881115
LOC0.850.800.82809
ORG0.850.830.84715
EVT0.460.420.4459
DAT0.910.910.91572

License

Non-commercial purposes only. For commercial usages, permissions must be taken from the authors and the relevant parties. See the contact address below.

Unless required by applicable law or agreed to in writing, software and data distributed here is on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Cite Our Work

If you use the EverestNER data set, please cite our publication:

@inproceedings{niraula2022named,
  title={Named Entity Recognition for Nepali: Data Sets and Algorithms},
  author={Niraula, Nobal and Chapagain, Jeevan},
  booktitle={The International FLAIRS Conference Proceedings},
  volume={35},
  year={2022}
}

Contact

Feel free to contact nobal @AT nowalab .DOT com for any inquiries regarding this work.

Acknowledgments

Nepali Shabdakosh - https://nepalishabdakosh.com