Chinese Entity Tagging

July 27, 2020 · View on GitHub

Background

Entity tagging identifies pieces of text (“mentions”) and marks them with types such as Person, Organization, Geo-political Entity, Location, etc. In addition to proper names (“Bob”), mentions may also include nominals (“the player”).

Example

Input:

美国国防部长马蒂斯说,与首尔举行的名为“秃鹫”的军事演习每年春天在韩国进行,但2019年将“缩小规模”。

Output:

[美国]GPE国防部长[马蒂斯]PER说,与[首尔]GPE举行的名为“秃鹫”的军事演习每年春天在[韩国]GPE进行,但[2019年]TMP将“缩小规模”。

Standard Metrics

F-score for selecting correct piece of text (“mention”) and assigning the correct type.

TAC-KBP / EDL Track (2015-2017).

The NIST TAC Knowledge Base Population (KBP) Entity Discovery and Linking (EDL) track includes Chinese entity tagging for 5 types: person (PER), geo-political entity (GPE), location (LOC), organization (ORG) and facility (FAC).

Data for this evaluation is available from the Linguistic Data Consortium (LDC).

Test setSize (documents)Genre
TAC-KBP-EDL 2015313 (train + eval)News
TAC-KBP-EDL 2016166News
TAC-KBP-EDL 2017167News

Metrics

NERC F-score

Results

SystemTAC-KBP / EDL 2015 NamesTAC-KBP / EDL 2016 Names and nominalsTAC-KBP / EDL 2017 Names and nominals
Best anonymous system in shared task writeup79.980.872.2

Resources

Ontonotes 5.0 (https://catalog.ldc.upenn.edu/LDC2013T19) from the Linguistic Data Consortium includes Chinese entity tagging.

  • 698 articles Xinhua (1994-1998)
  • 55 articles Information Services Department of HKSAR (1997)
  • 132 articles Sinorama magazine, Taiwan (1996-1998 & 2000-2001)

ACE 2005.

ACE 2005 evaluates on seven entity types: Facility (FAC), Geopolitical Entity (GPE), Location (LOC), Organization (ORG), Person (PER), Vehicle (VEH), and Weapon (WEA).

Data for this evaluation was prepared by the Linguistic Data Consortium (LDC).

A standard train/dev/test split does not seem to be available. Authors frequently split randomly 8:1:1 (Ju et. al. 2018).

Train + test setSize (characters)Genre
ACE 2005325,834Newswire, Broadcast News, Weblog

Results

SystemF-score
Wang et al (2020)81.7
Huang et al (2020)81.7
Wang & Lu. (2018)73.00
Ju et. al. (2018)72.25

SIGHAN bakeoff 2006 NER MSRA.

This bakeoff evaluates entity taggers on three types of entities: Person (PER), Location (LOC), and Organization (ORG).

Paper summarizing the bakeoff:

Test setSize (words)Genre
SIGHAN 2006 NER MSRA100,000Newswire, Broadcast News, Weblog

Results

SystemF-score
Liu et al (2020)95.7
Meng et. al. (2019)95.5
Ma et al (2020)95.4
Sun et al (2020)95.0
Yan et al (2020)94.1
Liu et. al. (2019)93.74
Sui et al. (2019)93.47
Gui et al. (2019)93.46
Zhang & Yang (2018)93.18

Resources

The “closed” task restricts participants to use only the following training material:

Train setSize (words)Genre
SIGHAN 2006 NER MSRA1.3MNewswire, Broadcast News, Weblog

Weibo NER.

This social media entity tagging task includes GPE, ORG, LOC, and PER. It was introduced by

Using the test split by http://www.aclweb.org/anthology/E17-2113:

Test setSize (name mentions)Size (nomial mentions)Genre
Weibo NER209196Social media (Weibo)

Results

SystemF-score (name mentions)F-score (nominal mentions)F-score (Overall)
Ma et al (2020)70.967.070.5
Meng et. al. (2019)67.6
Hu and Zheng (2020)56.4
Sui et al. (2019)56.4568.3263.09
Gui et al. (2019)55.3464.9860.21
Liu et. al. (2019)52.5567.4159.84
Zhu (2019)55.3862.9859.31
Zhang & Yang (2018)53.0462.2558.79
Peng & Dredze (2015)55.2862.9758.99

Resources

Train & Dev dataSize (name mentions)Size (nominal mentions)Genre
Weibo NER train----Social media (Weibo)
Weibo NER dev153226Social media (Weibo)

Also included are 112M unlabeled text Weibo messages.

Other Resources

This paper presents an NER-annotated corpus in the genres of social media, human-computer interaction, and e-commerce:


Suggestions? Changes? Please send email to chinesenlp.xyz@gmail.com