Chinese Spelling Correction

July 29, 2020 · View on GitHub

Background

A spelling corrector finds and correct typographical errors in text. These errors often occur between characters that are similar in appearance, pronunciation, or both.

Example

Input:

1986年毕业于国防科技大学计算机应用专业,获学时学位。

Output:

1986年毕业于国防科技大学计算机应用专业,获学士学位。
(时 -> 士)

Standard Metrics

Spelling correction performance is typically evaluated using accuracy, precision, recall, and F1 score. These metrics can be computed at the character level or the sentence level. Detection and correction are typically evaluated separately.

  • Detection: all locations of incorrect characters in a given passage should be completely identical with the gold standard.
  • Correction: all locations and corresponding corrections of incorrect characters should be completely identical with the gold standard.

SIGHAN Bake-off: Chinese Spelling Check Task.

test set# sentence pairs# characters# spelling errors (chars)character setgenre
SIGHAN 2015 (Tseng et. al. 2015)1,10033,711715traditionalsecond-language learning
SIGHAN 2014 (Yu et. al. 2014)1,06253,114792traditionalsecond-language learning
SIGHAN 2013 (Wu et. al. 2013)2,000143,0391,641traditionalsecond-language learning

Metrics

  • (1) False Positive Rate, (2) Detection Accuracy, (3) Detection Precision, (4) Detection Recall, (5) Detection F1, (6) Correction Accuracy, (7) Correction Precision, (8) Correction Recall, (9) Correction F1
  • Implementation: http://nlp.ee.ncu.edu.tw/resource/csc_download.html

Results

SystemFalse Positive RateDetection AccuracyDetection PrecisionDetection RecallDetection F1Correction AccuracyCorrection PrecisionCorrection RecallCorrection F1
Soft-Masked BERT (Zhang et. al. 2020)-80.973.773.273.577.466.766.266.4
Confusion-set (Wang et. al. 2019)--66.873.169.8-71.559.564.9
FASPell (Hong et. al. 2019)-74.267.660.063.573.766.659.162.6
CAS (Zhang et. al. 2015)11.670.180.353.364.069.279.751.562.5

Results above are all on the SIGHAN 2015 test set.

Resources

Source# sentence pairs# chars# spelling errorscharacter setgenre
SIGHAN 2015 Training data (Tseng et. al. 2015)3,17495,2204,237traditionalsecond-language learning
SIGHAN 2014 Training data (Yu et. al. 2014)6,526324,34210,087traditionalsecond-language learning
SIGHAN 2013 Training data (Wu et. al. 2013)35017,220350traditionalsecond-language learning

Other Resources

Source# sentence pairs# chars# spelling errorscharacter setgenre
Synthetic training dataset (Wang et. al. 2018)271,32912M382,702simplifiednews

Suggestions? Changes? Please send email to chinesenlp.xyz@gmail.com