Chinese Text Classification

July 29, 2020 · View on GitHub

Background

Text classification assigns tags or categories to text according to its topical content, typically training on labeled documents. Topics are sometimes broad and akin to genre (news, sports, arts) but sometimes as fine-grained as hashtags.

Example input/output

Input:

[国足]有信心了 中国国奥队取得热身赛三连胜

Output:

Sports

Standard Metrics

  • Accuracy: the percentage of correctly classified samples.

THUCNews.

Sina News RSS subscription channel data from 2005 to 2011, which contains 74 million news documents (2.19 GB), 14 topics, all in UTF-8 plain text format.

Source# ClassesSize(sentences)
THUCNews14740,000

Metrics

  • Accuracy

Results

Accuracy
J. Chen, C. Cao, X. Jiang98.7%
Y. Song97.56%
W. Liu, P. Zhou, et al96.71%
S. Xin96.04%
Sun, Baohua, et al94.85%

SogouCS.

Sohu News from June to July 2012 in 18 channels.

Source# ClassesSize(sentences)
Sougou news dataset586,597

Metrics

  • Accuracy

Results

Error rate
Chung, Tonglee, et al3.37%

Resources

DatasetClassesTrain(samples size)
Sougou news dataset5490,717

Fudan corpus.

contains 9804 documents of long sentences and paragraphs in 20 categories.

Source# ClassesSize(sentences)
Fudan corpus51836

Metrics

  • Accuracy

Results

Accuracy
Sun, Baohua, et al97.8%
Meng et al, 201996.3%

Resources

Source# ClassesSize(sentences)
Fudan corpus54284

Ifeng.

First paragraphs of Chinese news articles from 2006-2016 were evenly split into 5 news channels.

Source# ClassesSize(sentences)
Ifeng550,000

Metrics

  • Accuracy

Results

Accuracy
Meng et al, 201985.8%
Sun, Baohua, et al84.4%
Zhang and Lecun 201783.7%

Resources

DatasetClassesTrain(samples size)
Ifeng5800,000

Chinanews.

Chinese news articles from 2008- 2016 were evenly split into 7 news channels, removing duplicates.

Source# ClassesSize(sentences)
Chinanews7112,000

Metrics

  • Accuracy

Results

Accuracy
Sun, Baohua, et al92.0%
Meng et al, 201991.9%
Zhang and Lecun 201790.9%

Resources

DatasetClassesTrain(samples size)
China news71,400,000

Suggestions? Changes? Please send email to chinesenlp.xyz@gmail.com