Text classification assigns tags or categories to text according to its topical content, typically training on labeled documents. Topics are sometimes broad and akin to genre (news, sports, arts) but sometimes as fine-grained as hashtags.
Input:
[国足]有信心了 中国国奥队取得热身赛三连胜
Output:
Sports
- Accuracy: the percentage of correctly classified samples.
Sina News RSS subscription channel data from 2005 to 2011, which contains 74 million news documents (2.19 GB), 14 topics, all in UTF-8 plain text format.
| Source | # Classes | Size(sentences) |
|---|
| THUCNews | 14 | 740,000 |
Sohu News from June to July 2012 in 18 channels.
contains 9804 documents of long sentences and paragraphs in 20 categories.
First paragraphs of Chinese news articles from 2006-2016 were evenly split into 5 news channels.
| Source | # Classes | Size(sentences) |
|---|
| Ifeng | 5 | 50,000 |
| Dataset | Classes | Train(samples size) |
|---|
| Ifeng | 5 | 800,000 |
Chinese news articles from 2008- 2016 were evenly split into 7 news channels, removing duplicates.
| Source | # Classes | Size(sentences) |
|---|
| Chinanews | 7 | 112,000 |
| Dataset | Classes | Train(samples size) |
|---|
| China news | 7 | 1,400,000 |
Suggestions? Changes? Please send email to chinesenlp.xyz@gmail.com