Datasets

February 16, 2023 · View on GitHub

Area of Collected Datasets

Named Entity Recognition

LanguageDatasetSize#TypesDescriptionPaperDownload
Chinesemsra46364/-/43653Levowdamo/msra_ner
Chineseresume3821/463/4779Zhang & Yangdamo/resume_ner
Chineseweibo1350/269/2704Peng & Dredzedamo/weibo_ner
Chineseontonotes-v4-zh15724/4301/4346-ldc/ontonotes-v4
Chinesecluener202010748/1343/134510Xu et al., 2020github/cluener2020
Chinesepeople_dairy19983github/ChineseNLPCorpus
Chinesepeople_dairy20143baidu-pan passwrod:1fa3
Chinesecmeee15000/5000/3000CMeEE dataset in CBLUE benchmarkZhang et al., 2022github/cblue
Chineseyidu-s4k-openkg/yidu-s4k
ChineseecommerceJie et al., 2019github/ner_incomplete_annotation/ecommerce
ChinesedlnerXu, et al.,2017github/dlner
Dutchconll2002-nl15796/2895/51964Tjong Kim Sang, 2002
Englishwnut20162394/1000/3850Noisy User-generated TextStrauss et al., 2016damo/wnut16
Englishwnut20173394/1009/1287Derczynski et al., 2017damo/wnut17
Englishconll2003-en14041/3250/34534Tjong Kim Sang & De Meulder, 2003
Englishconllpp14041/3250/34534corrected version of the conll03-en NER datasetWang et al., 2019damo/conllpp_ner
Englishontonotes-v5-en59924/8528/8262(TBD)Pradhan et al., 2013ldc/ontonotes-v5
Englishai100/350/431Liu et al., 2020damo/cross_ner
Englishliterature100/400/416Liu et al., 2020damo/cross_ner
Englishmusic100/541/465Liu et al., 2020damo/cross_ner
Englishpolitics200/541/651Liu et al., 2020damo/cross_ner
Englishscience200/450/543Liu et al., 2020damo/cross_ner
Englishbc5cdr4560/4581/4797Li et al., 2016
Englishncbi5424/923/940Doğan et al., 2014
Englishmit-movie6816/1000/1953(TBD)Liu et al., 2013mit/movie
Englishmit-restaurant6900/760/1521Liu et al., 2013mit/restaurant
Englishace2004-en7nested nerDoddington et al., 2005ldc/ace04
Englishace2005-en7nested ner-ldc/ace05
Englishkbp2017nested ner--
Englishgenianested nerOhta et al., 2002
Englishfew-nerd131767/18824/375488 / 66a few-shot ner datasetDing et al., 2021
EnglishwikigoldBalasuriya et al.,2009
Englishbionlp2014Collier & Kim, 2004
EnglishfinAlvarado et al., 2015
Englishbtc6338/1001/20003Derczynski et al., 2016
EnglishttcRijhwani & Preot¸iuc-Pietrogithub/ttc
EnglishtweebankJiang et al.,2022github/tweebank
Englishtweetner7Ushio, et al., 2022huggingface/tweetner7
Germanconll2003-de12152/2866/30054Tjong Kim Sang & De Meulder, 2003
Spanishconll2002-es8302/1919/15174Tjong Kim Sang, 2002
Englishtwitter2015multi-modalZhang et al., 2018
Englishsnapmulti-modalLu et al., 2018github/UMT
Englishtwitter2017multi-modalYu et al., 2020github/UMT
Englishwiki-diverseconstructed from wiki-diverse (a multi-modal entity typing dataset)Wang et al., 2022github/wikidiverse
11 langsmulticoner2022-6dataset of SemEval 2022 Task 11
(English, Spanish, Dutch, Russian, Turkish, Korean, Farsi, German, Chinese, Hindi, and Bangla)
Malmasi et al., 2022aws/multiconer
282 langswikiann-silver-standard dataPan et al, 2017github/wikiann
9 langswikiner-silver-standard dataNothman et al, 2013
9 langswikineural-silver-standard dataTedeschi et al, 2021
10 langsmultinerd-silver-standard dataTedeschi & Navigli. 2022

Chinese Word Segmentation

LanguageDatasetSize#TypesDescriptionPaperDownload
ChinesePKU19056/-/1944--sighan05train
test
ChineseMSRA86924/-/3985--sighan05train
test
ChineseCTB623401/2078/2795--Chinese Tree Bank v6train
dev
test

Part-of-Speech Tagging

LanguageDatasetSize#TypesDescriptionPaperDownload
ChineseCTB5---train
dev
test
ChineseCTB823401 2078 2795--Chinese Tree Bank v6train
dev
test
ChineseCTB9---train
dev
test

Ultra-fine Entity-Typing

LanguageDatasetSize#TypesDescriptionPaperDownload
EnglishUFET1998/1998/199810331Ultra-fine Entity TypingChoi et al., 2018izhx404/ufet
ChineseCFET2880/960/9581299Unofficial split, no official split provided.Lee et al., 2020izhx404/cfet

Event Extraction

LanguageDatasetSizeDescriptionPaperDownload
ChineseFewFC7185/899/898Passage levelZhou et al., 2021here
ChineseDuee11908/1492/34904Passage levelLi et al., 2020here
ChineseDuee-fin7015/1171/59394Document levelLi et al., 2020here
ChineseChFinAnn25632/3204/3204Document levelZheng et al., 2019here
EnglishWIKIEVENTS206/20/20Document levelLi et al., 2021train / dev / test
EnglishRAMS7329/924/871Document levelEbner et al., 2020here

Entity Relation joint Extraction

LanguageDatasetSizeDescriptionPaperDownload
EnglishNYT--Ren et al.,2017here
EnglishNYT10-HRL/11-HRL70339/-/4006;62648/-/369got by preprocessing in paper HRLTakanobu et al., 2019here
EnglishWebNLG5019/-/703-Gardent et al.,2017here
EnglishADE--Gurulingappa et al., 2012-
EnglishSciERC1816/275/551-Luan et al., 2018here
EnglishCoNLL04--Roth et al., 2004-
EnglishACE04---here
EnglishACE0510051/2424/2050--here
ChineseDuIE2.0171135/-/21055-Li et al., 2019here

End-to-End Entity Linking

LanguageDomainDatasetTrain/Dev/Test/KB SizePaper/LinkDownload
EnglishNewsAIDA-CoNLL12820/4242/3953/5903530Hoffart et al.,2011here
EnglishMedicalBC5CDR9535/9481/10032/2291Li et al., 2016here
EnglishSpeechNLPCC202228400/7640/2905/118795NLPCC2022here
ChineseShortTextCCKS202069691/9148/-/3234418CCKS2020-