HunFlair - Data Sets

June 28, 2022 ยท View on GitHub

Here you can find an overview about biomedical NER data sets integrated in HunFlair.

Content: Overview | HUNER Data Sets | BioBERT Evaluation Splits

Overview

HunFlair integrates 31 biomedical named entity recognition (NER) data sets and provides them in an unified format to foster the development and evaluation of new NER models. All data set implementations can be found in flair.datasets.biomedical.

CorpusData Set ClassEntity TypesReference
AnatEMANAT_EMAnatomical entitiesPaper, Website
Arizona DiseaseAZDZDiseaseWebsite
BioCreative II GMBC2GMGenePaper
BioCreative V CDR taskCDRChemical, DiseasePaper, Website
BioInferBIO_INFERGene/ProteinPaper
BioNLP'2013 Cancer Genetics (ST)BIONLP2013_CGChemical, Disease, Gene/Protein, SpeciesPaper
BioNLP'2013 Pathway Curation (ST)BIONLP2013_PCChemical, Gene/ProteinsPaper
BioSemanticsBIOSEMANTICSChemical, DiseasePaper, Website
CellFinderCELL_FINDERCell line, Gene, SpeciesPaper
CEMPCEMPChemicalWebsite
CHEBICHEBIChemical, Gene, SpeciesPaper
CHEMDNERCHEMDNERChemicalPaper
CLLCLLCell linePaper
DECADECAGenePaper
FSUFSUGenePaper
GPROGPROGeneWebsite
CRAFT (v2.0)CRAFTChemical, Gene, SpeciesPaper
CRAFT (v4.0.1)CRAFT_V4Chemical, Gene, SpeciesWebsite
GELLUSGELLUSCell linePaper
IEPAIEPAGenePaper
JNLPBAJNLPBACell line, GenePaper
LINNEAUSLINNEAUSSpeciesPaper
LocTextLOCTEXTGene, SpeciesPaper
miRNAMIRNADisease, Gene, SpeciesPaper
NCBI DiseaseNCBI_DISEASEDiseasePaper
Osiris v1.2OSIRISGenePaper
Plant-Disease-RelationsPDRDiseasePaper, Website
S800S800SpeciesPaper
SCAI ChemicalsSCAI_CHEMICALSChemicalPaper
SCAI DiseaseSCAI_DISEASEDiseasePaper
VariomeVARIOMEGene, Disease, SpeciesPaper
Note: The table just gives an overview about the entity types of the individual corpora. Please refer to the original publications for annotation details.

HUNER Data Sets

Next to the integration of the biomedical data sets, HunFlair provides the fixed splits used by HUNER (Weber et al.) to improve comparability of evaluations

Entity TypeData Set ClassContained Data Sets
Cell LineHUNER_CELL_LINEHUNER_CELL_LINE_CELL_FINDER, HUNER_CELL_LINE_CLL, HUNER_CELL_LINE_GELLUS, HUNER_CELL_LINE_JNLPBA
ChemicalHUNER_CHEMICALHUNER_CHEMICAL_CDR, HUNER_CHEMICAL_CEMP, HUNER_CHEMICAL_CHEBI, HUNER_CHEMICAL_CHEMDNER, HUNER_CHEMICAL_CRAFT_V4, HUNER_CHEMICAL_SCAI
DiseaseHUNER_DISEASEHUNER_DISEASE_CDR, HUNER_DISEASE_MIRNA, HUNER_DISEASE_NCBI, HUNER_DISEASE_SCAI, HUNER_DISEASE_VARIOME
Gene/ProteinHUNER_GENEHUNER_GENE_BC2GM, HUNER_GENE_BIO_INFER, HUNER_GENE_CELL_FINDER, HUNER_GENE_CHEBI, HUNER_GENE_CRAFT_V4, HUNER_GENE_DECA, HUNER_GENE_FSU, HUNER_GENE_GPRO, HUNER_GENE_IEPA, HUNER_GENE_JNLPBA, HUNER_GENE_LOCTEXT, HUNER_GENE_MIRNA, HUNER_GENE_OSIRIS, HUNER_GENE_VARIOME
SpeciesHUNER_SPECIESHUNER_SPECIES_CELL_FINDER, HUNER_SPECIES_CHEBI, HUNER_SPECIES_CRAFT_V4, HUNER_SPECIES_LINNEAUS, HUNER_SPECIES_LOCTEXT, HUNER_SPECIES_MIRNA, HUNER_SPECIES_S800, HUNER_SPECIES_VARIOME

BioBERT evaluation splits

To ease comparison with BioBERT, HunFlair provides the splits used by Lee et al.: BIOBERT_GENE_BC4CHEMD, BIOBERT_GENE_BC2GM, BIOBERT_GENE_JNLPBA, BIOBERT_CHEMICAL_BC5CDR, BIOBERT_DISEASE_BC5CDR, BIOBERT_DISEASE_NCBI, BIOBERT_SPECIES_LINNAEUS, and BIOBERT_SPECIES_S800

Note: To download and use the BioBERT corpora you need to install the package googledrivedownloader, since the files are hosted in Google Drive:

pip install googledrivedownloader