Customizing Dataset
December 29, 2022 · View on GitHub
This part of tutorial shows how to prepare a custom dataset.
1. Overview
If you want to load a custom dataset, you should figure out 2 problems:
- Where is the dataset?
- Is the data format already supported?
In the following sections, we will introduce various dataset loading alternatives and different data formats we support (more and more on the way).
2. Loading Dataset
Currently, 5 types of dataset loading methods are supported. You can set it in the configuration file.
2.1 Loading dataset from ModelScope
dataset:
name: ${modelscope_dataset_name}
access_token: ${access_token}
nameshould be one of the uploaded datasets in ModelScope, such as damo/resume_ner.access_tokenis NOT necessary unless the dataset is private.
2.2 Loading dataset from Huggingface
dataset:
path: ${huggingface_dataset_name}
pathshould be one of the uploaded datasets in Huggingface, such as conll2003.
2.3 Loading dataset via dataset loading script
dataset:
path: ${path_to_py_script_or_folder}
pathshould be the absolute path to a custom python script for thedatasets.load_datasetor a directory containing the script.
2.4 Loading dataset files
dataset:
data_file:
train: ${path_to_train_file}
valid: ${path_to_validation_file}
test: ${path_to_test_file}
data_type: ${data_format}
trainvalidtestcould be the urls or local paths (absolute paths) to the dataset files.data_typeshould be one of the supported data formats such asconll.
2.5 Loading dataset directory or archive
dataset:
data_file: ${path_or_url_to_dir_or_archive}
data_type: ${data_format}
data_filecould be an url like"https://data.deepai.org/conll2003.zip", or a local directory (absolute path) like"/home/data/conll2003", or a local archive file (absolute path) like"/home/data/conll2003.zip". Alsodata_typeshould be one of the supported data formats such asconll.
3. Supported Dataset Formats
3.1 Sequence Labeling Tasks
For example, NER, CWS, POS Tagging, etc.
3.1.1 CoNLL format
The widely-used CoNLL format is a specific vertical format (like TSV) that represents a tagged dataset. Normally it is a
text file with one word per line with sentences separated by an empty line. The first column in a line should be a
word and the last column should be the word's tag (usually from BIO or BIOES).
Data Example:
鲁 B-ORG
迅 I-ORG
文 I-ORG
学 I-ORG
院 I-ORG
组 O
织 O
有 O
关 O
专 O
家 O
我 O
是 O
另 O
一 O
句 O
话 O
To use CoNLL format, set
data_type: conll. Optionally, you can usedelimiter: ${custom_delimiter}to set a custom delimiter for the conll file. By default, the delimiter is whitespace or tab.
3.1.2 json-tags format
The json-tags format is similar to CoNLL format, where each sentence contains a 'text' field and a 'labels' field. The length of 'text' and 'labels' should be exactly equal to each other, so we can assign all labels to its corresponding character.
{
"text": "鲁迅文学院组织有关专家",
"labels": ["B-ORG", "I-ORG", "I-ORG", "I-ORG", "I-ORG", "O", "O", "O", "O", "O", "O"]
}
To use CoNLL format, set
data_type: json_tags.
3.1.3 json-spans format
The json-spans format is another widely used format for both flat NER and nested NER. Each meaningful span is
represented as a dict with start end type field, indicating the [start, end) offsets, and the type of the span.
{
"text": "鲁迅文学院组织有关专家",
"spans": [{"start": 0, "end": 5, "type": "ORG"}, ...]
}
What's more, we allow type to be a list of labels, which means multi-label tagging is possible.
{
"text": "人民日报出版社新近出版了王梦奎的短文集《翠微居杂笔》。",
"spans": [{"start": 0, "end": 7, "type": ["组织", "出版商", "出版社"]}, ...]
}
To use CoNLL format, set
data_type: json_spans.
3.1.4 CLUENER format
The CLUENER format is the official format used in the CLUENER benchmark, which gathers entities of the same type in a group.
{
"text": "鲁迅文学院组织有关专家",
"label": {'ORG': [[0, 5], ...]}
}
To use CLUENER format, set
data_type: cluener.