README.md
April 22, 2025 · View on GitHub
English | 简体中文
In order for users to better use DeepKE to complete entity recognition tasks, we provide an easy-to-use dict matching based entity recognition automatic annotation tool.
Dict
-
The format of Dict:
-
Two entity Dicts (one in Chinese and one in English) are provided in advance, and the samples are automatically tagged using the entity dictionary + jieba part-of-speech tagging.
-
In Chinese example dict, we adapt People's Daily dataset. It is a dataset for NER, concentrating on their types of named entities related to persons(PER), locations(LOC), and organizations(ORG).
-
In English example dict,we adapt Conll dataset. It is a dataset for NER, concentrating on their types of named entities related to persons(PER), locations(LOC), organizations(ORG) and others(MISC).You can get the Conll dataset with the following command.
wget 121.41.117.246:8080/Data/ner/few_shot/data.tar.gz- Pre-provided dict from Google Drive:
- From BaiduNetDisk :
-
-
If you need to build a domain self-built dictionary, please refer to the pre-provided dictionary format (csv)
Entity Label Washington LOC ... ...
Source File
-
The input dictionary format is csv (contains two columns, entities and corresponding labels).
-
Data to be automatically marked (txt format and separated by lines, as shown in the figure below) should be placed under the
source_datapath, the script will traverse all txt format files in this folder, and automatically mark line by line.
-
The output file(the distribution ratio of
training set,validation set, andtest setcan be customized) can be directly used as training data in DeepKE.
Environment
Implementation Environment:
- jieba = 0.42.1
Args Description
language:cnorensource_dir: Corpus path (traverse all files in txt format under this folder, automatically mark line by line, the default issource_data)dict_dir: Entity dict path (defaults tovocab_dict.csv)test_rate, dev_rate, test_rate: The ratio of training_set, validation_set, and test_set (please make sure the sum is1, default0.8:0.1:0.1)
run
- Chinese
python prepare_weaksupervised_data.py --language cn --dict_dir vocab_dict_cn.csv
- English
python prepare_weaksupervised_data.py --language en --dict_dir vocab_dict_en.csv