README.md

April 22, 2025 · View on GitHub

In order for users to better use DeepKE to complete entity recognition tasks, we provide an easy-to-use dict matching based entity recognition automatic annotation tool.

Dict

The format of Dict：
Two entity Dicts (one in Chinese and one in English) are provided in advance, and the samples are automatically tagged using the entity dictionary + jieba part-of-speech tagging.
- In Chinese example dict, we adapt People's Daily dataset. It is a dataset for NER, concentrating on their types of named entities related to persons(PER), locations(LOC), and organizations(ORG).
- In English example dict，we adapt Conll dataset. It is a dataset for NER, concentrating on their types of named entities related to persons(PER), locations(LOC), organizations(ORG) and others(MISC).You can get the Conll dataset with the following command.
```
wget 121.41.117.246:8080/Data/ner/few_shot/data.tar.gz
```
- Pre-provided dict from Google Drive：
  - CN(vocab_dict_cn), EN(vocab_dict_en)
- From BaiduNetDisk ：
  - CN(vocab_dict_cn), EN(vocab_dict_en)
  - (x7ba)
If you need to build a domain self-built dictionary, please refer to the pre-provided dictionary format (csv)

Entity Label
Washington LOC
... ...

Entity	Label
Washington	LOC
...	...

Source File

The input dictionary format is csv (contains two columns, entities and corresponding labels).
Data to be automatically marked (txt format and separated by lines, as shown in the figure below) should be placed under the source_data path, the script will traverse all txt format files in this folder, and automatically mark line by line.
The output file(the distribution ratio of training set, validation set, and test set can be customized) can be directly used as training data in DeepKE.

Environment

Implementation Environment:

jieba = 0.42.1

Args Description

language: cn or en
source_dir: Corpus path (traverse all files in txt format under this folder, automatically mark line by line, the default is source_data)
dict_dir: Entity dict path (defaults to vocab_dict.csv)
test_rate, dev_rate, test_rate: The ratio of training_set, validation_set, and test_set (please make sure the sum is 1, default 0.8:0.1:0.1)

run

Chinese

python prepare_weaksupervised_data.py --language cn --dict_dir vocab_dict_cn.csv

English

python prepare_weaksupervised_data.py --language en --dict_dir vocab_dict_en.csv