Easy Start
April 22, 2025 · View on GitHub
English | 简体中文
Model
This project implements extraction models for NER tasks in the Standard scenario. The corresponding paths are:
- BiLSTM-CRF
- Bert (Tip: If using the dataset provided below, we recommend setting learning_rate to 2e-5 and num_train_epochs to 10)
- W2NER
Experimental Results
| Model | Accuracy | Recall | F1 Score | Inference Speed (People's Daily) |
|---|---|---|---|---|
| BERT | 91.15 | 93.68 | 92.40 | 106s |
| BiLSTM-CRF | 92.11 | 88.56 | 90.29 | 39s |
| W2NER | 96.76 | 96.11 | 96.43 | - |
Clone the Repository
git clone https://github.com/zjunlp/DeepKE.git
cd DeepKE/example/ner/standard
Environment Setup
1. Create a Python virtual environment and activate it:
conda create -n deepke python=3.8
conda activate deepke
2. Install dependencies:
pip install -r requirements.txt
Parameter Configuration
1. Model Parameters
Model-specific configurations (e.g., model path, hidden layer dimensions, case sensitivity) can be found in the conf/hydra/model/*.yaml directory.
2. Other Parameters
Settings for environment paths and other hyperparameters during training are located in train.yaml and custom.yaml.
Note: Vocabulary usage during training:
- For the
Bertmodel, the vocabulary is derived from the pre-trained weights on Hugging Face.- For
BiLSTM-CRF, the vocabulary must be built based on the training dataset and saved in a.pklfile for prediction and evaluation (configured in themodel_vocab_pathattribute oflstmcrf.yaml).- For model downloads with network error, we recommend using the Hugging Face mirror site . After downloading, modify the model path in
conf/hydra/model/*.yaml.
Training with Dataset
1. Supported Data Formats
The model supports json, docx, and txt formats. For details, refer to the data folder. The default dataset is the People's Daily (Chinese NER) with text data in {word, label} pairs.
- Note for English datasets: Update
laninconfig.yamlbefore prediction, and installnltkwithnltk.download('punkt').
2. Prepare Data
Download the dataset:
wget 121.41.117.246:8080/Data/ner/standard/data.tar.gz
tar -xzvf data.tar.gz
Place the following files in the data folder:
train.txt: Training datasetvalid.txt: Validation datasettest.txt: Test dataset
3. Start Training
Choose the appropriate model for your target scenario:
- Bert
python run_bert.py- Update
hydra/modelin config.yaml tobert. Hyperparameters for BERT are in bert.yaml. - Multi-GPU training is supported by setting
use_multi_gputoTrueintrain.yaml. Specify GPUs withos.environ['CUDA_VISIBLE_DEVICES']. - For the default dataset downloaded above: Set
learning_rateto2e-5andnum_train_epochsto10intrain.yaml. - Tip: If the dataset is small, reduce the learning rate to avoid fast parameter updates. For large datasets, try increasing the learning rate to speed up convergence. You can also monitor the training process using wandb (set config.yaml's
use_wandbparameter to True) and adjust the learning rate and epochs for optimal performance.
- Update
- BiLSTM-CRF
Configure BiLSTM-CRF hyperparameters inpython run_lstmcrf.pylstmcrf.yaml. Modify other training parameters in theconffolder. - W2NER
Hyperparameters are incd w2ner python run.pymodel.yaml. Specify the GPU index using thedeviceparameter (set to 0 for single GPU setups).
Training Output
- Logs and Results
- Training logs are saved in the
logsfolder. - Model checkpoints are stored in the
checkpointsfolder.
- Training logs are saved in the
- Batch Size
- For BERT training, a batch size of 64 or more is recommended.
Prediction
Run predictions using:
python predict.py
Prepare weak_supervised data
If you only have text data and corresponding dictionaries, but no canonical training data.
You can get weakly supervised formatted training data through automated labeling methods.
Please make sure that:
- Provide high-quality dictionaries
- Enough text data