-> The trained model will be saved at ./log/train/model

July 13, 2022 · View on GitHub

General usage

Installation

We recommend using Python 3.8+ as well as Pyenv and Virtualenv.

./install.sh

The data format of training, validation and test files should follow the MRP format (jsonline) which was introduced in the CoNLL 2019 and 2020 shared tasks. Different from the original MRP format, one should not contain properties and should not use overlap anchors.

Example train.mrp:

{"id": "001", "input": "This is the sentence A. I show the sentence B. Finally, you see the sentence C.", "framework": "sample_tree_corpus", "time": "2020-08-05", "flavor": 0, "language": "en", "version": 1.0, "provenance": "temp", "source": "tmp", "nodes": [{"id": 0, "label": "Claim", "anchors": [{"from": 0, "to": 22}]}, {"id": 1, "label": "Premise", "anchors": [{"from": 24, "to": 45}]}, {"id": 2, "label": "Premise", "anchors": [{"from": 56, "to": 78}]}], "edges": [{"source": 0, "target": 1, "label": "Support"}, {"source": 0, "target": 2, "label": "Attack"}], "tops": [0]}
{"id": "002", ...}
...

See some examples for more format detail.

2. Training

python -m amparse.trainer.train \
    --log ./log/train \
    --ftrain train.mrp \
    --seed 42 \
    --model_name_or_path "allenai/longformer-base-4096" \
    --build_numericalizer_on_entire_corpus true \
    --batch_size 4 \
    --eval_batch_size 16 \
    --postprocessor "sample_tree_corpus:tree,sample_trees_corpus:trees,sample_graph_corpus:graph" \
    --embed_dropout 0.1 \
    --mlp_dropout 0.1 \
    --dim_mlp 512 \
    --dim_biaffine 512 \
    --lambda_bio 1. \  # if you want to train only relations, set to 0.0
    --lambda_proposition 0.1 \  # if you want to train only relations, set to 0.0
    --lambda_arc 1. \  # if you want to train only components, set to 0.0
    --lambda_rel 0.1 \  # if you want to train only components, set to 0.0
    --lr 5e-5 \
    --beta1 0.9 \
    --beta2 0.998 \
    --warmup_ratio 0.1 \
    --clip 1.0 \
    --epochs 20 \
    --terminate_epochs 20 \
    --evaluate_epochs 20 \
    --evaluate_with_oracle_span false  # if you want to train only relations, set to true

# -> The trained model will be saved at ./log/train/model

Some key options (for more detail, use python -m amparse.trainer.train --help):

Name	Description
log	The output directory path
ftrain	Train file (*.mrp)
fvalid	[Optional] Dev file (*.mrp)
ftest	[Optional] Test file (*.mrp)
model_name_or_path	(i) Downloads the pre-trained model when Huggingface model name is specified. Otherwise, (ii) uses a trained model on the local path.
postprocessor	Specify the post-processor script for each framework. In default, we provide following post-processors: `tree`: The graph is composed of a tree. `trees`: The graph is composed of a set of trees. `graph`: The graph does not form a tree. See some examples for more format detail.

3. Prediction

python -m amparse.predictor.predict \
      --log ./log/predict \
      --models ./log/train/model \
      --input test.mrp \
      --batch_size 16 \
      --oracle_span false

# -> The predicted result (an MRP file) will be saved at ./log/predict/prediction.mrp

Some key options (for more detail, use python -m amparse.predictor.predict --help):

Name	Description
log	The output directory path
models	The trained model path. When multiple models are specified, the ensemble prediction will be enabled.
input	Input file (*.mrp)
oracle_span	Whether to use oracle span (i.e., gold node anchors) when prediction

4. Evaluation

python amparse/evaluator/scorer.py \
  -gold gold.mrp \
  -system prediction.mrp

The results contain following values:

Key	Description
g	The number of gold positive samples
s	The number of predicted positive samples
c	The number of correctly predicted positive samples
p	Precision (= c / s)
r	Recall (= c / g)
f	F1 (= (2 * p * r) / (p + r))

General usage

Installation

1. Data format (.mrp)

2. Training

3. Prediction

4. Evaluation