-> The trained model will be saved at ./log/train/model

July 13, 2022 ยท View on GitHub

General usage

Installation

We recommend using Python 3.8+ as well as Pyenv and Virtualenv.

./install.sh

1. Data format (.mrp)

The data format of training, validation and test files should follow the MRP format (jsonline) which was introduced in the CoNLL 2019 and 2020 shared tasks. Different from the original MRP format, one should not contain properties and should not use overlap anchors.

Example train.mrp:

{"id": "001", "input": "This is the sentence A. I show the sentence B. Finally, you see the sentence C.", "framework": "sample_tree_corpus", "time": "2020-08-05", "flavor": 0, "language": "en", "version": 1.0, "provenance": "temp", "source": "tmp", "nodes": [{"id": 0, "label": "Claim", "anchors": [{"from": 0, "to": 22}]}, {"id": 1, "label": "Premise", "anchors": [{"from": 24, "to": 45}]}, {"id": 2, "label": "Premise", "anchors": [{"from": 56, "to": 78}]}], "edges": [{"source": 0, "target": 1, "label": "Support"}, {"source": 0, "target": 2, "label": "Attack"}], "tops": [0]}
{"id": "002", ...}
...

See some examples for more format detail.

2. Training

python -m amparse.trainer.train \
    --log ./log/train \
    --ftrain train.mrp \
    --seed 42 \
    --model_name_or_path "allenai/longformer-base-4096" \
    --build_numericalizer_on_entire_corpus true \
    --batch_size 4 \
    --eval_batch_size 16 \
    --postprocessor "sample_tree_corpus:tree,sample_trees_corpus:trees,sample_graph_corpus:graph" \
    --embed_dropout 0.1 \
    --mlp_dropout 0.1 \
    --dim_mlp 512 \
    --dim_biaffine 512 \
    --lambda_bio 1. \  # if you want to train only relations, set to 0.0
    --lambda_proposition 0.1 \  # if you want to train only relations, set to 0.0
    --lambda_arc 1. \  # if you want to train only components, set to 0.0
    --lambda_rel 0.1 \  # if you want to train only components, set to 0.0
    --lr 5e-5 \
    --beta1 0.9 \
    --beta2 0.998 \
    --warmup_ratio 0.1 \
    --clip 1.0 \
    --epochs 20 \
    --terminate_epochs 20 \
    --evaluate_epochs 20 \
    --evaluate_with_oracle_span false  # if you want to train only relations, set to true

# -> The trained model will be saved at ./log/train/model

Some key options (for more detail, use python -m amparse.trainer.train --help):

NameDescription
logThe output directory path
ftrainTrain file (*.mrp)
fvalid[Optional] Dev file (*.mrp)
ftest[Optional] Test file (*.mrp)
model_name_or_path(i) Downloads the pre-trained model when Huggingface model name is specified. Otherwise, (ii) uses a trained model on the local path.
postprocessorSpecify the post-processor script for each framework. In default, we provide following post-processors:
tree: The graph is composed of a tree.
trees: The graph is composed of a set of trees.
graph: The graph does not form a tree.
See some examples for more format detail.

3. Prediction

python -m amparse.predictor.predict \
      --log ./log/predict \
      --models ./log/train/model \
      --input test.mrp \
      --batch_size 16 \
      --oracle_span false

# -> The predicted result (an MRP file) will be saved at ./log/predict/prediction.mrp

Some key options (for more detail, use python -m amparse.predictor.predict --help):

NameDescription
logThe output directory path
modelsThe trained model path. When multiple models are specified, the ensemble prediction will be enabled.
inputInput file (*.mrp)
oracle_spanWhether to use oracle span (i.e., gold node anchors) when prediction

4. Evaluation

python amparse/evaluator/scorer.py \
  -gold gold.mrp \
  -system prediction.mrp

The results contain following values:

KeyDescription
gThe number of gold positive samples
sThe number of predicted positive samples
cThe number of correctly predicted positive samples
pPrecision (= c / s)
rRecall (= c / g)
fF1 (= (2 * p * r) / (p + r))