summarus

November 19, 2021 ยท View on GitHub

Tests Status Code Climate

Abstractive and extractive summarization models, mostly for Russian language. Building on top of AllenNLP

You can also checkout the MBART-based Russian summarization model on Huggingface: mbart_ru_sum_gazeta

Based on the following papers:

Contacts

Prerequisites

pip install -r requirements.txt

Commands

train.sh

Script for training a model based on AllenNLP 'train' command.

ArgumentRequiredDescription
-ctruepath to file with configuration
-struepath to directory where model will be saved
-ttruepath to train dataset
-vtruepath to val dataset
-rfalserecover from checkpoint

predict.sh

Script for model evaluation. The test dataset should have the same format as the train dataset.

ArgumentRequiredDefaultDescription
-ttruepath to test dataset
-mtruepath to tar.gz archive with model
-ptruename of Predictor
-cfalse0CUDA device
-LtrueLanguage ("ru" or "en")
-bfalse32size of a batch with test examples to run simultaneously
-Mfalsepath to meteor.jar for Meteor metric
-Tfalsetokenize gold and predicted summaries before metrics calculation
-Dfalsesave temporary files with gold and predicted summaries

summarus.util.train_subword_model

Script for subword model training.

ArgumentDefaultDescription
--train-pathpath to train dataset
--model-pathpath to directory where generated subword model will be saved
--model-typebpetype of subword model, see sentencepiece
--vocab-size50000size of the resulting subword model vocabulary
--config-pathpath to file with configuration for DatasetReader (with parse_set)

Headline generation

Dataset splits:

Models:

Prediction script:

./predict.sh -t <path_to_test_dataset> -m ria_pgn_24kk.tar.gz -p subwords_summary -L ru 

Results

Train dataset: RIA, test dataset: RIA
ModelR-1-fR-2-fR-L-fBLEU
ria_copynet_10kk40.023.337.5-
ria_pgn_24kk42.325.139.6-
ria_mbart42.825.539.9-
First Sentence24.110.616.7-

Train dataset: RIA, eval dataset: Lenta

ModelR-1-fR-2-fR-L-fBLEU
ria_copynet_10kk25.612.323.0-
ria_pgn_24kk26.412.324.0-
ria_mbart30.314.527.1-
First Sentence25.511.219.2-

Summarization - CNN/DailyMail

Dataset splits:

Models:

Prediction script:

./predict.sh -t <path_to_test_dataset> -m cnndm_pgn_25kk.tar.gz -p words_summary -L en -R

Results:

ModelR-1-fR-2-fR-L-fMETEORBLEU
cnndm_pgn_25kk38.516.533.417.6-

Summarization - Gazeta, russian news dataset

Models:

Prediction scripts:

./predict.sh -t <path_to_test_dataset> -m gazeta_pgn_7kk.tar.gz -p subwords_summary -L ru -T
./predict.sh -t <path_to_test_dataset> -m gazeta_summarunner_3kk.tar.gz -p subwords_summary_sentences -L ru -T

External models:

Results:

ModelR-1-fR-2-fR-L-fMETEORBLEU
gazeta_pgn_7kk29.412.724.621.29.0
gazeta_pgn_7kk_cov29.812.825.422.110.1
gazeta_pgn_25kk29.612.824.621.59.3
gazeta_pgn_words_13kk29.412.624.420.98.9
gazeta_summarunner_3kk31.613.727.126.011.5
gazeta_mbart32.614.628.225.712.4
gazeta_mbart_lower32.714.728.325.812.5

Demo

python demo/server.py --include-package summarus --model-dir <model_dir> --host <host> --port <port>

Citations

Headline generation (PGN):

@article{Gusev2019headlines,
    author={Gusev, I.O.},
    title={Importance of copying mechanism for news headline generation},
    journal={Komp'juternaja Lingvistika i Intellektual'nye Tehnologii},
    year={2019},
    volume={2019-May},
    number={18},
    pages={229--236}
}

Headline generation (transformers):

@InProceedings{Bukhtiyarov2020headlines,
    author={Bukhtiyarov, Alexey and Gusev, Ilya},
    title="Advances of Transformer-Based Models for News Headline Generation",
    booktitle="Artificial Intelligence and Natural Language",
    year="2020",
    publisher="Springer International Publishing",
    address="Cham",
    pages={54--61},
    isbn="978-3-030-59082-6",
    doi={10.1007/978-3-030-59082-6_4}
}

Summarization:

@InProceedings{Gusev2020gazeta,
    author="Gusev, Ilya",
    title="Dataset for Automatic Summarization of Russian News",
    booktitle="Artificial Intelligence and Natural Language",
    year="2020",
    publisher="Springer International Publishing",
    address="Cham",
    pages="{122--134}",
    isbn="978-3-030-59082-6",
    doi={10.1007/978-3-030-59082-6_9}
}