README.txt

December 8, 2019 · View on GitHub

This is the code for the highest performing lexical simplification system featured on the paper: "SIMPLEX-PB: A Lexical Simplification Database and Benchmark for Portuguese"

It contains three files:

  • lib.py: A library with the classes and functions necessary to perform simplification.
  • simplifier.py: A simple script that tests the simplifier.
  • dataset_propor2018.txt: The test set used for the experiments featured in the paper.

To test the simplifier, run the following command:

python simplifier.py dataset_propor2018.txt <embeddings_model> <language_model> <how_many_to_generate>

The parameters are:

  • <test_corpus>: A lexical simplification corpus in the victor format, which is the format of the "dataset_propor2018.txt" file. Each line contains a sentence, a target complex word, its index in the sentence, and a series of gold substitutions accompanied by their simplicity rank. To know more about the victor format, please visit the LEXenstein manual (https://github.com/ghpaetzold/LEXenstein).
  • <embeddings_model>: A word embeddings model in the binary format produced by word2vec (https://radimrehurek.com/gensim/models/word2vec.html).
  • <language_model>: A language model in the binary format produced by the KenLM toolkit (https://kheafield.com/code/kenlm).
  • <how_many_to_generate>: The number of candidate substitutions that the model will generate for each target complex word.

This repository is result of the following paper:

Hartmann, Nathan S., Gustavo H. Paetzold, and Sandra M. Aluísio. "SIMPLEX-PB: A Lexical Simplification Database and Benchmark for Portuguese." International Conference on Computational Processing of the Portuguese Language. Springer, Cham, 2018.