LanguageToolModifier

December 5, 2024 ยท View on GitHub

Description

This modifier applies the LanguageTool to the raw version of the target layer and propagates the changes to the tokenized version.

Target layer is fixed to "norm", source layer is fixed to "orig".

Note: Depending on the data size and number of rules this modification may take a while. For ~150M tokens in 6.1M sentences and ~1000 rules, the script ran for 60 hours.

Required

Rules

Select LanguageTool rules to apply and store them by their id in a one-line-per-rule text format. A rule file may look like this:

$ cat rules.txt
ZUVIEL
STATT
GENANT_SPELLING_RULE

Usage

python3 src/transnormer_data/cli/modify_dataset.py \
    -m languagetoolmodifier \
    --modifier-kwargs "rule_file=<file-path-in>" \
    --data <dir-path-in> \
    -o <dir-path-out>