ReplaceToken1toNModifier

December 5, 2024 ยท View on GitHub

Description

Modifier that replaces unigrams with ngrams.

This modifier replaces the occurrences of a unigram type on the tokenized version of the target layer (here "norm_tok" or "orig_tok") with an ngram, propagates the changes to the raw version ("norm" or "orig") and computes a new alignment with the source layer ("orig_tok" or "norm_tok", respectively).

Default target layer is norm.

For related uses:

  • Replacements on the target layer where a single token is replaced by a unigram can be performed with the faster ReplaceToken1to1Modifier.
  • Replacements that depend on the source layer and/or depend on n-grams can be performed with the ReplaceNtoMCrossLayerModifier (much slower).

Required

Usage

$ python3 src/transnormer_data/cli/modify_dataset.py \
    -m replacetoken1tonmodifier \
    --modifier-kwargs "mapping_files=<file-path>+ layer={orig,norm}" \
    --data <dir-path-in> \
    -o <dir-path-out> &