ReplaceNtoMCrossLayerModifier

December 12, 2024 · View on GitHub

Description

n-gram to m-gram cross layer replacement modifier.

This modifier replaces token sequences in the tokenized target text (here "norm_tok" or "orig_tok") if this token sequence corresponds to a given sequence on the source layer. The changes are propagated to the raw version ("norm" or "orig"), new alignments are computed, etc.

Example: All occurrences of the n-gram (X,Y) on the "orig" layer will get normalized as (X',Y',Z). That is, we exchange the m-gram on the "norm" layer that corresponds to (X,Y) with (X',Y',Z).

Required

A tsv or csv file with n:m mappings of source sequences (column 1) to target sequences (column2). The elements of a sequence are separated by the space character.

$ cat mappings.tsv
Sag ' was	Sag was
irgend ' was	irgendetwas
' mal	mal
Neu-York	New York
Nieder-Jagd	Niederjagd

Usage

nohup nice python3 src/transnormer_data/cli/modify_dataset.py \
    -m replacentomcrosslayermodifier \
    --modifier-kwargs "mapping_files=<file-path>+ delimiter=<delimiter> source_layer={orig,norm} target_layer={norm,orig} [transliterate_source={true,t,yes,1}]" \
    --data <dir-path-in> \
    -o <dir-path-out> &

Note:

If the delimiter is the TAB character, delimiter={TAB} must be passed.
To transliterate the source tokens before dictionary lookup, transliterate_source must be passed with any of {true,t,yes,1}