LCNAT

October 17, 2021 · View on GitHub

Code for "Understanding and Improving Lexical Choice in Non-Autoregressive Translation" (ICLR 2021).

We add a data-dependent prior (from word alignment) to NAT training: an extra KL term with λ decay over the first half of training.

Paper: OpenReview

Setup

git clone https://github.com/alphadl/LCNAT.git
cd LCNAT
pip install -e fairseq/

Dependencies: PyTorch, fairseq (included), h5py, numpy.

Data

Put KD data in e.g. data/ende_data/: train_kd.{src,tgt}, valid.{src,tgt}, test.{src,tgt} (BPE).
Binarize: SRC=en TGT=de DATA_DIR=./data/ende_data bash preprocess.sh
Build prior: matrix [V_src, V_tgt] in .h5 with key "weights" (e.g. Bottleneck_LC/scripts/build_prior.py). Save as data/ende_data/prior.h5.

Training

SRC=en TGT=de databin=./data/ende_data/databin checkpoint=./checkpoint/ende/mask lcnat_weight_path=./data/ende_data/prior.h5 bash train_mask.sh

Levenshtein: same env with train_lev.sh.

Eval

SRC=en TGT=de DATA=./data/ende_data/databin CHECKPOINT=./checkpoint/ende/mask bash eval_mask.sh
REF=./data/ende_data/test.de bash test_mask.sh

Word-level translation accuracy stratified by source-word frequency (H/M/L), as in the paper and RLFW-NAT Table 8: for each source token with aligned reference word, check if the hypothesis has the correct ref word at the aligned position; bucket by frequency estimated on source-side training data.

Requirements: Tokenized src, ref, hyp (one sentence per line); alignments in fast_align format (i-j per link, one line per sentence).

Align source–reference and reference–hypothesis (e.g. fast_align):
```
fast_align -i test.en-de.raw -d -v -o > test.en-de.fwd
# format: "src_idx-ref_idx" per sentence
```
Do the same for ref–hyp (ref as first file, hyp as second) to get test.de-hyp.align.

Source-side frequency for H/M/L buckets: use training corpus or fairseq dict:

python run/lexical_choice_accuracy.py \
  --src test.en --ref test.de --hyp test.hyp.de \
  --align-src-ref test.en-de.align --align-ref-hyp test.de-hyp.align \
  --dict-src data/ende_data/databin/dict.en.txt

Or --train-src data/ende_data/train_kd.en to count tokens in the training source file.

Output: accuracy overall and per bucket (H / M / L).

Citation

@inproceedings{ding2021understanding,
  title={Understanding and Improving Lexical Choice in Non-Autoregressive Translation},
  author={Ding, Liang and Wang, Longyue and Liu, Xuebo and Wong, Derek F. and Tao, Dacheng and Tu, Zhaopeng},
  booktitle={ICLR},
  year={2021}
}

Setup

Data

Training

Eval

Lexical choice accuracy

Citation