Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)
January 28, 2021 ยท View on GitHub
This page includes instructions for reproducing results from the paper Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019).
Download data
First, follow the instructions to download and preprocess the WMT'17 En-De dataset.
Make sure to learn a joint vocabulary by passing the --joined-dictionary option to fairseq-preprocess.
Train a model
Then we can train a mixture of experts model using the translation_moe task.
Use the --method flag to choose the MoE variant; we support hard mixtures with a learned or uniform prior (--method hMoElp and hMoEup, respectively) and soft mixures (--method sMoElp and sMoEup).
The model is trained with online responsibility assignment and shared parameterization.
The following command will train a hMoElp model with 3 experts:
fairseq-train --ddp-backend='legacy_ddp' \
data-bin/wmt17_en_de \
--max-update 100000 \
--task translation_moe --user-dir examples/translation_moe/translation_moe_src \
--method hMoElp --mean-pool-gating-network \
--num-experts 3 \
--arch transformer_wmt_en_de --share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
--lr 0.0007 \
--dropout 0.1 --weight-decay 0.0 --criterion cross_entropy \
--max-tokens 3584
Translate
Once a model is trained, we can generate translations from different experts using the --gen-expert option.
For example, to generate from expert 0:
fairseq-generate data-bin/wmt17_en_de \
--path checkpoints/checkpoint_best.pt \
--beam 1 --remove-bpe \
--task translation_moe --user-dir examples/translation_moe/translation_moe_src \
--method hMoElp --mean-pool-gating-network \
--num-experts 3 \
--gen-expert 0
Evaluate
First download a tokenized version of the WMT'14 En-De test set with multiple references:
wget dl.fbaipublicfiles.com/fairseq/data/wmt14-en-de.extra_refs.tok
Next apply BPE on the fly and run generation for each expert:
BPE_CODE=examples/translation/wmt17_en_de/code
for EXPERT in $(seq 0 2); do \
cat wmt14-en-de.extra_refs.tok \
| grep ^S | cut -f 2 \
| fairseq-interactive data-bin/wmt17_en_de \
--path checkpoints/checkpoint_best.pt \
--beam 1 \
--bpe subword_nmt --bpe-codes $BPE_CODE \
--buffer-size 500 --max-tokens 6000 \
--task translation_moe --user-dir examples/translation_moe/translation_moe_src \
--method hMoElp --mean-pool-gating-network \
--num-experts 3 \
--gen-expert $EXPERT ; \
done > wmt14-en-de.extra_refs.tok.gen.3experts
Finally use score_moe.py to compute pairwise BLUE and average oracle BLEU:
python examples/translation_moe/score.py --sys wmt14-en-de.extra_refs.tok.gen.3experts --ref wmt14-en-de.extra_refs.tok
# pairwise BLEU: 48.26
# #refs covered: 2.11
# multi-reference BLEU (leave-one-out): 59.46
This matches row 3 from Table 7 in the paper.
Citation
@article{shen2019mixture,
title = {Mixture Models for Diverse Machine Translation: Tricks of the Trade},
author = {Tianxiao Shen and Myle Ott and Michael Auli and Marc'Aurelio Ranzato},
journal = {International Conference on Machine Learning},
year = 2019,
}