Fully Character-Level Neural Machine Translation
December 16, 2016 ยท View on GitHub
Theano implementation of the models described in the paper Fully Character-Level Neural Machine Translation without Explicit Segmentation.
We present code for training and decoding four different models:
- bilingual bpe2char (from Chung et al., 2016).
- bilingual char2char
- multilingual bpe2char
- multilingual char2char
Dependencies
Python
- Theano
- Numpy
- NLTK
GPU
- CUDA (we recommend using the latest version. The version 8.0 was used in all our experiments.)
Related code
- For preprocessing and evaluation, we used scripts from MOSES.
- This code is based on Subword-NMT and dl4mt-cdec.
Downloading Datasets & Pre-trained Models
The original WMT'15 corpora can be downloaded from here. For the preprocessed corpora used in our experiments, see below.
- WMT'15 preprocessed corpora
To obtain the pre-trained top-performing models, see below.
- Pre-trained models (6.0GB): Tarball updated on Nov 21st 2016. The CS-EN bi-char2char model in the previous tarball was not the best-performing model.
Training Details
Using GPUs
Do the following before executing train*.py.
$ export THEANO_FLAGS=device=gpu,floatX=float32
With space permitting on your GPU, it may speed up training to use cnmem:
$ export THEANO_FLAGS=device=gpu,floatX=float32,lib.cnmem=0.95,allow_gc=False
On a pre-2016 Titan X GPU with 12GB RAM, our bpe2char models were trained with cnmem. Our char2char models (both bilingual and multilingual) were trained without cnmem (due to lack of RAM).
Training models
Before executing the following, modify train*.py such that the correct directory containing WMT15 corpora is referenced.
Bilingual bpe2char
$ python bpe2char/train_bi_bpe2char.py -translate <LANGUAGE_PAIR>
Bilingual char2char
$ python char2char/train_bi_char2char.py -translate <LANGUAGE_PAIR>
Multilingual bpe2char
$ python bpe2char/train_multi_bpe2char.py
Multilingual char2char
$ python char2char/train_multi_char2char.py
Checkpoint
To resume training a model from a checkpoint, simply append -re_load and -re_load_old_setting above. Make sure the checkpoint resides in the correct directory (.../dl4mt-c2c/models).
Using Custom Datasets
To train your models using your own dataset (and not the WMT'15 corpus), you first need to learn your vocabulary using build_dictionary_char.py or build_dictionary_word.py for char2char or bpe2char model, respectively. For the bpe2char model, you additionally need to learn your BPE segmentation rules on the source corpus using the Subword-NMT repository (see below).
Decoding
Decoding WMT'15 validation / test files
Before executing the following, modify translate*.py such that the correct directory containing WMT15 corpora is referenced.
$ export THEANO_FLAGS=device=gpu,floatX=float32,lib.cnmem=0.95,allow_gc=False
$ python translate/translate_bpe2char.py -model <PATH_TO_MODEL.npz> -translate <LANGUAGE_PAIR> -saveto <DESTINATION> -which <VALID/TEST_SET> # for bpe2char models
$ python translate/translate_char2char.py -model <PATH_TO_MODEL.npz> -translate <LANGUAGE_PAIR> -saveto <DESTINATION> -which <VALID/TEST_SET> # for char2char models
When choosing which pre-trained model to give to -model, make sure to choose e.g. .grads.123000.npz. The models with .grads in their names are the optimal models and you should be decoding from those.
Decoding an arbitrary file
Remove -which <VALID/TEST_SET> and append -source <PATH_TO_SOURCE>.
If you choose to decode your own source file, make sure it is:
- properly tokenized (using
preprocess/preprocess.sh). - bpe-tokenized for bpe2char models.
- Cyrillic characters should be converted to Latin for multilingual models.
Decoding multilingual models
Append -many (of course, provide a path to a multilingual model for -model).
Evaluation
We use the script from MOSES to compute the bleu score. The reference translations can be found in .../wmt15.
perl preprocess/multi-bleu.perl reference.txt < model_output.txt
Extra
Extracting & applying BPE rules
Clone the Subword-NMT repository.
git clone https://github.com/rsennrich/subword-nmt
Use following commands (find more information in Subword-NMT)
./learn_bpe.py -s {num_operations} < {train_file} > {codes_file}
./apply_bpe.py -c {codes_file} < {test_file}
Converting Cyrillic to Latin
$ python preprocess/iso.py russian_source.txt
will produce an output at russian_source.txt.iso9.
Citation
@article{Lee:16,
author = {Jason Lee and Kyunghyun Cho and Thomas Hofmann},
title = {Fully Character-Level Neural Machine Translation without Explicit Segmentation},
year = {2016},
journal = {arXiv preprint arXiv:1610.03017},
}