TrueSkill for WMT

June 11, 2020 · View on GitHub

Source code used in 2014 WMT paper, "Efficient Elicitation of Annotations for Human Evaluation of Machine Translation"

Keisuke Sakaguchi
Matt Post
Benjamin Van Durme

Last updated: June 11th, 2020

This document describes the proposed method described in the following paper:

@InProceedings{sakaguchi-post-vandurme:2014:W14-33,
  author    = {Sakaguchi, Keisuke  and  Post, Matt  and  Van Durme, Benjamin},
  title     = {Efficient Elicitation of Annotations for Human Evaluation of Machine Translation},
  booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation},
  month     = {June},
  year      = {2014},
  address   = {Baltimore, Maryland, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {1--11},
  url       = {http://www.aclweb.org/anthology/W14-3301}
}

Prerequisites python modules:

python 2.7

pip install -r requirements.txt

Example Procedure:

1. Preprocessing: converting an xml file (from Appraise) to a csv file.
- mkdir result if not exist.
- cd data
- python xml2csv.py ABC.xml
- The xml/csv file must consist of a single language pair.
1. Training: run python infer_TS.py (TrueSkill) in the src directory.
- cd ../src
- cat ../data/ABC.csv |python infer_TS.py ../result/ABC -n 2 -d 0 -s 2
- for more details: python infer_TS.py --help
- You can change other parameters in infer_TS.py, if needed.
- For clustering (i.e. grouped ranking), we need to execute multiple runs of infer_TS.py (100+ is recommended) for each language pair (e.g. fr-en from fr-en0 to fr-en99).
- You will get the result named OUT_ID_mu_sigma.json in the result directory
- For using Expected Win, run python infer_EW.py -s 2 ../result/ABC
1. To see the grouped ranking, run cluster.py in the eval directory.
- cd ../eval
- python cluster.py fr-en ../result/fr-en -n 100 -by-rank -pdf
- for more details: python cluster.py --help
- pdf option might cause RuntimeError, but please check if a pdf file is successfully generated.
1. (optional) To tune decision radius in (accuracy), run tune_acc.py.
- e.g. cat data/sample-fr-en-{dev|test}.csv |python src/eval_acc.py -d 0.1 -i result/fr-en0_mu_sigma.json
1. (optional) To see the next systems to be compared, run python src/scripts/next_comparisons.py *_mu_sigma.json N
- This outputs the next comparison under the current result mu and sigma (.json) for N free-for-all matches.

Questions and comments:

Please e-mail to Keisuke Sakaguchi (keisuke[at]cs.jhu.edu).