alignment-scripts

April 25, 2019 · View on GitHub

Scripts to preprocess training and test data for alignment experiments and to run and evaluate FastAlign and Mgiza.

Dependencies

Usage Instructions

  • Install all necessary dependencies
  • Export install locations for dependencies: export {MOSES_DIR,FASTALIGN_DIR,MGIZA_DIR}=/foo/bar
  • Make sure you set a reasonable default locale, e.g.: export LC_ALL=en_US.UTF-8
  • Create folder for your test data: mkdir -p test
  • Download Test Data for German-English and move it into the folder test
  • Run preprocessing: ./preprocess/run.sh
  • Run Fastalign: ./scripts/run_fast_align.sh
  • Run Giza: ./scripts/run_giza.sh (This might take multiple days)

Results

All results are in percent in the format: AlignmentErrorRate (Precision/Recall)

German to English

MethodDeEnEnDeGrow-DiagGrow-Diag-Final
FastAlign28.4% (71.3%/71.8%)32.0% (69.7%/66.4%)27.0% (84.6%/64.1%)27.7% (80.7%/65.5%)
Mgiza21.0% (86.2%/72.8%)23.1% (86.6%/69.0%)21.4% (94.3%/67.2%)20.6% (91.3%/70.2%)

Romanian to English

MethodRoEnEnRoGrow-DiagGrow-Diag-Final
FastAlign33.8% (71.8%/61.3%)35.5% (70.6%/59.4%)32.1% (85.1%/56.5%)32.2% (81.4%/58.1%)
Mgiza28.7% (82.7%/62.6%)32.2% (79.5%/59.1%)27.9% (94.0%/58.5%)26.4% (90.9%/61.8%)

English to French

MethodEnFrFrEnGrow-DiagGrow-Diag-Final
FastAlign16.4% (80.0%/90.1%)15.9% (81.3%/88.7%)10.5% (90.8%/87.8%)12.1% (87.7%/88.3%)
Mgiza8.0% (91.4%/92.9%)9.8% (91.6%/88.3%)5.9% (97.5%/89.7%)6.2% (95.5%/91.6%)

Known Issues

  • Does not work on MacOs
  • Tokenization of the Canadian Hansards seems to be off when accents are present in the English text: Ms. H é l è ne Alarie, Mr. Andr é Harvey :, Mr. R é al M é nard