RQ1 Human Evaluation Results
June 25, 2021 · View on GitHub
Get Krippendorff's alpha and Kendall’s tau
python eval_reliability.py
Krippendorff's alpha for ordinal metric: 0.8551221815781381
Kendall’s tau (Author#0 and Author#1): KendalltauResult(correlation=0.8251680898238802 > 0.8 , pvalue=5.804877252677888e-23)
Kendall’s tau (Author#0 and Author#2): KendalltauResult(correlation=0.9007287753910033 > 0.8 , pvalue=1.994301058255853e-27)
Kendall’s tau (Author#1 and Author#2): KendalltauResult(correlation=0.8466989681122703 > 0.8 , pvalue=3.1986477435326564e-24)
Get Correlation
python eval_correlation.py --format
| B-Moses | B-Norm | B-CC | |
|---|---|---|---|
| Pearson | 0.2444 | 0.6966 | 0.5715 |
| Spearman | 0.1970 | 0.6229 | 0.5461 |
| Kendall | 0.1698 | 0.4677 | 0.4037 |
python eval_correlation.py
B-Moses
PearsonResult(correlation=0.24435772972227326, pvalue=0.014280876238425467)
SpearmanrResult(correlation=0.19697704093225188, pvalue=0.049495829147941255)
KendalltauResult(correlation=0.1697675971999534, pvalue=0.04930553082987851)
B-Norm
PearsonResult(correlation=0.6965603253000296, pvalue=8.476334482364091e-16)
SpearmanrResult(correlation=0.6228542206450789, pvalue=4.538124117661111e-12)
KendalltauResult(correlation=0.46767293985242286, pvalue=7.230184262098552e-11)
B-CC
PearsonResult(correlation=0.5715389625437758, pvalue=5.287140230888454e-10)
SpearmanrResult(correlation=0.546091524849012, pvalue=4.190203785059649e-09)
KendalltauResult(correlation=0.40371805391990023, pvalue=1.822821167326917e-08)