glue-benchmark

January 13, 2020 ยท View on GitHub

take on how close we can get to a flexible glue-benchmark where it matters.

Why

Apart from measuring the progress of research in NLP and NLP transfer learning, the Glue collection offers a good and varied set of low level NLP capabilities which can be used in a variety of higher level solutions. For instance, in large text & news corpora discerning entailment is key to reducing the volume of inputs as well as identifying truly new information.

Glue Tasks

IndexDescriptionInputsTargetMetricSOTA๐Ÿค—Best here
CoLALinguistic acceptabilitySent1BinaryMatthews Correlation72%49%48%
SST-2Sentiment analysisSent1BinaryAccuracy97.5%92%91%
MRPCSentence equivalenceSent1, Sent2BinaryAccuracy93%87%80%
STS-BMeaning similaritySent1, Sent2RegressionCorrelation93%91.4%
QQPQuora Question Pairs, Question equivalence (binary)Sent1, Sent2BinaryAccuracy91%88%86%
MNLI-mMatched-Textual entailmentSent1, Sent2entailment, no entailmentAccuracy, F191%84%75%
MNLI-mmSame as above, Mismatched - trained domains vs test domainsSent1, Sent2entailment, no entailmentAccuracy, F190.6%85%76%
QNLIStanford Question Answering Dataset (SQuAD), determine what is the answer and if the answer is available in the paragraph referenceQuestion, ParagraphBinary & Sequence with the answerAccuracy98%89%83%
RTERecognising Textual EntailmentSent1, Sent2entailment, contradiction, or neutralAccuracy, F191%71.4%54%
WNLIWinograd Schema, Pronoun ambiguity where the answer requires world knowledge and not only grammatical contextSent1, Sent2entailment, contradiction, or neutralAccuracy, F194.5%43.7%56%
AXDiagnosticMain, Different entailment relationships of arbitrary size predominantly for diagnostic purposesSent1, Sent2entailment, contradiction, or neutralAccuracy, F149.4%na.

Some of the above are not necessarily relevant

Source: gluebenchmark.com