AlphaMonarch-7B-Nous.md

February 14, 2024 · View on GitHub

ModelAGIEvalGPT4AllTruthfulQABigbenchAverage
AlphaMonarch-7B45.3777.0178.3950.262.74

AGIEval

TaskVersionMetricValueStderr
agieval_aqua_rat0acc28.35±2.83
acc_norm26.38±2.77
agieval_logiqa_en0acc38.25±1.91
acc_norm38.10±1.90
agieval_lsat_ar0acc23.91±2.82
acc_norm23.48±2.80
agieval_lsat_lr0acc52.75±2.21
acc_norm53.53±2.21
agieval_lsat_rc0acc66.91±2.87
acc_norm67.29±2.87
agieval_sat_en0acc78.64±2.86
acc_norm78.64±2.86
agieval_sat_en_without_passage0acc45.15±3.48
acc_norm44.17±3.47
agieval_sat_math0acc33.18±3.18
acc_norm31.36±3.14

Average: 45.37%

GPT4All

TaskVersionMetricValueStderr
arc_challenge0acc66.21±1.38
acc_norm68.17±1.36
arc_easy0acc86.53±0.70
acc_norm80.81±0.81
boolq1acc87.16±0.59
hellaswag0acc69.58±0.46
acc_norm87.43±0.33
openbookqa0acc39.20±2.19
acc_norm49.60±2.24
piqa0acc82.92±0.88
acc_norm84.82±0.84
winogrande0acc81.06±1.10

Average: 77.01%

TruthfulQA

TaskVersionMetricValueStderr
truthfulqa_mc1mc163.04±1.69
mc278.39±1.37

Average: 78.39%

Bigbench

TaskVersionMetricValueStderr
bigbench_causal_judgement0multiple_choice_grade60.53±3.56
bigbench_date_understanding0multiple_choice_grade61.79±2.53
bigbench_disambiguation_qa0multiple_choice_grade54.26±3.11
bigbench_geometric_shapes0multiple_choice_grade23.96±2.26
exact_str_match0.00±0.00
bigbench_logical_deduction_five_objects0multiple_choice_grade33.00±2.10
bigbench_logical_deduction_seven_objects0multiple_choice_grade23.86±1.61
bigbench_logical_deduction_three_objects0multiple_choice_grade59.00±2.84
bigbench_movie_recommendation0multiple_choice_grade58.00±2.21
bigbench_navigate0multiple_choice_grade56.10±1.57
bigbench_reasoning_about_colored_objects0multiple_choice_grade69.30±1.03
bigbench_ruin_names0multiple_choice_grade55.80±2.35
bigbench_salient_translation_error_detection0multiple_choice_grade41.78±1.56
bigbench_snarks0multiple_choice_grade72.93±3.31
bigbench_sports_understanding0multiple_choice_grade75.86±1.36
bigbench_temporal_sequences0multiple_choice_grade55.70±1.57
bigbench_tracking_shuffled_objects_five_objects0multiple_choice_grade23.44±1.20
bigbench_tracking_shuffled_objects_seven_objects0multiple_choice_grade19.31±0.94
bigbench_tracking_shuffled_objects_three_objects0multiple_choice_grade59.00±2.84

Average: 50.2%

Average score: 62.74%

Elapsed time: 02:31:13