AlphaMonarch-7B-Nous.md

February 14, 2024 · View on GitHub

Model	AGIEval	GPT4All	TruthfulQA	Bigbench	Average
AlphaMonarch-7B	45.37	77.01	78.39	50.2	62.74

AGIEval

Task	Version	Metric	Value		Stderr
agieval_aqua_rat	0	acc	28.35	±	2.83
		acc_norm	26.38	±	2.77
agieval_logiqa_en	0	acc	38.25	±	1.91
		acc_norm	38.10	±	1.90
agieval_lsat_ar	0	acc	23.91	±	2.82
		acc_norm	23.48	±	2.80
agieval_lsat_lr	0	acc	52.75	±	2.21
		acc_norm	53.53	±	2.21
agieval_lsat_rc	0	acc	66.91	±	2.87
		acc_norm	67.29	±	2.87
agieval_sat_en	0	acc	78.64	±	2.86
		acc_norm	78.64	±	2.86
agieval_sat_en_without_passage	0	acc	45.15	±	3.48
		acc_norm	44.17	±	3.47
agieval_sat_math	0	acc	33.18	±	3.18
		acc_norm	31.36	±	3.14

Average: 45.37%

GPT4All

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	66.21	±	1.38
		acc_norm	68.17	±	1.36
arc_easy	0	acc	86.53	±	0.70
		acc_norm	80.81	±	0.81
boolq	1	acc	87.16	±	0.59
hellaswag	0	acc	69.58	±	0.46
		acc_norm	87.43	±	0.33
openbookqa	0	acc	39.20	±	2.19
		acc_norm	49.60	±	2.24
piqa	0	acc	82.92	±	0.88
		acc_norm	84.82	±	0.84
winogrande	0	acc	81.06	±	1.10

Average: 77.01%

TruthfulQA

Task	Version	Metric	Value		Stderr
truthfulqa_mc	1	mc1	63.04	±	1.69
		mc2	78.39	±	1.37

Average: 78.39%

Bigbench

Task	Version	Metric	Value		Stderr
bigbench_causal_judgement	0	multiple_choice_grade	60.53	±	3.56
bigbench_date_understanding	0	multiple_choice_grade	61.79	±	2.53
bigbench_disambiguation_qa	0	multiple_choice_grade	54.26	±	3.11
bigbench_geometric_shapes	0	multiple_choice_grade	23.96	±	2.26
		exact_str_match	0.00	±	0.00
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	33.00	±	2.10
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	23.86	±	1.61
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	59.00	±	2.84
bigbench_movie_recommendation	0	multiple_choice_grade	58.00	±	2.21
bigbench_navigate	0	multiple_choice_grade	56.10	±	1.57
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	69.30	±	1.03
bigbench_ruin_names	0	multiple_choice_grade	55.80	±	2.35
bigbench_salient_translation_error_detection	0	multiple_choice_grade	41.78	±	1.56
bigbench_snarks	0	multiple_choice_grade	72.93	±	3.31
bigbench_sports_understanding	0	multiple_choice_grade	75.86	±	1.36
bigbench_temporal_sequences	0	multiple_choice_grade	55.70	±	1.57
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	23.44	±	1.20
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	19.31	±	0.94
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	59.00	±	2.84

Average: 50.2%

Average score: 62.74%

Elapsed time: 02:31:13