More Evaluation Results

December 4, 2023 ยท View on GitHub

ModelHellaSwagPIQAWinoGrandeRACE-MiddleRACE-HighTriviaQANaturalQuestionsMMLUMMLU (LM)ARC-EasyARC-ChallengeGSM8KHumanEvalMBPPDROP (EM)DROP (F1)OpenBookQAPile-testPile-test-BPBBBHAGIEvalCLUEWSCCHIDCEvalCMMLU
LLaMA2 7B75.678.069.660.745.863.825.545.844.569.149.015.514.621.834.039.857.41.7390.76438.522.864.037.933.932.6
Qwen 7B v273.777.567.357.043.159.632.357.340.956.541.552.132.337.243.451.748.62.0250.75647.329.376.586.662.362.6
Baichuan2 7B67.973.660.259.845.159.121.353.435.544.636.823.422.026.031.637.134.81.8420.78141.642.769.680.454.256.2
DeepSeek 7B Base75.479.270.563.246.559.722.248.242.967.948.117.426.239.034.941.055.81.8710.74639.526.473.189.345.047.2
DeepSeek 7B Chat68.577.666.965.250.857.932.549.442.371.049.462.648.235.237.549.154.8//42.319.371.964.947.049.7
LLaMA2 70B84.082.080.470.154.379.536.169.053.576.559.558.428.745.663.669.260.41.5260.67162.937.276.555.551.453.1
DeepSeek 67B Base84.083.679.869.950.778.936.671.354.176.959.063.442.757.461.067.960.21.6600.66268.741.381.092.166.170.8
DeepSeek 67B Chat75.782.676.070.956.081.547.071.155.081.664.184.173.861.459.471.963.2//71.746.460.072.665.267.8

Math evaluation results of DeepSeek LLM 67B Chat

InferenceGSM8kMATHMGSM-zhCMATHGaokao-MathClozeGaokao-MathQA
CoT84.1%32.6%74.0%80.3%16.9%20.2%
Tool-Integrated Reasoning86.7%51.1%76.4%85.4%21.2%28.2%

Never Seen Before Exam

ModelDeepSeek LLM 67B ChatQwen-14B-ChatChatGLM3-6BBaichuan2-Chat-13BYi-Chat-34BGPT-3.5-TurboGrok-1Claude 2GPT-4
Hungarian National High-School Exam5836.53219.53941595568
ModelQwen-14B-ChatChatGLM3-6BBaichuan2-Chat-13BYi-Chat-34BPaLM2 SmallDeepSeek LLM 67B ChatGPT-4
Prompt-level Instruction Following48.935.051.051.246.959.179.3
ModelQwen-14B-ChatChatGLM3-6BBaichuan2-Chat-13BYi-Chat-34BGPT-3.5-TurboPhind-CodeLlama-34B-v2DeepSeek LLM 67B ChatDeepSeek Coder 33BGPT-4
LeetCode Weekly Contest11.12.381.587.920.612.617.531.748.4