README.md

May 21, 2026 ยท View on GitHub

LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

Paper Venue Dataset Website LLMEval

Note: For the Chinese version of this README, please refer to README_zh.md.

๐Ÿ”” News

  • ๐Ÿ† [2026-04-10] Our paper has been accepted to the main conference of ACL 2026.
  • ๐Ÿ“Š [2025-12-23] We have updated the leaderboard with additional model results and refreshed the related tables and figures.
  • ๐ŸŽ‰ [2025-08-07] Our paper is released on arXiv: arXiv:2508.05452.

๐Ÿ“š Benchmark Content and Format

LLMEval-Fair focuses on evaluating professional knowledge capabilities, covering 13 academic disciplines as defined by the Ministry of Education: Philosophy, Economics, Law, Education, Literature, History, Science, Engineering, Agriculture, Medicine, Military Science, Management, and Arts. It includes over 50 sub-disciplines and a total of approximately 200,000 standardized generative question-answering items (we will continue to expand the question bank to 1 million).

Academic Disciplines Coverage

Question sources mainly include undergraduate homework, undergraduate mid-term and final exams, and graduate entrance exams. To prevent large models from being exposed to a significant portion of the evaluation data during pre-training, LLMEval-Fair sources its questions from non-public channels where possible. The data is in PDF and Word formats, which undergo OCR and data cleaning before being formatted. A standardized interface is provided for different question types to enable a fully automated process for the models under test.

Unlike other knowledge benchmarks that use a multiple-choice format, LLMEval-Fair treats all questions as generative knowledge question-answering. It includes a variety of formats such as short answer, calculation, true/false, analysis, and essay questions. Compared to standardized multiple-choice questions, the generative format used in LLMEval-Fair better reflects real-world user needs and the language capabilities of the models.

The full benchmark dataset is now publicly available in data/.

๐Ÿ”ฌ Methodology

Evaluation Pipeline

Preventing cheating is a key consideration for LLMEval-Fair. Existing public benchmarks suffer from test set leakage, which can lead to unfair practices like "leaderboard hacking" or score inflation. In LLMEval-Fair, each participating system must complete 1,000 questions randomly sampled from the total question bank. For models from the same institution, we ensure that the questions are not repeated in subsequent evaluations. The evaluation is conducted online, with questions sent sequentially in a single round; the next question is only sent after the previous one has been answered, preventing malicious crawling.

This round of evaluation uses an automated scoring method, with GPT-4 Turbo as the current evaluation model. Each question is scored on a scale of 0-3 points. The scoring focuses on the core correctness of the answer and the validity of the explanation, with core correctness being the primary metric. The evaluation prompt used is as follows:

Please evaluate the following response from the LLM regarding a discipline-specific question based on the following criteria. You must score it on a scale of 0, 1, 2 or 3 stars:

Overall Rating:
0 stars indicate wrong answer with a wrong explanation
1 star indicates wrong answer but a partially reasonable explanation
2 stars indicate a correct answer with a partially reasonable explanation
3 stars indicate a correct answer with a reasonable explanation

User: {question}

LLM:{answer_from_llm}

The correct answer to user's question is: {correct_answer}

You must provide your feedback in the following format:
{"Overall Rating":numbers of its stars(int)}

Scoring

To mitigate systematic bias introduced by randomly sampling 1,000 questions, LLMEval-Fair uses both relative scores and absolute scores.

Relative Score Calculation: Given the rapid development of large language model technology, we introduce a relative score to measure the gap between a model and the current state-of-the-art performance. We select the top-performing model on the leaderboard as the SOTA baseline, which is currently Doubao-1.5-Thinking-Pro:

RSOTAmodel=SmodelSsotamodelร—100R_{\text{SOTA}}^{\text{model}}=\frac{S_{model}}{S_\text{sotamodel}} \times 100

Absolute Score Calculation: The absolute score represents the model's raw performance on N=1,000 questions. It is calculated by normalizing each question's score (0-3 points) to a 0-100 scale:

Smodel=โˆ‘i=1Nsismaxร—100(1)S_{model}=\sum_{i=1}^N{\frac{s_i}{s_{max}} \times 100} \quad (1)

Where sis_i is the score for question i, and smax=3s_{max}=3.

Scoring Notes: SmodelS_{model} is the absolute score (0-100 scale), RSOTAmodelR_{\text{SOTA}}^{\text{model}} is the relative score (with the SOTA model as the 100% baseline), and discipline-specific scores use a 10-point scale.

๐Ÿ† Current Leaderboard (As of December 2025)

๐Ÿ“‹ Overall Scores

Model NameOrganizationAccess TypeEvaluation DateRelative ScoreAbsolute Score
Doubao-1.5-Thinking-ProByteDanceAPI2025.7.21100.0093.67
DeepSeek-R1DeepSeekAPI2025.7.2197.4091.23
Gemini-2.5-Pro-PreviewGoogleAPI2025.7.2197.2291.07
Gemini-2.5-Pro-Preview-ThinkingGoogleAPI2025.7.2197.1591.00
DeepSeek-V3DeepSeekAPI2025.7.2196.4890.37
Qwen3-235BAlibaba CloudAPI2025.7.2196.4490.33
Doubao-1.5-Pro-256KByteDanceAPI2025.7.2195.6989.63
GLM-4.6Zhipu AIAPI2025.9.3095.2689.23
QwQ-32BAlibaba CloudAPI2025.8.194.5288.54
Kimi-K2Moonshot AIAPI2025.9.594.2788.30
GPT-5OpenAIAPI2025.8.793.8487.90
Claude-Sonnet-4.5-ThinkingAnthropicAPI2025.9.2993.4887.57
o1-2024-12-17OpenAIAPI2025.7.2193.3587.43
Claude-Sonnet-4.5AnthropicAPI2025.9.2993.3187.40
Gemini-2.5-Flash-ThinkingGoogleAPI2025.8.192.7486.87
DeepSeek-V3.2DeepSeekAPI2025.12.192.2786.43
Qwen3-32BAlibaba CloudAPI2025.7.2192.2186.37
Claude-Sonnet-4-ThinkingAnthropicAPI2025.7.2191.0385.27
Claude-Sonnet-4AnthropicAPI2025.7.2191.0085.23
GPT-4o-Search-PreviewOpenAIAPI2025.7.2189.4083.73
GLM-4-32BTsinghua&Zhipu.AIAPI2025.8.188.4382.83
GPT-4o-2024-11-20OpenAIAPI2025.7.2188.0882.50
Gemini-1.5-ProGoogleAPI2025.8.185.9280.47
Qwen2.5-32B-InstructAlibaba CloudAPI2025.8.185.0779.68
o3-MiniOpenAIAPI2025.7.2184.1378.80
Qwen-Turbo-1101Alibaba CloudAPI2025.8.183.7178.41
Claude-3.5-SonnetAnthropicAPI2025.8.183.3878.10
o1-Mini-2024-09-12OpenAIAPI2025.8.178.9373.93
GPT-4 Turbo(gpt-4-1106-preview)OpenAIAPI2023.11.1878.5673.6
GPT-4-0125-PreviewOpenAIAPI2024.1.2876.4471.6
Baidu-4.0BaiduAPI2023.11.175.0970.33
Yi-34B-Chat01.AIAPI2023.12.170.1765.70
Baidu-3.5BaiduAPI2023.11.169.1464.73
ChatGLM-ProTsinghua&Zhipu.AIAPI2023.11.169.1464.73
Megrez-3B-InstructMegrezAPI2024.12.1667.0162.77
GPT-4-0613OpenAIAPI2023.9.2966.1761.97
iFlytek-Spark-v3.0iFlytekAPI2023.11.765.6461.47
Qwen2-7B-InstructAlibaba CloudAPI2024.6.665.1561.03
Nanbeige-PlusNanBeiGe LLM LabAPI2023.12.165.1461.00
Phi-4-FinalMicrosoftAPI2024.12.1263.9859.93
Claude-3-HaikuAnthropicAPI2025.8.162.9558.97
Llama-3.2-90B-Vision-InstructMetaAPI2025.8.161.7457.83
Llama-3.3-70BMetaAPI2025.8.160.8557.00
Baichuan2-13B-ChatBaichuanWeights2023.9.2958.3154.6
Gemini-ProGoogleAPI2024.1.1058.2054.5
Qwen-PlusAlibaba CloudAPI2023.11.156.6053.0
Qwen-TurboAlibaba CloudAPI2023.11.155.7852.23
Nanbeige-16BNanBeiGe LLM LabAPI2023.10.2355.4651.93
GPT-3.5-TurboOpenAIAPI2023.9.2955.4251.9
MiniMax-Abab5MiniMaxWeights2023.11.155.3351.83
Mixtral-8x7B-InstructMistral AIWeights2024.1.1051.6948.4
ChatGLM2-6BTsinghua&Zhipu.AIWeights2023.9.2942.3239.63
Llama-3.1-8BMetaAPI2024.7.2341.2438.63
Ziya-v1.1-13BIDEAWeights2023.9.2940.1837.63
InternLM-Chat-7BShanghai AI Lab&SenseTimeWeights2023.9.2938.7336.27
Linly-Chinese-Llama-2-13B-HFNational Engineering LabWeights2023.10.337.0634.7
Phi-3-Medium-128K-InstructMicrosoftAPI2025.8.136.9434.60
BELLE-Llama2-13B-Chat-0.4MLianjiaTechWeights2023.10.136.2833.97
Llama-2-7B-Chat-HFMetaWeights2023.9.2925.2423.63

๐Ÿ“Š Discipline-Specific Performance

Model NameOverallEngineeringEconomicsEducationLawLiteratureManagementScienceHistoryMedicineMilitary
Doubao-1.5-Thinking-Pro93.679.479.679.439.778.939.539.239.708.978.97
DeepSeek-R191.239.479.439.279.378.839.379.039.538.508.43
Gemini-2.5-Pro-Preview91.079.209.479.209.308.439.639.079.408.508.87
Gemini-2.5-Pro-Preview-Thinking91.009.139.509.379.478.409.639.209.278.308.73
DeepSeek-V390.379.309.578.939.238.609.138.979.478.838.33
Qwen3-235B90.339.239.439.039.508.239.438.979.178.738.60
Doubao-1.5-Pro-256K89.638.839.039.139.438.579.278.839.108.608.83
GLM-4.689.238.809.278.709.238.409.638.909.308.438.57
QwQ-32B88.548.309.469.239.337.839.468.659.278.578.43
Kimi-K288.309.239.178.809.008.409.178.779.138.538.10
GPT-587.908.839.378.908.878.109.108.909.038.508.30
Claude-Sonnet-4.5-Thinking87.578.909.178.808.978.009.238.909.008.278.33
o1-2024-12-1787.438.909.308.678.777.739.278.908.978.178.77
Claude-Sonnet-4.587.408.808.978.938.738.379.108.978.938.138.47
Gemini-2.5-Flash-Thinking86.878.679.278.709.007.808.938.909.008.038.57
DeepSeek-V3.286.438.739.138.538.707.409.338.879.378.537.83
Qwen3-32B86.378.439.108.579.107.779.478.679.307.708.27
Claude-Sonnet-4-Thinking85.278.579.008.638.737.579.108.938.707.978.07
Claude-Sonnet-485.238.578.808.508.707.809.038.808.808.178.07
GPT-4o-Search-Preview83.738.278.778.438.677.778.808.208.738.277.83
GLM-4-32B82.837.778.978.338.337.039.138.278.778.238.00
GPT-4o-2024-11-2082.507.908.678.308.337.178.978.578.677.638.30
Gemini-1.5-Pro80.478.138.458.308.377.048.178.438.507.487.60
Qwen2.5-32B-Instruct79.687.708.578.338.336.708.508.177.707.608.08
o3-Mini78.807.978.608.308.206.738.578.537.177.037.70
Qwen-Turbo-110178.417.978.378.038.236.408.508.107.507.278.05
Claude-3.5-Sonnet78.107.978.538.277.937.038.508.007.576.707.60
o1-Mini-2024-09-1273.937.278.437.907.536.278.278.176.436.637.03
GPT-4 Turbo(gpt-4-1106-preview)73.66.978.178.337.86.07.578.137.06.437.2
GPT-4-0125-Preview71.66.97.48.037.36.07.477.636.876.337.67
Baidu-4.070.337.277.237.677.435.636.476.87.637.86.4
Yi-34B-Chat65.705.776.637.377.535.475.775.477.476.37.93
Baidu-3.564.736.26.77.86.835.25.56.07.236.576.7
ChatGLM-Pro64.735.97.077.037.95.436.335.06.675.977.43
Megrez-3B-Instruct62.775.806.776.807.135.406.875.706.535.706.07
GPT-4-061361.976.56.736.66.735.436.16.475.35.26.9
iFlytek-Spark-v3.061.475.776.57.277.35.75.95.036.55.236.27
Qwen2-7B-Instruct61.035.476.736.337.605.136.176.175.735.336.37
Nanbeige-Plus61.005.785.576.777.375.375.935.456.35.676.77
Phi-4-Final59.935.806.476.236.535.536.306.275.505.435.87
Claude-3-Haiku58.975.806.606.976.634.835.936.334.805.235.83
Llama-3.2-90B-Vision-Instruct57.835.636.336.205.804.736.106.575.035.276.17
Llama-3.3-70B57.005.806.905.635.705.475.706.304.704.875.93
Baichuan2-13B-Chat54.64.475.537.46.94.634.84.336.234.65.7
Gemini-Pro54.54.875.437.076.435.104.54.656.334.425.7
Qwen-Plus53.04.45.16.536.535.04.774.875.175.135.5
Qwen-Turbo52.234.16.076.636.434.434.534.975.274.375.43
Nanbeige-16B51.934.375.36.56.33.974.74.075.94.736.1
GPT-3.5-Turbo51.94.975.376.46.474.434.675.434.24.375.6
MiniMax-Abab551.833.875.636.876.974.334.42.936.134.276.43
Mixtral-8x7B-Instruct48.44.275.476.476.43.134.55.073.574.375.17
ChatGLM2-6B39.632.333.775.976.132.833.832.63.84.04.37
Llama-3.1-8B38.633.874.204.274.173.503.834.303.173.204.13
Ziya-v1.1-13B37.632.773.975.175.332.83.772.533.73.034.57
InternLM-Chat-7B36.272.633.674.875.573.173.332.334.033.133.53
Linly-Chinese-Llama-2-13B-HF34.72.23.774.55.02.433.332.533.92.54.53
Phi-3-Medium-128K-Instruct34.602.274.173.704.232.874.503.573.202.273.83
BELLE-Llama2-13B-Chat-0.4M33.972.573.074.934.732.833.82.433.332.43.87
Llama-2-7B-Chat-HF23.631.533.433.03.731.732.431.972.170.82.83

Note: Discipline scores are on a 10-point scale

The performance distribution over time for the currently ranked models is shown in the figure below:

Model Performance Trends

For more experimental details and analysis, please refer to our paper.

๐Ÿ“‘ Citation

If you find LLMEval-Fair helpful for your research, please cite our paper:

@misc{zhang2025llmevalfair,
      title         = {{LLMEval-Fair}: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models},
      author        = {Ming Zhang and Yujiong Shen and Jingyi Deng and Yuhui Wang and Huayu Sha and Kexin Tan and Qiyuan Peng and Yue Zhang and Junzhe Wang and Shichun Liu and Yueyuan Huang and Jingqi Tong and Changhao Jiang and Yilong Wu and Zhihao Zhang and Mingqi Wu and Mingxu Chai and Zhiheng Xi and Shihan Dou and Tao Gui and Qi Zhang and Xuanjing Huang},
      year          = {2025},
      eprint        = {2508.05452},
      archivePrefix = {arXiv},
      primaryClass  = {cs.CL},
      url           = {https://arxiv.org/abs/2508.05452},
      note          = {Accepted at ACL 2026 (Main)}
}

The paper has been accepted to ACL 2026 (Main Conference). The official ACL Anthology BibTeX will be added once the proceedings are released.

๐Ÿ“ž Contact Us

This project is open to the public, and we welcome you to participate in our evaluation.

Institutional evaluation requires certification. After registering an account, please contact the administrators for verification and to apply for evaluation permissions.

Unless there are special circumstances, all evaluation results will be added to the leaderboard upon completion.

ProjectDescriptionPaperCode
LLMEval-Logic (arXiv 2026)Solver-verified Chinese logical-reasoning benchmarkarXivGitHub
LLMEval-Med (EMNLP 2025 Findings)Physician-validated clinical benchmarkarXivGitHub
LLMEval-2 (AAAI 2024)Phase II: Professional domain evaluationarXivGitHub
LLMEval-1 (AAAI 2024)Phase I: General capability evaluationarXivGitHub

Full project list & leaderboard: llmeval.com ยท All datasets: ๐Ÿค— llmeval-fdu


LLMEval-Fair | Building the Future of LLM Evaluation