README.md

May 21, 2026 · View on GitHub

LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

Note: For the Chinese version of this README, please refer to README_zh.md.

🔔 News

🏆 [2026-04-10] Our paper has been accepted to the main conference of ACL 2026.
📊 [2025-12-23] We have updated the leaderboard with additional model results and refreshed the related tables and figures.
🎉 [2025-08-07] Our paper is released on arXiv: arXiv:2508.05452.

📚 Benchmark Content and Format

LLMEval-Fair focuses on evaluating professional knowledge capabilities, covering 13 academic disciplines as defined by the Ministry of Education: Philosophy, Economics, Law, Education, Literature, History, Science, Engineering, Agriculture, Medicine, Military Science, Management, and Arts. It includes over 50 sub-disciplines and a total of approximately 200,000 standardized generative question-answering items (we will continue to expand the question bank to 1 million).

Academic Disciplines Coverage

Question sources mainly include undergraduate homework, undergraduate mid-term and final exams, and graduate entrance exams. To prevent large models from being exposed to a significant portion of the evaluation data during pre-training, LLMEval-Fair sources its questions from non-public channels where possible. The data is in PDF and Word formats, which undergo OCR and data cleaning before being formatted. A standardized interface is provided for different question types to enable a fully automated process for the models under test.

Unlike other knowledge benchmarks that use a multiple-choice format, LLMEval-Fair treats all questions as generative knowledge question-answering. It includes a variety of formats such as short answer, calculation, true/false, analysis, and essay questions. Compared to standardized multiple-choice questions, the generative format used in LLMEval-Fair better reflects real-world user needs and the language capabilities of the models.

The full benchmark dataset is now publicly available in data/.

🔬 Methodology

Evaluation Pipeline

Preventing cheating is a key consideration for LLMEval-Fair. Existing public benchmarks suffer from test set leakage, which can lead to unfair practices like "leaderboard hacking" or score inflation. In LLMEval-Fair, each participating system must complete 1,000 questions randomly sampled from the total question bank. For models from the same institution, we ensure that the questions are not repeated in subsequent evaluations. The evaluation is conducted online, with questions sent sequentially in a single round; the next question is only sent after the previous one has been answered, preventing malicious crawling.

This round of evaluation uses an automated scoring method, with GPT-4 Turbo as the current evaluation model. Each question is scored on a scale of 0-3 points. The scoring focuses on the core correctness of the answer and the validity of the explanation, with core correctness being the primary metric. The evaluation prompt used is as follows:

Please evaluate the following response from the LLM regarding a discipline-specific question based on the following criteria. You must score it on a scale of 0, 1, 2 or 3 stars:

Overall Rating:
0 stars indicate wrong answer with a wrong explanation
1 star indicates wrong answer but a partially reasonable explanation
2 stars indicate a correct answer with a partially reasonable explanation
3 stars indicate a correct answer with a reasonable explanation

User: {question}

LLM:{answer_from_llm}

The correct answer to user's question is: {correct_answer}

You must provide your feedback in the following format:
{"Overall Rating":numbers of its stars(int)}

Scoring

To mitigate systematic bias introduced by randomly sampling 1,000 questions, LLMEval-Fair uses both relative scores and absolute scores.

Relative Score Calculation: Given the rapid development of large language model technology, we introduce a relative score to measure the gap between a model and the current state-of-the-art performance. We select the top-performing model on the leaderboard as the SOTA baseline, which is currently Doubao-1.5-Thinking-Pro:

$R_{\text{SOTA}}^{\text{model}}=\frac{S_{model}}{S_\text{sotamodel}} \times 100$

Absolute Score Calculation: The absolute score represents the model's raw performance on N=1,000 questions. It is calculated by normalizing each question's score (0-3 points) to a 0-100 scale:

$S_{model}=\sum_{i=1}^N{\frac{s_i}{s_{max}} \times 100} \quad (1)$

Where $s_i$ is the score for question i, and $s_{max}=3$ .

Scoring Notes: $S_{model}$ is the absolute score (0-100 scale), $R_{\text{SOTA}}^{\text{model}}$ is the relative score (with the SOTA model as the 100% baseline), and discipline-specific scores use a 10-point scale.

🏆 Current Leaderboard (As of December 2025)

📋 Overall Scores

Model Name	Organization	Access Type	Evaluation Date	Relative Score	Absolute Score
Doubao-1.5-Thinking-Pro	ByteDance	API	2025.7.21	100.00	93.67
DeepSeek-R1	DeepSeek	API	2025.7.21	97.40	91.23
Gemini-2.5-Pro-Preview	Google	API	2025.7.21	97.22	91.07
Gemini-2.5-Pro-Preview-Thinking	Google	API	2025.7.21	97.15	91.00
DeepSeek-V3	DeepSeek	API	2025.7.21	96.48	90.37
Qwen3-235B	Alibaba Cloud	API	2025.7.21	96.44	90.33
Doubao-1.5-Pro-256K	ByteDance	API	2025.7.21	95.69	89.63
GLM-4.6	Zhipu AI	API	2025.9.30	95.26	89.23
QwQ-32B	Alibaba Cloud	API	2025.8.1	94.52	88.54
Kimi-K2	Moonshot AI	API	2025.9.5	94.27	88.30
GPT-5	OpenAI	API	2025.8.7	93.84	87.90
Claude-Sonnet-4.5-Thinking	Anthropic	API	2025.9.29	93.48	87.57
o1-2024-12-17	OpenAI	API	2025.7.21	93.35	87.43
Claude-Sonnet-4.5	Anthropic	API	2025.9.29	93.31	87.40
Gemini-2.5-Flash-Thinking	Google	API	2025.8.1	92.74	86.87
DeepSeek-V3.2	DeepSeek	API	2025.12.1	92.27	86.43
Qwen3-32B	Alibaba Cloud	API	2025.7.21	92.21	86.37
Claude-Sonnet-4-Thinking	Anthropic	API	2025.7.21	91.03	85.27
Claude-Sonnet-4	Anthropic	API	2025.7.21	91.00	85.23
GPT-4o-Search-Preview	OpenAI	API	2025.7.21	89.40	83.73
GLM-4-32B	Tsinghua&Zhipu.AI	API	2025.8.1	88.43	82.83
GPT-4o-2024-11-20	OpenAI	API	2025.7.21	88.08	82.50
Gemini-1.5-Pro	Google	API	2025.8.1	85.92	80.47
Qwen2.5-32B-Instruct	Alibaba Cloud	API	2025.8.1	85.07	79.68
o3-Mini	OpenAI	API	2025.7.21	84.13	78.80
Qwen-Turbo-1101	Alibaba Cloud	API	2025.8.1	83.71	78.41
Claude-3.5-Sonnet	Anthropic	API	2025.8.1	83.38	78.10
o1-Mini-2024-09-12	OpenAI	API	2025.8.1	78.93	73.93
GPT-4 Turbo(gpt-4-1106-preview)	OpenAI	API	2023.11.18	78.56	73.6
GPT-4-0125-Preview	OpenAI	API	2024.1.28	76.44	71.6
Baidu-4.0	Baidu	API	2023.11.1	75.09	70.33
Yi-34B-Chat	01.AI	API	2023.12.1	70.17	65.70
Baidu-3.5	Baidu	API	2023.11.1	69.14	64.73
ChatGLM-Pro	Tsinghua&Zhipu.AI	API	2023.11.1	69.14	64.73
Megrez-3B-Instruct	Megrez	API	2024.12.16	67.01	62.77
GPT-4-0613	OpenAI	API	2023.9.29	66.17	61.97
iFlytek-Spark-v3.0	iFlytek	API	2023.11.7	65.64	61.47
Qwen2-7B-Instruct	Alibaba Cloud	API	2024.6.6	65.15	61.03
Nanbeige-Plus	NanBeiGe LLM Lab	API	2023.12.1	65.14	61.00
Phi-4-Final	Microsoft	API	2024.12.12	63.98	59.93
Claude-3-Haiku	Anthropic	API	2025.8.1	62.95	58.97
Llama-3.2-90B-Vision-Instruct	Meta	API	2025.8.1	61.74	57.83
Llama-3.3-70B	Meta	API	2025.8.1	60.85	57.00
Baichuan2-13B-Chat	Baichuan	Weights	2023.9.29	58.31	54.6
Gemini-Pro	Google	API	2024.1.10	58.20	54.5
Qwen-Plus	Alibaba Cloud	API	2023.11.1	56.60	53.0
Qwen-Turbo	Alibaba Cloud	API	2023.11.1	55.78	52.23
Nanbeige-16B	NanBeiGe LLM Lab	API	2023.10.23	55.46	51.93
GPT-3.5-Turbo	OpenAI	API	2023.9.29	55.42	51.9
MiniMax-Abab5	MiniMax	Weights	2023.11.1	55.33	51.83
Mixtral-8x7B-Instruct	Mistral AI	Weights	2024.1.10	51.69	48.4
ChatGLM2-6B	Tsinghua&Zhipu.AI	Weights	2023.9.29	42.32	39.63
Llama-3.1-8B	Meta	API	2024.7.23	41.24	38.63
Ziya-v1.1-13B	IDEA	Weights	2023.9.29	40.18	37.63
InternLM-Chat-7B	Shanghai AI Lab&SenseTime	Weights	2023.9.29	38.73	36.27
Linly-Chinese-Llama-2-13B-HF	National Engineering Lab	Weights	2023.10.3	37.06	34.7
Phi-3-Medium-128K-Instruct	Microsoft	API	2025.8.1	36.94	34.60
BELLE-Llama2-13B-Chat-0.4M	LianjiaTech	Weights	2023.10.1	36.28	33.97
Llama-2-7B-Chat-HF	Meta	Weights	2023.9.29	25.24	23.63

📊 Discipline-Specific Performance

Model Name	Overall	Engineering	Economics	Education	Law	Literature	Management	Science	History	Medicine	Military
Doubao-1.5-Thinking-Pro	93.67	9.47	9.67	9.43	9.77	8.93	9.53	9.23	9.70	8.97	8.97
DeepSeek-R1	91.23	9.47	9.43	9.27	9.37	8.83	9.37	9.03	9.53	8.50	8.43
Gemini-2.5-Pro-Preview	91.07	9.20	9.47	9.20	9.30	8.43	9.63	9.07	9.40	8.50	8.87
Gemini-2.5-Pro-Preview-Thinking	91.00	9.13	9.50	9.37	9.47	8.40	9.63	9.20	9.27	8.30	8.73
DeepSeek-V3	90.37	9.30	9.57	8.93	9.23	8.60	9.13	8.97	9.47	8.83	8.33
Qwen3-235B	90.33	9.23	9.43	9.03	9.50	8.23	9.43	8.97	9.17	8.73	8.60
Doubao-1.5-Pro-256K	89.63	8.83	9.03	9.13	9.43	8.57	9.27	8.83	9.10	8.60	8.83
GLM-4.6	89.23	8.80	9.27	8.70	9.23	8.40	9.63	8.90	9.30	8.43	8.57
QwQ-32B	88.54	8.30	9.46	9.23	9.33	7.83	9.46	8.65	9.27	8.57	8.43
Kimi-K2	88.30	9.23	9.17	8.80	9.00	8.40	9.17	8.77	9.13	8.53	8.10
GPT-5	87.90	8.83	9.37	8.90	8.87	8.10	9.10	8.90	9.03	8.50	8.30
Claude-Sonnet-4.5-Thinking	87.57	8.90	9.17	8.80	8.97	8.00	9.23	8.90	9.00	8.27	8.33
o1-2024-12-17	87.43	8.90	9.30	8.67	8.77	7.73	9.27	8.90	8.97	8.17	8.77
Claude-Sonnet-4.5	87.40	8.80	8.97	8.93	8.73	8.37	9.10	8.97	8.93	8.13	8.47
Gemini-2.5-Flash-Thinking	86.87	8.67	9.27	8.70	9.00	7.80	8.93	8.90	9.00	8.03	8.57
DeepSeek-V3.2	86.43	8.73	9.13	8.53	8.70	7.40	9.33	8.87	9.37	8.53	7.83
Qwen3-32B	86.37	8.43	9.10	8.57	9.10	7.77	9.47	8.67	9.30	7.70	8.27
Claude-Sonnet-4-Thinking	85.27	8.57	9.00	8.63	8.73	7.57	9.10	8.93	8.70	7.97	8.07
Claude-Sonnet-4	85.23	8.57	8.80	8.50	8.70	7.80	9.03	8.80	8.80	8.17	8.07
GPT-4o-Search-Preview	83.73	8.27	8.77	8.43	8.67	7.77	8.80	8.20	8.73	8.27	7.83
GLM-4-32B	82.83	7.77	8.97	8.33	8.33	7.03	9.13	8.27	8.77	8.23	8.00
GPT-4o-2024-11-20	82.50	7.90	8.67	8.30	8.33	7.17	8.97	8.57	8.67	7.63	8.30
Gemini-1.5-Pro	80.47	8.13	8.45	8.30	8.37	7.04	8.17	8.43	8.50	7.48	7.60
Qwen2.5-32B-Instruct	79.68	7.70	8.57	8.33	8.33	6.70	8.50	8.17	7.70	7.60	8.08
o3-Mini	78.80	7.97	8.60	8.30	8.20	6.73	8.57	8.53	7.17	7.03	7.70
Qwen-Turbo-1101	78.41	7.97	8.37	8.03	8.23	6.40	8.50	8.10	7.50	7.27	8.05
Claude-3.5-Sonnet	78.10	7.97	8.53	8.27	7.93	7.03	8.50	8.00	7.57	6.70	7.60
o1-Mini-2024-09-12	73.93	7.27	8.43	7.90	7.53	6.27	8.27	8.17	6.43	6.63	7.03
GPT-4 Turbo(gpt-4-1106-preview)	73.6	6.97	8.17	8.33	7.8	6.0	7.57	8.13	7.0	6.43	7.2
GPT-4-0125-Preview	71.6	6.9	7.4	8.03	7.3	6.0	7.47	7.63	6.87	6.33	7.67
Baidu-4.0	70.33	7.27	7.23	7.67	7.43	5.63	6.47	6.8	7.63	7.8	6.4
Yi-34B-Chat	65.70	5.77	6.63	7.37	7.53	5.47	5.77	5.47	7.47	6.3	7.93
Baidu-3.5	64.73	6.2	6.7	7.8	6.83	5.2	5.5	6.0	7.23	6.57	6.7
ChatGLM-Pro	64.73	5.9	7.07	7.03	7.9	5.43	6.33	5.0	6.67	5.97	7.43
Megrez-3B-Instruct	62.77	5.80	6.77	6.80	7.13	5.40	6.87	5.70	6.53	5.70	6.07
GPT-4-0613	61.97	6.5	6.73	6.6	6.73	5.43	6.1	6.47	5.3	5.2	6.9
iFlytek-Spark-v3.0	61.47	5.77	6.5	7.27	7.3	5.7	5.9	5.03	6.5	5.23	6.27
Qwen2-7B-Instruct	61.03	5.47	6.73	6.33	7.60	5.13	6.17	6.17	5.73	5.33	6.37
Nanbeige-Plus	61.00	5.78	5.57	6.77	7.37	5.37	5.93	5.45	6.3	5.67	6.77
Phi-4-Final	59.93	5.80	6.47	6.23	6.53	5.53	6.30	6.27	5.50	5.43	5.87
Claude-3-Haiku	58.97	5.80	6.60	6.97	6.63	4.83	5.93	6.33	4.80	5.23	5.83
Llama-3.2-90B-Vision-Instruct	57.83	5.63	6.33	6.20	5.80	4.73	6.10	6.57	5.03	5.27	6.17
Llama-3.3-70B	57.00	5.80	6.90	5.63	5.70	5.47	5.70	6.30	4.70	4.87	5.93
Baichuan2-13B-Chat	54.6	4.47	5.53	7.4	6.9	4.63	4.8	4.33	6.23	4.6	5.7
Gemini-Pro	54.5	4.87	5.43	7.07	6.43	5.10	4.5	4.65	6.33	4.42	5.7
Qwen-Plus	53.0	4.4	5.1	6.53	6.53	5.0	4.77	4.87	5.17	5.13	5.5
Qwen-Turbo	52.23	4.1	6.07	6.63	6.43	4.43	4.53	4.97	5.27	4.37	5.43
Nanbeige-16B	51.93	4.37	5.3	6.5	6.3	3.97	4.7	4.07	5.9	4.73	6.1
GPT-3.5-Turbo	51.9	4.97	5.37	6.4	6.47	4.43	4.67	5.43	4.2	4.37	5.6
MiniMax-Abab5	51.83	3.87	5.63	6.87	6.97	4.33	4.4	2.93	6.13	4.27	6.43
Mixtral-8x7B-Instruct	48.4	4.27	5.47	6.47	6.4	3.13	4.5	5.07	3.57	4.37	5.17
ChatGLM2-6B	39.63	2.33	3.77	5.97	6.13	2.83	3.83	2.6	3.8	4.0	4.37
Llama-3.1-8B	38.63	3.87	4.20	4.27	4.17	3.50	3.83	4.30	3.17	3.20	4.13
Ziya-v1.1-13B	37.63	2.77	3.97	5.17	5.33	2.8	3.77	2.53	3.7	3.03	4.57
InternLM-Chat-7B	36.27	2.63	3.67	4.87	5.57	3.17	3.33	2.33	4.03	3.13	3.53
Linly-Chinese-Llama-2-13B-HF	34.7	2.2	3.77	4.5	5.0	2.43	3.33	2.53	3.9	2.5	4.53
Phi-3-Medium-128K-Instruct	34.60	2.27	4.17	3.70	4.23	2.87	4.50	3.57	3.20	2.27	3.83
BELLE-Llama2-13B-Chat-0.4M	33.97	2.57	3.07	4.93	4.73	2.83	3.8	2.43	3.33	2.4	3.87
Llama-2-7B-Chat-HF	23.63	1.53	3.43	3.0	3.73	1.73	2.43	1.97	2.17	0.8	2.83

Note: Discipline scores are on a 10-point scale

The performance distribution over time for the currently ranked models is shown in the figure below:

Model Performance Trends

For more experimental details and analysis, please refer to our paper.

📑 Citation

If you find LLMEval-Fair helpful for your research, please cite our paper:

@misc{zhang2025llmevalfair,
      title         = {{LLMEval-Fair}: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models},
      author        = {Ming Zhang and Yujiong Shen and Jingyi Deng and Yuhui Wang and Huayu Sha and Kexin Tan and Qiyuan Peng and Yue Zhang and Junzhe Wang and Shichun Liu and Yueyuan Huang and Jingqi Tong and Changhao Jiang and Yilong Wu and Zhihao Zhang and Mingqi Wu and Mingxu Chai and Zhiheng Xi and Shihan Dou and Tao Gui and Qi Zhang and Xuanjing Huang},
      year          = {2025},
      eprint        = {2508.05452},
      archivePrefix = {arXiv},
      primaryClass  = {cs.CL},
      url           = {https://arxiv.org/abs/2508.05452},
      note          = {Accepted at ACL 2026 (Main)}
}

The paper has been accepted to ACL 2026 (Main Conference). The official ACL Anthology BibTeX will be added once the proceedings are released.

📞 Contact Us

This project is open to the public, and we welcome you to participate in our evaluation.

Institutional evaluation requires certification. After registering an account, please contact the administrators for verification and to apply for evaluation permissions.

Unless there are special circumstances, all evaluation results will be added to the leaderboard upon completion.

Website: https://llmeval.com/
Email: mingzhang23@m.fudan.edu.cn
WeChat: zanyingluan

Project	Description	Paper	Code
LLMEval-Logic (arXiv 2026)	Solver-verified Chinese logical-reasoning benchmark	arXiv	GitHub
LLMEval-Med (EMNLP 2025 Findings)	Physician-validated clinical benchmark	arXiv	GitHub
LLMEval-2 (AAAI 2024)	Phase II: Professional domain evaluation	arXiv	GitHub
LLMEval-1 (AAAI 2024)	Phase I: General capability evaluation	arXiv	GitHub

Full project list & leaderboard: llmeval.com · All datasets: 🤗 llmeval-fdu

LLMEval-Fair | Building the Future of LLM Evaluation