README.md

May 21, 2026 ยท View on GitHub

LLMEval-2: Professional Domain Evaluation of Chinese LLMs (Phase II)

Paper AAAI 2024 Dataset Website LLMEval

Note: For the Chinese version of this README, please refer to README_zh.md.

๐Ÿ”” News

  • ๐Ÿ† [2024-03-24] Our paper "LLMEval: A Preliminary Study on How to Evaluate Large Language Models" has been accepted at AAAI 2024.
  • ๐Ÿ“Š [2023-07] LLMEval-2 evaluation results released, covering 20 LLMs across 12 academic disciplines.

๐Ÿ“š Overview

LLMEval-2 is the Phase II dataset of the LLMEval paper (AAAI 2024). While LLMEval-1 focused on general capabilities, LLMEval-2 targets professional domain evaluation across 12 academic disciplines with approximately 480 questions (both objective and subjective).

Key Features

  • 12 academic disciplines โ€” domain-specific knowledge test sets constructed by subject experts from external databases
  • Dual question types โ€” ~25โ€“30 objective (multiple-choice) + ~10โ€“15 subjective (open-ended) questions per discipline
  • Practical focus โ€” questions based on real tasks undergraduate and graduate students seek LLM assistance with
  • 20 LLMs evaluated โ€” comprehensive comparison with both human and GPT-4 scoring

๐Ÿ“‹ Evaluation Criteria

Question TypeDimensionMax ScoreDescription
ObjectiveAnswer Accuracy3Is the answer correct?
ObjectiveExplanation Quality2Does the explanation contain errors?
SubjectiveAccuracy5Is the answer content correct?
SubjectiveInformativeness3Is sufficient information provided?
SubjectiveFluency3Are grammar and format correct?
SubjectiveLogic3Is the reasoning sound?

๐Ÿ† Leaderboard

ModelObj. AccuracyObj. ExplanationFluencyAccuracyLogicInfo.RankTotal
GPT-42.378 (2.395)1.670 (1.595)2.895 (2.989)4.260 (4.545)2.779 (2.903)2.691 (2.886)1 (1)86.72 (89.54)
GPT-3.52.160 (2.138)1.542 (1.503)2.861 (3.000)3.822 (4.295)2.694 (2.818)2.489 (2.750)2 (2)80.71 (84.69)
Xunfei-Spark2.114 (2.243)1.557 (1.632)2.815 (2.977)3.750 (4.193)2.560 (2.739)2.196 (2.716)3 (5)78.05 (82.26)
Baichuan-13B-Chat2.003 (2.013)1.428 (1.441)2.847 (2.949)3.727 (4.102)2.631 (2.778)2.472 (2.756)4 (6)77.51 (81.82)
MiniMax-Abab51.922 (1.928)1.443 (1.493)2.878 (2.989)3.800 (3.977)2.656 (2.722)2.478 (2.699)5 (7)77.47 (80.64)
NewBing2.197 (2.211)1.583 (1.615)2.796 (2.989)3.608 (3.875)2.558 (2.773)2.061 (2.511)6 (4)77.28 (82.63)
Claude1.923 (2.066)1.463 (1.576)2.680 (2.977)3.597 (4.125)2.613 (2.801)2.414 (2.710)7 (3)75.57 (83.49)
MOSS-Mars1.961 (1.967)1.465 (1.470)2.737 (3.000)3.480 (3.807)2.508 (2.648)2.229 (2.534)8 (9)74.41 (79.21)
Tiangong1.933 (1.961)1.354 (1.500)2.774 (2.983)3.520 (3.807)2.576 (2.682)2.339 (2.523)9 (8)74.36 (79.31)
Ziya-LLaMA-13B1.681 (1.592)1.306 (1.201)2.804 (3.000)3.207 (3.364)2.473 (2.585)2.120 (2.278)10 (13)69.48 (70.92)

Values in parentheses are GPT-4 automated evaluation scores and rankings. Full results for all 20 models available in LLMEVAL-2.pdf.

๐Ÿ“„ Full Report

The complete evaluation report is available: LLMEVAL-2.pdf

ProjectDescriptionPaperCode
LLMEval-Logic (arXiv 2026)Solver-verified Chinese logical-reasoning benchmarkarXivGitHub
LLMEval-Fair (ACL 2026 Main)Robust & fair evaluation across 13 disciplines, 200K+ questionsarXivGitHub
LLMEval-Med (EMNLP 2025 Findings)Physician-validated clinical benchmarkarXivGitHub
LLMEval-1 (AAAI 2024)Phase I: General capability evaluationarXivGitHub

Full project list & leaderboard: llmeval.com ยท All datasets: ๐Ÿค— llmeval-fdu

๐Ÿ“ Citation

@inproceedings{zhang2024llmeval,
  title     = {LLMEval: A Preliminary Study on How to Evaluate Large Language Models},
  author    = {Yue Zhang and Ming Zhang and Haipeng Yuan and Shichun Liu and Yongyao Shi and Tao Gui and Qi Zhang and Xuanjing Huang},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  volume    = {38},
  number    = {17},
  pages     = {19615--19622},
  year      = {2024},
  doi       = {10.1609/aaai.v38i17.29934}
}

๐Ÿ“ž Contact Us

This project is open to the public, and we welcome you to participate in our evaluation.


LLMEval | Fudan University NLP Lab