README.md
May 21, 2026 ยท View on GitHub
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation
๐ News: Our paper has been accepted at EMNLP 2025 Findings!
Note: For the Chinese version of this README, please refer to README_zh.md.
๐ Overview
LLMEval-Med provides a comprehensive, physician-validated benchmark for evaluating Large Language Models (LLMs) on real-world clinical tasks. The dataset covers a wide range of medical scenarios and is designed to facilitate rigorous, standardized assessment of medical LLMs. For details on the benchmark design, evaluation protocol, and baseline results, please refer to our paper. The dataset is also available on Hugging Face.
๐๏ธ Project Structure
.
โโโ dataset/
โ โโโ dataset.json # Medical domain evaluation dataset
โโโ evaluate/
โ โโโ Answer.py # Script for getting model responses
โ โโโ Evaluate.py # Script for scoring model responses (1-5 per question)
โ โโโ Aggregate.py # Script for aggregating scores into Usability Rate / OP
๐พ Dataset Structure
The dataset/dataset.json file contains a test set of 667 medical questions, organized by different categories:
- Medical Knowledge
- Medical Language Understanding
- Medical Reasoning
- Medical Ethics and Safety
- Medical Text Generation
Each question in the test set is a JSON object with the following fields:
- category1: Primary category of the question (e.g., "Medical Knowledge").
- category2: Secondary category, providing more specific grouping.
- scene: Scenario or context for the question.
- round: Round number, used for multi-turn conversations (1 for single-turn).
- problem: The medical question or prompt presented to the model.
- groupCode: Group identifier for the question.
- sanswer: The standard (reference) answer provided by medical experts.
- difficulty: Difficulty level.
- checklist: Key points or criteria for evaluation, ensuring the answer covers essential aspects.
Note:
The scoring prompts for each category (e.g., Medical Knowledge, Medical Language Understanding, Medical Reasoning, Medical Ethics and Safety, Medical Text Generation) are defined directly inevaluate/Evaluate.py.
Each prompt is carefully designed to guide the evaluation process and ensure consistency across different types of questions.
Example:
{
"category1": "Medical Knowledge",
"category2": "Basic Medical Knowledge/Medical Exam",
"scene": "Basic Medical Knowledge/Medical Exam_Traditional Chinese Medicine",
"round": 1,
"problem": "Why is ฮฒ-OH anthraquinone more acidic than ฮฑ-OH anthraquinone?",
"groupCode": 5,
"sanswer": "The stronger acidity of ฮฒ-OH anthraquinone compared to ฮฑ-OH anthraquinone is mainly due to resonance effects, hydrogen bonding, and steric hindrance...",
"difficulty": "Medium",
"checklist": "Core requirements:\n1. Explain the enhanced resonance effect, reduced hydrogen bonding, and steric hindrance for ฮฒ-OH anthraquinone acidity.\n2. Detail how the ฮฒ-OH position stabilizes the anion via resonance, and how the ฮฑ-OH position's intramolecular hydrogen bond reduces acidity.\n\nSecondary requirements:\n1. Emphasize the role of the conjugated system and electron-withdrawing effects."
}
๐ ๏ธ Usage Guide
1. Getting Model Responses
Use evaluate/Answer.py to get responses from your LLM:
python evaluate/Answer.py
Key configurations in Answer.py:
- Set your model path in
model_name - Configure GPU settings in
CUDA_VISIBLE_DEVICES - Adjust output paths in
inputs_dirandoutputs_dir
The script will:
- Load questions from
dataset/dataset.json - Generate responses using the specified model
- Save results in JSON format
- Handle multi-turn conversations using conversation history
- Manage GPU memory efficiently
2. Evaluating Model Performance
Use evaluate/Evaluate.py to assess model responses:
python evaluate/Evaluate.py
Key configurations in Evaluate.py:
- Set your OpenAI API configuration:
client = OpenAI( base_url='<Your API URL>', api_key='<Your API Key>' ) - Configure input/output paths in
inputs_dirandoutputs_dir
The evaluation process:
- Loads model responses
- Evaluates each response using GPT-4
- Assigns scores on a 5-point scale
- Provides detailed feedback for each response
3. Aggregating into Overall Performance (OP)
evaluate/Evaluate.py only produces per-question 1โ5 scores. Use evaluate/Aggregate.py to turn those scores into the per-category Usability Rate and Overall Performance (OP) numbers reported in Table 2 of the paper:
# Single judging run
python evaluate/Aggregate.py path/to/dataset_processed_score.json
# Three judging runs (the paper's protocol): per-question scores are
# averaged across runs before the >=4 usability threshold is applied.
python evaluate/Aggregate.py run1.json run2.json run3.json --out summary.json
A response is counted as usable when its averaged 0โ5 score is โฅ 4 (for MK / MLU / MR / MSE). For MTG the paper additionally maps a 5-dimension human evaluation through the Appendix D piecewise formula (threshold โฅ 5, with Safety = 1 as a hard veto); a helper mtg_score_from_human_eval is provided in Aggregate.py for that case. OP is the sample-count-weighted usability rate across all questions.
๐ Evaluation Metrics
Responses are scored on a 5-point scale based on:
-
Accuracy (5 points)
- Perfect alignment with reference answer
- Meets both core and secondary requirements
- No medical safety risks
-
Good (4 points)
- Correct core information
- Minor non-critical errors
- No medical safety risks
-
Fair (3 points)
- Some key information misalignment
- Partial fulfillment of core requirements
- No medical safety risks
-
Poor (2 points)
- Significant information errors
- Medical safety concerns
- Failure to meet core requirements
-
Unacceptable (1 point)
- Major factual errors
- Serious medical safety risks
- Complete misalignment with requirements
๐ Important Notes
-
GPU Requirements:
- The system supports multi-GPU configurations
- Automatically selects GPU with most available memory
- Implements memory management for efficient processing
-
API Keys:
- Requires OpenAI API access for evaluation
- Configure API keys and endpoints in
Evaluate.py
-
Data Processing:
- Supports batch processing
- Handles conversation history for multi-turn scenarios
- Implements robust error handling
๐ฅ Contributing
Contributions are welcome! Please feel free to submit issues and pull requests.
๐ฎ Contact Us
For questions or suggestions, please:
-
Open an issue on GitHub
-
Contact the project maintainers ๏ผ
Ming Zhang: mingzhang23@m.fudan.edu.cn
Yujiong Shen: shenyj22@m.fudan.edu.cn
๐ Citation
If you find this benchmark useful, please cite our paper:
@inproceedings{zhang-etal-2025-llmeval,
title = "{LLME}val-{M}ed: A Real-world Clinical Benchmark for Medical {LLM}s with Physician Validation",
author = "Zhang, Ming and
Shen, Yujiong and
Li, Zelin and
Sha, Huayu and
Hu, Binze and
Wang, Yuhui and
Huang, Chenhao and
Liu, Shichun and
Tong, Jingqi and
Jiang, Changhao and
Chai, Mingxu and
Xi, Zhiheng and
Dou, Shihan and
Gui, Tao and
Zhang, Qi and
Huang, Xuanjing",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.263/",
doi = "10.18653/v1/2025.findings-emnlp.263",
pages = "4888--4914",
ISBN = "979-8-89176-335-7",
abstract = "Evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error. Current medical benchmarks have three main types: medical exam-based, comprehensive medical, and specialized assessments. However, these benchmarks have limitations in question design (mostly multiple-choice), data sources (often not derived from real clinical scenarios), and evaluation methods (poor assessment of complex reasoning). To address these issues, we present LLMEval-Medicine, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios. We also design an automated evaluation pipeline, incorporating expert-developed checklists into our LLM-as-Judge framework. Furthermore, our methodology validates machine scoring through human-machine agreement analysis, dynamically refining checklists and prompts based on expert feedback to ensure reliability. We evaluate 13 LLMs across three categories (specialized medical models, open-source models, and closed-source models) on LLMEval-Med, providing valuable insights for the safe and effective deployment of LLMs in medical domains."
}
๐ Related Projects
| Project | Description | Paper | Code |
|---|---|---|---|
| LLMEval-Logic (arXiv 2026) | Solver-verified Chinese logical-reasoning benchmark | arXiv | GitHub |
| LLMEval-Fair (ACL 2026 Main) | Robust & fair evaluation across 13 disciplines, 200K+ questions | arXiv | GitHub |
| LLMEval-2 (AAAI 2024) | Phase II: Professional domain evaluation | arXiv | GitHub |
| LLMEval-1 (AAAI 2024) | Phase I: General capability evaluation | arXiv | GitHub |
Full project list & leaderboard: llmeval.com ยท All datasets: ๐ค llmeval-fdu
LLMEval | Fudan University NLP Lab