MedAsk Benchmarks
July 4, 2025 ยท View on GitHub
A comprehensive suite of medical AI benchmarks for evaluating Large Language Model (LLM) performance on clinical tasks.
Overview
This repository contains multiple medical AI benchmarks developed by MedAsk to evaluate and compare the performance of LLMs on various medical tasks:
๐ฉบ SymptomCheck Bench
An OSCE-style benchmark for evaluating diagnostic accuracy of LLM-based medical agents in symptom assessment conversations. The benchmark simulates medical consultations through a structured four-step process:
- Initialization: Selection of a clinical vignette
- Dialogue: Simulated conversation between an LLM agent and patient
- Diagnosis: Generation of top 5 differential diagnoses
- Evaluation: Automated assessment of diagnostic accuracy
๐ Blog Post: Introducing SymptomCheck Bench
๐จ Triage Bench
A benchmark for evaluating LLM performance on medical triage classification tasks. Models classify clinical vignettes into three urgency levels:
- Emergency (em): Requires immediate emergency room attention
- Non-Emergency (ne): Needs medical evaluation within a week
- Self-care (sc): Can be managed with self-care and monitoring
๐ Blog Post: Medical AI Triage Accuracy 2025: MedAsk Beats OpenAI's O3 & GPT-4.5
Published Research & Results
ICD-10 Coding Accuracy
Results from our ICD-10 coding evaluation demonstrate MedAsk's superior accuracy in medical coding tasks. They can be found in the results folder of symptomcheck_bench ๐ Read more: How MedAsk's Cognitive Architecture Improves ICD-10 Coding Accuracy
Repository Structure
medask-benchmarks/
โโโ README.md # This overview
โโโ medask/ # Core supporting functions and LLM clients
โ โโโ ummon/ # LLM client implementations
โ โโโ models/ # Data models for communication
โ โโโ util/ # Utility functions
โโโ symptomcheck_bench/ # Diagnostic accuracy benchmark
โ โโโ README.md # Detailed usage instructions
โ โโโ main.py # Main evaluation script
โ โโโ vignettes/ # Clinical vignettes
โ โโโ results/ # Evaluation results
โโโ triage_bench/ # Medical triage benchmark
โโโ README.md # Detailed usage instructions
โโโ main.py # Main evaluation script
โโโ paired_analysis.py # Statistical comparison tool
โโโ vignettes/ # Clinical vignettes
โโโ results/ # Triage evaluation results
โโโ medask_results_jul25/ # July 2025 study results
Quick Start
Installation
# Create and activate environment
conda create -n medask-benchmarks python=3.12
conda activate medask-benchmarks
# Install dependencies
pip install -r requirements/development.txt
pip install -e .
# Set API keys
export KEY_OPENAI="sk-..." # For OpenAI models
export KEY_DEEPSEEK="..." # For DeepSeek models
Running Benchmarks
SymptomCheck Bench:
cd symptomcheck_bench
python main.py --file=avey --doctor_llm=gpt-4o --num_vignettes=5
Triage Bench:
cd triage_bench
python main.py --model gpt-4o --runs 3
For detailed usage instructions, see the README files in each benchmark directory.
Supported Models
- OpenAI: GPT-4o, GPT-4.5, O1, O3 series
- DeepSeek: DeepSeek Chat, DeepSeek Reasoner
- MedAsk: Proprietary medical AI models
Citation
If you use these benchmarks in your research, please cite the associated publications and reference the relevant MedAsk blog posts linked above.