[ICLR'26] LiveClin: A Live Clinical Benchmark

February 27, 2026 ยท View on GitHub

๐Ÿ“ƒ Paper โ€ข ๐Ÿค— Dataset โ€ข ๐Ÿ’ป Code

LiveClin_logo

  • LiveClin is a contamination-free, biannually updated clinical benchmark for evaluating large vision-language models on realistic, multi-stage clinical case reasoning with medical images and tables.
  • Each case presents a clinical scenario followed by a sequence of multiple-choice questions (MCQs) that mirror the progressive diagnostic workflow a clinician would follow โ€” from initial presentation through diagnosis, treatment, complication management, and follow-up.

Updates ๐Ÿ””

  • [2026.02.27] Evaluation framework refactored.
  • [2026.02.21] Paper released.
  • [2026.02.15] LiveClin is published!

Project Structure ๐Ÿ—๏ธ

Click to expand
LiveClin/
โ”œโ”€โ”€ evaluate.py                    # CLI entry-point
โ”œโ”€โ”€ liveclin/                      # Core package
โ”‚   โ”œโ”€โ”€ __init__.py                #   EvalConfig dataclass
โ”‚   โ”œโ”€โ”€ client.py                  #   Async API client (shared connection pool)
โ”‚   โ”œโ”€โ”€ runner.py                  #   Multi-turn evaluation engine
โ”‚   โ”œโ”€โ”€ analyzer.py                #   Fine-grained results analysis
โ”‚   โ”œโ”€โ”€ data.py                    #   HuggingFace download & JSONL loading
โ”‚   โ””โ”€โ”€ utils.py                   #   Prompt formatting & answer extraction
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ serve_sglang.py            # SGLang deployment helper
โ”‚   โ””โ”€โ”€ test_vision.py             # Vision capability smoke test
โ”œโ”€โ”€ requirements.txt
โ””โ”€โ”€ README.md

Main Results ๐Ÿ“Š

Overall case accuracy, showing models grouped by family and ordered reverse chronologically

Bar textures indicate model type and dashed lines represent physician reference levels.

LiveClin_result1

Data Example ๐Ÿฉบ

An example simulating the entire clinical pathway

The case progresses from initial assessment to long-term management, with new clinical information and diverse imaging modalities (e.g., X-ray, MRI, pathology, CT) progressively introduced at each key decision point.

LiveClin_example

Quick Start ๐Ÿš€

1. Install

git clone https://github.com/AQ-MedAI/LiveClin.git
cd LiveClin
pip install -r requirements.txt

2. Evaluate

A single command downloads the dataset (on first run) and runs the full pipeline:

# Remote API โ€” images sent as URLs
python evaluate.py \
    --model gpt-5 \
    --api-base https://api.openai.com/v1 \
    --api-key sk-xxx \
    --image-mode url

For locally-served models (e.g. via SGLang), --api-key can be omitted:

python evaluate.py \
    --model Qwen2.5-VL-7B-Instruct \
    --api-base http://localhost:8000/v1 \
    --image-mode local

The evaluation pipeline will:

  1. Auto-download the dataset from HuggingFace (only the requested config, cached for future runs)
  2. Evaluate all cases concurrently via multi-turn conversation
  3. Print a structured summary to the terminal
  4. Save detailed results with fine-grained analysis to JSON

Example terminal output:

============================================================
  LiveClin Results: GPT-5 (2025_H1)
============================================================
  Question Accuracy:  5179/6605 (78.4%)
  Case Accuracy:      433/1407 (30.8%)
------------------------------------------------------------
  By Chapter (Top-5 Case Accuracy):
    Chapter 4: Endocrine, nutritional ...     ( 79 cases)  C-Acc 45.6%  Q-Acc 83.3%
    Chapter 12: Diseases of the skin ...      ( 40 cases)  C-Acc 45.0%  Q-Acc 81.7%
    ...
  By Chapter (Bottom-5 Case Accuracy):
    Chapter 14: Diseases of the geni...       ( 80 cases)  C-Acc 22.5%  Q-Acc 76.5%
    Chapter 11: Diseases of the dige...       (131 cases)  C-Acc 24.4%  Q-Acc 74.6%
    ...
------------------------------------------------------------
  By Subcategory (Top-5 Case Accuracy):
    Mental disorders due to substance...      ( 10 cases)  C-Acc 60.0%  Q-Acc 89.1%
    Dermatitis and eczema (L20-L30)           ( 10 cases)  C-Acc 60.0%  Q-Acc 84.8%
    ...
  By Subcategory (Bottom-5 Case Accuracy):
    Glomerular diseases (N00-N08)             ( 20 cases)  C-Acc 10.0%  Q-Acc 72.3%
    Renal tubulo-interstitial diseases...     ( 20 cases)  C-Acc 15.0%  Q-Acc 73.1%
    ...
------------------------------------------------------------
  By Rarity:
    Rare         (1181 cases)  Q-Acc 78.5%  C-Acc 31.0%
    Unrare       ( 226 cases)  Q-Acc 78.0%  C-Acc 29.6%
------------------------------------------------------------
  By Clinical Stage:
    Presentation & Assessment      (1618 MCQs)  Q-Acc 77.8%
    Diagnosis & Interpretation     (2168 MCQs)  Q-Acc 75.0%
    Therapeutic Strategy           (1601 MCQs)  Q-Acc 83.6%
    Complication Management        ( 184 MCQs)  Q-Acc 76.1%
    Follow-up                      ( 391 MCQs)  Q-Acc 86.2%
------------------------------------------------------------
  By Question Position:
    Q1     (1407 MCQs)  Q-Acc 78.5%  Err 0.1%
    Q2     (1407 MCQs)  Q-Acc 76.5%  Err 0.4%
    ...
------------------------------------------------------------
  By Image Modality:
    CT                   ( 832 MCQs)  Q-Acc 76.4%
    MRI                  ( 621 MCQs)  Q-Acc 78.2%
    Clinical Photo       ( 504 MCQs)  Q-Acc 74.1%
    ...
------------------------------------------------------------
  By Table Modality:
    Lab Results          (1023 MCQs)  Q-Acc 79.8%
    Medications          ( 412 MCQs)  Q-Acc 82.3%
    ...
============================================================

3. Test Vision (Optional)

Verify the model can perceive images before running a full evaluation:

# Remote API
python scripts/test_vision.py \
    --model gpt-5 \
    --api-base https://api.openai.com/v1 \
    --api-key sk-xxx

# Local deployment (--api-key can be omitted)
python scripts/test_vision.py \
    --model your-model \
    --api-base http://localhost:8000/v1

4. Self-Hosted Models (Optional)

Deploy your own model with SGLang to expose an OpenAI-compatible API:

# Terminal 1 โ€” launch the model server
python scripts/serve_sglang.py \
    --model-path /path/to/your-model \
    --tp 2 --dp 4 --port 8000

# Terminal 2 โ€” run evaluation
python evaluate.py \
    --model your-model-name \
    --api-base http://localhost:8000/v1 \
    --image-mode local

CLI Reference โš™๏ธ

FlagDescriptionDefault
--modelModel identifier (required)โ€”
--api-baseAPI base URL (required)โ€”
--api-keyAPI key (omit for local deployments)token
--image-modeurl or local (required)โ€”
--datasetDataset config name2025_H1
--concurrencyMax concurrent case evaluations100
--outputOutput JSON pathauto
--resumeResume and retry failed casesoff
--max-retriesMax retries per API call5
--temperatureSampling temperature0.0
--max-tokensMax tokens per response16384
--verbosePrint per-MCQ retry detailsoff
--data-dirRoot directory for auto-downloaded datadata
--jsonl-pathOverride: direct path to JSONL fileโ€”
--image-rootOverride: direct path to image directoryโ€”

Data ๐Ÿ“ฆ

Auto-Download (Default)

No extra steps needed. On first run, only the requested dataset config (e.g. 2025_H1) is downloaded from HuggingFace and cached locally in data/.

Manual Download

For offline use or shared storage, download the dataset yourself:

# Via git (requires git-lfs)
git lfs install
git clone https://huggingface.co/datasets/AQ-MedAI/LiveClin /path/to/liveclin-data

# Or via Python
python -c "from huggingface_hub import snapshot_download; snapshot_download('AQ-MedAI/LiveClin', repo_type='dataset', local_dir='/path/to/liveclin-data')"

Then point the evaluator to your local copy:

# Set the data root (auto-resolves internal structure)
python evaluate.py ... --data-dir /path/to/liveclin-data

# Or point directly to specific files (highest priority)
python evaluate.py ... --jsonl-path /path/to/2025_H1.jsonl --image-root /path/to/image/

Path priority: --jsonl-path / --image-root > --data-dir > default (data/).

Load with datasets

from datasets import load_dataset

ds = load_dataset("AQ-MedAI/LiveClin", "2025_H1", split="test")

case = ds[0]
fp = case["exam_creation"]["final_policy"]
print(fp["scenario"])
for mcq in fp["mcqs"]:
    print(f"[{mcq['stage']}] {mcq['question'][:80]}...")
    print(f"  Answer: {mcq['correct_answer']}")

Retry & Resume ๐Ÿ”„

The framework applies a three-layer retry strategy for robust evaluation under unstable network conditions:

LayerScopeBehavior
APISingle API callRetries on timeout, connection error, rate limit, 5xx with exponential backoff
MCQSingle questionIf all API retries fail, retries the whole question before abandoning the case
Run--resume flagRe-runs only failed cases; successfully completed cases are preserved
# Resume after a run with transient failures
python evaluate.py --model gpt-5 --api-base ... --api-key ... --image-mode url --resume

Output Format ๐Ÿ“„

Results are saved as a single JSON file (default: results/<model>_<dataset>.json):

{
  "meta": {
    "model": "gpt-5",
    "dataset": "2025_H1",
    "image_mode": "url",
    "started_at": "...",
    "finished_at": "..."
  },
  "summary": {
    "total_cases": 1407,
    "total_mcqs": 6605,
    "question_accuracy": ...,
    "case_accuracy": ...,
    "by_rarity":          { "rare": {...}, "unrare": {...} },
    "by_chapter":         { "Chapter 2: Neoplasms": {...}, ... },
    "by_subcategory":     { "Chapter 2: Neoplasms": { "Subcategory A": {...}, ... }, ... },
    "by_stage":           { "Presentation & Assessment": {...}, ... },
    "by_position":        { "Q1": {...}, "Q2": {...}, ... },
    "by_image_modality":  { "CT": {...}, "MRI": {...}, ... },
    "by_table_modality":  { "Lab Results": {...}, "Medications": {...}, ... }
  },
  "cases": [...]
}

Analysis Dimensions

DimensionGranularityCategories
Rarity2 groupsRare (84%), Non-rare (16%)
ICD-10 Chapter16 chaptersDisease-system-level breakdown
ICD-10 Subcategory48 groupsNested under chapters โ€” fine-grained ICD-10 category breakdown
Clinical Stage5 categoriesPresentation & Assessment, Diagnosis & Interpretation, Therapeutic Strategy, Complication Management, Follow-up
Question PositionQ1โ€“Q6Accuracy and error rate by position within each case
Image Modality11 typesX-ray, CT, MRI, Ultrasound, Clinical Photo, Endoscopy, Angiography, PET & SPECT, Pathology, Biosignals, Diagram & Plot
Table Modality9 typesLab Results, Medications, Demographics, Monitoring, Literature, Genomics, Pathology & IHC, Procedures, Staging Systems

Citation ๐Ÿ“

If you find LiveClin useful, please cite:

@misc{wang2026liveclinliveclinicalbenchmark,
      title={LiveClin: A Live Clinical Benchmark without Leakage},
      author={Xidong Wang and Shuqi Guo and Yue Shen and Junying Chen and Jian Wang and Jinjie Gu and Ping Zhang and Lei Liu and Benyou Wang},
      year={2026},
      eprint={2602.16747},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.16747},
}