RareArena

March 13, 2026 ยท View on GitHub

A Comprehensive Rare Disease Diagnostic Dataset with nearly 50,000 patients covering more than 4000 diseases.

News

  • [2026/03] We found that the previous evaluation method tended to assign an overestimated score 1 to hypernyms. To address this issue, we constructed a rare disease hypernym hierarchy based on Orphanet, which enables a stricter criterion for assigning a score of 1. The updated evaluation logic has been implemented in the revised file.
  • [2026/03] To enable cost-effective benchmarking, we have carefully selected a representative subset of cases that ensures coverage of all recorded rare diseases. Specifically, we randomly sampled up to three cases for each Orphanet Disorder type and for each disorder subtype. The resulting benchmark datasets are available at RDS_benchmark and RDC_benchmark.
  • [2026/02] Our paper has been published in The Lancet Digital Health. The full paper can be accessed via this link.

Data Collection

We build our work upon PMC-Patients, a large-scale patient summary dataset sourced from PMC case reports, and we use GPT-4o for all data processing.

To be specific, we first filter cases focusing on rare disease diagnoses from PMC-Patients, and extract their ground-truth diagnosis. Then we map each diagnosis to the Orphanet database using CODER term embeddings, and filter out the cases with diagnosis failing to map. Next, we truncate the cases and rephrased them to avoid diagnosis leakage. Here we consider two task settings:

  • Rare Disease Screening (RDS), where the cases are truncated up to any diagnosic tests, such as whole-genome sequencing for genetic diseases and pathogen detection for rare infections.
  • Rare Disease Confirmation (RDC), where the cases are truncated up to the final diagnosis. Finally, we remove any cases with potential diagnosis leakage.

To reproduce RareArena, see dataset_collection directory for all the scripts used in our pipeline.

Evaluation

To evaluate certain model on RareArena, there are three steps to take:

  1. Generate top 5 diagnosis using the model. We provide an OpenAI-style script and our naive prompt used in our paper in eval/run.py.

  2. Evaluate the top 5 diagnosis using GPT-4o (since it is untrivial to identify whether the true diagnosis is retrieved due to presence of synonyms and hypernyms). The script and prompt for GPT-4o is given in eval/eval.py.

  3. Parse the evaluation output and calculate top-1 and top-5 recall using eval/metric.py.

Model Performances

Rare Disease Screening Task

Model Top 1 Recall (%) Top 5 Recall (%)
Score = 0 (missing) Score = 1 (hypernyms) Score = 2 (synonyms) Total* Score = 0 (missing) Score = 1 (hypernyms) Score = 2 (synonyms) Total
GPT-4o 66.95 9.93 23.13 33.05 43.14 20.26 36.61 56.86
Llama3.1-70B 74.44 8.29 17.27 25.56 52.00 17.56 30.45 48.00
Qwen2.5-72B 75.44 10.14 14.42 24.56 49.87 23.79 26.34 50.13
Gemma2-9B 82.09 9.75 8.16 17.91 56.01 22.90 21.09 43.99
Phi3-7B 84.31 6.11 9.58 15.69 57.61 21.15 21.24 42.39
Llama3.1-7B 86.03 6.21 7.76 13.97 58.09 19.07 22.84 41.91
Qwen2.5-7B 86.80 7.46 5.74 13.20 55.80 29.20 15.00 44.20

* Total recall is defined as the sum of score 2 and score 1 matches.

Rare Disease Confirmation Task

Model Top 1 Recall (%) Top 5 Recall (%)
Score = 0 (missing) Score = 1 (hypernyms) Score = 2 (synonyms) Total Score = 0 (missing) Score = 1 (hypernyms) Score = 2 (synonyms) Total
GPT-4o 35.76 14.51 49.72 64.24 14.08 20.23 65.69 85.92
Llama3.1-70B 43.94 14.41 41.66 56.06 18.43 21.12 60.45 81.57
Qwen2.5-72B 49.46 15.46 35.09 50.54 22.98 25.93 51.09 77.02
Gemma2-9B 60.22 16.09 23.69 39.78 29.44 29.70 40.86 70.56
Phi3-7B 68.82 9.15 22.03 31.18 37.68 23.48 38.84 62.32
Llama3.1-8B 64.14 11.17 24.69 35.86 31.13 23.84 45.03 68.87
Qwen2.5-7B 71.78 12.68 15.54 28.22 35.08 34.08 30.85 64.92

License

RareArena is released under CC BY-NC-SA 4.0 License.

Acknowledgements

We would like to acknowledge that the RareArena dataset was created and provided by Tsinghua Medicine, Peking Union Medical College, and Department of Statistics and Data Science at Tsinghua University.

Citation

@article{CHEN2026100953,
  title = {RareArena: a comprehensive benchmark dataset unveiling the potential of large language models in rare disease diagnosis},
  journal = {The Lancet Digital Health},
  pages = {100953},
  year = {2026},
  issn = {2589-7500},
  doi = {https://doi.org/10.1016/j.landig.2025.100953},
  url = {https://www.sciencedirect.com/science/article/pii/S2589750025001359},
  author = {Haichao Chen and Zhengyun Zhao and Songchi Zhou and Shikai Hu and Jinyuan Wang and Ye Jin and Xianghong Jin and Yih Chung Tham and Xiaofei Wang and Weizhi Ma and Honghan Wu and Bin Sheng and Shuyang Zhang and Sheng Yu and Tien Yin Wong}
}