BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases
March 10, 2026 ยท View on GitHub
Preprint: https://arxiv.org/abs/2505.20321

Overview of text-to-SQL workflow used to evaluate LLMs on BiomedSQL.
Requirements
We use uv for dependency management. To install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh
Then install all dependencies:
uv sync
Environment Setup
BiomedSQL requires the extensive use of both opened a closed source LLMs. The following services are needed to run the full set of experiments:
- AzureOpenAI (with endpoints for GPT-4o, GPT-o3-mini, and GPT-5.2)
- AzureAI (with an endpoint for Meta-Llama-405B-Instruct)
- Gemini (for access to Gemini-2.0-Flash and Gemini-3-Pro)
- OpenAI (for access to the general
completions()API for use in the Schema Indexing interaction paradigm) - Anthtropic (for access to Claude-3.7-Sonnet and Claude-4.5-Opus)
- HuggingFace (for access to gated Qwen-2.5-Coder-14B-Instruct, Qwen-2.5-Coder-32B-Instruct, and Llama-70B-Instruct repositories)
See config/sample.env for a complete list of specific information needed from each provider. Once complete, please move this file to config/.env for seamless use in the current experiment setup.
Benchmark Dataset
Our benchmark dataset and associated database tabular data can be found on HuggingFace.
BigQuery Database Creation
To create the BigQuery database from the tabular data hosted on HuggingFace, run:
uv run python create_database.py
LLM Experiments
To run the isolated SQL generation experiments for BiomedSQL, run:
uv run python run_llm_experiments.py
Note: If you do not have GPUs available, comment out any models with
provider: huggingfaceunder theexperiment_modelssection ofconfig/llm_config.yamlbefore running.
Currently we use the following open-source models and detail the following compute requirements to run our experiment pipeline as-is:
- meta-llama/Llama-3.1-70B-Instruct (three NVIDIA 80GB A100 GPUs)
- Qwen/Qwen2.5-Coder-32B-Instruct (two NVIDIA 80GB A100 GPUs)
- Qwen/Qwen2.5-Coder-14B-Instruct (two NVIDIA 80GB A100 GPUs)
We understand that GPU access may differ from user to user, so in order to run our experiments without the need for GPUs, please comment out any models specified with provider: huggingface under the experiment_models section of config/llm_config.yaml.
Interaction Paradigm Experiments
To run the interaction paradigm experiments for BiomedSQL, run:
uv run python run_interaction_experiments.py
Generate Results
To generate figures and tables after the experiments are finished, run:
uv run python results.py
Tables will show up in results/ and plots will show up in results/plots/.
Results
On BiomedSQL, GPT-o3-mini is consistently the top-performing model on the variety of experiments performed. However, even when paired with our custom-built text-to-SQL system (BMSQL), it still falls short of domain-expert level performance.
| Model name | Execution Accuracy | Response Quality Rate |
|---|---|---|
| GPT-o3-mini-baseline | 53.5% | 73.3% |
| GPT-o3-mini-combo | 59.0% | 77.8% |
| BMSQL-GPT-o3-mini | 62.6% | 84.6% |
License and Contributing
This respository is under the PolyForm Noncommercial License (Version 1.0.0).
To contribute, simply clone the repository and open a pull request. For any bugs or other fixes, feel free to open an issue.
Relevant Citation
@article{koretsky2025biomedsql,
title = {BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases},
author = {Mathew J. Koretsky and Maya Willey and Adi Asija and Owen Bianchi and Chelsea X. Alvarado and Tanay Nayak and Nicole Kuznetsov and Sungwon Kim and Mike A. Nalls and Daniel Khashabi and Faraz Faghri},
year = {2025},
eprint = {2505.20321},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2505.20321},
code = {https://github.com/NIH-CARD/biomedsql},
}