🛡️TrustSQL🛡️: Benchmarking Text-to-SQL Reliability with Penalty-Based Scoring

May 14, 2026 · View on GitHub

TrustSQL is a benchmark for reliable text-to-SQL modeling. It evaluates whether a model can generate correct SQL for feasible questions and abstain from infeasible questions.

Dataset

Place the TrustSQL dataset under dataset/ in the repository root:

TrustSQL/
  dataset/
    atis/
    advising/
    ehrsql/
    spider/

Download the dataset from Google Drive:

For example:

python3 -m pip install gdown
python3 -m gdown -O trustsql_dataset.zip --fuzzy "https://drive.google.com/file/d/19IpLSc2QncO2273E8z-lvU2ME9wEHIan/view?usp=sharing"
unzip trustsql_dataset.zip -x "__MACOSX/*"

If the unzipped directory has a different name, rename or move it so that dataset/atis, dataset/advising, dataset/ehrsql, and dataset/spider exist directly under the repository root.

Each dataset directory contains schema files, SQLite databases, and split files such as {dataset}_train.json, {dataset}_valid.json, and {dataset}_test.json.

The expected split sizes are:

DatasetTrainValid feasibleValid infeasibleTest feasibleTest infeasible
ATIS1,114489489476476
Advising1,170533533533533
EHRSQL4,674931931934934
Spider7,000507507527527

dataset/ is intentionally ignored by git because it contains large database files and benchmark data.

Setup

Install Python dependencies:

pip install -r requirements.txt

For OpenAI-based experiments, create gpt/api.json:

{
  "API_KEY": "YOUR_API_KEY"
}

gpt/api.json is ignored by git.

Running OpenAI Baselines

Run SQLPrompt:

bash script/run_sqlprompt.sh

Run the pipeline baseline:

bash script/run_clsprompt.sh
bash script/run_errorprompt.sh

Run SQLPrompt with demonstrations or voting:

bash script/run_sqlprompt_demo.sh
bash script/run_sqlprompt_voting.sh

Run the T5 evaluation example after placing checkpoints at the paths specified in the T5 config files:

bash script/run_t5_eval_example.sh

Evaluation

Evaluate generated outputs with:

bash script/evaluate_sqlprompt_cls+error.sh
bash script/evaluate_sqlprompt_demo.sh
bash script/evaluate_sqlprompt_voting.sh

Additional Model Code

This repository also includes model code for T5, SQLCoder, and Llama variants. These model implementations may require external checkpoints or generated outputs that are not committed to git.

TriageSQL v1.3/ is only needed for some T5 Mahalanobis or FeatRMD-style background-distribution experiments. It is a large external artifact and is ignored by git.

To obtain it, download the official TriageSQL dataset from the TriageSQL repository:

For the TrustSQL T5 background-data scripts, use the raw TriageSQL dataset and place it as:

TrustSQL/
  TriageSQL v1.3/
    trainset.json
    devset.json
    testset.json

For example:

python3 -m pip install gdown
python3 -m gdown --fuzzy "https://drive.google.com/file/d/1w55CaVEuimUlP-jerOCrVHF1iF0FZYKe/view?usp=sharing"
unzip TriageSQL.zip
mv TriageSQL "TriageSQL v1.3"

If the unzipped directory has a different name, rename it so that TriageSQL v1.3/trainset.json, TriageSQL v1.3/devset.json, and TriageSQL v1.3/testset.json exist directly under the repository root.

Citation

@article{lee2024trustsql,
  title={TrustSQL: Benchmarking Text-to-SQL Reliability with Penalty-Based Scoring}, 
  author={Lee, Gyubok and Chay, Woosog and Cho, Seonhee and Choi, Edward},
  journal={arXiv preprint arXiv:2403.15879},
  year={2024}
}