FusionSQL
May 31, 2026 · View on GitHub
FusionSQL
Text2SQL evaluation, FusionDataset construction, and shift-aware regression for Text-to-SQL.

Citation
@inproceedings{fusionsql,
author = {Trinh Pham and Thanh Tam Nguyen and Viet Huynh and Hongzhi Yin and Quoc Viet Hung Nguyen},
title = {An Efficient and Effective Evaluator for Text2SQL Models on Unseen and Unlabeled Data},
booktitle = {ICDE},
publisher = {IEEE},
year = {2026},
}
What is it?
FusionSQL provides:
- A portable evaluator that reports execution accuracy for Spider, Spider 2.0, BIRD, SParC, CoSQL, and WikiSQL.
- A pipeline to construct a synthetic FusionDataset of databases, SQLs, and paraphrased questions.
- Shift descriptors (Frechet-like, Mahalanobis, Sliced-Wasserstein) between a target workload and the training set.
- An MLP regressor that learns to predict execution accuracy for a given base model with minimal MAE.
All metrics and reports here are execution-accuracy by design.

Project layout
fusion_evaluator/data/dataset loaders and adapterssql/SQL normalization and parsing (sqlglot)exec/SQLite execution with cachingmetrics/executionevaluator.pyorchestratorcli.pyevaluation entrypoint
figure/diagramsoutputs/reports and caches
Getting Started
0) Dependencies
- Python 3.10+
- SQLite (comes with Python stdlib
sqlite3) - Recommended: a GPU with CUDA if embedding large datasets
Install Python dependencies:
pip install -r requirements.txt
Torch wheels differ by platform/GPU. If the default install fails or is slow, install a matching build from the official site: PyTorch Install.
1) Datasets and expected layout
-
Spider / Spider 2.0 / BIRD / SParC / CoSQL
- Gold and predictions are JSON/JSONL with fields:
question,query(gold) orprediction(pred), anddb_id. - Databases are under
db_root/DBID/DB.sqlite.
- Gold and predictions are JSON/JSONL with fields:
-
WikiSQL
- Gold/pred are JSONL; tables file is
tables.jsonl(id,header,rows). - We materialize one SQLite per table into an output directory.
- Gold/pred are JSONL; tables file is
Download links:
- Spider: Project page
- Spider 2.0: Project page
- BIRD: Project page
- SParC: Project page
- CoSQL: Project page
- WikiSQL: GitHub
Place gold/pred files accordingly and provide --db_root pointing to per-DB folders with DB.sqlite for Spider/Spider2/BIRD/SParC/CoSQL.
FusionDataset
Construct a synthetic, diverse dataset from CSV sources:
python -m fusion_evaluator.fusion_dataset.cli \
--sources /path/to/csv_sources ... \
--out_root outputs/fusion_dataset \
--max_tables 1000
Optional LLM-driven question generation and rewrites (provide both to enable):
python -m fusion_evaluator.fusion_dataset.cli \
--sources /path/to/csv_sources \
--out_root outputs/fusion_dataset \
--prompts fusion_evaluator/fusion_dataset/prompts.yaml \
--hf_model Qwen/Qwen2.5-72B-Instruct \
--device cuda --torch_dtype fp16 \
--q_per_sql 4 \
--enable_rewrites --rw_per_cat 2
This will:
- acquire CSVs, filter tables (language, structure, near-dup),
- synthesize relational DBs (SQLite under
outputs/fusion_dataset/databases), - generate SQLs and paraphrased questions with distractors (LLM-backed if provided),
- optionally produce rewritten Q/A pairs for semantic rewriting, numeric condition transforms, and query logic adjustments,
- write
outputs/fusion_dataset/fusion_dataset.jsonl.
FusionSQL
We embed SQLs (or questions) with a Hugging Face model, compute shift descriptors between a training workload and FusionDataset, and fit an MLP to predict execution accuracy.
1) Compute embeddings directly
python -m fusion_evaluator.evaluator_training.cli embed \
--input outputs/fusion_dataset/fusion_dataset.jsonl \
--output outputs/fusion_dataset/fusion_emb.npy \
--model Qwen/Qwen2.5-72B-Instruct \
--field sql \
--device cuda \
--batch_size 8 \
--max_length 256 \
--torch_dtype fp16
You can pass any compatible encoder from Hugging Face. Common choices include:
Qwen/Qwen2.5-72B-Instructmeta-llama/Llama-3.1-70B-Instructdeepseek-ai/deepseek-coder-33b-instructXGenerationLab/XiYanSQL-QwenCoder-14B-2502cycloneboy/CscSQL-Grpo-Qwen2.5-Coder-7B-Instruct
2) Train the regressor (from precomputed embeddings)
python -m fusion_evaluator.evaluator_training.cli train \
--source_embeddings path/to/source.npy \
--target_embeddings path/to/fusion.npy \
--observed_metric 0.712 \
--slices 34 \
--hybrid_swd --pca_k 10 --rand_r 24 --pca_subsample 8192 \
--out outputs/regressor.joblib
3) End-to-end training with FusionDataset
python -m fusion_evaluator.evaluator_training.pipeline train \
--dataset spider \
--gold path/to/spider_dev_gold.json \
--pred path/to/spider_dev_preds.jsonl \
--db_root path/to/spider/database \
--fusion_jsonl outputs/fusion_dataset/fusion_dataset.jsonl \
--exec_accuracy 0.712 \
--model_name Qwen/Qwen2.5-72B-Instruct \
--hybrid_swd --pca_k 10 --rand_r 24 --pca_subsample 8192 \
--slices 34 \
--out outputs/regressor_spider_qwen.joblib
4) Inference with FusionSQL
python -m fusion_evaluator.evaluator_training.pipeline infer \
--dataset spider \
--gold path/to/spider_dev_gold.json \
--pred path/to/spider_dev_preds.jsonl \
--db_root path/to/spider/database \
--fusion_jsonl outputs/fusion_dataset/fusion_dataset.jsonl \
--model_name Qwen/Qwen2.5-72B-Instruct \
--hybrid_swd --pca_k 10 --rand_r 24 --pca_subsample 8192 \
--slices 34 \
--model outputs/regressor_spider_qwen.joblib
The regressor predicts execution accuracy for the target workload and chosen base model.
5) Sampling-based shift + true execution accuracy (small example)
This helper script repeatedly samples target subsets (e.g., 500 examples), computes shift descriptors between the training workload and each subset, then estimates true execution accuracy by generating SQL with a model and executing against the databases. It saves the 100 shift vectors and their accuracies, then fits a 3-layer MLP regressor.
Example (BIRD dev):
python -m fusion_evaluator.evaluator_training.shift_sampling_train \
--db_root fusion_evaluator/data/bird/dev/dev_databases \
--source fusion_evaluator/data/spider/sft_spider_train_text2sql.json \
--target fusion_evaluator/data/bird/sft_bird_dev_text2sql.json \
--target_limit 500 \
--num_sets 100 \
--seed 0 \
--device cuda --batch_size 8 --torch_dtype fp16
What it does:
- Builds prompts from question + schema (same format as
shift_from_json.py). - Uses
Qwen/Qwen2.5-3B-Instructto generate SQL. - Computes execution accuracy by running SQL against SQLite databases under
--db_root. - Samples 100 subsets of size 500 (no replacement per subset).
- Computes 100 shift vectors and their 100 accuracies.
- Trains a 3-layer MLP regressor
(256, 128, 64)on these vectors.
Outputs:
outputs/shift_samples/shift_samples.npzcontaining:deltas:(num_sets, 5)shift vectorsaccuracies:(num_sets,)true execution accuraciessample_indices:(num_sets, target_limit)indices into the target set
outputs/shift_samples/shift_mlp.joblibtrained regressor
Notes:
- For Spider, set
--db_roottofusion_evaluator/data/spider/database(ortest_databaseif needed). - If you want to reuse a different generation model, set
--model. - To embed with a different model than generation, set
--embed_model.
Show additional usage (Spider, Spider2, BIRD, SParC, CoSQL, WikiSQL)
# Spider
python -m fusion_evaluator.cli \
--dataset spider \
--gold path/to/dev_gold.json \
--pred path/to/predictions.jsonl \
--db_root path/to/spider/database \
--out outputs/spider_report.json
# Spider 2.0
python -m fusion_evaluator.cli \
--dataset spider2 \
--gold path/to/spider2_gold.json \
--pred path/to/spider2_preds.jsonl \
--db_root path/to/spider2/database \
--out outputs/spider2_report.json
# BIRD
python -m fusion_evaluator.cli \
--dataset bird \
--gold path/to/bird_gold.jsonl \
--pred path/to/bird_preds.jsonl \
--db_root path/to/bird/database \
--out outputs/bird_report.json
# SParC
python -m fusion_evaluator.cli \
--dataset sparc \
--gold path/to/sparc_dev.json \
--pred path/to/preds.jsonl \
--db_root path/to/spider/database \
--out outputs/sparc_report.json
# CoSQL
python -m fusion_evaluator.cli \
--dataset cosql \
--gold path/to/cosql_dev.json \
--pred path/to/preds.jsonl \
--db_root path/to/spider/database \
--out outputs/cosql_report.json
# WikiSQL
python -m fusion_evaluator.cli \
--dataset wikisql \
--gold path/to/wikisql_gold.jsonl \
--pred path/to/wikisql_preds.jsonl \
--wikisql_tables path/to/tables.jsonl \
--wikisql_db_out databases/wikisql \
--out outputs/wikisql_report.json
Output:
- JSON report at
--outwithsummaryand per-sample metrics. - Console table:
ExecAcc.
Reported Results
FusionSQL-TL denotes FusionSQL Transfer Learning. FusionSQL-ML denotes FusionSQL Meta-learning.
Each cell reports mean ± 95% CI in percentage points. Best is in bold, second-best is underlined.
| Transfer | Method | Qwen2.5-72B | Llama-3.1-70B | DeepSeek-33B | XiYanSQL-14B | CSC-SQL-7B | Avg. |
|---|---|---|---|---|---|---|---|
| Spider → BIRD | ATC-MC | 13.9 ± 1.1 | 14.6 ± 1.2 | 15.2 ± 1.2 | 17.4 ± 1.4 | 18.3 ± 1.5 | 15.9 ± 1.3 |
| ATC-NE | 15.0 ± 1.2 | 15.7 ± 1.3 | 16.5 ± 1.3 | 18.6 ± 1.5 | 19.8 ± 1.6 | 17.1 ± 1.4 | |
| DoC (τ=0.8) | 15.5 ± 1.3 | 16.0 ± 1.3 | 17.3 ± 1.4 | 19.2 ± 1.6 | 20.5 ± 1.6 | 17.7 ± 1.4 | |
| DoC (τ=0.9) | 16.7 ± 1.4 | 17.3 ± 1.4 | 18.6 ± 1.5 | 20.3 ± 1.7 | 21.7 ± 1.7 | 18.9 ± 1.5 | |
| PseAutoEval | 11.6 ± 0.9 | 12.2 ± 1.0 | 13.1 ± 1.0 | 15.1 ± 1.2 | 16.3 ± 1.3 | 13.7 ± 1.1 | |
| BugJudge | 14.8 ± 1.2 | 15.4 ± 1.2 | 16.2 ± 1.3 | 18.1 ± 1.4 | 19.0 ± 1.5 | 16.7 ± 1.3 | |
| ArenaCmp | 9.7 ± 0.8 | 10.4 ± 0.9 | 11.2 ± 0.9 | 12.6 ± 1.0 | 13.5 ± 1.1 | 11.5 ± 0.9 | |
| FusionSQL-TL | 3.4 ± 1.2 | 4.0 ± 1.2 | 4.6 ± 1.3 | 5.2 ± 1.4 | 5.6 ± 1.4 | 4.6 ± 1.3 | |
| FusionSQL (Ours) | 3.1 ± 0.5 | 3.7 ± 0.5 | 4.2 ± 0.6 | 4.8 ± 0.7 | 5.1 ± 0.7 | 4.2 ± 0.6 | |
| WikiSQL → Spider | ATC-MC | 12.2 ± 1.0 | 13.1 ± 1.1 | 13.8 ± 1.2 | 15.2 ± 1.3 | 16.1 ± 1.4 | 14.1 ± 1.2 |
| ATC-NE | 13.4 ± 1.1 | 14.0 ± 1.2 | 15.1 ± 1.3 | 16.3 ± 1.4 | 17.5 ± 1.5 | 15.3 ± 1.3 | |
| DoC (τ=0.8) | 14.6 ± 1.2 | 15.3 ± 1.3 | 16.5 ± 1.4 | 17.8 ± 1.5 | 19.0 ± 1.6 | 16.6 ± 1.4 | |
| DoC (τ=0.9) | 15.8 ± 1.3 | 16.4 ± 1.3 | 17.7 ± 1.4 | 19.1 ± 1.6 | 20.3 ± 1.6 | 17.9 ± 1.4 | |
| PseAutoEval | 11.1 ± 0.9 | 11.8 ± 1.0 | 12.6 ± 1.0 | 13.7 ± 1.1 | 14.9 ± 1.2 | 12.8 ± 1.0 | |
| BugJudge | 13.6 ± 1.1 | 14.2 ± 1.1 | 15.1 ± 1.2 | 16.5 ± 1.3 | 17.6 ± 1.4 | 15.4 ± 1.2 | |
| ArenaCmp | 9.2 ± 0.8 | 9.9 ± 0.8 | 10.7 ± 0.9 | 12.0 ± 1.0 | 12.8 ± 1.0 | 10.9 ± 0.9 | |
| FusionSQL-TL | 3.6 ± 1.2 | 4.1 ± 1.2 | 4.7 ± 1.3 | 5.1 ± 1.3 | 5.6 ± 1.4 | 4.6 ± 1.3 | |
| FusionSQL (Ours) | 3.2 ± 0.5 | 3.8 ± 0.5 | 4.3 ± 0.6 | 4.7 ± 0.7 | 5.2 ± 0.6 | 4.2 ± 0.6 | |
| SParC → CoSQL (in-domain) | ATC-MC | 6.5 ± 0.6 | 7.2 ± 0.7 | 7.8 ± 0.8 | 8.3 ± 0.8 | 9.0 ± 0.9 | 7.8 ± 0.8 |
| ATC-NE | 7.1 ± 0.6 | 7.8 ± 0.7 | 8.4 ± 0.7 | 9.0 ± 0.8 | 9.6 ± 0.9 | 8.4 ± 0.7 | |
| DoC (τ=0.8) | 7.7 ± 0.6 | 8.3 ± 0.7 | 8.8 ± 0.7 | 9.3 ± 0.8 | 9.9 ± 0.8 | 8.8 ± 0.7 | |
| DoC (τ=0.9) | 8.8 ± 0.7 | 9.3 ± 0.7 | 9.8 ± 0.8 | 10.4 ± 0.9 | 10.9 ± 0.9 | 9.8 ± 0.8 | |
| PseAutoEval | 5.5 ± 0.5 | 6.1 ± 0.5 | 6.7 ± 0.6 | 7.2 ± 0.6 | 7.8 ± 0.7 | 6.7 ± 0.6 | |
| BugJudge | 6.1 ± 0.6 | 6.7 ± 0.6 | 7.3 ± 0.7 | 7.9 ± 0.7 | 8.4 ± 0.8 | 7.3 ± 0.7 | |
| ArenaCmp | 3.9 ± 0.4 | 4.4 ± 0.4 | 4.9 ± 0.5 | 5.4 ± 0.5 | 5.9 ± 0.5 | 4.9 ± 0.5 | |
| FusionSQL-TL | 1.5 ± 1.2 | 1.7 ± 1.2 | 2.0 ± 1.3 | 2.2 ± 1.3 | 2.4 ± 1.3 | 2.0 ± 1.3 | |
| FusionSQL (Ours) | 1.6 ± 0.3 | 1.8 ± 0.3 | 2.1 ± 0.3 | 2.3 ± 0.4 | 2.5 ± 0.4 | 2.1 ± 0.3 | |
| Spider → SynSQL-2.5M | ATC-MC | 10.9 ± 0.9 | 11.7 ± 1.0 | 12.3 ± 1.0 | 13.8 ± 1.1 | 14.7 ± 1.2 | 12.7 ± 1.0 |
| ATC-NE | 12.1 ± 1.0 | 12.9 ± 1.1 | 13.5 ± 1.1 | 14.9 ± 1.2 | 15.8 ± 1.3 | 13.8 ± 1.1 | |
| DoC (τ=0.8) | 12.9 ± 1.0 | 13.6 ± 1.1 | 14.7 ± 1.2 | 16.0 ± 1.3 | 17.2 ± 1.4 | 14.9 ± 1.2 | |
| DoC (τ=0.9) | 14.1 ± 1.1 | 14.8 ± 1.2 | 15.9 ± 1.3 | 17.2 ± 1.4 | 18.4 ± 1.5 | 16.1 ± 1.3 | |
| PseAutoEval | 9.5 ± 0.8 | 10.1 ± 0.9 | 10.8 ± 0.9 | 12.0 ± 1.0 | 13.1 ± 1.1 | 11.1 ± 0.9 | |
| BugJudge | 12.4 ± 1.0 | 13.2 ± 1.1 | 14.0 ± 1.1 | 15.5 ± 1.2 | 16.6 ± 1.3 | 14.3 ± 1.1 | |
| ArenaCmp | 8.4 ± 0.7 | 9.1 ± 0.8 | 9.8 ± 0.8 | 11.1 ± 0.9 | 11.9 ± 1.0 | 10.1 ± 0.8 | |
| FusionSQL-TL | 3.1 ± 1.2 | 3.5 ± 1.2 | 4.0 ± 1.3 | 4.4 ± 1.3 | 4.9 ± 1.4 | 4.0 ± 1.3 | |
| FusionSQL (Ours) | 2.8 ± 0.4 | 3.2 ± 0.5 | 3.7 ± 0.5 | 4.1 ± 0.6 | 4.5 ± 0.6 | 3.7 ± 0.5 | |
| WikiSQL → Spider 2.0 | ATC-MC | 18.0 ± 1.5 | 18.7 ± 1.5 | 19.6 ± 1.6 | 21.0 ± 1.7 | 22.2 ± 1.8 | 19.9 ± 1.6 |
| ATC-NE | 19.4 ± 1.6 | 20.1 ± 1.7 | 21.3 ± 1.8 | 22.6 ± 1.9 | 23.9 ± 2.0 | 21.5 ± 1.8 | |
| DoC (τ=0.8) | 20.5 ± 1.7 | 21.3 ± 1.8 | 22.7 ± 1.9 | 24.0 ± 2.0 | 25.4 ± 2.1 | 22.8 ± 1.9 | |
| DoC (τ=0.9) | 21.7 ± 1.8 | 22.5 ± 1.9 | 23.9 ± 2.0 | 25.2 ± 2.1 | 26.6 ± 2.2 | 23.9 ± 2.0 | |
| PseAutoEval | 16.3 ± 1.3 | 17.0 ± 1.4 | 17.7 ± 1.4 | 18.8 ± 1.5 | 20.1 ± 1.6 | 18.0 ± 1.4 | |
| BugJudge | 17.3 ± 1.4 | 18.1 ± 1.5 | 19.3 ± 1.6 | 20.7 ± 1.7 | 22.0 ± 1.8 | 19.5 ± 1.6 | |
| ArenaCmp | 12.6 ± 1.0 | 13.4 ± 1.1 | 14.5 ± 1.2 | 15.8 ± 1.3 | 16.9 ± 1.4 | 14.6 ± 1.2 | |
| FusionSQL-TL | 4.5 ± 1.3 | 5.1 ± 1.4 | 5.6 ± 1.4 | 6.1 ± 1.5 | 6.6 ± 1.5 | 5.6 ± 1.4 | |
| FusionSQL (Ours) | 4.2 ± 0.6 | 4.8 ± 0.7 | 5.3 ± 0.7 | 5.8 ± 0.8 | 6.3 ± 0.8 | 5.3 ± 0.7 |
Columns are the unseen model pool. Each cell reports mean ± 95% CI in percentage points. Best is in bold.
| Transfer | Method | CodeLlama-34B | StarCoder2-15B | Mistral-7B | DeepSeek-Coder-6.7B | Phi-3-mini | Avg. |
|---|---|---|---|---|---|---|---|
| Spider → BIRD | BugJudge | 13.8 ± 1.0 | 13.5 ± 1.1 | 14.0 ± 1.0 | 13.9 ± 0.9 | 13.6 ± 1.0 | 13.8 ± 1.0 |
| ArenaCmp | 11.1 ± 0.8 | 10.8 ± 0.9 | 11.4 ± 0.9 | 11.2 ± 0.9 | 10.9 ± 0.8 | 11.1 ± 0.9 | |
| FusionSQL-ML (Ours) | 6.7 ± 0.5 | 6.5 ± 0.6 | 6.8 ± 0.7 | 6.7 ± 0.6 | 6.6 ± 0.5 | 6.7 ± 0.6 | |
| WikiSQL → Spider | BugJudge | 12.7 ± 1.0 | 12.4 ± 1.1 | 12.9 ± 1.0 | 12.8 ± 0.9 | 12.5 ± 1.0 | 12.7 ± 1.0 |
| ArenaCmp | 10.4 ± 0.8 | 10.1 ± 0.9 | 10.6 ± 1.0 | 10.4 ± 0.9 | 10.2 ± 0.8 | 10.3 ± 0.9 | |
| FusionSQL-ML (Ours) | 6.0 ± 0.4 | 5.8 ± 0.5 | 6.1 ± 0.6 | 6.0 ± 0.5 | 5.9 ± 0.4 | 6.0 ± 0.5 | |
| SParC → CoSQL | BugJudge | 11.5 ± 0.8 | 11.3 ± 0.9 | 11.6 ± 1.0 | 11.5 ± 0.9 | 11.2 ± 0.8 | 11.4 ± 0.9 |
| ArenaCmp | 9.6 ± 0.7 | 9.4 ± 0.8 | 9.7 ± 0.9 | 9.6 ± 0.8 | 9.3 ± 0.7 | 9.5 ± 0.8 | |
| FusionSQL-ML (Ours) | 5.1 ± 0.4 | 4.9 ± 0.5 | 5.1 ± 0.6 | 5.0 ± 0.5 | 4.9 ± 0.4 | 5.0 ± 0.5 | |
| Spider → SynSQL-2.5M | BugJudge | 13.3 ± 1.0 | 13.0 ± 1.1 | 13.4 ± 1.0 | 13.2 ± 0.9 | 13.1 ± 1.0 | 13.2 ± 1.0 |
| ArenaCmp | 10.9 ± 0.8 | 10.6 ± 0.9 | 11.0 ± 1.0 | 10.9 ± 0.9 | 10.7 ± 0.8 | 10.8 ± 0.9 | |
| FusionSQL-ML (Ours) | 6.5 ± 0.5 | 6.3 ± 0.6 | 6.6 ± 0.7 | 6.5 ± 0.6 | 6.4 ± 0.5 | 6.5 ± 0.6 | |
| WikiSQL → Spider 2.0 | BugJudge | 14.6 ± 1.0 | 14.2 ± 1.1 | 14.7 ± 1.2 | 14.5 ± 1.1 | 14.3 ± 1.0 | 14.5 ± 1.1 |
| ArenaCmp | 12.0 ± 0.9 | 11.7 ± 1.0 | 12.1 ± 1.1 | 12.0 ± 1.0 | 11.8 ± 0.9 | 11.9 ± 1.0 | |
| FusionSQL-ML (Ours) | 7.0 ± 0.5 | 6.8 ± 0.6 | 7.1 ± 0.7 | 7.0 ± 0.6 | 6.9 ± 0.5 | 7.0 ± 0.6 |
Each cell reports mean ± 95% CI in percentage points. Best is in bold, second-best is underlined.
| Dataset | Method | ATHENA | ATHENA++ | SQLizer | Avg. |
|---|---|---|---|---|---|
| Spider | BugJudge | 12.0 ± 1.0 | 11.8 ± 0.9 | 12.1 ± 1.0 | 12.0 ± 1.0 |
| ArenaCmp | 10.8 ± 0.9 | 10.6 ± 0.8 | 10.9 ± 0.9 | 10.8 ± 0.9 | |
| FusionSQL-TL | 14.3 ± 1.1 | 14.1 ± 1.1 | 14.4 ± 1.2 | 14.3 ± 1.2 | |
| FusionSQL-LLM | 12.8 ± 1.1 | 12.6 ± 1.0 | 12.9 ± 1.1 | 12.8 ± 1.1 | |
| FusionSQL | 8.3 ± 0.6 | 8.2 ± 0.7 | 8.4 ± 0.8 | 8.3 ± 0.7 | |
| Spider 2.0 | BugJudge | 12.8 ± 1.1 | 12.6 ± 1.0 | 12.9 ± 1.1 | 12.8 ± 1.1 |
| ArenaCmp | 11.6 ± 0.8 | 11.4 ± 0.9 | 11.7 ± 1.0 | 11.6 ± 0.9 | |
| FusionSQL-TL | 15.1 ± 1.1 | 14.9 ± 1.2 | 15.2 ± 1.3 | 15.1 ± 1.2 | |
| FusionSQL-LLM | 13.6 ± 1.0 | 13.4 ± 1.0 | 13.7 ± 1.2 | 13.6 ± 1.1 | |
| FusionSQL | 9.0 ± 0.6 | 8.9 ± 0.7 | 9.1 ± 0.8 | 9.0 ± 0.7 | |
| SynSQL-2.5M | BugJudge | 13.0 ± 1.1 | 12.8 ± 1.1 | 13.1 ± 1.1 | 13.0 ± 1.1 |
| ArenaCmp | 11.8 ± 0.8 | 11.6 ± 0.9 | 11.9 ± 1.0 | 11.8 ± 0.9 | |
| FusionSQL-TL | 15.3 ± 1.2 | 15.1 ± 1.2 | 15.4 ± 1.3 | 15.3 ± 1.3 | |
| FusionSQL-LLM | 13.7 ± 1.1 | 13.5 ± 1.1 | 13.8 ± 1.2 | 13.7 ± 1.1 | |
| FusionSQL | 9.1 ± 0.6 | 9.0 ± 0.7 | 9.2 ± 0.8 | 9.1 ± 0.7 | |
| CoSQL | BugJudge | 11.5 ± 0.8 | 11.3 ± 0.9 | 11.6 ± 1.0 | 11.5 ± 0.9 |
| ArenaCmp | 10.2 ± 0.8 | 10.0 ± 0.7 | 10.3 ± 0.8 | 10.2 ± 0.8 | |
| FusionSQL-TL | 13.8 ± 1.1 | 13.6 ± 1.1 | 13.9 ± 1.2 | 13.8 ± 1.2 | |
| FusionSQL-LLM | 12.3 ± 1.1 | 12.1 ± 1.0 | 12.4 ± 1.1 | 12.3 ± 1.1 | |
| FusionSQL | 7.9 ± 0.5 | 7.8 ± 0.6 | 8.0 ± 0.7 | 7.9 ± 0.6 | |
| BIRD | BugJudge | 13.2 ± 1.1 | 13.0 ± 1.0 | 13.3 ± 1.1 | 13.2 ± 1.1 |
| ArenaCmp | 12.0 ± 0.9 | 11.8 ± 0.8 | 12.1 ± 0.9 | 12.0 ± 0.9 | |
| FusionSQL-TL | 15.5 ± 1.3 | 15.3 ± 1.2 | 15.6 ± 1.3 | 15.5 ± 1.3 | |
| FusionSQL-LLM | 13.9 ± 1.2 | 13.7 ± 1.1 | 14.0 ± 1.2 | 13.9 ± 1.2 | |
| FusionSQL | 9.2 ± 0.6 | 9.1 ± 0.7 | 9.3 ± 0.8 | 9.2 ± 0.7 |
If you run into issues or need helper scripts for dataset downloads/materialization, open an issue or reach out.
Backup Statistics