FusionSQL

May 31, 2026 · View on GitHub

FusionSQL

Text2SQL evaluation, FusionDataset construction, and shift-aware regression for Text-to-SQL.

Motivation

Citation

@inproceedings{fusionsql,
  author       = {Trinh Pham and Thanh Tam Nguyen and Viet Huynh and Hongzhi Yin and Quoc Viet Hung Nguyen},
  title        = {An Efficient and Effective Evaluator for Text2SQL Models on Unseen and Unlabeled Data},
  booktitle    = {ICDE},
  publisher    = {IEEE},
  year         = {2026},
}

What is it?

FusionSQL provides:

A portable evaluator that reports execution accuracy for Spider, Spider 2.0, BIRD, SParC, CoSQL, and WikiSQL.
A pipeline to construct a synthetic FusionDataset of databases, SQLs, and paraphrased questions.
Shift descriptors (Frechet-like, Mahalanobis, Sliced-Wasserstein) between a target workload and the training set.
An MLP regressor that learns to predict execution accuracy for a given base model with minimal MAE.

All metrics and reports here are execution-accuracy by design.

FusionSQL Framework

Project layout

fusion_evaluator/
- data/ dataset loaders and adapters
- sql/ SQL normalization and parsing (sqlglot)
- exec/ SQLite execution with caching
- metrics/ execution
- evaluator.py orchestrator
- cli.py evaluation entrypoint
figure/ diagrams
outputs/ reports and caches

Getting Started

0) Dependencies

Python 3.10+
SQLite (comes with Python stdlib sqlite3)
Recommended: a GPU with CUDA if embedding large datasets

Install Python dependencies:

pip install -r requirements.txt

Torch wheels differ by platform/GPU. If the default install fails or is slow, install a matching build from the official site: PyTorch Install.

1) Datasets and expected layout

Spider / Spider 2.0 / BIRD / SParC / CoSQL
- Gold and predictions are JSON/JSONL with fields: question, query (gold) or prediction (pred), and db_id.
- Databases are under db_root/DBID/DB.sqlite.
WikiSQL
- Gold/pred are JSONL; tables file is tables.jsonl (id, header, rows).
- We materialize one SQLite per table into an output directory.

Download links:

Spider: Project page
Spider 2.0: Project page
BIRD: Project page
SParC: Project page
CoSQL: Project page
WikiSQL: GitHub

Place gold/pred files accordingly and provide --db_root pointing to per-DB folders with DB.sqlite for Spider/Spider2/BIRD/SParC/CoSQL.

FusionDataset

Construct a synthetic, diverse dataset from CSV sources:

python -m fusion_evaluator.fusion_dataset.cli \
	--sources /path/to/csv_sources ... \
	--out_root outputs/fusion_dataset \
	--max_tables 1000

Optional LLM-driven question generation and rewrites (provide both to enable):

python -m fusion_evaluator.fusion_dataset.cli \
  --sources /path/to/csv_sources \
  --out_root outputs/fusion_dataset \
  --prompts fusion_evaluator/fusion_dataset/prompts.yaml \
  --hf_model Qwen/Qwen2.5-72B-Instruct \
  --device cuda --torch_dtype fp16 \
  --q_per_sql 4 \
  --enable_rewrites --rw_per_cat 2

This will:

acquire CSVs, filter tables (language, structure, near-dup),
synthesize relational DBs (SQLite under outputs/fusion_dataset/databases),
generate SQLs and paraphrased questions with distractors (LLM-backed if provided),
optionally produce rewritten Q/A pairs for semantic rewriting, numeric condition transforms, and query logic adjustments,
write outputs/fusion_dataset/fusion_dataset.jsonl.

FusionSQL

We embed SQLs (or questions) with a Hugging Face model, compute shift descriptors between a training workload and FusionDataset, and fit an MLP to predict execution accuracy.

1) Compute embeddings directly

python -m fusion_evaluator.evaluator_training.cli embed \
	--input outputs/fusion_dataset/fusion_dataset.jsonl \
	--output outputs/fusion_dataset/fusion_emb.npy \
	--model Qwen/Qwen2.5-72B-Instruct \
	--field sql \
	--device cuda \
	--batch_size 8 \
	--max_length 256 \
	--torch_dtype fp16

You can pass any compatible encoder from Hugging Face. Common choices include:

Qwen/Qwen2.5-72B-Instruct
meta-llama/Llama-3.1-70B-Instruct
deepseek-ai/deepseek-coder-33b-instruct
XGenerationLab/XiYanSQL-QwenCoder-14B-2502
cycloneboy/CscSQL-Grpo-Qwen2.5-Coder-7B-Instruct

2) Train the regressor (from precomputed embeddings)

python -m fusion_evaluator.evaluator_training.cli train \
	--source_embeddings path/to/source.npy \
	--target_embeddings path/to/fusion.npy \
	--observed_metric 0.712 \
	--slices 34 \
	--hybrid_swd --pca_k 10 --rand_r 24 --pca_subsample 8192 \
	--out outputs/regressor.joblib

3) End-to-end training with FusionDataset

python -m fusion_evaluator.evaluator_training.pipeline train \
    --dataset spider \
    --gold path/to/spider_dev_gold.json \
    --pred path/to/spider_dev_preds.jsonl \
    --db_root path/to/spider/database \
    --fusion_jsonl outputs/fusion_dataset/fusion_dataset.jsonl \
    --exec_accuracy 0.712 \
    --model_name Qwen/Qwen2.5-72B-Instruct \
    --hybrid_swd --pca_k 10 --rand_r 24 --pca_subsample 8192 \
    --slices 34 \
    --out outputs/regressor_spider_qwen.joblib

4) Inference with FusionSQL

python -m fusion_evaluator.evaluator_training.pipeline infer \
    --dataset spider \
    --gold path/to/spider_dev_gold.json \
    --pred path/to/spider_dev_preds.jsonl \
    --db_root path/to/spider/database \
    --fusion_jsonl outputs/fusion_dataset/fusion_dataset.jsonl \
    --model_name Qwen/Qwen2.5-72B-Instruct \
    --hybrid_swd --pca_k 10 --rand_r 24 --pca_subsample 8192 \
    --slices 34 \
    --model outputs/regressor_spider_qwen.joblib

The regressor predicts execution accuracy for the target workload and chosen base model.

5) Sampling-based shift + true execution accuracy (small example)

This helper script repeatedly samples target subsets (e.g., 500 examples), computes shift descriptors between the training workload and each subset, then estimates true execution accuracy by generating SQL with a model and executing against the databases. It saves the 100 shift vectors and their accuracies, then fits a 3-layer MLP regressor.

Example (BIRD dev):

python -m fusion_evaluator.evaluator_training.shift_sampling_train \
  --db_root fusion_evaluator/data/bird/dev/dev_databases \
  --source fusion_evaluator/data/spider/sft_spider_train_text2sql.json \
  --target fusion_evaluator/data/bird/sft_bird_dev_text2sql.json \
  --target_limit 500 \
  --num_sets 100 \
  --seed 0 \
  --device cuda --batch_size 8 --torch_dtype fp16

What it does:

Builds prompts from question + schema (same format as shift_from_json.py).
Uses Qwen/Qwen2.5-3B-Instruct to generate SQL.
Computes execution accuracy by running SQL against SQLite databases under --db_root.
Samples 100 subsets of size 500 (no replacement per subset).
Computes 100 shift vectors and their 100 accuracies.
Trains a 3-layer MLP regressor (256, 128, 64) on these vectors.

Outputs:

outputs/shift_samples/shift_samples.npz containing:
- deltas: (num_sets, 5) shift vectors
- accuracies: (num_sets,) true execution accuracies
- sample_indices: (num_sets, target_limit) indices into the target set
outputs/shift_samples/shift_mlp.joblib trained regressor

Notes:

For Spider, set --db_root to fusion_evaluator/data/spider/database (or test_database if needed).
If you want to reuse a different generation model, set --model.
To embed with a different model than generation, set --embed_model.

Show additional usage (Spider, Spider2, BIRD, SParC, CoSQL, WikiSQL)

# Spider
python -m fusion_evaluator.cli \
  --dataset spider \
  --gold path/to/dev_gold.json \
  --pred path/to/predictions.jsonl \
  --db_root path/to/spider/database \
  --out outputs/spider_report.json

# Spider 2.0
python -m fusion_evaluator.cli \
  --dataset spider2 \
  --gold path/to/spider2_gold.json \
  --pred path/to/spider2_preds.jsonl \
  --db_root path/to/spider2/database \
  --out outputs/spider2_report.json

# BIRD
python -m fusion_evaluator.cli \
  --dataset bird \
  --gold path/to/bird_gold.jsonl \
  --pred path/to/bird_preds.jsonl \
  --db_root path/to/bird/database \
  --out outputs/bird_report.json

# SParC
python -m fusion_evaluator.cli \
  --dataset sparc \
  --gold path/to/sparc_dev.json \
  --pred path/to/preds.jsonl \
  --db_root path/to/spider/database \
  --out outputs/sparc_report.json

# CoSQL
python -m fusion_evaluator.cli \
  --dataset cosql \
  --gold path/to/cosql_dev.json \
  --pred path/to/preds.jsonl \
  --db_root path/to/spider/database \
  --out outputs/cosql_report.json

# WikiSQL
python -m fusion_evaluator.cli \
  --dataset wikisql \
  --gold path/to/wikisql_gold.jsonl \
  --pred path/to/wikisql_preds.jsonl \
  --wikisql_tables path/to/tables.jsonl \
  --wikisql_db_out databases/wikisql \
  --out outputs/wikisql_report.json

Output:

JSON report at --out with summary and per-sample metrics.
Console table: ExecAcc.

Reported Results

FusionSQL-TL denotes FusionSQL Transfer Learning. FusionSQL-ML denotes FusionSQL Meta-learning.

Table III. MAE (↓) of dataset-level accuracy estimation for source-target transfers

Each cell reports mean ± 95% CI in percentage points. Best is in bold, second-best is underlined.

Transfer	Method	Qwen2.5-72B	Llama-3.1-70B	DeepSeek-33B	XiYanSQL-14B	CSC-SQL-7B	Avg.
Spider → BIRD	ATC-MC	13.9 ± 1.1	14.6 ± 1.2	15.2 ± 1.2	17.4 ± 1.4	18.3 ± 1.5	15.9 ± 1.3
	ATC-NE	15.0 ± 1.2	15.7 ± 1.3	16.5 ± 1.3	18.6 ± 1.5	19.8 ± 1.6	17.1 ± 1.4
	DoC (τ=0.8)	15.5 ± 1.3	16.0 ± 1.3	17.3 ± 1.4	19.2 ± 1.6	20.5 ± 1.6	17.7 ± 1.4
	DoC (τ=0.9)	16.7 ± 1.4	17.3 ± 1.4	18.6 ± 1.5	20.3 ± 1.7	21.7 ± 1.7	18.9 ± 1.5
	PseAutoEval	11.6 ± 0.9	12.2 ± 1.0	13.1 ± 1.0	15.1 ± 1.2	16.3 ± 1.3	13.7 ± 1.1
	BugJudge	14.8 ± 1.2	15.4 ± 1.2	16.2 ± 1.3	18.1 ± 1.4	19.0 ± 1.5	16.7 ± 1.3
	ArenaCmp	9.7 ± 0.8	10.4 ± 0.9	11.2 ± 0.9	12.6 ± 1.0	13.5 ± 1.1	11.5 ± 0.9
	FusionSQL-TL	3.4 ± 1.2	4.0 ± 1.2	4.6 ± 1.3	5.2 ± 1.4	5.6 ± 1.4	4.6 ± 1.3
	FusionSQL (Ours)	3.1 ± 0.5	3.7 ± 0.5	4.2 ± 0.6	4.8 ± 0.7	5.1 ± 0.7	4.2 ± 0.6
WikiSQL → Spider	ATC-MC	12.2 ± 1.0	13.1 ± 1.1	13.8 ± 1.2	15.2 ± 1.3	16.1 ± 1.4	14.1 ± 1.2
	ATC-NE	13.4 ± 1.1	14.0 ± 1.2	15.1 ± 1.3	16.3 ± 1.4	17.5 ± 1.5	15.3 ± 1.3
	DoC (τ=0.8)	14.6 ± 1.2	15.3 ± 1.3	16.5 ± 1.4	17.8 ± 1.5	19.0 ± 1.6	16.6 ± 1.4
	DoC (τ=0.9)	15.8 ± 1.3	16.4 ± 1.3	17.7 ± 1.4	19.1 ± 1.6	20.3 ± 1.6	17.9 ± 1.4
	PseAutoEval	11.1 ± 0.9	11.8 ± 1.0	12.6 ± 1.0	13.7 ± 1.1	14.9 ± 1.2	12.8 ± 1.0
	BugJudge	13.6 ± 1.1	14.2 ± 1.1	15.1 ± 1.2	16.5 ± 1.3	17.6 ± 1.4	15.4 ± 1.2
	ArenaCmp	9.2 ± 0.8	9.9 ± 0.8	10.7 ± 0.9	12.0 ± 1.0	12.8 ± 1.0	10.9 ± 0.9
	FusionSQL-TL	3.6 ± 1.2	4.1 ± 1.2	4.7 ± 1.3	5.1 ± 1.3	5.6 ± 1.4	4.6 ± 1.3
	FusionSQL (Ours)	3.2 ± 0.5	3.8 ± 0.5	4.3 ± 0.6	4.7 ± 0.7	5.2 ± 0.6	4.2 ± 0.6
SParC → CoSQL (in-domain)	ATC-MC	6.5 ± 0.6	7.2 ± 0.7	7.8 ± 0.8	8.3 ± 0.8	9.0 ± 0.9	7.8 ± 0.8
	ATC-NE	7.1 ± 0.6	7.8 ± 0.7	8.4 ± 0.7	9.0 ± 0.8	9.6 ± 0.9	8.4 ± 0.7
	DoC (τ=0.8)	7.7 ± 0.6	8.3 ± 0.7	8.8 ± 0.7	9.3 ± 0.8	9.9 ± 0.8	8.8 ± 0.7
	DoC (τ=0.9)	8.8 ± 0.7	9.3 ± 0.7	9.8 ± 0.8	10.4 ± 0.9	10.9 ± 0.9	9.8 ± 0.8
	PseAutoEval	5.5 ± 0.5	6.1 ± 0.5	6.7 ± 0.6	7.2 ± 0.6	7.8 ± 0.7	6.7 ± 0.6
	BugJudge	6.1 ± 0.6	6.7 ± 0.6	7.3 ± 0.7	7.9 ± 0.7	8.4 ± 0.8	7.3 ± 0.7
	ArenaCmp	3.9 ± 0.4	4.4 ± 0.4	4.9 ± 0.5	5.4 ± 0.5	5.9 ± 0.5	4.9 ± 0.5
	FusionSQL-TL	1.5 ± 1.2	1.7 ± 1.2	2.0 ± 1.3	2.2 ± 1.3	2.4 ± 1.3	2.0 ± 1.3
	FusionSQL (Ours)	1.6 ± 0.3	1.8 ± 0.3	2.1 ± 0.3	2.3 ± 0.4	2.5 ± 0.4	2.1 ± 0.3
Spider → SynSQL-2.5M	ATC-MC	10.9 ± 0.9	11.7 ± 1.0	12.3 ± 1.0	13.8 ± 1.1	14.7 ± 1.2	12.7 ± 1.0
	ATC-NE	12.1 ± 1.0	12.9 ± 1.1	13.5 ± 1.1	14.9 ± 1.2	15.8 ± 1.3	13.8 ± 1.1
	DoC (τ=0.8)	12.9 ± 1.0	13.6 ± 1.1	14.7 ± 1.2	16.0 ± 1.3	17.2 ± 1.4	14.9 ± 1.2
	DoC (τ=0.9)	14.1 ± 1.1	14.8 ± 1.2	15.9 ± 1.3	17.2 ± 1.4	18.4 ± 1.5	16.1 ± 1.3
	PseAutoEval	9.5 ± 0.8	10.1 ± 0.9	10.8 ± 0.9	12.0 ± 1.0	13.1 ± 1.1	11.1 ± 0.9
	BugJudge	12.4 ± 1.0	13.2 ± 1.1	14.0 ± 1.1	15.5 ± 1.2	16.6 ± 1.3	14.3 ± 1.1
	ArenaCmp	8.4 ± 0.7	9.1 ± 0.8	9.8 ± 0.8	11.1 ± 0.9	11.9 ± 1.0	10.1 ± 0.8
	FusionSQL-TL	3.1 ± 1.2	3.5 ± 1.2	4.0 ± 1.3	4.4 ± 1.3	4.9 ± 1.4	4.0 ± 1.3
	FusionSQL (Ours)	2.8 ± 0.4	3.2 ± 0.5	3.7 ± 0.5	4.1 ± 0.6	4.5 ± 0.6	3.7 ± 0.5
WikiSQL → Spider 2.0	ATC-MC	18.0 ± 1.5	18.7 ± 1.5	19.6 ± 1.6	21.0 ± 1.7	22.2 ± 1.8	19.9 ± 1.6
	ATC-NE	19.4 ± 1.6	20.1 ± 1.7	21.3 ± 1.8	22.6 ± 1.9	23.9 ± 2.0	21.5 ± 1.8
	DoC (τ=0.8)	20.5 ± 1.7	21.3 ± 1.8	22.7 ± 1.9	24.0 ± 2.0	25.4 ± 2.1	22.8 ± 1.9
	DoC (τ=0.9)	21.7 ± 1.8	22.5 ± 1.9	23.9 ± 2.0	25.2 ± 2.1	26.6 ± 2.2	23.9 ± 2.0
	PseAutoEval	16.3 ± 1.3	17.0 ± 1.4	17.7 ± 1.4	18.8 ± 1.5	20.1 ± 1.6	18.0 ± 1.4
	BugJudge	17.3 ± 1.4	18.1 ± 1.5	19.3 ± 1.6	20.7 ± 1.7	22.0 ± 1.8	19.5 ± 1.6
	ArenaCmp	12.6 ± 1.0	13.4 ± 1.1	14.5 ± 1.2	15.8 ± 1.3	16.9 ± 1.4	14.6 ± 1.2
	FusionSQL-TL	4.5 ± 1.3	5.1 ± 1.4	5.6 ± 1.4	6.1 ± 1.5	6.6 ± 1.5	5.6 ± 1.4
	FusionSQL (Ours)	4.2 ± 0.6	4.8 ± 0.7	5.3 ± 0.7	5.8 ± 0.8	6.3 ± 0.8	5.3 ± 0.7

Table IV. MAE (↓) for generalizing FusionSQL to unseen Text2SQL models

Columns are the unseen model pool. Each cell reports mean ± 95% CI in percentage points. Best is in bold.

Transfer	Method	CodeLlama-34B	StarCoder2-15B	Mistral-7B	DeepSeek-Coder-6.7B	Phi-3-mini	Avg.
Spider → BIRD	BugJudge	13.8 ± 1.0	13.5 ± 1.1	14.0 ± 1.0	13.9 ± 0.9	13.6 ± 1.0	13.8 ± 1.0
	ArenaCmp	11.1 ± 0.8	10.8 ± 0.9	11.4 ± 0.9	11.2 ± 0.9	10.9 ± 0.8	11.1 ± 0.9
	FusionSQL-ML (Ours)	6.7 ± 0.5	6.5 ± 0.6	6.8 ± 0.7	6.7 ± 0.6	6.6 ± 0.5	6.7 ± 0.6
WikiSQL → Spider	BugJudge	12.7 ± 1.0	12.4 ± 1.1	12.9 ± 1.0	12.8 ± 0.9	12.5 ± 1.0	12.7 ± 1.0
	ArenaCmp	10.4 ± 0.8	10.1 ± 0.9	10.6 ± 1.0	10.4 ± 0.9	10.2 ± 0.8	10.3 ± 0.9
	FusionSQL-ML (Ours)	6.0 ± 0.4	5.8 ± 0.5	6.1 ± 0.6	6.0 ± 0.5	5.9 ± 0.4	6.0 ± 0.5
SParC → CoSQL	BugJudge	11.5 ± 0.8	11.3 ± 0.9	11.6 ± 1.0	11.5 ± 0.9	11.2 ± 0.8	11.4 ± 0.9
	ArenaCmp	9.6 ± 0.7	9.4 ± 0.8	9.7 ± 0.9	9.6 ± 0.8	9.3 ± 0.7	9.5 ± 0.8
	FusionSQL-ML (Ours)	5.1 ± 0.4	4.9 ± 0.5	5.1 ± 0.6	5.0 ± 0.5	4.9 ± 0.4	5.0 ± 0.5
Spider → SynSQL-2.5M	BugJudge	13.3 ± 1.0	13.0 ± 1.1	13.4 ± 1.0	13.2 ± 0.9	13.1 ± 1.0	13.2 ± 1.0
	ArenaCmp	10.9 ± 0.8	10.6 ± 0.9	11.0 ± 1.0	10.9 ± 0.9	10.7 ± 0.8	10.8 ± 0.9
	FusionSQL-ML (Ours)	6.5 ± 0.5	6.3 ± 0.6	6.6 ± 0.7	6.5 ± 0.6	6.4 ± 0.5	6.5 ± 0.6
WikiSQL → Spider 2.0	BugJudge	14.6 ± 1.0	14.2 ± 1.1	14.7 ± 1.2	14.5 ± 1.1	14.3 ± 1.0	14.5 ± 1.1
	ArenaCmp	12.0 ± 0.9	11.7 ± 1.0	12.1 ± 1.1	12.0 ± 1.0	11.8 ± 0.9	11.9 ± 1.0
	FusionSQL-ML (Ours)	7.0 ± 0.5	6.8 ± 0.6	7.1 ± 0.7	7.0 ± 0.6	6.9 ± 0.5	7.0 ± 0.6

Table VI. MAE (↓) on classic Text2SQL models such as ATHENA++

Each cell reports mean ± 95% CI in percentage points. Best is in bold, second-best is underlined.

Dataset	Method	ATHENA	ATHENA++	SQLizer	Avg.
Spider	BugJudge	12.0 ± 1.0	11.8 ± 0.9	12.1 ± 1.0	12.0 ± 1.0
	ArenaCmp	10.8 ± 0.9	10.6 ± 0.8	10.9 ± 0.9	10.8 ± 0.9
	FusionSQL-TL	14.3 ± 1.1	14.1 ± 1.1	14.4 ± 1.2	14.3 ± 1.2
	FusionSQL-LLM	12.8 ± 1.1	12.6 ± 1.0	12.9 ± 1.1	12.8 ± 1.1
	FusionSQL	8.3 ± 0.6	8.2 ± 0.7	8.4 ± 0.8	8.3 ± 0.7
Spider 2.0	BugJudge	12.8 ± 1.1	12.6 ± 1.0	12.9 ± 1.1	12.8 ± 1.1
	ArenaCmp	11.6 ± 0.8	11.4 ± 0.9	11.7 ± 1.0	11.6 ± 0.9
	FusionSQL-TL	15.1 ± 1.1	14.9 ± 1.2	15.2 ± 1.3	15.1 ± 1.2
	FusionSQL-LLM	13.6 ± 1.0	13.4 ± 1.0	13.7 ± 1.2	13.6 ± 1.1
	FusionSQL	9.0 ± 0.6	8.9 ± 0.7	9.1 ± 0.8	9.0 ± 0.7
SynSQL-2.5M	BugJudge	13.0 ± 1.1	12.8 ± 1.1	13.1 ± 1.1	13.0 ± 1.1
	ArenaCmp	11.8 ± 0.8	11.6 ± 0.9	11.9 ± 1.0	11.8 ± 0.9
	FusionSQL-TL	15.3 ± 1.2	15.1 ± 1.2	15.4 ± 1.3	15.3 ± 1.3
	FusionSQL-LLM	13.7 ± 1.1	13.5 ± 1.1	13.8 ± 1.2	13.7 ± 1.1
	FusionSQL	9.1 ± 0.6	9.0 ± 0.7	9.2 ± 0.8	9.1 ± 0.7
CoSQL	BugJudge	11.5 ± 0.8	11.3 ± 0.9	11.6 ± 1.0	11.5 ± 0.9
	ArenaCmp	10.2 ± 0.8	10.0 ± 0.7	10.3 ± 0.8	10.2 ± 0.8
	FusionSQL-TL	13.8 ± 1.1	13.6 ± 1.1	13.9 ± 1.2	13.8 ± 1.2
	FusionSQL-LLM	12.3 ± 1.1	12.1 ± 1.0	12.4 ± 1.1	12.3 ± 1.1
	FusionSQL	7.9 ± 0.5	7.8 ± 0.6	8.0 ± 0.7	7.9 ± 0.6
BIRD	BugJudge	13.2 ± 1.1	13.0 ± 1.0	13.3 ± 1.1	13.2 ± 1.1
	ArenaCmp	12.0 ± 0.9	11.8 ± 0.8	12.1 ± 0.9	12.0 ± 0.9
	FusionSQL-TL	15.5 ± 1.3	15.3 ± 1.2	15.6 ± 1.3	15.5 ± 1.3
	FusionSQL-LLM	13.9 ± 1.2	13.7 ± 1.1	14.0 ± 1.2	13.9 ± 1.2
	FusionSQL	9.2 ± 0.6	9.1 ± 0.7	9.3 ± 0.8	9.2 ± 0.7

If you run into issues or need helper scripts for dataset downloads/materialization, open an issue or reach out.

Backup Statistics