README.md
March 12, 2026 ยท View on GitHub
Prerequisites
Before running data augmentation, make sure the following prerequisites are satisfied:
-
Dataset files are already prepared under
dataset/(frompreprocess/prepare_datasets.py). -
IR files are generated (mandatory for schema linking augmentation):
artifacts/ir/Spider_train.jsonartifacts/ir/BIRD_train.json
You can generate them with:
python preprocess/schema_to_ir.py --bench Spider_train python preprocess/schema_to_ir.py --bench BIRD_train -
Python dependencies are installed from
requirements.txt(includingopenai,transformers,vllm, andfunc-timeout). -
For scripts that call external LLM APIs (
sql_augment.py,compare_augment.py), configure credentials indata_augment/llm.pybefore running.
(1) Schema Linking
Generate schema-linking training data (local + global).
Default command:
python data_augment/schema_linking_augment.py
Default output:
artifacts/schema_link/local_schema_linking.jsonartifacts/schema_link/global_schema_linking_sft.jsonartifacts/schema_link/global_schema_linking_dpo.jsonartifacts/schema_link/config.json
Single benchmark mode writes into a same-name subdirectory:
python data_augment/schema_linking_augment.py --bench BIRD_train- output dir:
artifacts/schema_link/BIRD_train/
If you need intermediate files (such as per-benchmark split files), add --full-output.
# Single benchmark
python data_augment/schema_linking_augment.py --bench BIRD_train
# Custom parameters
python data_augment/schema_linking_augment.py --dpo-ratio 0.4 --num-col-samples 5 --seed 0
# Disable table-deletion negative samples
python data_augment/schema_linking_augment.py --no-table-deletion
# Skip noise augmentation
python data_augment/schema_linking_augment.py --no-noise
# Write full intermediate outputs
python data_augment/schema_linking_augment.py --full-output
(2) SQL Augmentation
Generate multiple reasoning-path SQL variants (Normal / CTE / Subquery) via LLM, then validate them against SQLite databases with iterative self-correction.
Output is written to artifacts/sql_augment.
Input should be *_schema_link.json generated with --full-output.
python data_augment/sql_augment.py --input artifacts/schema_link/Spider_train_schema_link.json
# Custom parameters
python data_augment/sql_augment.py --input artifacts/schema_link/Spider_train_schema_link.json \
--gen-workers 100 --val-workers 8 --query-timeout 60 --max-corrections 3
(3) SQL Selection
Sample SQL candidates, build pairwise correct/incorrect pairs, then annotate each pair with Chain-of-Thought reasoning.
vLLM Sampling
Use an LLM to generate multiple SQL candidates per question, validate them against SQLite, and separate them into correct/incorrect buckets.
Output is written to artifacts/vllm_sample.
Input should be *_schema_link.json generated with --full-output.
python data_augment/vLLM_sample.py \
--model /path/to/pretrained-model \
--input artifacts/schema_link/Spider_train_schema_link.json
Pairwise Data Construction
Sample correct + incorrect SQL pairs.
Output is written to artifacts/pairwise_data.
python data_augment/create_pairwise_data.py \
--input artifacts/vllm_sample/Spider_train_schema_link_structured.json
CoT Annotation
Annotate each pairwise sample with Chain-of-Thought analysis via an LLM.
Output is written to artifacts/cot_star.
python data_augment/compare_augment.py \
--input artifacts/pairwise_data/Spider_train_schema_link_structured_pairwise.json