README.md
March 12, 2026 ยท View on GitHub
A modular text-to-SQL project with a full pipeline for:
- Dataset preprocessing
- SQL data augmentation
- Model training
- Multi-stage inference
This repository is organized by stage. The root README is a navigation guide. For detailed usage and arguments, please go to each subdirectory README.
๐ง Note: This repository is under construction. ๐ง
subdirectories
- Preprocessing:
preprocess/README.md - Data augmentation:
data_augment/README.md - Training:
training/README.md - Inference:
infer/README.md - Artifacts layout:
artifacts/README.md
Repository Layout
โโโ preprocess/ # dataset prep, IR generation, schema input construction
โโโ data_augment/ # schema linking / SQL augmentation / pairwise CoT annotation
โโโ training/ # stage-wise training entrypoints and launch scripts
โโโ infer/ # unified inference pipeline
โโโ schema_utils/ # IR -> schema rendering utilities
โโโ value_index/ # value embedding and vector index utilities
โโโ artifacts/ # intermediate files and generated artifacts
โโโ dataset/ # benchmark datasets (can be a symlink)
Environment Setup
Because vLLM and training stacks have different dependency constraints, we recommend separate environments.
Training Environment
conda create -n train_env python=3.12
conda activate train_env
pip install -r requirements-train.txt
conda install ninja
MAX_JOBS=64 pip install flash-attn --no-build-isolation
Inference Environment
conda create -n eval_env python=3.12
conda activate eval_env
pip install -r requirements-eval.txt
Workflow
- Prepare datasets
python preprocess/prepare_datasets.py --all
- Build IR / index / schema input
python preprocess/schema_to_ir.py --all
python preprocess/build_value_index.py --all --device cuda:0
python preprocess/schema_input.py --bench Spider_dev --device cuda:0
- Generate training data
python data_augment/schema_linking_augment.py
- Train stage models
bash training/train.sh global-sft <lr> <epoch> <model_name_or_path>
bash training/train.sh local-linker <lr> <epoch> <model_name_or_path>
bash training/train.sh generator <lr> <epoch> <model_name_or_path>
bash training/train.sh selector <lr> <epoch> <model_name_or_path>
- Run unified inference
python infer/inference.py --help
# or
bash infer/start_pipeline.sh <BENCHMARK> <SETTING>