README.md

March 12, 2026 ยท View on GitHub

A modular text-to-SQL project with a full pipeline for:

  1. Dataset preprocessing
  2. SQL data augmentation
  3. Model training
  4. Multi-stage inference

This repository is organized by stage. The root README is a navigation guide. For detailed usage and arguments, please go to each subdirectory README.

๐Ÿšง Note: This repository is under construction. ๐Ÿšง

subdirectories

Repository Layout

โ”œโ”€โ”€ preprocess/      # dataset prep, IR generation, schema input construction
โ”œโ”€โ”€ data_augment/    # schema linking / SQL augmentation / pairwise CoT annotation
โ”œโ”€โ”€ training/        # stage-wise training entrypoints and launch scripts
โ”œโ”€โ”€ infer/           # unified inference pipeline
โ”œโ”€โ”€ schema_utils/    # IR -> schema rendering utilities
โ”œโ”€โ”€ value_index/     # value embedding and vector index utilities
โ”œโ”€โ”€ artifacts/       # intermediate files and generated artifacts
โ””โ”€โ”€ dataset/         # benchmark datasets (can be a symlink)

Environment Setup

Because vLLM and training stacks have different dependency constraints, we recommend separate environments.

Training Environment

conda create -n train_env python=3.12
conda activate train_env
pip install -r requirements-train.txt
conda install ninja
MAX_JOBS=64 pip install flash-attn --no-build-isolation

Inference Environment

conda create -n eval_env python=3.12
conda activate eval_env
pip install -r requirements-eval.txt

Workflow

  1. Prepare datasets
python preprocess/prepare_datasets.py --all
  1. Build IR / index / schema input
python preprocess/schema_to_ir.py --all
python preprocess/build_value_index.py --all --device cuda:0
python preprocess/schema_input.py --bench Spider_dev --device cuda:0
  1. Generate training data
python data_augment/schema_linking_augment.py
  1. Train stage models
bash training/train.sh global-sft <lr> <epoch> <model_name_or_path>
bash training/train.sh local-linker <lr> <epoch> <model_name_or_path>
bash training/train.sh generator <lr> <epoch> <model_name_or_path>
bash training/train.sh selector <lr> <epoch> <model_name_or_path>
  1. Run unified inference
python infer/inference.py --help
# or
bash infer/start_pipeline.sh <BENCHMARK> <SETTING>