KaSLA: Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL

April 23, 2026 Β· View on GitHub

KaSLA: Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL

Paper

Official implementation of the ICDE 2026 paper "Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation"

Zheng Yuan, Hao Chen, Zijin Hong, Qinggang Zhang, Feiran Huang, Qing Li, Xiao Huang

The Hong Kong Polytechnic University Β· City University of Macau Β· Jinan University


πŸ”Ž Overview

Text-to-SQL systems built on Large Language Models (LLMs) critically depend on an accurate schema linking stage β€” the mapping of a natural-language query to the relevant tables and columns of the target database. However:

  • Missing a single relevant schema element makes downstream SQL generation fail.
  • Redundant schema elements dilute the prompt, hurt reasoning, and inflate cost.
  • Traditional metrics (Recall / Precision / F1) do not penalize missing-but-critical elements strongly enough, masking real failures.

KaSLA reformulates schema linking as a 0-1 knapsack optimization problem that explicitly balances relevance against a bounded tolerance of redundancy, guided by both a binary and a probabilistic scoring model. A hierarchical strategy first selects tables, then columns within the selected tables, drastically shrinking the candidate space.

KaSLA pipeline
Figure 1. KaSLA serves as a plug-in schema linker: a binary model and a probabilistic model jointly produce relevance / redundancy scores; a 0-1 knapsack solver then selects the optimal tables and columns subject to a data-driven redundancy tolerance.


✨ Key Contributions

  1. Enhanced metrics Recall⁺ / Precision⁺ / F1⁺. A restricted missing indicator zeroes out the score when any essential schema element is missed, aligning metric outcomes with actual SQL-generation success.
  2. Knapsack-based schema linking. Formulate linking as a 0-1 knapsack: maximize total estimated relevance while keeping total redundancy below a tolerance bound u.
  3. Hierarchical linking strategy. Select tables first, then columns within those tables β€” reducing the combinatorial search space and error propagation.
  4. Data-driven redundancy tolerance. Estimate u by aggregating (via max) the cumulative weights of ground-truth links retrieved from a query-linking pool of similar training queries.

πŸ“¦ Data

Because some files exceed GitHub's 100 MB limit, we host them on Google Drive:

πŸ‘‰ Download: https://drive.google.com/file/d/1vv_6ZdOGGsMZ1P0emPIPn4ZE7A8qKNak/view?usp=sharing

After extraction, place the files under KaSLA/data/ so the tree looks like:

data/
β”œβ”€β”€ bird_dev_full.json
β”œβ”€β”€ spider_dev_full.json
β”œβ”€β”€ SL_ranking_probs/
β”‚   β”œβ”€β”€ SL_recall_bird_train_perT.json
β”‚   β”œβ”€β”€ SL_recall_bird_dev_perT.json
β”‚   β”œβ”€β”€ SL_ideal_bird_train_results.json
β”‚   β”œβ”€β”€ SL_recall_spider_train_perT.json
β”‚   β”œβ”€β”€ SL_recall_spider_dev_perT.json
β”‚   β”œβ”€β”€ SL_ideal_spider_train_results.json
β”‚   β”œβ”€β”€ bird_question_similarities_pool_dev.npy
β”‚   └── spider_question_similarities_pool_dev.npy
└── train_datasets/
    └── ...

The raw benchmarks are publicly available:


πŸ”§ Pipeline & Reproducibility

KaSLA is a plug-in schema linker. The full pipeline is:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Raw BIRD/Spider  β”‚ β†’ β”‚  Binary scoring +    β”‚ β†’ β”‚  KaSLA knapsack  β”‚ β†’ β”‚ Text-to-SQL    β”‚
β”‚  natural-lang Qs  β”‚   β”‚  Probabilistic       β”‚   β”‚  optimizer       β”‚   β”‚ generator      β”‚
β”‚                   β”‚   β”‚  scoring models      β”‚   β”‚                  β”‚   β”‚                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step 1: Train the Binary Scoring Model (Generative Linker)

Use src/finetune.py (built on LLaMA-Factory) to fine-tune an LLM as the binary scoring model, then run inference via src/incontextlearning.py to produce generative schema linking predictions β†’ icl-results/.

Step 2: Train the Probabilistic Scoring Model

Follow CodeS to train the schema item classifier. This produces the per-table/column probability rankings β†’ data/SL_ranking_probs/.

Quick path: The pre-computed outputs of Step 1 & 2 are provided in the Google Drive bundle, so you can skip directly to Step 3.

Step 3: Apply KaSLA

python KaSLA.py

KaSLA combines the binary and probabilistic signals via 0-1 knapsack optimization with data-driven redundancy tolerance. Edit the SL_file_path variable at the bottom of KaSLA.py to target a different input file.

Step 4: Evaluate Schema Linking

bash eval_schema-linking.sh

The evaluator reports Recall⁺, Precision⁺ and F1⁺ for both tables and columns.

Step 5: Downstream SQL Generation

Plug the KaSLA schema linking results into a Text-to-SQL generator (e.g., via src/eval_ours.py or src/incontextlearning.py) to generate and evaluate SQL.


πŸ“š Citation

If you find KaSLA useful in your research, please cite:

@inproceedings{yuan2026kasla,
  title     = {Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation},
  author    = {Yuan, Zheng and Chen, Hao and Hong, Zijin and Zhang, Qinggang and
               Huang, Feiran and Li, Qing and Huang, Xiao},
  booktitle = {Proceedings of the 42nd IEEE International Conference on Data Engineering (ICDE)},
  year      = {2026}
}

πŸ™ Acknowledgements

We thank the authors of the following projects:

  • LLaMA-Factory β€” The fine-tuning engine used for training our binary scoring model.
  • CodeS β€” Our probabilistic scoring model is trained following their schema item classifier framework.
  • BIRD and Spider β€” The datasets used for training and evaluation.

πŸ“ License

This project is released under the MIT License.