π¦ Falcon: Enterprise-Grade Text-to-SQL Benchmark
January 21, 2026 Β· View on GitHub
π¦ Falcon: Enterprise-Grade Text-to-SQL Benchmark
A Comprehensive Chinese Text-to-SQL Benchmark for Complex, Cross-Domain Analytical Scenarios
Introduction | Dataset Structure | Getting Started | Citation
π Introduction
Falcon is a continuously evolving, high-quality benchmark designed to bridge the gap between academic Text-to-SQL datasets and real-world enterprise requirements. Unlike traditional benchmarks, Falcon focuses on MaxCompute/Hive dialects and stresses models with complex SQL patterns and linguistic ambiguities common in production environments.
Key Features
- SQL Complexity: Heavy focus on multi-table joins (77% of samples), nested CTEs, window functions, ranking, and type casting.
- Linguistic Challenges: Includes Chinese fuzzy time expressions, colloquial business jargon, ellipsis, and multi-intent questions.
- Enterprise Scale: Schemas involve denormalized fields, implicit foreign keys, and domain-specific synonyms.
The current release is built on curated public datasets covering Finance, Internet, and Retail domains.
π Dataset Structure
To facilitate robust evaluation, the Falcon benchmark is split into a Development Set (with ground truth) and a Test Set (blind).
Repository Layout
FALCON/
βββ dev_data/ # Development Set
β βββ dev.json # Questions, SQL, and Execution Results
β βββ tables.json # Schema definitions (PK/FK/Columns)
β βββ dev_databases/ # SQLite/CSV source files for execution
β
βββ test_data/ # Test Set
β βββ test.json # Questions ONLY (Ground truth hidden)
β βββ tables.json # Schema definitions
β βββ test_databases/ # SQLite/CSV source files
β
βββ simple_agent/ # [NEW] Lightweight Evaluation Scripts
β βββ comparator.py # SQL execution result comparator
β βββ utils.py # Utilities for SQL extraction from LLM response
β βββ simple_benchmark.py # Main script to run dev/test evaluation
β
βββ submission/ # [NEW] Submission Helpers & Examples
β βββ example_submission_csv/ # Example CSV files for leaderboard submission
β βββ example_submission_sql/ # Example SQL files for leaderboard submission
β βββ format_submission.py # Helper to convert DB-GPT Excel output to Zip
β
βββ README.md
Data Format Details
1. Development Data (dev_data/dev.json)
Used for few-shot prompting, fine-tuning, or debugging. Contains the natural language question, the ground truth SQL, and the expected execution result.
[
{
"question_id": "1",
"dataset_id": "finance_01",
"question": "ζ―δΈͺζ§ε«ηεΉ³εεΉ΄ιΎζ―ε€ε°οΌζεΉ΄ιΎζεΊοΌ",
"sql": "SELECT Gender, AVG(Age) FROM customers GROUP BY Gender ORDER BY AVG(Age)",
"answer": {
"Gender": ["Female", "Male"],
"AvgAge": [27.73, 27.84]
},
"is_order": "0"
}
]
2. Test Data (test_data/test.json)
Used for the official leaderboard. Only the question and schema reference are provided.
π Getting Started
We currently provide two methods for evaluating your models on the Falcon benchmark: a lightweight script-based approach and a GUI-based approach via DB-GPT.
Method 1: Simple Agent (Script-based)
The simple_agent directory contains a lightweight evaluation pipeline. You can use simple_benchmark.py to run evaluations on either the development or test sets.
-
Clone the Repository
git clone https://github.com/eosphoros-ai/Falcon.git cd Falcon -
Setup Environment Ensure you have the necessary Python dependencies installed.
pip install openai pandas tqdm -
Run Evaluation
-
Development Set: Run the benchmark on the dev set to check performance against ground truth.
cd simple_agent python simple_benchmark.py dev -
Test Set: Run the benchmark on the test set to generate predictions.
cd simple_agent python simple_benchmark.py testNote on Submission: After execution, a
submission.zipwill be automatically generated. For official leaderboard submission, a trace log (we recommend.jsonlformat) is required. Please ensure you manually include your trace log in the final ZIP before submitting.
-
Method 2: DB-GPT (GUI-based)
Falcon is fully integrated into DB-GPT, allowing you to evaluate both Models (LLMs) and Agents through a visual interface.
-
Configuration & Execution Please refer to the official DB-GPT Evaluation Documentation for detailed steps on how to:
- Import the Falcon benchmark dataset.
- Configure your Model or Agent.
- Run the evaluation pipeline via the "Models Evaluation" module.
-
Format Submission DB-GPT will generate an evaluation report in Excel (
.xlsx) format. To submit your results to the Falcon leaderboard, you must convert this file into the required ZIP format using our helper script.# Run the formatting script python submission/format_submission.py --input <path_to_dbgpt_output.xlsx> --output submission.zipNote: The generated
submission.zipwill contain the requiredresult_sqlandresult_csvfolders formatted correctly for the leaderboard.
π€ Submission
Once you have generated your SQL queries (and execution results), please refer to the submission/ directory for format requirements.
- Examples: Check
submission/example_submission_csvandsubmission/example_submission_sqlfor the expected file structure. - Guidelines: Please refer to the Falcon Submission Guidelines for detailed rules.
π Citation
If you use Falcon in your research or development, please cite our paper:
@article{falcon2025,
title={Falcon: A Comprehensive Chinese Text-to-SQL Benchmark for Enterprise-Grade Evaluation},
author={Luo, Wenzhen and Guan, Wei and Yao, Yifan and Pan, Yimin and Wang, Feng and Yu, Zhipeng and Wen, Zhe and Chen, Liang and Zhuang, Yihong},
journal={arXiv preprint arXiv:2510.24762},
year={2025},
url={https://arxiv.org/abs/2510.24762}
}
βοΈ License
This project is licensed under the Apache License, Version 2.0.
See the LICENSE file for the full text.