Relaxed Evaluation Metrics for Text-to-SQL

September 26, 2025 · View on GitHub

This project provides tools for evaluating text-to-SQL systems beyond binary metrics like Exact Match (EM) and Execution Accuracy (EX). We compute:

Execution Accuracy (EX) – do the predicted and ground-truth results match exactly?
Execution Precision (EXP) – of what the system predicted, how much was correct?
Execution Recall (EXR) – of what should have been predicted, how much was recovered?
F1 Score – a balance between precision and recall.

Project Structure
Setup
Usage
Experiments
Results

Project Structure

.
├── data/                        
├── docs/
│   ├── METRICS.md               # documentation for relaxed evaluation metrics
│   └── ...                      
├── scripts/
│   ├── load_dotenv.sh           # helper to load environment variables
│   └── ...                     
├── src/
│   ├── core/                    # core utilities and shared components
│   ├── analysis/                
│   │   ├── metrics/             
│   │   └── ...
│   ├── experiments/            
│   │   ├── metrics/             
│   │   └── ...
│   ├── metrics/                 # evaluation framework
│   │   ├── evaluation.py        # main entry point for running evaluation
│   │   ├── __init__.py
│   │   └── metrics/             
│   │       ├── __init__.py
│   │       ├── execution_accuracy.py
│   │       ├── exact_column_and_exact_cell.py
│   │       ├── exact_column_and_partial_cell.py
│   │       ├── semantic_column_and_exact_cell.py
│   │       ├── semantic_column_and_partial_cell.py
│   │       ├── free_column_and_partial_cell.py
│   │       └── unified_column_and_semantic_row.py
│   └── ...                      
├── LICENSE
├── README.md
├── pyproject.toml
└── uv.lock

Setup

1. Environment Variables

Copy .env.example → .env and edit values and load with:

source scripts/load_dotenv.sh

Usage

CLI Usage

# 1. Configure settings (update file and run)
source scripts/metrics_config.sh

# 2. Run evaluation
python src/metrics/evaluation.py \
  --predicted-sql "SELECT ...;" \
  --ground-truth-sql "SELECT ...;"

Python Usage

from src.metrics.evaluation import Evaluation, EvaluationTechnique
from src.core.database.database_handler import DBMS
from src.core.model_manager import OpenAIModel

config = {
    "evaluation_technique": EvaluationTechnique.SEMANTIC_COLUMN_AND_PARTIAL_CELL,
    "db_params": {"dbms": DBMS.SQLITE, "db_path": "path/to/database.sqlite"},
    "penalize_extra_columns": True,
    "embedding_model": OpenAIModel.TEXT_EMBEDDING_3_SMALL,
    "logs_dir_path": "data/evaluation_outputs/",
}

predicted_sql = "SELECT ...;"
ground_truth_sql = "SELECT ...;"

evaluator = Evaluation(config)
results = evaluator.run_evaluation(predicted_sql, ground_truth_sql, log=True)

Experiments

We provide three experiments showing how relaxed metrics uncover insights hidden by EX.

Experiment 1: Table Shape Sensitivity

source scripts/run_metrics_experiment1.sh

Experiment 2: Controlled Error Sensitivity

1- Single Error Mutants

bash scripts/source scripts/run_metrics_experiment2_1.sh

2- Multi Error Mutants

bash scripts/source scripts/run_metrics_experiment2_2.sh

Experiment 3: System-Level Comparison on Shared Failures

source scripts/run_metrics_experiment3.sh

Results

Add tables, figures, or summary observations here.