README.md

April 4, 2025 · View on GitHub

Each folder cordis, sdss, oncomx holds the relevant files (i.e. seed data, synth data, and dev data) for each of the datasets. Additionally each file contains a tables.json file, which contains a json structure of the database schema including table names, column names, column data types and primary/foreign key relationships.

The following is an example of the file structure:

dev.json --> the manually generated development dataset
seed.json --> the manually generated seed dataset
synth.json --> the synthetically generated dataset using the seed query templates
tables.json --> a json representation of the schema containing:
- the database name ("db_id"),
- free text table names for NLP pipelines ("table_names") e.g. "Stellar spectral line indices" vs "spplines"
- original table names ("table_names_original") i.e. the table names as they are in the database
- free text column names for NLP pipelines ("column_names")
- original column names ("column_names_original") i.e. the column names as they are in the database
- column data types ("column_types"): time, text or number
- foreign key relationships("foreign_keys")
- primary keys ("primary_keys")

The PostgreSQL databases for each of the 3 databases used for this benchmark can be found at the following links: CORDIS SDSS OncoMX

Before using pg_restore to import the data, please ensure that the extension pg_trgm is installed.

To install it, execute CREATE EXTENSION pg_trgm; in psql.

PostgreSQL specification: DBMS: PostgreSQL (ver. 9.5.20) Case sensitivity: plain=lower, delimited=exact Driver: PostgreSQL JDBC Driver (ver. 42.5.0, JDBC4.2)