README.md

April 4, 2025 ยท View on GitHub

Each folder cordis, sdss, oncomx holds the relevant files (i.e. seed data, synth data, and dev data) for each of the datasets. Additionally each file contains a tables.json file, which contains a json structure of the database schema including table names, column names, column data types and primary/foreign key relationships.

The following is an example of the file structure:

  • dev.json --> the manually generated development dataset
  • seed.json --> the manually generated seed dataset
  • synth.json --> the synthetically generated dataset using the seed query templates
  • tables.json --> a json representation of the schema containing:
    • the database name ("db_id"),
    • free text table names for NLP pipelines ("table_names") e.g. "Stellar spectral line indices" vs "spplines"
    • original table names ("table_names_original") i.e. the table names as they are in the database
    • free text column names for NLP pipelines ("column_names")
    • original column names ("column_names_original") i.e. the column names as they are in the database
    • column data types ("column_types"): time, text or number
    • foreign key relationships("foreign_keys")
    • primary keys ("primary_keys")

The PostgreSQL databases for each of the 3 databases used for this benchmark can be found at the following links: CORDIS SDSS OncoMX

Before using pg_restore to import the data, please ensure that the extension pg_trgm is installed.

To install it, execute CREATE EXTENSION pg_trgm; in psql.

PostgreSQL specification: DBMS: PostgreSQL (ver. 9.5.20) Case sensitivity: plain=lower, delimited=exact Driver: PostgreSQL JDBC Driver (ver. 42.5.0, JDBC4.2)