Load dataset from Hugging Face

April 15, 2025 · View on GitHub

The ai.load_dataset function allows you to load datasets from Hugging Face's datasets library directly into your PostgreSQL database.

Example Usage

select ai.load_dataset('squad');

select * from squad limit 10;

Parameters

NameTypeDefaultRequiredDescription
nametext-The name of the dataset on Hugging Face (e.g., 'squad', 'glue', etc.)
config_nametext-The specific configuration of the dataset to load. See Hugging Face documentation for more information.
splittext-The split of the dataset to load (e.g., 'train', 'test', 'validation'). Defaults to all splits.
schema_nametext'public'The PostgreSQL schema where the table will be created
table_nametext-The name of the table to create. If null, will use the dataset name
if_table_existstext'error'Behavior when table exists: 'error' (raise error), 'append' (add rows), 'drop' (drop table and recreate)
field_typesjsonb-Custom PostgreSQL data types for columns as a JSONB dictionary from name to type.
batch_sizeint5000Number of rows to insert in each batch
max_batchesintnullMaximum number of batches to load. Null means load all
kwargsjsonb-Additional arguments passed to the Hugging Face dataset loading function

Returns

Returns the number of rows loaded into the database (bigint).

Using Multiple Transactions

The ai.load_dataset function loads all data in a single transaction. However, to load large dataset, it is sometimes useful to use multiple transactions. For this purpose, we provide the ai.load_dataset_multi_txn procedure. That procedure is similar to ai.load_dataset, but it allows you to specify the number of batches between commits using the commit_every_n_batches parameter.

CALL ai.load_dataset_multi_txn('squad', commit_every_n_batches => 10);

Examples

  1. Basic usage - Load the entire 'squad' dataset:
SELECT ai.load_dataset('squad');

The data is loaded into a table named squad.

  1. Load a small subset of the 'squad' dataset:
SELECT ai.load_dataset('squad', batch_size => 100, max_batches => 1);
  1. Load the entire 'squad' dataset using multiple transactions:
CALL ai.load_dataset_multi_txn('squad', commit_every_n_batches => 100);
  1. Load specific configuration and split:
SELECT ai.load_dataset(
    name => 'glue',
    config_name => 'mrpc',
    split => 'train'
);
  1. Load with custom table name and field types:
SELECT ai.load_dataset(
    name => 'glue',
    config_name => 'mrpc',
    table_name => 'mrpc',
    field_types => '{"sentence1": "text", "sentence2": "text"}'::jsonb
);
  1. Pre-create the table and load data into it:

CREATE TABLE squad (
    id          TEXT,
    title       TEXT,
    context     TEXT,
    question    TEXT,
    answers     JSONB
);

SELECT ai.load_dataset(
    name => 'squad',
    table_name => 'squad',
    if_table_exists => 'append'
);

Notes

  • The function requires an active internet connection to download datasets from Hugging Face.
  • Large datasets may take significant time to load depending on size and connection speed.
  • The function automatically maps Hugging Face dataset types to appropriate PostgreSQL data types unless overridden by field_types.