Load dataset from Hugging Face

April 15, 2025 · View on GitHub

The ai.load_dataset function allows you to load datasets from Hugging Face's datasets library directly into your PostgreSQL database.

Example Usage

select ai.load_dataset('squad');

select * from squad limit 10;

Parameters

Name	Type	Default	Required	Description
name	text	-	✔	The name of the dataset on Hugging Face (e.g., 'squad', 'glue', etc.)
config_name	text	-	✖	The specific configuration of the dataset to load. See Hugging Face documentation for more information.
split	text	-	✖	The split of the dataset to load (e.g., 'train', 'test', 'validation'). Defaults to all splits.
schema_name	text	'public'	✖	The PostgreSQL schema where the table will be created
table_name	text	-	✖	The name of the table to create. If null, will use the dataset name
if_table_exists	text	'error'	✖	Behavior when table exists: 'error' (raise error), 'append' (add rows), 'drop' (drop table and recreate)
field_types	jsonb	-	✖	Custom PostgreSQL data types for columns as a JSONB dictionary from name to type.
batch_size	int	5000	✖	Number of rows to insert in each batch
max_batches	int	null	✖	Maximum number of batches to load. Null means load all
kwargs	jsonb	-	✖	Additional arguments passed to the Hugging Face dataset loading function

Returns

Returns the number of rows loaded into the database (bigint).

The ai.load_dataset function loads all data in a single transaction. However, to load large dataset, it is sometimes useful to use multiple transactions. For this purpose, we provide the ai.load_dataset_multi_txn procedure. That procedure is similar to ai.load_dataset, but it allows you to specify the number of batches between commits using the commit_every_n_batches parameter.

CALL ai.load_dataset_multi_txn('squad', commit_every_n_batches => 10);

Examples

Basic usage - Load the entire 'squad' dataset:

SELECT ai.load_dataset('squad');

The data is loaded into a table named squad.

Load a small subset of the 'squad' dataset:

SELECT ai.load_dataset('squad', batch_size => 100, max_batches => 1);

Load the entire 'squad' dataset using multiple transactions:

CALL ai.load_dataset_multi_txn('squad', commit_every_n_batches => 100);

Load specific configuration and split:

SELECT ai.load_dataset(
    name => 'glue',
    config_name => 'mrpc',
    split => 'train'
);

Load with custom table name and field types:

SELECT ai.load_dataset(
    name => 'glue',
    config_name => 'mrpc',
    table_name => 'mrpc',
    field_types => '{"sentence1": "text", "sentence2": "text"}'::jsonb
);

Pre-create the table and load data into it:


CREATE TABLE squad (
    id          TEXT,
    title       TEXT,
    context     TEXT,
    question    TEXT,
    answers     JSONB
);

SELECT ai.load_dataset(
    name => 'squad',
    table_name => 'squad',
    if_table_exists => 'append'
);

Notes

The function requires an active internet connection to download datasets from Hugging Face.
Large datasets may take significant time to load depending on size and connection speed.
The function automatically maps Hugging Face dataset types to appropriate PostgreSQL data types unless overridden by field_types.

Example Usage

Parameters

Returns

Using Multiple Transactions

Examples

Notes