Load dataset from Hugging Face
April 15, 2025 · View on GitHub
The ai.load_dataset function allows you to load datasets from Hugging Face's datasets library directly into your PostgreSQL database.
Example Usage
select ai.load_dataset('squad');
select * from squad limit 10;
Parameters
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| name | text | - | ✔ | The name of the dataset on Hugging Face (e.g., 'squad', 'glue', etc.) |
| config_name | text | - | ✖ | The specific configuration of the dataset to load. See Hugging Face documentation for more information. |
| split | text | - | ✖ | The split of the dataset to load (e.g., 'train', 'test', 'validation'). Defaults to all splits. |
| schema_name | text | 'public' | ✖ | The PostgreSQL schema where the table will be created |
| table_name | text | - | ✖ | The name of the table to create. If null, will use the dataset name |
| if_table_exists | text | 'error' | ✖ | Behavior when table exists: 'error' (raise error), 'append' (add rows), 'drop' (drop table and recreate) |
| field_types | jsonb | - | ✖ | Custom PostgreSQL data types for columns as a JSONB dictionary from name to type. |
| batch_size | int | 5000 | ✖ | Number of rows to insert in each batch |
| max_batches | int | null | ✖ | Maximum number of batches to load. Null means load all |
| kwargs | jsonb | - | ✖ | Additional arguments passed to the Hugging Face dataset loading function |
Returns
Returns the number of rows loaded into the database (bigint).
Using Multiple Transactions
The ai.load_dataset function loads all data in a single transaction. However, to load large dataset, it is sometimes useful to use multiple transactions.
For this purpose, we provide the ai.load_dataset_multi_txn procedure. That procedure is similar to ai.load_dataset, but it allows you to specify the number of batches between commits
using the commit_every_n_batches parameter.
CALL ai.load_dataset_multi_txn('squad', commit_every_n_batches => 10);
Examples
- Basic usage - Load the entire 'squad' dataset:
SELECT ai.load_dataset('squad');
The data is loaded into a table named squad.
- Load a small subset of the 'squad' dataset:
SELECT ai.load_dataset('squad', batch_size => 100, max_batches => 1);
- Load the entire 'squad' dataset using multiple transactions:
CALL ai.load_dataset_multi_txn('squad', commit_every_n_batches => 100);
- Load specific configuration and split:
SELECT ai.load_dataset(
name => 'glue',
config_name => 'mrpc',
split => 'train'
);
- Load with custom table name and field types:
SELECT ai.load_dataset(
name => 'glue',
config_name => 'mrpc',
table_name => 'mrpc',
field_types => '{"sentence1": "text", "sentence2": "text"}'::jsonb
);
- Pre-create the table and load data into it:
CREATE TABLE squad (
id TEXT,
title TEXT,
context TEXT,
question TEXT,
answers JSONB
);
SELECT ai.load_dataset(
name => 'squad',
table_name => 'squad',
if_table_exists => 'append'
);
Notes
- The function requires an active internet connection to download datasets from Hugging Face.
- Large datasets may take significant time to load depending on size and connection speed.
- The function automatically maps Hugging Face dataset types to appropriate PostgreSQL data types unless overridden by
field_types.