Quick Start

September 11, 2023 ยท View on GitHub

JupySQL allows you to run SQL and plot large datasets in Jupyter via a %sql, %%sql, and %sqlplot magics. JupySQL is compatible with all major databases (e.g., PostgreSQL, MySQL, SQL Server), data warehouses (e.g., Snowflake, BigQuery, Redshift), and embedded engines (SQLite, and DuckDB).

It is a fork of ipython-sql with many bug fixes and a lot of great new features!

+++

Installation

Run this on your terminal (we'll use DuckDB for this example):

pip install jupysql duckdb-engine

Or the following in a Jupyter notebook:

%pip install jupysql duckdb-engine --quiet

You might also install it from conda:

conda install jupysql -c conda-forge

Setup

If you are unfamiliar with Jupyter magics, you can refer to our [FAQ](community/FAQ.md#what-is-a-magic). Also, you can view the documentation and command line arguments of any magic command by running `%magic?` like `%sql?` or `%sqlplot?`.

Load the extension:

%load_ext sql

Let's download some sample .csv data:

from pathlib import Path
from urllib.request import urlretrieve

if not Path("penguins.csv").is_file():
    urlretrieve(
        "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv",
        "penguins.csv",
    )

Start a DuckDB in-memory database:

%sql duckdb://
You can create as many connections as you want. Pass an `--alias {alias}` to easily
[switch them or close](howto.md#switch-connections) them.

Querying

For short queries, you can write them in a single line via the %sql line magic:

%sql SELECT * FROM penguins.csv LIMIT 3

For longer queries, you can break them down into multiple lines using the %%sql cell magic:

%%sql
SELECT *
FROM penguins.csv
WHERE bill_length_mm > 40
LIMIT 3

Saving queries

%%sql --save not_nulls --no-execute
SELECT *
FROM penguins.csv
WHERE bill_length_mm IS NOT NULL
AND bill_depth_mm IS NOT NULL

Plotting

%sqlplot boxplot --column bill_length_mm bill_depth_mm --table not_nulls
%sqlplot histogram --column bill_length_mm bill_depth_mm --table not_nulls

pandas integration

result = %sql SELECT * FROM penguins.csv
df = result.DataFrame()
df.head()