Quick Start
September 11, 2023 ยท View on GitHub
JupySQL allows you to run SQL and plot large datasets in Jupyter via a %sql, %%sql, and %sqlplot magics. JupySQL is compatible with all major databases (e.g., PostgreSQL, MySQL, SQL Server), data warehouses (e.g., Snowflake, BigQuery, Redshift), and embedded engines (SQLite, and DuckDB).
It is a fork of ipython-sql with many bug fixes and a lot of great new features!
+++
Installation
Run this on your terminal (we'll use DuckDB for this example):
pip install jupysql duckdb-engine
Or the following in a Jupyter notebook:
%pip install jupysql duckdb-engine --quiet
You might also install it from conda:
conda install jupysql -c conda-forge
Setup
If you are unfamiliar with Jupyter magics, you can refer to our [FAQ](community/FAQ.md#what-is-a-magic). Also, you can view the documentation and command line arguments of any magic command by running `%magic?` like `%sql?` or `%sqlplot?`.
Load the extension:
%load_ext sql
Let's download some sample .csv data:
from pathlib import Path
from urllib.request import urlretrieve
if not Path("penguins.csv").is_file():
urlretrieve(
"https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv",
"penguins.csv",
)
Start a DuckDB in-memory database:
%sql duckdb://
You can create as many connections as you want. Pass an `--alias {alias}` to easily
[switch them or close](howto.md#switch-connections) them.
Querying
For short queries, you can write them in a single line via the %sql line magic:
%sql SELECT * FROM penguins.csv LIMIT 3
For longer queries, you can break them down into multiple lines using the %%sql cell magic:
%%sql
SELECT *
FROM penguins.csv
WHERE bill_length_mm > 40
LIMIT 3
Saving queries
%%sql --save not_nulls --no-execute
SELECT *
FROM penguins.csv
WHERE bill_length_mm IS NOT NULL
AND bill_depth_mm IS NOT NULL
Plotting
%sqlplot boxplot --column bill_length_mm bill_depth_mm --table not_nulls
%sqlplot histogram --column bill_length_mm bill_depth_mm --table not_nulls
pandas integration
result = %sql SELECT * FROM penguins.csv
df = result.DataFrame()
df.head()