polarstation
May 30, 2026 · View on GitHub
Tidy helper functions for Polars, inspired by the R tidyverse.
Installation
pip install polarstation
or with uv:
uv add polarstation
Quick start
import polars as pl
import polarstation # registers extension functions for polars
df = pl.DataFrame({
"animal": ["dog", "dog", None, "bird", "cow" , "bird", "bird"],
"weight": [12.2, 8.1, 7.5, 0.5, 460, 0.4, None],
}).ps.with_columns(
pl.col("animal").ps_enum.make().ps_enum.reorder(by='weight')
)
print(df)
print(df['animal'].dtype)
shape: (7, 2)
┌────────┬────────┐
│ animal ┆ weight │
│ --- ┆ --- │
│ enum ┆ f64 │
╞════════╪════════╡
│ dog ┆ 12.2 │
│ dog ┆ 8.1 │
│ null ┆ 7.5 │
│ bird ┆ 0.5 │
│ cow ┆ 460.0 │
│ bird ┆ 0.4 │
│ bird ┆ null │
└────────┴────────┘
Enum(categories=['bird', 'dog', 'cow'])
ps.with_columns is a drop-in replacement for with_columns from
polars that can handle some additional use cases like functions that
need to peek at the full data for evaluation. It works efficiently on
both DataFrame and LazyFrame.
Details
The key idea is FrameExpr — an expression that needs a peek at the
data (schema or a small aggregation) before it resolves into a regular
Polars expression. This unlocks operations like deriving Enum categories
from the data, lumping rare levels, or reordering factor levels by a
summary statistic, while keeping the rest of your pipeline lazy.
How FrameExpr stays efficient
ps.with_columns resolves each FrameExpr in two phases. First it runs
a small aggregation (e.g. unique().sort() to discover categories)
against the current lazy plan — so any preceding .filter() or
.select() is already embedded and Polars’ predicate/projection
pushdown keeps the peek cheap. Then it uses the result to build a
concrete pl.Expr (e.g. .cast(pl.Enum(["a", "b", "c"]))) that goes
back into the lazy plan and executes normally.
# Only the filtered rows are scanned for category discovery;
# the cast itself remains lazy.
lf = pl.scan_parquet("events.parquet")
result = (
lf.filter(pl.col("country") == "DE")
.ps.with_columns(pl.col("status").ps_enum.make())
.filter(pl.col("status") == "active")
.collect()
)
See the FrameExpr docstring for the full explanation, including when
the peek is larger and notes on parallel evaluation.
Dev Notes
To build the documentation run:
uv run quarto render
and then in a separate terminal
uv run quarto preview
To update the documentation at https://const-ae.github.io/polarstation/
`uv run quarto publish gh-pages
To re-render the README.md run
quarto render README.qmd --to gfm
To upload to pypi run
uv build
uv publish
Acknowledgements
This package stands on the shoulders of several excellent projects:
- The tidyverse team for establishing the tidy data philosophy and the vocabulary that shapes this package’s design.
- Hadley Wickham and the forcats
authors for the factor-manipulation functions that directly inspired
the
ps_enumnamespace. - David Hugh-Jones for santoku,
which inspired the
ps_chopfunctions. - Allison Horst, Alison Hill, and Kristen Gorman for the palmerpenguins dataset used in the examples and walkthrough.
License
MIT