Polars Utils ๐Ÿปโ€โ„๏ธ

January 3, 2025 ยท View on GitHub

License: MIT

A collection of utilities for data exploration and analysis with Polars DataFrames, focusing on making EDA and data processing tasks easier and more insightful.

Features โœจ

1. Join Analysis

Automatically analyze potential join relationships between DataFrames:

  • Identify optimal join keys
  • Detect type mismatches and coercion needs
  • Show match rates and relationship patterns
  • Handle one-to-many and many-to-one relationships
import polars as pl
from polars_utils import register_extensions

register_extensions()

# Create sample DataFrames
df1 = pl.DataFrame({
    "id": [1, 2, 3, 4, None],
    "name": ["A", "B", "C", "D", "E"],
    "value": [10, 20, 30, 40, 50],
    "mixed": ["1", "2", "3", "4", "5"]
})

df2 = pl.DataFrame({
    "id": [1, 2, 3, 3, None],
    "name": ["A", "B", "C", "C", "F"],
    "score": [100, 200, 300, 400, 500],
    "mixed": [1, 2, 3, 4, 5]
})

# Analyze join possibilities
df1.polars_utils.join_analysis(df2)

Output:

                                             Join Analysis Results                                              
โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฆโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฆโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฆโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฆโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฆโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฆโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘ Left Column โ•‘ Right Column โ•‘ Types          โ•‘ Left Match % โ•‘ Right Match % โ•‘ Matched Rows โ•‘ Coercion Applied โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘ mixed       โ•‘ mixed        โ•‘ String โ†” Int64 โ•‘ 100.0%       โ•‘ 100.0%        โ•‘ 5            โ•‘ R โ†’ String       โ•‘
โ•‘ id          โ•‘ mixed        โ•‘ Int64          โ•‘ 80.0%        โ•‘ 80.0%         โ•‘ 4            โ•‘ -                โ•‘
โ•‘ id          โ•‘ id           โ•‘ Int64          โ•‘ 60.0%        โ•‘ 80.0%         โ•‘ 4            โ•‘ -                โ•‘
โ•‘ name        โ•‘ name         โ•‘ String         โ•‘ 60.0%        โ•‘ 80.0%         โ•‘ 4            โ•‘ -                โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฉโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฉโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฉโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฉโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฉโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฉโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

2. Data Quality Analysis

Analyze data quality across your DataFrame:

  • Null value analysis
  • Cardinality measurements
  • Type distribution
  • Value patterns

Search for patterns across all columns and values in your DataFrame:

  • Regex pattern matching
  • Match counts and percentages
  • Filter to matching columns only
  • Handles mixed data types automatically
# Search for email patterns across all columns
results = df.polars_utils.regex_search(r".*@.*\.com")

# Only show columns with matches
results = df.polars_utils.regex_search("pattern", matches_only=True)

4. Visual Data Analysis

Create compact visualizations within your DataFrame:

  • Single-line histograms for numeric columns
  • Group-wise distribution visualization
  • Customizable characters and widths
  • Works with both groupby and window operations
import polars as pl
from polars_utils import register_extensions

register_extensions()

# Example 1: Basic distribution by age groups
df = pl.DataFrame({
    "age_group": ["0-18", "19-30", "31-50", "51+"],
    "values": [
        [10, 12, 15, 15, 16, 17, 18],  # Young, clustered
        [20, 21, 21, 25, 25, 25, 29],  # Young adults, right skewed
        [35, 35, 35, 40, 45, 45, 50],  # Middle age, bimodal
        [55, 60, 65, 70, 70, 75, 80],  # Senior, spread out
    ]
})

result = df.with_columns(pl.col("values").list.explode()).group_by("age_group").agg(
    pl.col("values").polars_utils.create_histogram(max_width=30).alias("distribution")
)
print(result)

# Example 2: Sales patterns across weekdays
df_sales = pl.DataFrame({
    "day": ["Mon", "Tue", "Wed", "Thu", "Fri"],
    "sales": [
        [100, 120, 110, 105, 115],  # Monday - consistent
        [150, 155, 145, 160, 140],  # Tuesday - high, stable
        [200, 180, 190, 195, 185],  # Wednesday - peak
        [160, 150, 155, 145, 165],  # Thursday - declining
        [120, 125, 115, 110, 130],  # Friday - low, variable
    ]
})

result = df_sales.with_columns(pl.col("sales").list.explode()).group_by("day").agg(
    pl.col("sales").polars_utils.create_histogram(max_width=20).alias("distribution")
)
print("\nSales Distribution by Day:")
print(result)

Example Outputs:

  1. Age Group Distribution:
shape: (4, 2)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ age_group โ”† distribution                                         โ”‚
โ”‚ ---       โ”† ---                                                  โ”‚
โ”‚ str       โ”† str                                                  โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ 0-18      โ”† โ–โ–‚โ–ƒโ–ƒโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–‡                      [10.00, 18.00]      โ”‚
โ”‚ 19-30     โ”† โ–โ–‚โ–‚โ–ƒโ–ƒโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–…โ–ƒโ–‚                   [20.00, 29.00]      โ”‚
โ”‚ 31-50     โ”† โ–ˆโ–ˆโ–ˆโ–ˆโ–โ–โ–‚โ–‚โ–ˆโ–ˆโ–ˆโ–ˆโ–โ–โ–‚โ–‚                 [35.00, 50.00]      โ”‚
โ”‚ 51+       โ”† โ–โ–‚โ–ƒโ–„โ–…โ–†โ–‡โ–ˆโ–‡โ–†โ–…โ–„โ–ƒโ–‚โ–                  [55.00, 80.00]      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  1. Sales Distribution:
shape: (5, 2)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ day โ”† distribution                               โ”‚
โ”‚ --- โ”† ---                                        โ”‚
โ”‚ str โ”† str                                        โ”‚
โ•žโ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ Mon โ”† โ–…โ–ˆโ–ƒโ–โ–ˆโ–†              [100.00, 120.00]       โ”‚
โ”‚ Tue โ”† โ–โ–‡โ–ˆโ–…โ–ˆโ–ˆโ–ˆ             [140.00, 160.00]       โ”‚
โ”‚ Wed โ”† โ–‚โ–‡โ–†โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ            [180.00, 200.00]       โ”‚
โ”‚ Thu โ”† โ–โ–†โ–ˆโ–ˆโ–‡โ–…โ–‚             [145.00, 165.00]       โ”‚
โ”‚ Fri โ”† โ–‚โ–ƒโ–ˆโ–‡โ–…โ–              [110.00, 130.00]       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The histograms provide quick visual insights:

  • Age groups show different distribution patterns (clustered, skewed, bimodal)
  • Sales patterns reveal daily trends (peak days, variability)
  • Min/max values help contextualize the distributions

Installation ๐Ÿ“ฆ

# The hope is to get this onto PyPI soon.
# pip install polars-utils

# For now, you can install from GitHub
pip install -U git+https://github.com/junghoon-son/polars-utils.git

Use Cases ๐Ÿ“Š

  • Data Exploration: Quick insights into data relationships and patterns
  • Data Quality: Identify data issues and inconsistencies
  • Join Debugging: Understand and fix join problems
  • Pattern Matching: Find specific patterns across your entire dataset
  • Data Integration: Analyze relationships between different data sources

Contributing ๐Ÿค

Contributions are welcome! Please feel free to submit a Pull Request.

License ๐Ÿ“„

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments ๐Ÿ™

  • Built on top of the amazing Polars DataFrame library
  • Inspired by the need for better data exploration tools

Made with โค๏ธ for the Polars community