pyjanitor

October 24, 2021 ยท View on GitHub

Binder

pyjanitor is a Python implementation of the R package janitor. It provides a clean user-friendly API for extending pandas with powerful and readable data-cleaning functions.

Quick start

Why janitor?

Originally a port of the R package, pyjanitor has evolved from a set of convenient data cleaning routines into an experiment with the method chaining paradigm.

Data preprocessing usually consists of a series of steps that involve transforming raw data into an understandable/usable format. These series of steps need to be run in a certain sequence to achieve success. We take a base data file as the starting point, and perform actions on it, such as removing null/empty rows, replacing them with other values, adding/renaming/removing columns of data, filtering rows and others. More formally, these steps along with their relationships and dependencies are commonly referred to as a Directed Acyclic Graph (DAG).

The pandas API has been invaluable for the Python data science ecosystem, and implements method chaining of a subset of methods as part of the API. For example, resetting indexes (.reset_index()), dropping null values (.dropna()), and more, are accomplished via the appropriate pd.DataFrame method calls.

Inspired by the ease-of-use and expressiveness of the dplyr package of the R statistical language ecosystem, we have evolved pyjanitor into a language for expressing the data processing DAG for pandas users.

To accomplish this, actions for which we would need to invoke imperative-style statements, can be replaced with method chains that allow one to read off the logical order of actions taken. Let us see the annotated example below. First off, here is the textual description of a data cleaning pathway:

  1. Create a DataFrame.
  2. Delete one column.
  3. Drop rows with empty values in two particular columns.
  4. Rename another two columns.
  5. Add a new column.

Let's import some libraries and begin with some sample data for this example:

# Libraries
import numpy as np
import pandas as pd
import janitor

# Sample Data curated for this example
company_sales = {
    'SalesMonth': ['Jan', 'Feb', 'Mar', 'April'],
    'Company1': [150.0, 200.0, 300.0, 400.0],
    'Company2': [180.0, 250.0, np.nan, 500.0],
    'Company3': [400.0, 500.0, 600.0, 675.0]
}

In pandas code, most users might type something like this:

# The Pandas Way

# 1. Create a pandas DataFrame from the company_sales dictionary
df = pd.DataFrame.from_dict(company_sales)

# 2. Delete a column from the DataFrame. Say 'Company1'
del df['Company1']

# 3. Drop rows that have empty values in columns 'Company2' and 'Company3'
df = df.dropna(subset=['Company2', 'Company3'])

# 4. Rename 'Company2' to 'Amazon' and 'Company3' to 'Facebook'
df = df.rename(
    {
        'Company2': 'Amazon',
        'Company3': 'Facebook',
    },
    axis=1,
)

# 5. Let's add some data for another company. Say 'Google'
df['Google'] = [450.0, 550.0, 800.0]

# Output looks like this:
# Out[15]:
#   SalesMonth  Amazon  Facebook  Google
# 0        Jan   180.0     400.0   450.0
# 1        Feb   250.0     500.0   550.0
# 3      April   500.0     675.0   800.0

Slightly more advanced users might take advantage of the functional API:

df = (
    pd.DataFrame(company_sales)
    .drop(columns="Company1")
    .dropna(subset=["Company2", "Company3"])
    .rename(columns={"Company2": "Amazon", "Company3": "Facebook"})
    .assign(Google=[450.0, 550.0, 800.0])
)

# The output is the same as before, and looks like this:
# Out[15]:
#   SalesMonth  Amazon  Facebook  Google
# 0        Jan   180.0     400.0   450.0
# 1        Feb   250.0     500.0   550.0
# 3      April   500.0     675.0   800.0

With pyjanitor, we enable method chaining with method names that are explicitly named verbs, which describe the action taken.

df = (
    pd.DataFrame.from_dict(company_sales)
    .remove_columns(["Company1"])
    .dropna(subset=["Company2", "Company3"])
    .rename_column("Company2", "Amazon")
    .rename_column("Company3", "Facebook")
    .add_column("Google", [450.0, 550.0, 800.0])
)

# Output looks like this:
# Out[15]:
#   SalesMonth  Amazon  Facebook  Google
# 0        Jan   180.0     400.0   450.0
# 1        Feb   250.0     500.0   550.0
# 3      April   500.0     675.0   800.0

As such, pyjanitor's etymology has a two-fold relationship to "cleanliness". Firstly, it's about extending Pandas with convenient data cleaning routines. Secondly, it's about providing a cleaner, method-chaining, verb-based API for common pandas routines.

Installation

pyjanitor is currently installable from PyPI:

pip install pyjanitor

pyjanitor also can be installed by the conda package manager:

conda install pyjanitor -c conda-forge

pyjanitor can be installed by the pipenv environment manager too. This requires enabling prerelease dependencies:

pipenv install --pre pyjanitor

pyjanitor requires Python 3.11+.

Functionality

Current functionality includes:

  • Cleaning columns name (multi-indexes are possible!)
  • Removing empty rows and columns
  • Identifying duplicate entries
  • Encoding columns as categorical
  • Splitting your data into features and targets (for machine learning)
  • Adding, removing, and renaming columns
  • Coalesce multiple columns into a single column
  • Date conversions (from matlab, excel, unix) to Python datetime format
  • Expand a single column that has delimited, categorical values into dummy-encoded variables
  • Concatenating and deconcatenating columns, based on a delimiter
  • Syntactic sugar for filtering the dataframe based on queries on a column
  • Experimental submodules for finance, biology, chemistry, engineering, and pyspark

API

The idea behind the API is two-fold:

  • Copy the R package function names, but enable Pythonic use with method chaining or pandas piping.
  • Add other utility functions that make it easy to do data cleaning/preprocessing in pandas.

Continuing with the company_sales dataframe previously used:

import pandas as pd
import numpy as np
company_sales = {
    'SalesMonth': ['Jan', 'Feb', 'Mar', 'April'],
    'Company1': [150.0, 200.0, 300.0, 400.0],
    'Company2': [180.0, 250.0, np.nan, 500.0],
    'Company3': [400.0, 500.0, 600.0, 675.0]
}

As such, there are three ways to use the API. The first, and most strongly recommended one, is to use pyjanitor's functions as if they were native to pandas.

import janitor  # upon import, functions are registered as part of pandas.

# This cleans the column names as well as removes any duplicate rows
df = pd.DataFrame.from_dict(company_sales).clean_names().remove_empty()

The second is the functional API.

from janitor import clean_names, remove_empty

df = pd.DataFrame.from_dict(company_sales)
df = clean_names(df)
df = remove_empty(df)

The final way is to use the pipe() method:

from janitor import clean_names, remove_empty
df = (
    pd.DataFrame.from_dict(company_sales)
    .pipe(clean_names)
    .pipe(remove_empty)
)

Contributing

Follow the development guide for a full description of the process of contributing to pyjanitor.

Adding new functionality

Keeping in mind the etymology of pyjanitor, contributing a new function to pyjanitor is a task that is not difficult at all.

Define a function

First off, you will need to define the function that expresses the data processing/cleaning routine, such that it accepts a dataframe as the first argument, and returns a modified dataframe:

import pandas_flavor as pf

@pf.register_dataframe_method
def my_data_cleaning_function(df, arg1, arg2, ...):
    # Put data processing function here.
    return df

We use pandas_flavor to register the function natively on a pandas.DataFrame.

Add a test case

Secondly, we ask that you contribute a test case, to ensure that the function works as intended. Follow the contribution docs for further details.

Feature requests

If you have a feature request, please post it as an issue on the GitHub repository issue tracker. Even better, put in a PR for it! We are more than happy to guide you through the codebase so that you can put in a contribution to the codebase.

Because pyjanitor is currently maintained by volunteers and has no fiscal support, any feature requests will be prioritized according to what maintainers encounter as a need in our day-to-day jobs. Please temper expectations accordingly.

API Policy

pyjanitor only extends or aliases the pandas API (and other dataframe APIs), but will never fix or replace them.

Undesirable pandas behaviour should be reported upstream in the pandas issue tracker. We explicitly do not fix the pandas API. If at some point the pandas devs decide to take something from pyjanitor and internalize it as part of the official pandas API, then we will deprecate it from pyjanitor, while acknowledging the original contributors' contribution as part of the official deprecation record.

Contributors

Thanks goes to these wonderful people who have contributed to pyjanitor:

Paul Anzel
Paul Anzel

๐Ÿ’ป
aopisco
aopisco

๐Ÿ’ป
apatao
apatao

๐Ÿ› ๐Ÿ’ฌ
Abigail Searfoss
Abigail Searfoss

๐Ÿ’ป
ashenafiyb
ashenafiyb

๐Ÿ› ๐Ÿ’ฌ
asmirnov69
asmirnov69

๐Ÿ› ๐Ÿ’ฌ
BaritoneBeard
BaritoneBeard

๐Ÿ› ๐Ÿ’ฌ
Bradley Dice
Bradley Dice

๐Ÿ’ป
Benjamin R. Jack
Benjamin R. Jack

๐Ÿ’ป
Yash Bhalla
Yash Bhalla

๐Ÿ’ป
Catherine Devlin
Catherine Devlin

๐Ÿ’ป
cduvallet
cduvallet

๐Ÿ’ป
cjmayers
cjmayers

๐Ÿ’ป
CWen
CWen

๐Ÿ’ป
Davide Frazzetto
Davide Frazzetto

๐Ÿ’ป
Dal Williams
Dal Williams

๐Ÿ’ป
Derek Powell
Derek Powell

๐Ÿ› ๐Ÿ’ฌ
DollofCuty
DollofCuty

๐Ÿ’ป
Dmitry Nortsev
Dmitry Nortsev

๐Ÿ’ป
Daniel Dsouza
Daniel Dsouza

๐Ÿ’ป
David Goltra
David Goltra

๐Ÿ’ป
Econundrums
Econundrums

๐Ÿ› ๐Ÿ’ฌ
Niklas Eidhagen
Niklas Eidhagen

๐Ÿ’ป
Emmanuel Ferdman
Emmanuel Ferdman

๐Ÿ› ๐Ÿ’ฌ
Emily Qin
Emily Qin

๐Ÿ’ป
ericclessantostv
ericclessantostv

๐Ÿ› ๐Ÿ’ฌ
Eric Ma
Eric Ma

๐Ÿ’ป
Eric Thompson
Eric Thompson

๐Ÿ› ๐Ÿ’ฌ
Evan Anderson
Evan Anderson

๐Ÿ› ๐Ÿ’ฌ
Eyal Trabelsi
Eyal Trabelsi

๐Ÿ’ป
Shubhransh Jagota
Shubhransh Jagota

๐Ÿ› ๐Ÿ’ฌ
Geir Arne Hjelle
Geir Arne Hjelle

๐Ÿ› ๐Ÿ’ฌ
Andy Gaworecki
Andy Gaworecki

๐Ÿ’ป
Gustavo Camargo
Gustavo Camargo

๐Ÿ’ป
gddcunh
gddcunh

๐Ÿ’ป
Glen JOHNSON
Glen JOHNSON

๐Ÿ’ป
Hector Munoz
Hector Munoz

๐Ÿ’ป
John C Vallier
John C Vallier

๐Ÿ’ป
Jennifer Watt
Jennifer Watt

๐Ÿ”ง
Kevin Chen
Kevin Chen

๐Ÿ’ป
jk3587
jk3587

๐Ÿ’ป ๐Ÿ”ง
Jonny Elliott
Jonny Elliott

๐Ÿ’ป
joranbeasley
joranbeasley

๐Ÿ› ๐Ÿ’ฌ
JoshuaC3
JoshuaC3

๐Ÿ’ป
keoghdata
keoghdata

๐Ÿ’ป
Kevin-Smith77
Kevin-Smith77

๐Ÿ› ๐Ÿ’ฌ
Ang
Ang

๐Ÿ’ป
Taewon D. Kim
Taewon D. Kim

๐Ÿ’ป
Han
Han

๐Ÿ’ป
Kurtis Pinkney
Kurtis Pinkney

๐Ÿ’ป
Luca Beltrame
Luca Beltrame

๐Ÿ’ป
Logan Thomas
Logan Thomas

๐Ÿ”ง
Lucas
Lucas

๐Ÿ’ป
mdini
mdini

๐Ÿ’ป
MinchinWeb
MinchinWeb

๐Ÿ› ๐Ÿ’ฌ
MollyCroke
MollyCroke

๐Ÿ› ๐Ÿ’ฌ
Margesh Phirke
Margesh Phirke

๐Ÿ› ๐Ÿ’ฌ
Marcelo Albuquerque
Marcelo Albuquerque

๐Ÿ’ป
Nick
Nick

๐Ÿ’ป
Vamsi Krishna
Vamsi Krishna

๐Ÿ› ๐Ÿ’ฌ
OdinTech3
OdinTech3

๐Ÿ› ๐Ÿ’ฌ
portc13
portc13

๐Ÿ’ป
Tom Purucker
Tom Purucker

๐Ÿ’ป
qtson
qtson

๐Ÿ’ป
rahosbach
rahosbach

๐Ÿ’ป
Rajat Mittal
Rajat Mittal

๐Ÿ’ป
Ram Narasimhan
Ram Narasimhan

๐Ÿ’ป
Becca Perry
Becca Perry

๐Ÿ’ป
Richard Qiu
Richard Qiu

๐Ÿ› ๐Ÿ’ฌ
Ricky Lim
Ricky Lim

๐Ÿ’ป
Robert Mitchell
Robert Mitchell

๐Ÿ› ๐Ÿ’ฌ
Sally Hong
Sally Hong

๐Ÿ’ป ๐Ÿ”ง
Samuel Oranyeli
Samuel Oranyeli

๐Ÿ’ป
Sam Walkow
Sam Walkow

๐Ÿ’ป
Nathaniel Saul
Nathaniel Saul

๐Ÿ› ๐Ÿ’ฌ
Shan Dou
Shan Dou

๐Ÿ’ป
Shantanu Oak
Shantanu Oak

๐Ÿ’ป
smu095
smu095

๐Ÿ› ๐Ÿ’ฌ
SorenFrohlich
SorenFrohlich

๐Ÿ’ป
Carlos Sousa
Carlos Sousa

๐Ÿ› ๐Ÿ’ฌ
StephenSchroed
StephenSchroed

๐Ÿ’ป
Sam Zuckerman
Sam Zuckerman

๐Ÿ’ป
Jeremy Goh
Jeremy Goh

๐Ÿ› ๐Ÿ’ฌ
Thomas J. Fan
Thomas J. Fan

๐Ÿ’ป
Tom Jemmett
Tom Jemmett

๐Ÿ› ๐Ÿ’ฌ
Tom Monks
Tom Monks

๐Ÿ’ป
Uday Guntupalli
Uday Guntupalli

๐Ÿ› ๐Ÿ’ฌ
Vincent Perrollaz
Vincent Perrollaz

๐Ÿ› ๐Ÿ’ฌ
Ji Xu
Ji Xu

๐Ÿ› ๐Ÿ’ฌ
Zach Barry
Zach Barry

๐Ÿ’ป ๐Ÿ“ข
40%
40%

๐Ÿ’ป
Zijie (ZJ) Poh
Zijie (ZJ) Poh

๐Ÿ’ป

This project follows the all-contributors specification. Contributions of any kind welcome!

Credits

As of 11/05/2025, the test data for the chemistry submodule is unavailable.