Data Wrangling With pandas
February 16, 2021 ยท View on GitHub
ATTENTION!
These examples are significantly out of date. For an update look at Pandas, check out my free course Python Pandas For Your Grandpa - so easy your grandpa could learn it!
Intro
Note that the examples here use pandas version 0.18.1
Install numpy and pandas (data.table)
pip install numpy
pip install pandas
Import numpy and pandas (data.table)
import numpy as np
import pandas as pd
Build a DataFrame from scratch (data.table)
transactions = pd.DataFrame({
'TransactionID': np.arange(10)+1,
'TransactionDate': pd.to_datetime(['2010-08-21', '2011-05-26', '2011-06-16', '2012-08-26', '2013-06-06',
'2013-12-23', '2013-12-30', '2014-04-24', '2015-04-24', '2016-05-08']).date,
'UserID': [7, 3, 3, 1, 2, 2, 3, np.nan, 7, 3],
'ProductID': [2, 4, 3, 2, 4, 5, 4, 2, 4, 4],
'Quantity': [1, 1, 1, 3, 1, 6, 1, 3, 3, 4]
})
Read data from a CSV file (data.table)
Load transactions (data.table)
transactions = pd.read_csv('https://raw.githubusercontent.com/ben519/DataWrangling/master/Data/transactions.csv')
Meta info (data.table)
Full summary (data.table)
transactions.info()
How many rows? (data.table)
transactions.shape[0]
How many columns? (data.table)
transactions.shape[1]
Get the row names (data.table)
transactions.index.values
Get the column names (data.table)
transactions.columns.values
Change the name of column "Quantity" to "Quant" (data.table)
transactions.rename(columns={'Quantity': 'Quant'}) # use argument inplace=TRUE to keep the changes
Change the name of columns ProductID and UserID to PID and UID respectively (data.table)
transactions.rename(columns={'ProductID': 'PID', 'UserID': 'UID'}) # use argument inplace=TRUE to keep the changes
Ordering the rows of a DataFrame (data.table)
Order the rows of transactions by TransactionID descending (data.table)
transactions.sort_values('TransactionID', ascending=False)
Order the rows of transactions by Quantity ascending, TransactionDate descending (data.table)
transactions.sort_values(['Quantity', 'TransactionDate'], ascending=[True, False])
Ordering the columns of a DataFrame (data.table)
Set the column order of transactions as ProductID, Quantity, TransactionDate, TransactionID, UserID (data.table)
transactions[['ProductID', 'Quantity', 'TransactionDate', 'TransactionID', 'UserID']]
Make UserID the first column of transactions (data.table)
transactions[pd.unique(['UserID'] + transactions.columns.values.tolist()).tolist()]
Extracting arrays from a DataFrame (data.table)
Get the 2nd column (data.table)
transactions[[1]].values[:, 0]
Get the ProductID array (data.table)
transactions.ProductID.values
Get the ProductID array using a variable (data.table)
col = "ProductID"
transactions[[col]].values[:, 0]
Row subsetting (data.table)
Subset rows 1, 3, and 6 (data.table)
transactions.iloc[[0,2,5]]
Subset rows exlcuding 1, 3, and 6 (data.table)
transactions.drop([0,2,5], axis=0)
Subset the first 3 rows (data.table)
transactions[:3]
transactions.head(3)
Subset rows excluding the first 3 rows (data.table)
transactions[3:]
transactions.tail(-3)
Subset the last 2 rows (data.table)
transactions.tail(2)
Subset rows excluding the last 2 rows (data.table)
transactions.tail(-2)
Subset rows where Quantity > 1 (data.table)
transactions[transactions.Quantity > 1]
Subset rows where UserID = 2 (data.table)
transactions[transactions.UserID == 2]
Subset rows where Quantity > 1 and UserID = 2 (data.table)
transactions[(transactions.Quantity > 1) & (transactions.UserID == 2)]
Subset rows where Quantity + UserID is > 3 (data.table)
transactions[transactions.Quantity + transactions.UserID > 3]
Subset rows where an external array, foo, is True (data.table)
foo = np.array([True, False, True, False, True, False, True, False, True, False])
transactions[foo]
Subset rows where an external array, bar, is positive (data.table)
bar = np.array([1, -3, 2, 2, 0, -4, -4, 0, 0, 2])
transactions[bar > 0]
Subset rows where foo is TRUE or bar is negative (data.table)
transactions[foo | (bar < 0)]
Subset the rows where foo is not TRUE and bar is not negative (data.table)
transactions[~foo & (bar >= 0)]
Column subsetting (data.table)
Subset by columns 1 and 3 (data.table)
transactions.iloc[:, [0, 2]]
Subset by columns TransactionID and TransactionDate (data.table)
transactions[['TransactionID', 'TransactionDate']]
Subset rows where TransactionID > 5 and subset columns by TransactionID and TransactionDate (data.table)
transactions.loc[transactions.TransactionID > 5, ['TransactionID', 'TransactionDate']]
Subset columns by a variable list of columm names (data.table)
cols = ["TransactionID", "UserID", "Quantity"]
transactions[cols]
Subset columns excluding a variable list of column names (data.table)
cols = ["TransactionID", "UserID", "Quantity"]
transactions.drop(cols, axis=1)
Inserting and updating values (data.table)
Convert the TransactionDate column to type Date (data.table)
transactions['TransactionDate'] = pd.to_datetime(transactions.TransactionDate)
Insert a new column, Foo = UserID + ProductID (data.table)
transactions['Foo'] = transactions.UserID + transactions.ProductID
Subset rows where TransactionID is even and set Foo = NA (data.table)
transactions.loc[transactions.TransactionID % 2 == 0, 'Foo'] = np.nan
Add 100 to each TransactionID (data.table)
transactions.TransactionID = transactions.TransactionID + 100
transactions.TransactionID = transactions.TransactionID - 100 # revert to original IDs
Insert a column indicating each row number (data.table)
transactions['RowIdx'] = np.arange(transactions.shape[0])
Insert columns indicating the rank of each Quantity, minimum Quantity and maximum Quantity (data.table)
transactions['QuantityRk'] = transactions.Quantity.rank(method='average')
transactions['QuantityMin'] = transactions.Quantity.min()
transactions['QuantityMax'] = transactions.Quantity.max()
Remove column Foo (data.table)
transactions.drop('Foo', axis=1, inplace=True)
Remove multiple columns RowIdx, QuantityRk, and RowIdx (data.table)
transactions.drop(['QuantityRk', 'QuantityMin', 'QuantityMax'], axis=1, inplace=True)
Grouping the rows of a DataFrame (data.table)
Group By + Aggregate (data.table)
Group the transations per user, measuring the number of transactions per user (data.table)
transactions.groupby('UserID').apply(lambda x: pd.Series(dict(
Transactions=x.shape[0],
))).reset_index()
Group the transactions per user, measuring the transactions and average quantity per user (data.table)
transactions.groupby('UserID').apply(lambda x: pd.Series(dict(
Transactions=x.shape[0],
QuantityAvg=x.Quantity.mean()
))).reset_index()
Group the transactions per year of the transaction date, measuring the number of transactions per year (data.table)
# TODO
Group the transactions per (user, transaction-year) pair, measuring the number of transactions per group (data.table)
# TODO
Group the transactions per user, measuring the max quantity each user made for a single transaction and the date of that transaction (data.table)
# TODO
Group the transactions per (user, transaction-year), and then group by transaction-year to get the number of users who made a transaction each year (data.table)
# TODO
Group By + Update (data.table)
Insert a column in transactions indicating the number of transactions per user (data.table)
# TODO
Insert columns in transactions indicating the first transaction date and last transaction date per user (data.table)
# TODO
For each transaction, get the date of the previous transaction made by the same user (data.table)
# TODO
Joining DataFrames (data.table)
Setup (data.table)
# Load datasets from CSV
users = pd.read_csv('https://raw.githubusercontent.com/ben519/DataWrangling/master/Data/users.csv')
sessions = pd.read_csv('https://raw.githubusercontent.com/ben519/DataWrangling/master/Data/sessions.csv')
products = pd.read_csv('https://raw.githubusercontent.com/ben519/DataWrangling/master/Data/products.csv')
transactions = pd.read_csv('https://raw.githubusercontent.com/ben519/DataWrangling/master/Data/transactions.csv')
# Convert date columns to Date type
users['Registered'] = pd.to_datetime(users.Registered)
users['Cancelled'] = pd.to_datetime(users.Cancelled)
transactions['TransactionDate'] = pd.to_datetime(transactions.TransactionDate)
Basic Joins (data.table)
Join users to transactions, keeping all rows from transactions and only matching rows from users (left join) (data.table)
transactions.merge(users, how='left', on='UserID')
Which transactions have a UserID not in users? (anti join) (data.table)
transactions[~transactions['UserID'].isin(users['UserID'])]
Join users to transactions, keeping only rows from transactions and users that match via UserID (inner join) (data.table)
transactions.merge(users, how='inner', on='UserID')
Join users to transactions, displaying all matching rows AND all non-matching rows (full outer join) (data.table)
transactions.merge(users, how='outer', on='UserID')
Determine which sessions occured on the same day each user registered (data.table)
pd.merge(left=users, right=sessions, how='inner', left_on=['UserID', 'Registered'], right_on=['UserID', 'SessionDate'])
Build a dataset with every possible (UserID, ProductID) pair (cross join) (data.table)
df1 = pd.DataFrame({'key': np.repeat(1, users.shape[0]), 'UserID': users.UserID})
df2 = pd.DataFrame({'key': np.repeat(1, products.shape[0]), 'ProductID': products.ProductID})
pd.merge(df1, df2,on='key')[['UserID', 'ProductID']]
Determine how much quantity of each product was purchased by each user (data.table)
df1 = pd.DataFrame({'key': np.repeat(1, users.shape[0]), 'UserID': users.UserID})
df2 = pd.DataFrame({'key': np.repeat(1, products.shape[0]), 'ProductID': products.ProductID})
user_products = pd.merge(df1, df2,on='key')[['UserID', 'ProductID']]
pd.merge(user_products, transactions, how='left', on=['UserID', 'ProductID']).groupby(['UserID', 'ProductID']).apply(lambda x: pd.Series(dict(
Quantity=x.Quantity.sum()
))).reset_index().fillna(0)
For each user, get each possible pair of pair transactions (TransactionID1, TransactionID2) (data.table)
pd.merge(transactions, transactions, on='UserID')
Join each user to his/her first occuring transaction in the transactions table (data.table)
pd.merge(users, transactions.groupby('UserID').first().reset_index(), how='left', on='UserID')
Rolling Joins (data.table)
Determine the ID of the last session which occured prior to (and including) the date of each transaction per user (data.table)
# TODO
Determine the ID of the first session which occured after (and including) the date of each transaction per user (data.table)
# TODO
Non-equi joins (data.table)
Determine the first transaction that occured for each user prior to (and including) his/her Cancelled date (data.table)
# TODO
Get all transactions where TransactionDate is after the user's Cancellation Date (data.table)
# TODO
Join + Update (data.table)
Insert the price of each product in the transactions dataset (join + update) (data.table)
# TODO
Insert the number of transactions each user made into the users dataset (data.table)
# TODO
Setting indexes (data.table)
Set the index of Transactions as UserID (data.table)
# TODO
View the index of transactions (data.table)
# TODO
Set the index of users as UserID and join to transactions, matching rows only (inner join) (data.table)
# TODO
Set ProductID as the index of transactions and products without re-ordering the rows, then join matching rows only (data.table)
# TODO
Set each ID column as a secondary join index (data.table)
# TODO
View indices (data.table)
# TODO
Inner join between users, transactions, and products (data.table)
# TODO
Reshaping a DataFrame (data.table)
Setup (data.table)
# Read datasets from CSV
users = pd.read_csv('https://raw.githubusercontent.com/ben519/DataWrangling/master/Data/users.csv')
transactions = pd.read_csv('https://raw.githubusercontent.com/ben519/DataWrangling/master/Data/transactions.csv')
# Convert date columns to Date type
users['Registered'] = pd.to_datetime(users.Registered)
users['Cancelled'] = pd.to_datetime(users.Cancelled)
transactions['TransactionDate'] = pd.to_datetime(transactions.TransactionDate)
# Add column TransactionWeekday as Categorical type with categories Sunday through Saturday
transactions['TransactionWeekday'] = pd.Categorical(transactions.TransactionDate.dt.weekday_name, categories=['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'])
Convert data from tall format to wide format (data.table)
One-hot encode Weekday (i.e. convert data from tall to wide, where each possible weekday is a column) (data.table)
# TODO
Convert data from wide format to tall format (data.table)
Build a DataFrame with columns {UserID, ActionType, Date} where ActionType is either "Registered" or "Cancelled" and Date is the corresponding date value (data.table)
# TODO