Cleaning the Home Prices dataset

October 26, 2018 ยท View on GitHub

This dataset was downloaded from DataSF as a csv file named Assessor_Historical_Secured_Property_Tax_Rolls.csv. It contains many columns but only the following are relevant for this project: "Closed Roll Year", "Assessor Neighborhood", and "Assessed Land Value". Assessed land value is treated as the home price since this is the 'valuation of a residence for tax purposes'.

Note: This csv file is not uploaded because Github limits file sizes to 100 MB.

Getting Started

Two Python (version 3.6) modules are used in processing the data:

  • Pandas
  • datetime
# Import Dependencies
import pandas as pd
import datetime

The csv file is loaded into a dataframe.

csv_path = "Assessor_Historical_Secured_Property_Tax_Rolls.csv"
home_df = pd.read_csv(csv_path, encoding='utf-8')

Data cleaning

First, only the relevant columns were retained in the dataframe.

# Create a new dataframe that contains only the relevant columns
home_df2 = home_df[["Closed Roll Year", "Assessor Neighborhood", "Assessed Land Value"]]

Second, the data was grouped based on year and neighbourhood.

# Group the data by neighbourhood
home_gb = home_df2.groupby(["Closed Roll Year", "Assessor Neighborhood"])
home_gb_mean = home_gb.mean()
home_gb2 = home_gb_mean.reset_index() # index (year) is treated as a column

"Closed Roll Year" is a column containing floats. The values need to be converted to integers (to get rid of the decimal place) and then to the datetime format.

# Convert Closed Roll Year values to integers
home_gb2["Closed Roll Year"].astype(int)

# Convert integers to year and put the data into a new column
home_gb2["Closed Roll Year 2"] = pd.to_datetime(home_gb2["Closed Roll Year"].astype(int), 
                                                format = '%Y', errors = 'coerce').dt.year

# Drop the original Closed Roll Year column and replace with the new one.
home_gb2 = home_gb2.drop(["Closed Roll Year"], axis = 1)
home_gb2 = home_gb2.rename(columns = {"Closed Roll Year 2": "Closed Roll Year"})

Output

The first five rows of the resulting dataframe looks like this:

Assessor NeighborhoodAssessed Land ValueClosed Roll Year
0Alamo Square356275.882007
1Anza Vista249759.912007
2Balboa Terrace308412.952007
3Bayview181096.932007
4Bayview Heights128157.542007

It is saved as a csv file in the Data folder, ready for analyses.

# Save file as a csv file
home_gb2.to_csv("combine_updated.csv")