Getting Housing Data From Dirty to Clean
Data Cleaning in SQL; How do you find and populate key missing fields in housing data? Or reproduce a clean version of the data for analysis? Let’s do this with SQL.
Housing and Real Estate are sectors that develop on a foundation of data. Whether as a government organisation or a private business, being able to analyze such datasets is essential to planning your next building project. This is where data analysts like me come in. Most times, such large datasets may come in dirty and unformatted. This will need expertise to clean and get ready for extraction of insights that will help stakeholders make decisions.
I got my hands on housing data for Nashville, a city in Tennessee. Below is a subset of the dataset. It had missing property addresses, unformatted sale dates and locations, duplicate entries and a lot more issues that needed to be resolved before exporting the clean version for analysis. Let’s have a look at each cleaning action and its output. Follow along on my github to see the code used.
— Change the Date Format
The SaleDate field was originally in DateTime format and needed to be converted to Date format. Standardizing the field makes it legible and easy to analyze. You can alter and update the original table to insert the cleaned field into it. Here’s a before and after:
— Populate Missing Address Records
The PropertyAddress field has some missing (null) rows of data. We need to populate these in order to have a robust dataset for analysis; this means we need reference point data. Looking at the table, we notice that the ParcelID field has no null rows and each unique PropertyAddress has its own ParcelID as seen below. Therefore, we can write code to find rows where the PropertyAddress field is null and populate the null fields with the known PropertyAddress data for that ParcelID.
By using an inner join and a select statement that populates the null rows with PropertyAddress data, we are able to output a populated PropertyAddress field and update the table to reflect this. So from 29 null rows in the original PropertyAddress field, we have zero null rows returned after executing the code, thereby resolving all null rows in the address field.
— Split the Address fields into Individual Columns
Looking at the table in the previous section, we notice that the PropertyAddress field contains the address and city in that single column. This holds true for the OwnerAddress field as well, where address, city and state are in that single field. We need to separate these into their individual fields before any meaningful analysis can be done on the values.
Note that commas are used as the delimiter or separator between the address, city and state in the two fields, which gives us two ways of splitting the fields with SQL queries; by using Substring and character index (CHARINDEX) queries, or by using PARSENAME and Replace queries. I used the former to split the PropertyAddress field and used the second method on the OwnerAddress field. Recall that you can view the code used on my github. Alter and Update the original table with the output of these queries to insert the separated Address, City and State fields. A subset of the output is below.
— Standardize the ‘SoldAsVacant’ Field
This field has boolean values of Yes, Y, No and N. Clearly, Y is the same as Yes to indicate that the property was vacant when sold while N is the same as No for indicating that the property was occupied when sold. Say we want to analyze this field to find all properties that were not vacant when sold, we’ll first need to standardize the field and have a uniform representation of Yes and No to get an accurate result.
I first executed a count to see which of these values were most popular in the field. Yes and No were ubiquitous so it makes sense to change the Y values to Yes and the N values to No. This was achieved by executing a Case statement and subsequently updating the field to reflect the output. You can use a Distinct and Count query to check that the update worked, as this shouldn’t return any Y and N values after updating the field.
— Remove Duplicates
Now we’ve standardized the fields for analysis, we can check for and remove duplicate entries. It’s not standard practice to delete data from your database, so it’s best to put any removed duplicates in a temp table in case these are needed later. Another option is to make a copy of your table before removing any seemingly duplicate data.
I used windows functions to find duplicate values where two rows contain exactly the same values for some key fields. For example, we can deduce that two or more rows having the exact same ParcelID, PropertyAddress, SalePrice, SaleDate and Legal Reference are duplicated. Legal Reference is included as it’s unique to each sale transaction. Executing a row-number partition by these fields returns these duplicated rows. Putting the output in a CTE to query for where they occur more than once then helps us pinpoint only the duplicate rows for subsequent deletion or extraction from the table.
Now we’ve cleaned the dataset, you can save your clean data as a view and query further or export as csv for further analysis outside SQL. Although not recommended for raw data, you can create a copy and go a step further to delete unused columns, like the previous address and date columns there were cleaned. Alter the table and use the Drop Column statement to remove these, if needed.
I hope you enjoyed walking through how I cleaned a dataset in SQL. Remember to check out my github to walk through the code and view my portfolio for more data stories.