Although you may be excited to see what insights you can get from the 16,000 records in Tableau, it's important to clean up the data in a tool like OpenRefine. You will be able to easily extract data to create new fields necessary for your analysis, delete insufficient or irrelavant data that would skew your results, and ensure you will produce a quality final project.
1.To begin, you’ll need to download the 16,000 CSV file by clicking "Create New Project." You should then see the data organized as such:
2.You'll notice that there are many businesses listed, so we will want to get rid of those to ensure we will only be analyzing what kinds of housing people have by their kinds of jobs. We can do this by clicking the carrot next to Biz -> Facet -> Custom text facet. You will then click on the data that have a "1" in this field on the left. Then, click the first carrot by All, Edit Rows -> Star Rows -> Remove matching rows. This should remove the majority of the businesses. We'll continue to sieve the data by deleting irrelevant columns such as "Spouse Name," "Biz" Columns, "Widow," and "Company Details." See if you can figure out how to do this! You should end up with a screen like this:
3.Now, we'll start to extract the hidden housing data by adding a column "Residence Type" based on the "Housing" column. In the new "Residence Type," go to Edit Cells -> Transform. Use the following GREL value.replace("h ", "House"). You should end up with the rows shown below. Now repeat this transformation to turn bds into "Beds" and rms into "Rooms."
4.Next we will start using regular expressions, or regex to get rid of the address in the "Residence Type" column. This will take a few steps:
Edit Cells -> Transform
Enter: value.startsWith("House")
Click on true to filter out the non-House rows.
Enter: value.replace(/House.*$/, "House")
This will delete all characters after "House". You should end up with a screen as shown below. Now repeat this for Rooms and Beds.
Now we'll go back to the "Residence" column and remove "h," "bds," and "rms." Can you guess how to do this? Hint: look at the GREL function in the beginning of the step!
5.You may have notice there are some red flags in the data. Some addresses didn't have a housing designation, and will certainly mess up our data visualizations in Tableau if we tried to import the data now. Let's now get rid of those addresses. We will create a custom text facet for entries beginning with the number 1, starring that row, and deleting the row as we did earlier. Repeat this for numbers 2-9. Now create a general text facet to delete those businesses that had fallen through the cracks earlier. Star and delete those rows again. Your screen should now look like this: