#!/usr/bin/env python # coding: utf-8 # ## Getting Started With OpenRefine # # In order to work with the spreadsheet of raw data (~16,000 residents) provided to us by Professor Marciano, I needed to first get the addresses in a format that they could be loaded into a geocoding tool, in order to determine their latitude and longitude. To accomplish this, I loaded the spreadsheet into OpenRefine, to begin cleaning up the address field. This took several steps, building on the work we had previosuly done in OpenRefine earlier in the semester. # # # The first step was to remove any lines of data that couldn't be addresses; there were several entries that had "addresses" listed as text such as "s Tryon" or "near Charlotte," as you can see in the "Housing" column of the image below, neither of which would be properly found in a geolocating tool. To begin to weed down data, I removed all entries that didn't have a number present in the address field. This brought me down to around 12,500 total residents. # # ![Initial Cleaning](Images/Henry/OpenRefine3.png) # My next step was to filter the data based on the entries that already began with a number. Looking through those entries, I could be confident that most addresses that began with a number were in the correct format I needed. I just needed to properly format the entries that didn't begin with a number in such a way that they could be read by a geocoding tool. This involved multiple cleaning methods, including the removal of preceding characters identifying the type of housing the residents lived in, such as "h" for House, which I could do en masse. # # ![Further Cleaning](Images/Henry/OpenRefine1.png) # I did need to go in and alter some fields manually, however, as there were some fields that were uniquely formatted and wouldn't benefit from writing an expression to edit multiple fields at once. Ultimately, I ended up with a spreadsheet of a little over 12,000 residents whose address all resembeled real addresses, as shown in the "Housing" column in the above image. # Next Step: [QGIS](QGIS.ipynb) # # Return: [Title](9.VisualizingNeighborhoodDemographics.ipynb)