#!/usr/bin/env python # coding: utf-8 # # OpenRefine: Gathering, Manipulating, and Combining Data # # ### Getting started: Gathering the Business Directory Data # # While we already had the full [16,000 record dataset for the General Directory](Data/Charlotte1911_FULLDATASET.csv) available, I did not have a data set of information from the Business Directory. Thus, before I could start cleaning and combining data, the Business Diretory data needed to be collected. Generating a Business Directory dataset meant OCR-ing this 52 page section of the 1911 Chralotte City Directory, a task that required using Abbyy FineReader. # # ![Abbyy FineReader Screenshot](Images/Abby_FineReader_Screenshot.png) # # Given the column-like structure of the Business Directory and its use of headings to delinate different business types, multiple boxes were drawn around each individual heading and then around all of the businesses listed under each heading. In telling Abbyy FineReader's algorithm to read the Business Directory based on these boxes, Abbyy FineReader was able to generate [OCR results](Data/Business_Directory_Pages_complete.pdf) that placed each heading and, for the most part, each business entry on its own line. # # ![Abbyy FineReader Screenshot-Boxes](Images/Abbyy_FineReader_Screenshot_Boxes.png) # # ![Resulting OCR](Images/Resulting_OCR.png) # # *It should be noted that due to both the time constraints of this project and its central focus being on utilizing OpenRefine's capabilities to combine and expand upon a set of data, only fifteen of the 52 pages of the Business Directory were read and little time was spent working through Abbyy FineReader to try and improve the OCR that was produced after the initial reading. This not only explains why this project can only be seen as a case study, but it also acts as a step to go back to in further work to see if spending more time with Abbyy FineReader could improve the work being done in OpenRefine.* # # ### Working in OpenRefine # #### Manipulating the Business Directory Data # # After generating the OCR for the first fifteen pages of the Business Directory, I could begin trying to manipulate the data to eventually add Business Directory information, specifically the business type of a specific company or individual, to the General Directory dataset. In order to get to this end goal, the work needed to be divided up. I thought I could divide the work into the following steps: # 1. The raw OCR needed to be cleaned up both to fix patterns of mistakes that were in the raw OCR but also to get the data from simply a list of businesses and business types to a table that labeled each company and individual with their correct business type. # 2. Then, I could cross populate the General Directory dataset with the cleaned-up Business Directory dataset to add business types to the former dataset. # # To transform the raw Business Directory OCR into a dataset with each business labeled and organized by their business type, the OCR was uploaded into OpenRefine. From here text filters were used to separate out the records that just contained a business type heading or a subheading and then create separate columns to split these headings and subheadings out in to. # # ![OpenRefine Screenshot 1](Images/OpenRefine_Screenshot_1.png) # # While this did separate out the business type headings and subheadings out from actual companies and individuals, an additional step was needed to connect each company and individual with their specific business type. This step was fairly easy to accomplish by filtering the records by business type and then using a common OpenRefine transformation to add the business type entry down to each of the companies and individuals that were listed under that heading. # # ![2OpenRefine](Images/2OpenRefine.png) # # However, for business types that had multiple subheadings that provided more specific information as to a company or individual's business type, this process would not work; OpenRefine did not recognize what it was being asked to do, so no changes to any records were made. As a work around to this issue, it was determined that using manual labor to populate the Business Directory records with the appropriate business type subheading would be the most convienent plan of action given the limited number of records in the dataset, the limitations of OpenRefine in this area, and the limited time constraints to complete the project. Additionally, manipulating this part of the dataset enabled the records that just contained a business type heading to be deleted from the dataset since OpenRefine is not capable of deleting individual rows easily. After completeing both of these steps, reuploading the edited dataset to OpenRefine, and conducting additional transformations to fix patterns of misread characters noticed in the raw OCR output, the Business Directory dataset fit the intially defined goals: a list of companies and individuals labled by their business type. # # ![3OpenRefine](Images/3OpenRefine.png) # # #### Working Across Datasets: Combining the General Directory and Business Directory # # Seemingly, with the Business Directory dataset manipulated and cleaned-up, the Business Directory dataset could be combined, or crossed, with the General Directory dataset, specifically to add the business type headings and subheadings to the General Directory dataset. To complete this compiling or crossing of datasets the names of companies or individuals in the Business Directory dataset needed to be matched to names in the General Directory so that the corresponding business types could be transferred from the Business Directory dataset to the correct name in the General Directory dataset. # # ![4OpenRefine](Images/4OpenRefine.png) # *In prose, this expression translates to from the OpenRefine project titled "Business Directory Data FINAL" match the entries in the "Business Name" column with the entires in the "Name" column of the General Directory dataset; then, for entires that match across these columns, copy the entry in the corresponding "Business Type 1" column of the "Business Directory Data FINAL" dataset into a new column in the General Directory dataset titled "Business Type."* # # The first attempt at connecting the two dataset, however, connected barely 100 business types to their corresponding company or individual names in the General Directory, indicating, initially, that company or individual name was not going to be the method to connect the two datasets. Reexamining the two datasets indicated that addresses were included in both, offering another avenue to connect the datasets. Upon adapting the previous expression to connect the addresses in both datasets rouglhy 200 records were matched and business types brought over into the General Directory dataset. However, many companies and individuals' names that were not included in the Business Directory dataset had a business type connected to them, meaning this method for crossing the two datasets was generating inaccurate data. Upon closer look, many people and companies worked or lived at the same address causing business types to be attached to people that were not a business but had the same address as a business. This indicated that to connect the two datasets an element of the data that was as close to a unique identifier as possible was needed as the element to match the two datasets together and pull over business type information accurately. Given the limited number of fields in both datasets, the closest element to a unique identifier was company or individual names. Thus, these columns needed to be reexamined. # # This reexamination led to the realization that the formatting of the names of companies and individuals in the Business Directory dataset **did not match** the formatting of the same names in the General Directory dataset, in three specific ways: # 1. In the Business Directory dataset, if the entry was the full name (last name, first name, and middle name) of an individual it was formatted as last name first initial middle initial; an individual's full name in the General Directory dataset was formatted as last name first name middle initial. # 2. The Business Directory dataset placed the title "Mrs." or "Miss" after the middle initial of a female individual; these titles had been separated out into a different column in the General Directory dataset. ![6OpenRefine](Images/6OpenRefine.png) # 3. For individuals with the business type of Clergymen, an abriviation of the demoniation of the place of worship they worked at was included in parenthesis after thier middle inital in the Business Directory dataset; no demoination was included in the General Directory dataset. ![5OpenRefine](Images/5OpenRefine.png) # # For OpenRefine to connect datasets, to follow the "cell.cross" command of the previously mentioned expression, the entires being connected **have to match entirely**. Thus, to connect a larger and more accurate number of records between the two datasets, I needed to do more work transforming both datasets. # # It was determined that the first two issues could be resolved by transforming the General Directory dataset. It would be easier to separate out the first and middle initials from a person's full name in the General Directory than trying to map full names onto initials in the Business Directory. Additionally, adding the "Mrs." or "Miss" title to a name could be done simply by utilizing OpenRefine's capability to combine columns. The third issues was a transformation that had to occur in the Business Directory dataset as it was the only dataset that included this information. The formatting, however, made it simple as the name and denomination could be separated along the opening parenthesis. # # These issues were all resolved using a variety of filtering and facets in OpenRefine to isolate the records that fit within one or more of these categories of issues (full name, Mrs./Miss title, Clergymen) before completing transformations, creating new columns, and combining columns to ensure the names of individuals and companies matched in both datasets. # # Upon completing these transformations and manipulations, the initial "cell.cross" expression was used to create a new column in the General Directory dataset.![7OpenRefine](Images/7OpenRefine.png) # # As seen in the preview, this connected company or individual name with the correct business type. Thus, the new column was populated with accurate information indicating a solid connection between the two datasets was made. Ultimately, half of the records in the Business Directory dataset matched records in the General Directory dataset, the highest number of matches of any of the attemps made in this project. As a result, more than [240 records in the General Directory dataset](Data/FINAL_DATA_SET.csv) gained an additional piece of information, creating more nuanced understandings of the makeup of 1911 Charlotte. # # ![8OpenRefine](Images/8OpenRefine.png) # # Next: [Visualizations and Further Opportunities](Visualizations.ipynb)