Before we begin to clean the data, it is important we become more familiar with the contex of the data.
Grade
and their Area Number
The fields of each record are characteristics of a graded section indicated by the security grade map seen above. These fields have been transcribed and stored in a comma separated value (csv) format.
Using the pandas library, we can create a data frame from the csv. A data frame is a table where every row represents a single unique object, called an entity, and every column represents characteristics of these objects, called attributes.
# loads the pandas library
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning) # Ignore Pandas future warnings
# creates data frame named df by reading in the Baltimore csv
df = pd.read_csv("AD_Data_BaltimoreProject.csv")
df.head(n=3)
Cell In [2], line 7 df = pd. ^ SyntaxError: invalid syntax
Note that we have 26 attributes. We can all of the atrributes and their data types calling dtypes
on our dataframe.
df.dtypes
Form object State object City object Security_Grade object Area_Number int64 Terrain_Description object Favorable_Influences object Detrimental_Influences object INHABITANTS_Type object INHABITANTS_Annual_Income object INHABITANTS_Foreignborn object INHABITANTS_F float64 INHABITANTS_Negro object INHABITANTS_N object INHABITANTS_Infiltration object INHABITANTS_Relief_Families object INHABITANTS_Population_Increase object INHABITANTS_Population_Decrease object INHABITANTS_Population_Static object BUILDINGS_Types object BUILDINGS_Construction object BUILDINGS_Age object BUILDINGS_Repair object Ten_Fifteen_Desirability object Remarks object Date object dtype: object
The data type object
indicates that the value is a string in Python. Since we are not going to use the Form
column, we are going to drop it. Notice that on the form INHABITANTS_F
is a percent, so we want to convert it to a decimal value to be easier to work with later. What other attribute is a percentage?
# drops a single column
df.drop(['Form'], axis=1)
# iterates through this attribute and replaces the current value with decimal form
for value in df['INHABITANTS_F']:
if value != 'NaN':
df['INHABITANTS_F'] = df['INHABITANTS_F'].replace(value, value/100)
After obtaining our data, we need to clean it to prepare for data analysis. This includes handling similiar text values and making sure a column of interest contains a single attribute.
The INHABITANTS_Foreignborn
varaible indicates whether there are foreigners in a neighborhood section. Observe that different forms used 'No' or 'None' to indicate there are no foreigners and 'Small' , 'Very few' , and 'Mixture' are various ways of indicating there were foreigners. NaN
indicates the value in the form is missing.
# selects rows with indices between 0 to 15 included of the 'INHABITANTS_Foreignborn' column
df.loc[0:15,'INHABITANTS_Foreignborn']
0 No 1 None 2 None 3 None 4 None 5 NaN 6 No 7 No 8 No 9 Small 10 Very few 11 No 12 No 13 No 14 No 15 Mixture Name: INHABITANTS_Foreignborn, dtype: object
Since columns with categorized values are easier to analyze, we are going to transform this column to indicate whether there are foreigners or not. In Python, None
is a special keyword that indicates the cell has a null value. Therefore, we will alter all the 'None' values to 'No' and every other value that indicates there are to 'Yes'.
# replaces the values of 'None' with 'No'
df['INHABITANTS_Foreignborn'] = df['INHABITANTS_Foreignborn'].replace('None', 'No')
# replaces all other values with 'Yes'
for value in df['INHABITANTS_Foreignborn']:
if value != 'No' and value != 'NaN':
df['INHABITANTS_Foreignborn'] = df['INHABITANTS_Foreignborn'].replace(value, 'Yes')
df.loc[0:15,'INHABITANTS_Foreignborn']
0 No 1 No 2 No 3 No 4 No 5 Yes 6 No 7 No 8 No 9 Yes 10 Yes 11 No 12 No 13 No 14 No 15 Yes Name: INHABITANTS_Foreignborn, dtype: object
Detrimental_Influences
are descriptions of a region's undesiriable characteristics. Notice in this case that there are multiple variations of the meaning of 'No' to be replaced.
df.loc[0:15,'Detrimental_Influences']
0 None 1 None 2 Distance to City 3 None 4 None 5 NaN 6 No. 7 None 8 Few streets of property in poor condition 9 None 10 None 11 Built on filled ground. 12 None 13 None 14 Distance to center of city 15 None Name: Detrimental_Influences, dtype: object
To handle the multiple words to be replaced we can use an array, a list that holds a fixed number of values of the same type, to store several strings. Then we can replace each occurence in Detrimental_Influences
by iterating through the array and using the replace()
function.
# declares an array called 'no_words' containing strings that all mean 'No'
no_words = ['No.','no', 'none', 'None']
# interates through no_words and replaces each occurence in Detrimental_Influences
for word in no_words:
df['Detrimental_Influences'] = df['Detrimental_Influences'].replace( word , 'No')
df.loc[0:15,'Detrimental_Influences']
0 No 1 No 2 Distance to City 3 No 4 No 5 NaN 6 No 7 No 8 Few streets of property in poor condition 9 No 10 No 11 Built on filled ground. 12 No 13 No 14 Distance to center of city 15 No Name: Detrimental_Influences, dtype: object
We need to do these same operations for INHABITANTS_Negro
, whether residents are African American, and INHABITANTS_Infiltration
, which refers to if the area is being infiltrated. Functions are procedures that perform specific tasks. In Python, we can create our own functions or used built-in functions such as replace()
. Parameters are variables that are defined in the function definition. Arguments are the values passed into a function.
The replace_values
function has three parmeters : attribute
, new_word
, and words_to_replace
.
# this function iterates through an array of words to be replaced and replaces each occurence in the attribute
# with the new word
def replace_values(attribute, new_word, words_to_replace):
for word in words_to_replace:
df[attribute] = df[attribute].replace(word, new_word)
replace_values('INHABITANTS_Negro', 'No', no_words)
replace_values('INHABITANTS_Infiltration', 'No', no_words)
df.loc[0:15,'INHABITANTS_Infiltration']
0 No 1 No 2 No 3 No 4 No 5 NaN 6 No 7 No 8 No 9 NaN 10 No 11 No 12 No 13 No 14 No 15 People from C-1 Name: INHABITANTS_Infiltration, dtype: object
Note that we do the same procedure for words meaning 'small', however we could have made these columns have two categories, 'yes' or 'no', and indicate in our data analysis the alterations to our data.
# Various words meaning small
small_words = ['Nominal','nominal', 'Small','small','Minimum', 'minimum','minimal', 'Very few']
replace_values('INHABITANTS_Negro', 'Nominal', small_words)
replace_values('INHABITANTS_Infiltration', 'Nominal', small_words)
df.loc[0:15,'INHABITANTS_Infiltration']
0 No 1 No 2 No 3 No 4 No 5 NaN 6 No 7 No 8 No 9 NaN 10 No 11 No 12 No 13 No 14 No 15 People from C-1 Name: INHABITANTS_Infiltration, dtype: object
df.loc[0:15,'INHABITANTS_Negro']
0 No 1 No 2 No 3 No 4 No 5 NaN 6 No 7 No 8 No 9 No 10 No 11 No 12 No 13 No 14 No 15 No Name: INHABITANTS_Negro, dtype: object
Some of the cities have corresponding suburbs followed by a dash. We can extract the suburbs and create a new column using str.split()
.
df.loc[16:30,'City']
16 Baltimore 17 Baltimore - Dundalk 18 Baltimore 19 Baltimore - Metropolitan 20 Baltimore 21 Baltimore - Metropolitan 22 Baltimore 23 Baltimore - Sub. Burton 24 Baltimore - Sub. Mt. Washington Summit 25 Baltimore - Sub. Villahove 26 Baltimore - Sub. Colonial Park 27 Baltimore - Sub. Linthicum Heights 28 Baltimore 29 Baltimore 30 Baltimore Name: City, dtype: object
A new data frame seperate is created with two columns seperated based on str.split
, where n=1
indicates the string is split once. We create two new attributes in the original data frame, City_clean
and Suburb
. Then we drop the old City
city column and rename our new City
attribute.
# creates a new data frame that splits the values in the orginal data frame on the '-' character
newdf = df["City"].str.split('-', n = 1, expand = True)
newdf.head()
# creates two new columns
df['City_clean'] = newdf[0]
df['Suburb'] = newdf[1]
# removes the old column
df.drop(["City"], axis = 1, inplace = True)
# renames as 'City'
df.rename(index=str, columns={"City_clean": "City"});
df.loc[16:30,['Suburb']]
Suburb | |
---|---|
16 | None |
17 | Dundalk |
18 | None |
19 | Metropolitan |
20 | None |
21 | Metropolitan |
22 | None |
23 | Sub. Burton |
24 | Sub. Mt. Washington Summit |
25 | Sub. Villahove |
26 | Sub. Colonial Park |
27 | Sub. Linthicum Heights |
28 | None |
29 | None |
30 | None |
The BUILDINGS_Age
attribute represents the range of the age of the buildings in a region. Notice the range is given is given as a string of text. For purpose of analysis, we would like to extract the upper end of the range and create a column of numeric values.
df.loc[0:15,'BUILDINGS_Age']
0 1 to 10 years 1 12 years 2 1 to 20 years 3 10 years 4 1 to 20 years 5 NaN 6 25 years 7 15 to 25 years 8 15 years 9 5 to 25 years 10 20 years 11 5 to 10 years 12 10 years 13 25 years 14 1 to 20 years 15 6 to 25 years Name: BUILDINGS_Age, dtype: object
Observe that the numeric value at the end of the range is always the last number followed by the string years
. We use specific string patterns called regular expressions (regex) and str.extract()
to create a new attribute called max_building_age
.
df['max_building_age'] = df['BUILDINGS_Age'].str.extract('(\d+)(?!.*\d)', expand=True)
The regular expression in this example is (\d+)(?!.*\d)
and was tested using Pythex, a regex editor. We could have imported re
, Python's regular expression library, to perform the same task, however this notebook consistenly uses str()
functions.
df.loc[0:15,['BUILDINGS_Age','max_building_age']]
BUILDINGS_Age | max_building_age | |
---|---|---|
0 | 1 to 10 years | 10 |
1 | 12 years | 12 |
2 | 1 to 20 years | 20 |
3 | 10 years | 10 |
4 | 1 to 20 years | 20 |
5 | NaN | NaN |
6 | 25 years | 25 |
7 | 15 to 25 years | 25 |
8 | 15 years | 15 |
9 | 5 to 25 years | 25 |
10 | 20 years | 20 |
11 | 5 to 10 years | 10 |
12 | 10 years | 10 |
13 | 25 years | 25 |
14 | 1 to 20 years | 20 |
15 | 6 to 25 years | 25 |
The Date
, the date indicated on the form, has three attributes with the pattern given by month day, year. As previously stated, we would like to have a single attribute per column.
df.loc[0:3,['Date']]
Date | |
---|---|
0 | May 4,1937 |
1 | May 4,1937 |
2 | May 4,1937 |
3 | May 4,1937 |
We can seperate this column into three columns: Month
, Day
, and Year
using str.extract()
and regex patterns
(\d\d\d\d)
to indicate four digit year, (\d)
to indicate a single digit day, and [A-Z]\w{0,}
to indicate aplhanumeric text to extract the month. Note that regex can also be used to reformat the date column.
# extracts the year
df['Year'] = df['Date'].str.extract('(\d\d\d\d)', expand=True)
# extracts day
df['Day'] = df['Date'].str.extract('(\d)', expand=True)
# extracts month
df['Month'] = df['Date'].str.extract('([A-Z]\w{0,})', expand=True)
df.loc[0:3,['Date','Month','Day','Year']]
Date | Month | Day | Year | |
---|---|---|---|---|
0 | May 4,1937 | May | 4 | 1937 |
1 | May 4,1937 | May | 4 | 1937 |
2 | May 4,1937 | May | 4 | 1937 |
3 | May 4,1937 | May | 4 | 1937 |
Similar to BUILDINGS_Age
, INHABITANTS_Annual_Income
, the range of income of residents in a section, is a string of text and we would like to extract the numeric value at the end of the range.
df.loc[0:7,['INHABITANTS_Annual_Income']]
INHABITANTS_Annual_Income | |
---|---|
0 | $3000 - 5,000 |
1 | over $5000 |
2 | 3500 - 7000 |
3 | over $5000 |
4 | $3,500 - $10,000 |
5 | NaN |
6 | over $4000 |
7 | over $5000 |
Observe that some values are located after '$', some inlcude '-', and others include ','.Since the pattern to extract the last numeric value varies, we use str.replace()
, str.extract
, and regex in a sequence of operations to create a new column max_annual_income
.
# replaces anything that is not a digit or ',' with empty space
df['max_annual_income'] = df['INHABITANTS_Annual_Income'].str.replace('[^\d(,)]',' ')
<ipython-input-34-f14c5f79d783>:2: FutureWarning: The default value of regex will change from True to False in a future version. df['max_annual_income'] = df['INHABITANTS_Annual_Income'].str.replace('[^\d(,)]',' ')
df.loc[0:7,['max_annual_income']]
max_annual_income | |
---|---|
0 | 3000 5,000 |
1 | 5000 |
2 | 3500 7000 |
3 | 5000 |
4 | 3,500 10,000 |
5 | NaN |
6 | 4000 |
7 | 5000 |
We did not include the delimeter in the previous str.replace()
operation because their positions would have been replaced by an empty space. Now we want to remove delimeters.
# removes delimeter by replacing ',' with ''
df['max_annual_income'] = df['max_annual_income'].str.replace('[(,)]','')
<ipython-input-36-3eaf5e901346>:2: FutureWarning: The default value of regex will change from True to False in a future version. df['max_annual_income'] = df['max_annual_income'].str.replace('[(,)]','')
df.loc[0:7,['max_annual_income']]
max_annual_income | |
---|---|
0 | 3000 5000 |
1 | 5000 |
2 | 3500 7000 |
3 | 5000 |
4 | 3500 10000 |
5 | NaN |
6 | 4000 |
7 | 5000 |
Finally, we can extract the last number.
# extracts the second or last number
df['max_annual_income'] = df['max_annual_income'].str.extract('(\d+)(?!.*\d)', expand=True)
df.loc[0:7,['max_annual_income']]
max_annual_income | |
---|---|
0 | 5000 |
1 | 5000 |
2 | 7000 |
3 | 5000 |
4 | 10000 |
5 | NaN |
6 | 4000 |
7 | 5000 |
- Can you identify another way to handle No/No./None and it's variations?
BUILDINGS_Age
.median
income?We need to convert our dataframe to another csv to use in the next section of this series. The new csv should be available in your local repository once running the line below.
df.to_csv(r'clean_baltimore_data.csv')
clean_df = pd.read_csv("clean_baltimore_data.csv")
clean_df.head(n=2)
Unnamed: 0 | Form | State | Security_Grade | Area_Number | Terrain_Description | Favorable_Influences | Detrimental_Influences | INHABITANTS_Type | INHABITANTS_Annual_Income | ... | Ten_Fifteen_Desirability | Remarks | Date | City_clean | Suburb | max_building_age | Year | Day | Month | max_annual_income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | NS FORM-8 6-1-37 | Maryland | A | 2 | Rolling | Fairly new suburban area of homogeneous charac... | No | Substantial Middle Class | $3000 - 5,000 | ... | Upward | A recent development with much room for expans... | May 4,1937 | Baltimore | NaN | 10.0 | 1937.0 | 4.0 | May | 5000.0 |
1 | 1 | NS FORM-8 6-1-37 | Maryland | A | 1 | Undulating | Very nicely planned residential area of medium... | No | Executives, Professional Men | over $5000 | ... | Upward | Mostly fee properties. A few homes valued at $... | May 4,1937 | Baltimore | NaN | 12.0 | 1937.0 | 4.0 | May | 5000.0 |
2 rows × 33 columns
Continue to the Data Manipulation portion of this module.