Data Manipulation¶

Some aspects of data manipulation, altering data to make it easier to read or use, include sorting and grouping attributes and encoding categorical variables.

In [30]:

# loads the pandas library 
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)  # Ignore Pandas future warnings

# creates data frame named df by reading in the Baltimore csv
df = pd.read_csv("clean_baltimore_data.csv")
df.head(n=2)

Requirement already satisfied: pandas in /opt/conda/lib/python3.8/site-packages (1.2.2)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.8/site-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.8/site-packages (from pandas) (2021.1)
Requirement already satisfied: numpy>=1.16.5 in /opt/conda/lib/python3.8/site-packages (from pandas) (1.19.5)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)

Out[30]:

	Unnamed: 0	Form	State	Security_Grade	Area_Number	Terrain_Description	Favorable_Influences	Detrimental_Influences	INHABITANTS_Type	INHABITANTS_Annual_Income	...	Ten_Fifteen_Desirability	Remarks	Date	City_clean	Suburb	max_building_age	Year	Day	Month	max_annual_income
0	0	NS FORM-8 6-1-37	Maryland	A	2	Rolling	Fairly new suburban area of homogeneous charac...	No	Substantial Middle Class	$3000 - 5,000	...	Upward	A recent development with much room for expans...	May 4,1937	Baltimore	NaN	10.0	1937.0	4.0	May	5000.0
1	1	NS FORM-8 6-1-37	Maryland	A	1	Undulating	Very nicely planned residential area of medium...	No	Executives, Professional Men	over $5000	...	Upward	Mostly fee properties. A few homes valued at $...	May 4,1937	Baltimore	NaN	12.0	1937.0	4.0	May	5000.0

2 rows × 33 columns

1. Indicators¶

Since many of the attributes have multiple values contained within them, we are going to encode them to indicate that a feature exist or not and use Tableau to visualize the frequency of different characteristics by grade. For example, here the Terrain_Description column has two categories of values : flat or rolling.

In [31]:

%%HTML
<script type='text/javascript' src='https://10ay.online.tableau.com/javascripts/api/viz_v1.js'></script><div class='tableauPlaceholder' style='width: 1440px; height: 655px;'><object class='tableauViz' width='1440' height='655' style='display:none;'><param name='host_url' value='https%3A%2F%2F10ay.online.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='&#47;t&#47;sadata' /><param name='name' value='mapping_inequality&#47;Sheet4' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='showAppBanner' value='false' /><param name='filter' value='iframeSizedToWindow=true' /></object></div>

Here, this tells us that rolling is generally a characteristic overall of all the grades and it's indicated most frequently in the B-grade. The flat characteristic is not indicated as frequently compared to rolling overall, however, in the C-grade it's about the same. Note that a value could be flat and rolling.

In [32]:

# case normalization 
df['Terrain_Description'] = df['Terrain_Description'].str.lower()

# remove extraneous punctuation 
df['Terrain_Description'] = df['Terrain_Description'].str.replace('[^\w\s]','')
df["Terrain_Description"][0:15]

Out[32]:

0              rolling
1          undulating 
2              rolling
3                level
4          undulating 
5              rolling
6     rolling to hilly
7     rolling to hilly
8           high level
9     rolling to level
10         undulating 
11             rolling
12          high level
13             rolling
14         undulating 
Name: Terrain_Description, dtype: object

Here we identified that there are three categories of all the values:

rolling/hilly/undualating
level/flat
missing

Before we encode, we need to remove punctuation and convert the strings to lowercase so that the ocurrence of rolling will be standardized.

In [33]:

import numpy as np
# inserts a 1 in terrain_rolling whenever the value is 'rolling' or 'undulating' or 'hilly'
df["terrain_rolling"] = np.where((df["Terrain_Description"].str.contains("rolling") | 
                                      df["Terrain_Description"].str.contains("undulating")|
                                     df["Terrain_Description"].str.contains("hilly")) , 1, 0)

In [34]:

df.loc[0:10,['terrain_rolling','Terrain_Description']]

Out[34]:

	terrain_rolling	Terrain_Description
0	1	rolling
1	1	undulating
2	1	rolling
3	0	level
4	1	undulating
5	1	rolling
6	1	rolling to hilly
7	1	rolling to hilly
8	0	high level
9	1	rolling to level
10	1	undulating

The'INHABITANTS_Type' describes what the occupation is like in a different section. We would like to get an idea of what sections indicated certain occupations the most. Using value_counts(), we can see what values frequently occur in this column.

In [35]:

df['INHABITANTS_Type'].value_counts()

Out[35]:

White collar class                         8
Executives, Professional Men               5
Mixture                                    5
Laborers                                   3
Business men                               2
Business people                            2
Clerks, mechanics                          2
Skilled labor clerks                       2
Middle class                               2
Mechanics, clerks                          2
Skilled mechanics, clerks                  2
Industrial employees                       1
Mixture, laboring class                    1
Clerks, skilled mechanics                  1
Officers of Plants (industrial)            1
Skilled labor                              1
Business men, clerks                       1
Skilled mechanics and Junior Executives    1
Substantial Middle Class                   1
 White collar class                        1
Mixture, Jewish                            1
Industrial workers                         1
Mechanics                                  1
Business Men                               1
White collar class, clerks                 1
Professional and Executives                1
Name: INHABITANTS_Type, dtype: int64

Here, are the different groups of inhabitants overall:

white colllar class
mixture/jewish
professional
business/executive
laborers / labor
clerks
mechanics
industrial

Once again, we need to normalize all the text values. Note that the data type of INHABITANTS_Type was incorrect and needs to be converted to type object.

In [36]:

df['INHABITANTS_Type'] = df['INHABITANTS_Type'].astype('object')
df['INHABITANTS_Type'] = df['INHABITANTS_Type'].str.lower()
df['INHABITANTS_Type'] = df['INHABITANTS_Type'].str.replace('[^\w\s]','')
df['INHABITANTS_Type'][0:15]

Out[36]:

0         substantial middle class
1      executives professional men
2      executives professional men
3     professional and executives 
4      executives professional men
5                              NaN
6      executives professional men
7      executives professional men
8                     business men
9                  business people
10              white collar class
11              white collar class
12                       mechanics
13                    middle class
14                    middle class
Name: INHABITANTS_Type, dtype: object

In [37]:

df['white_collar'] = np.where(df['INHABITANTS_Type'].str.contains("white collar class"), 1, 0)

df['mixture_or_jewish'] = np.where((df['INHABITANTS_Type'].str.contains("mixture") |
                                   df['INHABITANTS_Type'].str.contains("jewish")), 1, 0)

df['professional'] = np.where(df['INHABITANTS_Type'].str.contains("professional"), 1, 0)

df['business_or_executive'] = np.where((df['INHABITANTS_Type'].str.contains("business") |
                              df['INHABITANTS_Type'].str.contains("executive")), 1, 0)

df['laborer'] = np.where((df['INHABITANTS_Type'].str.contains("laborer") |
                              df['INHABITANTS_Type'].str.contains("laborers")), 1, 0)

df['clerks'] = np.where(df['INHABITANTS_Type'].str.contains("clerks"), 1, 0)

df['mechanics'] = np.where(df['INHABITANTS_Type'].str.contains("mechanics"), 1, 0)

df['industrial'] = np.where(df['INHABITANTS_Type'].str.contains("industrial"), 1, 0)

In [38]:

%%HTML
<script type='text/javascript' src='https://10ay.online.tableau.com/javascripts/api/viz_v1.js'></script><div class='tableauPlaceholder' style='width: 1440px; height: 655px;'><object class='tableauViz' width='1440' height='655' style='display:none;'><param name='host_url' value='https%3A%2F%2F10ay.online.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='&#47;t&#47;sadata' /><param name='name' value='mapping_inequality&#47;Sheet5' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='showAppBanner' value='false' /><param name='filter' value='iframeSizedToWindow=true' /></object></div>

Here we see that grade C has the most variation compared to the other grades, however, within this grade Clerks and Mechanics are indicated the most. Grade D has the least variation with only Mixture/Jewish and Laborers. Most of the INHABITANTS_Type in grade B are indicated as Business/Executive or White collar. Grade A has more variation than grade D, however less than B and C. The frequency for Professional and Business/Executtive in grade A is about equal.

Exercise 3¶

Can you create the terrain_flat variable?

What are some other things you notice about values in INHABITANTS_Type type? What are possible explainations for these outcomes?
Check out this DataCamp article about encoding categorical data in python. What are some other attributes that can be encoded and how would you encode them?

2. Sorting and Grouping¶

The values of Area_Number are out of order and we want these values to be sorted by Security_Grade.

In [39]:

# removes any additional spaces from Security_Grade
df['Security_Grade'] = df['Security_Grade'].str.replace('[\W]','')
# converts 'Area_Number' from type object to type 'numeric'
df['Area_Number'] = pd.to_numeric(df['Area_Number'])
df.loc[0:10,['Security_Grade','Area_Number']]

Out[39]:

	Security_Grade	Area_Number
0	A	2
1	A	1
2	A	3
3	A	4
4	A	5
5	A	6
6	B	1
7	B	2
8	B	3
9	B	4
10	B	5

To do this, we created use the sort_values() function on the original data frame and reset the index. First, the data is sorted and grouped by Security_Grade and then Area_Number is sorted in increasing order.

In [40]:

df = df.sort_values(by=['Security_Grade', 'Area_Number'])
# resets the index starting from 0
df = df.reset_index(drop=True)
# in order to save the new sorted area_number into the dataframe
df['Area_Number'] = df['Area_Number']
df.loc[0:10,['Security_Grade','Area_Number']]

Out[40]:

	Security_Grade	Area_Number
0	A	1
1	A	2
2	A	3
3	A	4
4	A	5
5	A	6
6	B	1
7	B	2
8	B	3
9	B	4
10	B	5

In [41]:

df.to_csv(r'manipulated_baltimore_data.csv')

Next Step¶

Continue to the data analysis and visualization portion of this module.

In [ ]: