Some aspects of data manipulation, altering data to make it easier to read or use, include sorting and grouping attributes and encoding categorical variables.
# loads the pandas library
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning) # Ignore Pandas future warnings
# creates data frame named df by reading in the Baltimore csv
df = pd.read_csv("clean_baltimore_data.csv")
df.head(n=2)
Requirement already satisfied: pandas in /opt/conda/lib/python3.8/site-packages (1.2.2) Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.8/site-packages (from pandas) (2.8.1) Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.8/site-packages (from pandas) (2021.1) Requirement already satisfied: numpy>=1.16.5 in /opt/conda/lib/python3.8/site-packages (from pandas) (1.19.5) Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
Unnamed: 0 | Form | State | Security_Grade | Area_Number | Terrain_Description | Favorable_Influences | Detrimental_Influences | INHABITANTS_Type | INHABITANTS_Annual_Income | ... | Ten_Fifteen_Desirability | Remarks | Date | City_clean | Suburb | max_building_age | Year | Day | Month | max_annual_income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | NS FORM-8 6-1-37 | Maryland | A | 2 | Rolling | Fairly new suburban area of homogeneous charac... | No | Substantial Middle Class | $3000 - 5,000 | ... | Upward | A recent development with much room for expans... | May 4,1937 | Baltimore | NaN | 10.0 | 1937.0 | 4.0 | May | 5000.0 |
1 | 1 | NS FORM-8 6-1-37 | Maryland | A | 1 | Undulating | Very nicely planned residential area of medium... | No | Executives, Professional Men | over $5000 | ... | Upward | Mostly fee properties. A few homes valued at $... | May 4,1937 | Baltimore | NaN | 12.0 | 1937.0 | 4.0 | May | 5000.0 |
2 rows × 33 columns
Since many of the attributes have multiple values contained within them, we are going to encode them to indicate that a feature exist or not and use Tableau to visualize the frequency of different characteristics by grade. For example, here the Terrain_Description
column has two categories of values : flat or rolling.
%%HTML
<script type='text/javascript' src='https://10ay.online.tableau.com/javascripts/api/viz_v1.js'></script><div class='tableauPlaceholder' style='width: 1440px; height: 655px;'><object class='tableauViz' width='1440' height='655' style='display:none;'><param name='host_url' value='https%3A%2F%2F10ay.online.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='/t/sadata' /><param name='name' value='mapping_inequality/Sheet4' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='showAppBanner' value='false' /><param name='filter' value='iframeSizedToWindow=true' /></object></div>
Here, this tells us that rolling
is generally a characteristic overall of all the grades and it's indicated most frequently in the B-grade. The flat
characteristic is not indicated as frequently compared to rolling
overall, however, in the C-grade it's about the same. Note that a value could be flat and rolling.
# case normalization
df['Terrain_Description'] = df['Terrain_Description'].str.lower()
# remove extraneous punctuation
df['Terrain_Description'] = df['Terrain_Description'].str.replace('[^\w\s]','')
df["Terrain_Description"][0:15]
0 rolling 1 undulating 2 rolling 3 level 4 undulating 5 rolling 6 rolling to hilly 7 rolling to hilly 8 high level 9 rolling to level 10 undulating 11 rolling 12 high level 13 rolling 14 undulating Name: Terrain_Description, dtype: object
Here we identified that there are three categories of all the values:
Before we encode, we need to remove punctuation and convert the strings to lowercase so that the ocurrence of rolling will be standardized.
import numpy as np
# inserts a 1 in terrain_rolling whenever the value is 'rolling' or 'undulating' or 'hilly'
df["terrain_rolling"] = np.where((df["Terrain_Description"].str.contains("rolling") |
df["Terrain_Description"].str.contains("undulating")|
df["Terrain_Description"].str.contains("hilly")) , 1, 0)
df.loc[0:10,['terrain_rolling','Terrain_Description']]
terrain_rolling | Terrain_Description | |
---|---|---|
0 | 1 | rolling |
1 | 1 | undulating |
2 | 1 | rolling |
3 | 0 | level |
4 | 1 | undulating |
5 | 1 | rolling |
6 | 1 | rolling to hilly |
7 | 1 | rolling to hilly |
8 | 0 | high level |
9 | 1 | rolling to level |
10 | 1 | undulating |
The'INHABITANTS_Type' describes what the occupation is like in a different section. We would like to get an idea of what sections indicated certain occupations the most. Using value_counts()
, we can see what values frequently occur in this column.
df['INHABITANTS_Type'].value_counts()
White collar class 8 Executives, Professional Men 5 Mixture 5 Laborers 3 Business men 2 Business people 2 Clerks, mechanics 2 Skilled labor clerks 2 Middle class 2 Mechanics, clerks 2 Skilled mechanics, clerks 2 Industrial employees 1 Mixture, laboring class 1 Clerks, skilled mechanics 1 Officers of Plants (industrial) 1 Skilled labor 1 Business men, clerks 1 Skilled mechanics and Junior Executives 1 Substantial Middle Class 1 White collar class 1 Mixture, Jewish 1 Industrial workers 1 Mechanics 1 Business Men 1 White collar class, clerks 1 Professional and Executives 1 Name: INHABITANTS_Type, dtype: int64
Here, are the different groups of inhabitants overall:
Once again, we need to normalize all the text values. Note that the data type of INHABITANTS_Type
was incorrect and needs to be converted to type object
.
df['INHABITANTS_Type'] = df['INHABITANTS_Type'].astype('object')
df['INHABITANTS_Type'] = df['INHABITANTS_Type'].str.lower()
df['INHABITANTS_Type'] = df['INHABITANTS_Type'].str.replace('[^\w\s]','')
df['INHABITANTS_Type'][0:15]
0 substantial middle class 1 executives professional men 2 executives professional men 3 professional and executives 4 executives professional men 5 NaN 6 executives professional men 7 executives professional men 8 business men 9 business people 10 white collar class 11 white collar class 12 mechanics 13 middle class 14 middle class Name: INHABITANTS_Type, dtype: object
df['white_collar'] = np.where(df['INHABITANTS_Type'].str.contains("white collar class"), 1, 0)
df['mixture_or_jewish'] = np.where((df['INHABITANTS_Type'].str.contains("mixture") |
df['INHABITANTS_Type'].str.contains("jewish")), 1, 0)
df['professional'] = np.where(df['INHABITANTS_Type'].str.contains("professional"), 1, 0)
df['business_or_executive'] = np.where((df['INHABITANTS_Type'].str.contains("business") |
df['INHABITANTS_Type'].str.contains("executive")), 1, 0)
df['laborer'] = np.where((df['INHABITANTS_Type'].str.contains("laborer") |
df['INHABITANTS_Type'].str.contains("laborers")), 1, 0)
df['clerks'] = np.where(df['INHABITANTS_Type'].str.contains("clerks"), 1, 0)
df['mechanics'] = np.where(df['INHABITANTS_Type'].str.contains("mechanics"), 1, 0)
df['industrial'] = np.where(df['INHABITANTS_Type'].str.contains("industrial"), 1, 0)
%%HTML
<script type='text/javascript' src='https://10ay.online.tableau.com/javascripts/api/viz_v1.js'></script><div class='tableauPlaceholder' style='width: 1440px; height: 655px;'><object class='tableauViz' width='1440' height='655' style='display:none;'><param name='host_url' value='https%3A%2F%2F10ay.online.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='/t/sadata' /><param name='name' value='mapping_inequality/Sheet5' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='showAppBanner' value='false' /><param name='filter' value='iframeSizedToWindow=true' /></object></div>
Here we see that grade C has the most variation compared to the other grades, however, within this grade Clerks and Mechanics are indicated the most. Grade D has the least variation with only Mixture/Jewish and Laborers. Most of the INHABITANTS_Type
in grade B are indicated as Business/Executive or White collar. Grade A has more variation than grade D, however less than B and C. The frequency for Professional and Business/Executtive in grade A is about equal.
- Can you create the
terrain_flat
variable?
INHABITANTS_Type
type? What are possible explainations for these outcomes?The values of Area_Number
are out of order and we want these values to be sorted by Security_Grade
.
# removes any additional spaces from Security_Grade
df['Security_Grade'] = df['Security_Grade'].str.replace('[\W]','')
# converts 'Area_Number' from type object to type 'numeric'
df['Area_Number'] = pd.to_numeric(df['Area_Number'])
df.loc[0:10,['Security_Grade','Area_Number']]
Security_Grade | Area_Number | |
---|---|---|
0 | A | 2 |
1 | A | 1 |
2 | A | 3 |
3 | A | 4 |
4 | A | 5 |
5 | A | 6 |
6 | B | 1 |
7 | B | 2 |
8 | B | 3 |
9 | B | 4 |
10 | B | 5 |
To do this, we created use the sort_values()
function on the original data frame and reset the index. First, the data is sorted and grouped by Security_Grade
and then Area_Number
is sorted in increasing order.
df = df.sort_values(by=['Security_Grade', 'Area_Number'])
# resets the index starting from 0
df = df.reset_index(drop=True)
# in order to save the new sorted area_number into the dataframe
df['Area_Number'] = df['Area_Number']
df.loc[0:10,['Security_Grade','Area_Number']]
Security_Grade | Area_Number | |
---|---|---|
0 | A | 1 |
1 | A | 2 |
2 | A | 3 |
3 | A | 4 |
4 | A | 5 |
5 | A | 6 |
6 | B | 1 |
7 | B | 2 |
8 | B | 3 |
9 | B | 4 |
10 | B | 5 |
df.to_csv(r'manipulated_baltimore_data.csv')
Continue to the data analysis and visualization portion of this module.