The goal of exploratory data analysis (EDA) is to explore attributes across multiple entities to decide what statistical or machine learning techniques to apply to the data. Visualizations are used to assist in understanding the data.
# loads the pandas library
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning) # Ignore Pandas future warnings
# creates data frame named df by reading in the Baltimore csv
df = pd.read_csv("manipulated_baltimore_data.csv")
df.head(n=3)
Requirement already satisfied: pandas in /opt/conda/lib/python3.8/site-packages (1.2.2) Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.8/site-packages (from pandas) (2.8.1) Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.8/site-packages (from pandas) (2021.1) Requirement already satisfied: numpy>=1.16.5 in /opt/conda/lib/python3.8/site-packages (from pandas) (1.19.5) Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
Unnamed: 0 | Unnamed: 0.1 | Form | State | Security_Grade | Area_Number | Terrain_Description | Favorable_Influences | Detrimental_Influences | INHABITANTS_Type | ... | max_annual_income | terrain_rolling | white_collar | mixture_or_jewish | professional | business_or_executive | laborer | clerks | mechanics | industrial | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | NS FORM-8 6-1-37 | Maryland | A | 1 | undulating | Very nicely planned residential area of medium... | No | executives professional men | ... | 5000.0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | NS FORM-8 6-1-37 | Maryland | A | 2 | rolling | Fairly new suburban area of homogeneous charac... | No | substantial middle class | ... | 5000.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 2 | 2 | NS FORM-8 6-1-37 | Maryland | A | 3 | rolling | Good residential area. Well planned. | Distance to City | executives professional men | ... | 7000.0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
3 rows × 43 columns
The .describe()
function summarizes a data frame column. Since the data type of max_building_age
is currently type 'object', which in python is an indcator of type 'string', we have to first convert this attribute into a numeric value.
df['max_building_age'].describe()
count 46.000000 mean 30.086957 std 16.497577 min 10.000000 25% 20.000000 50% 25.000000 75% 40.000000 max 65.000000 Name: max_building_age, dtype: float64
Now that max_building_age
is numeric type, we see that describe()
provides summary statistics on this attribute.
# converts max_building age to numeric type
df["max_building_age"] = pd.to_numeric(df["max_building_age"])
df['max_building_age'].describe()
count 46.000000 mean 30.086957 std 16.497577 min 10.000000 25% 20.000000 50% 25.000000 75% 40.000000 max 65.000000 Name: max_building_age, dtype: float64
We can the same operations to max_annual_income
.
df['max_annual_income'].describe()
count 46.000000 mean 3139.130435 std 2009.806874 min 1000.000000 25% 1850.000000 50% 2750.000000 75% 4000.000000 max 10000.000000 Name: max_annual_income, dtype: float64
df['max_annual_income'] = pd.to_numeric(df['max_annual_income'])
df['max_annual_income'].describe()
count 46.000000 mean 3139.130435 std 2009.806874 min 1000.000000 25% 1850.000000 50% 2750.000000 75% 4000.000000 max 10000.000000 Name: max_annual_income, dtype: float64
Finally we create some plots our data. A scatter plot and a bar chart are shown below.
%%HTML
<script type='text/javascript' src='https://10ay.online.tableau.com/javascripts/api/viz_v1.js'></script><div class='tableauPlaceholder' style='width: 1920px; height: 895px;'><object class='tableauViz' width='1920' height='895' style='display:none;'><param name='host_url' value='https%3A%2F%2F10ay.online.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='/t/sadata' /><param name='name' value='mapping_inequality/Sheet1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='showAppBanner' value='false' /><param name='filter' value='iframeSizedToWindow=true' /></object></div>
- Hover over different points and explore their additional characteristics. Note:
INHABITANTS_F/N
should be multiplied by 100 to be a percent.
BUILDINGS_Construction
vary across the different points?%%HTML
<script type='text/javascript' src='https://public.tableau.com/javascripts/api/tableau-2.min.js'></script><div class='tableauPlaceholder' style='width: 1440px; height: 715px;'><object class='tableauViz' width='1440' height='715' style='display:none;'><param name='host_url' value='https%3A%2F%2F10ay.online.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='/t/sadata' /><param name='name' value='mapping_inequality/Sheet3' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='showAppBanner' value='false' /><param name='filter' value='iframeSizedToWindow=true' /></object></div>
- Recall the preperations done to the INHABITANTS_Foreignborn, how might this have influenced these outcomes?
%%HTML
<script type='text/javascript' src='https://10ay.online.tableau.com/javascripts/api/viz_v1.js'></script><div class='tableauPlaceholder' style='width: 1440px; height: 715px;'><object class='tableauViz' width='1440' height='715' style='display:none;'><param name='host_url' value='https%3A%2F%2F10ay.online.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='/t/sadata' /><param name='name' value='mapping_inequality/Sheet2' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='showAppBanner' value='false' /><param name='filter' value='iframeSizedToWindow=true' /></object></div>
- What can you learn from this graph?