#!/usr/bin/env python # coding: utf-8 # ## Exploratory Data Analysis and Visualization # The goal of __exploratory data analysis (EDA)__ is to explore attributes across multiple entities to decide what statistical or machine learning techniques to apply to the data. Visualizations are used to assist in understanding the data. # In[15]: # loads the pandas library import pandas as pd import warnings warnings.simplefilter(action='ignore', category=FutureWarning) # Ignore Pandas future warnings # creates data frame named df by reading in the Baltimore csv df = pd.read_csv("manipulated_baltimore_data.csv") df.head(n=3) # The `.describe()` function summarizes a data frame column. Since the data type of `max_building_age` is currently type 'object', which in python is an indcator of type 'string', we have to first convert this attribute into a numeric value. # In[16]: df['max_building_age'].describe() # Now that `max_building_age` is numeric type, we see that `describe()` provides __summary statistics__ on this attribute. # In[17]: # converts max_building age to numeric type df["max_building_age"] = pd.to_numeric(df["max_building_age"]) df['max_building_age'].describe() # We can the same operations to `max_annual_income`. # In[18]: df['max_annual_income'].describe() # In[19]: df['max_annual_income'] = pd.to_numeric(df['max_annual_income']) df['max_annual_income'].describe() # Finally we create some plots our data. A __scatter plot__ and a __bar chart__ are shown below. # In[20]: get_ipython().run_cell_magic('HTML', '', "

\n") # ### Exercise 4 # > 1. Hover over different points and explore their additional characteristics. __Note__:`INHABITANTS_F/N` should be multiplied by 100 to be a percent. # 2. The different points are clustered by grades. Which clusters have the most variation? # 3. How does `BUILDINGS_Construction` vary across the different points? # 4. Can you identify a trend overall? # In[21]: get_ipython().run_cell_magic('HTML', '', "

\n") # ### Excercise 5 # > 1. Recall the preperations done to the INHABITANTS_Foreignborn, how might this have influenced these outcomes? # 2. What can you learn from this graph? # 3. What do you learn about the different grades? # In[22]: get_ipython().run_cell_magic('HTML', '', "

\n") # ### Exercise 6 # > 1. What can you learn from this graph? # 2. What are some explanations for the outcomes? # 3. What can you learn about the different grades? # 4. Compare to the previous graph, what are the similarities and differences? # In[ ]: