#!/usr/bin/env python
# coding: utf-8

# ## Step - 2: A Framework for Unlocking and Linking WWII Japanese American Incarceration Biographical Data - Context Based Data Manipulation and Analysis - Part 2

# The focus of this module will be manipulating the geographical data collected in part 1 to explore a variety of structures for visualizing spatial data. 
# 
# The actions taken in part 1 to locate the place of origin, assembly center, camp relocations, residence at Tule Lake and the final movement for George Kuratomi were repeated for the other 24 selected individuals and aggregated into an Excel spreadsheet which can be seen below. Following this, the latitude and longitude coordinates were added for all five movements for the 25 individuals. A separate spreadsheet, included below, was also created in Excel which structures the latitute and longitude coordinates in a format that will allow us to map out the paths of each person. ***Note:The creation of the paths spreadsheet can be done through python but would require a lot of manipulation of the data to structure it in a useful way, as a result and for ease the data was formatted in an excel spreadsheet. 
# 
# To begin working with the geographical data, the spreadsheet(s) should be saved as a comma separated value (.csv) file(s) then read into the jupyter notebook following the same process as in part 1. This process can be seen below. **Note: Excel files can also be read in, it's a matter of personal preference. 

# In[1]:


# Import libraries used for dataframe (table-like) operations, and numeric data structure operations
import pandas as pd
import numpy as np


# In[2]:


# The below command will read your file into your notebook 
fullstackeddf = pd.read_csv('python-fullmovements-stacked.csv',dtype=object,na_values=[],keep_default_na=False)
pathsdf = pd.read_csv('python-paths.csv',dtype=object,na_values=[],keep_default_na=False)


# ### Creation of Points

# Spatial data is geographic information about the earth and typically references a specific geospatial area or location. To perform any kind of spatial analysis a dataset must include the latitude and longitude coordinates. Additional elements added like name, city, state, dates, etc. give rise to exploring other visualizations and also provide more context about the dataset.
# 
# The headers for this dataset include, Name, lat, long, city, state, order, dates, fid, and notes, as shown below. 

# In[3]:


# The below command shows the first ten rows of the dataset
fullstackeddf.head(10)


# To view all reported movements of one person, we can use Python's contains function to return results of that particular individual. The data is already structured in a way that will make it easy to explore and plot locations for other indivdiuals or groups. 
# 

# In[4]:


# The contains function can pull results specific to a name 
kuratomi = fullstackeddf[fullstackeddf['name'].str.contains('kuratomi')]
kuratomi


# Similiarly, as illustrated above the contains function can also return results for distinct cities, states, orders, and dates. 
# 
# This is particularly useful for exploring and viewing the data through a different lens, especially if you want to analyze where individuals or groups were on a particular date or location. In the example table below, the contains function was used to pull data that contain 'california'. When mapped the result will show points for individuals where their location, point of origin, assigned assembly center, first and/or second incarceration center, and final departure state, was in California. 

# In[5]:


# The below command displays values that contain california
california = fullstackeddf[fullstackeddf['state'].str.contains('california')]
california


# ### Creation of Cluster Data

# So far we've only looked at and constructed tables that plot points on a map. An alternative approach for viewing our data is to create tables that show the relative size or cluster of given variables. 
# 
# Clustering data to view relative sizes is important for performing surface level analysis and can give us a better understainding of where large groups were concentrated at each time point. 
# 
# To view the number of individuals at each location for all movements, we can apply the value_counts method introduced in part 1 to return counts of unique values. As seen below, the list does include a couple of states in the cities column such as California, Pennsylvania, North Dakota, New Mexico, and Hawaii. This was strategically done for a few of the 25 individuals due to missing data in the FAR as a result it was unclear of their final departure city. Additionally, if these cells in the spreadsheet were left blank then they would be counted as a unique value when the value_counts function is performed, and in this specific case we do not want that value included. 

# In[6]:


# The below command returns count values of the cities
fullstackeddf['city'].value_counts()


# Once the unique value count is retrieved then the values need to be appended (i.e., added) to the table. This can be achieved by using Pythons groupby function. 

# In[7]:


# The below command groups the city and order into a new column we titled as counts
fullstackeddf['counts'] = fullstackeddf.groupby(['city'])['order'].transform('count')
fullstackeddf.head()


# ### Creation of Paths

# By using the paths dataset we can spatially view and analyze the movement of a person or group in a unique way. Mapping paths lets us connect points plotted on the map and visually see the routes and the distances between locations. We can use the paths data to identify if and where indvidual paths cross allowing us to glimpse where individuals might have met or at what point families were separated from one another. 
# 
# As seen below, the contains function can be used to extract path data from one or more persons. The pandas operator "|" aka "OR" tells the contains function to also search and pull specific value from separate columns.

# In[8]:


# The below command will return path results for Kuratomi
kuratomipaths = pathsdf[pathsdf['name'].str.contains('kuratomi')]
kuratomipaths


# In[10]:


# The below contains function will allow for searching through the data for two separate variables
kuratomiandterada = pathsdf[pathsdf['name'].str.contains('kuratomi')| pathsdf['name'].str.contains('terada')]
kuratomiandterada


# In this second module, I have shown how to use Pythons contains function to search and pull values from separate columns within datasets using the "|" aka OR operator. We used the value_counts function to return the number of individuals located at each city in our dataset which will let us view the concentration of groups. Additionally, we filtered out the paths dataset to view results for George Kuratomi as well as paths for Singer Terada which will be saved and used in part 3.   
# 
# In the following module, we will look at how to use the datasets we created outside of the notebook as well as the data that we processed and prepared in this module to create spatial and graph visualizations. 

# In[ ]:


# The below command let's us save the modified dataframes into a new output csv file. 
# This can be useful when using these files for further steps of processing.
kuratomiandterada.to_csv('kuratomiandterada.csv', index=False)


# ## Notebooks
# 
# The below module is organized into a sequential set of Python Notebooks that allows us to interact with the collections related to the Framework for Unlocking and Linking WWII Japanese American Incarceration Biographical Data to explore, clean, prepare, visualize and analyze it from historical context perspective.
# 
# 1. A Framework for Unlocking and Linking WWII Japanese American Incarceration Biographical Data - Data Visualization 
# 2. A Framework for Unlocking and Linking WWII Japanese American Incarceration Biographical Data - Context Based Data Manipulation and Analysis - Part 1