#!/usr/bin/env python
# coding: utf-8
# # Major Challenges
#
# Historical data is not omnipotent -- it has errors! Enslaved people's ages and DOBs are not trustworthy.
#
#
#
#
# Humans are fallible, especially crowdsourced transcribers.
#
#
#
# **Summary:** Inconsistent data with a lot of nulls, a lot of ambiguity, and many human errors (historical and contemporary).
# # Mapping Computational Thinking into Library and Archival Science Practices (CT-LAS)
# (Data Practices and Systems Thinking Practices), focused on age and sex
#
#
# ## Data Practices
#
# * **Collecting Data** (Maryland State Archives and crowdsourced transcribers)
# * **Creating Data / Manipulating Data** - aggregating ages and using SQL queries to assign age ranges
# * Combination of manual find/replace and SQL queries to remove all non-standard integers from the original data set’s age information, including “about 25,” “abour 25,” “between 35 and 40”
# * When age ranges were provided, the lower was selected.
#
# * **Analyzing / Visualizing Data** - in SQLite (in DBeaver) and Tableau, queried relationships between age, age ranges, sex, and runaway departure dates to begin understanding more about the enslaved population in question.
# ## Systems Thinking Practices
#
# * **Investigating Complex System as a Whole** and **Understanding the Relationships within a System** - spent first day getting comfortable with the dataset and understanding the distinct and interconnected information included (runaway data and owner data, newspaper data, etc.) - they are all linked.
# * Started to play with these relationships with initial attempts to identify duplicates
# * **Thinking in Levels** - ‘zoomed in’ to ages and started first with cleaning data to have set integers from which data analysis could begin
#
# # Using Tableau Dashboard
#
#
# * **Top visualization** is new, in which every individual in the data set with an age provided is counted, having cleaned the data to only include whole integers. The bar showing unknowns was kept in order to illustrate an important aspect of the data: there are many holes!
#
# * **Bottom visualization** is from the first team of students to investigate this data set. Critique: the visualization is not actually that helpful, although it is beautiful, because it is not organized in a visually logical way (example: descending ages). Also, they only included whole integers already present in the data and excluded all other data points instead of cleaning.
#
#
# * **Age Ranges:**
#
#
# * Looking at age ranges would be more interesting than individual integers in order to start seeing broader trends. Also, grouping in ranges would balance the assumption previously made by choosing the lower value of any age ranges originally present in the data.
# * Also worth looking at is gender, finding many more men than women were running away, though the age range trend was similar for both. Again, included Unknowns to illustrate the holes in the data.
#
# **Other Fun Aggregations:**
#
# **Top Newspapers:**
# * Isolated the newspapers with the most ads. Not without struggles due to human error (spelling) and data entry (multiple newspapers listed in one row):
#
#
# **Departure Date & Sex:**
# * First wanted to see when people were running away, then curious to see if men and women followed the same trends - in general, yes, but with a few minor exceptions. Historical research and other data sets may help understand the trends more deeply, example: were women running away with men or on their own?
# **Rewards by Age:**
# * Really want to do more with rewards but not enough time and not necessarily enough data. Our data set does not include the reward amount, only if a reward was advertised.
# * Percentages of rewards offered by age group - generally higher for younger rather than older runaways. Is this based on the owners wanting their younger slaves back more than the older ones, therefore not offering rewards for the latter? Or were a higher number of older slaves caught by law enforcement and held in local jails, advertised for the sheriff without reward?
# * **Important note:** in data, “no” only means no reward advertised. **DOES NOT** distinguish between an ad placed by an owner not offering a reward or an ad placed by a sheriff or other law enforcement advertising the capture and holding of a slave, to be claimed by an owner (therefore no reward provided).
# * Also, dirty data:
#
#
#
# **Finale:**
# There are probably duplicates and it will take a lot of work to isolate individual enslaved people! This work will require cleaning the data and linking other data sets and pursuing historical research. Thus, all the great Tableau visualizations are still exploratory because they likely represent the same individual multiple times.
#
#
#
# One attempt to isolate individuals = **10,992** records retrieved.
#
# Another attempt = **8,516** records retrieved
#
# Sure, add more parameters, but that requires trusting them! The same individual could be listed with and without a middle name; the DOB could be different; the owner could have changed over time, the contemporary transcriber could have spelled a name wrong
#
#