The second dataset that we will examine is the data that was created second and it comes from so-called incident cards. The file Cards_Box9.csv was produced from scanned images of index cards take from Box 9 of the NARA archival series. The scanned images were then processed through Optical Character Recognition (OCR) software to produce plain text. Later this plain text was parsed into separate fields through a named entity recognition process (NER).
Here is an example image of an incident card, with some information redacted:
Our first step in exploring this data is to load the file into a Pandas data frame.
# Loading the Data
import pandas as pd
# Read the CSV file into a Pandas data frame:
data_Card = pd.read_csv("Datasets/Cards_Box9.csv")
Now that the data is loaded, the code below will show you the first three rows of index card data, as printed out by the Pandas head() function.
# Show the first three rows
data_Card.head(3)
Box # | Not Inmate | Last Name | First Name | Other Name | Date | Year | |
---|---|---|---|---|---|---|---|
0 | Box9-0788.jpg | NaN | Enomoto | Masanobu | NaN | 7/21/42 | 1942 |
1 | Box9-0692.jpg | NaN | Ebesu | Kikumatsu | NaN | 7/24/42 | 1942 |
2 | Box9-0642.jpg | NaN | Doi | Satomi | NaN | 8/6/42 | 1942 |
Next we call a few other Pandas Data Frame functions. For more information about Pandas data frames, see the documentation:
data_Card.describe() # basic numeric and object stats
Year | |
---|---|
count | 113.0 |
mean | 1943.0 |
std | 0.5 |
min | 1942.0 |
25% | 1943.0 |
50% | 1943.0 |
75% | 1943.0 |
max | 1944.0 |
data_Card.ndim # how many dimensions are there?
2
data_Card.shape # how many row and columns are there? (length in each dimension)
(113, 7)
data_Card.dtypes # what data types are detected in each column?
Box # object Not Inmate object Last Name object First Name object Other Name object Date object Year int64 dtype: object
You can also summarize the data from individual columns, like this:
data_Card['Year'].value_counts() # Counts for each distinct value in "Year" column
1943 85 1944 14 1942 14 Name: Year, dtype: int64
data_Card.tail(3) # tail(3) is the opposite of head(3) and shows the last 3 rows
Box # | Not Inmate | Last Name | First Name | Other Name | Date | Year | |
---|---|---|---|---|---|---|---|
110 | Box9-1021.jpg | NaN | Fujii | Biichi | NaN | 3/7/44 | 1944 |
111 | Box9-1053.jpg | NaN | Fujii | Yasuko | NaN | 3/7/44 | 1944 |
112 | Box9-0196.jpg | Y | NaN | NaN | NaN | 3/11/44 | 1944 |
In order to further explore the dataset in depth, we may rely on Pandas functions and raw row data to discover the following information:
Activity: Explore this dataset using Pandas data frame functions and identify the information above. Record what you discover in the Markdown cell below.
# Add your code here and create more cells if needed..
(This area provided for you to record your own notes on the dataset.)