Notebook

Notebook 1: NARA 1950 Census Image Downloads¶

We will use AWS CLI commands to access data from the National Archives (NARA). With these commands we'll be able to target and download all of the schedule images (i.e., the scanned census forms) from one or more specific counties within a state. (2) Next, we'll create folders based on the enumeration districts for the selected county and save the images into their respective folders. File and folder names will be based on the JSON files that describe the data housed in NARA. This code will save the schedule images as well as the map images for the associated county. This code can be modified to save images from any county, state combination.

Saving all of the images from NARA will take a significant amount of time. (3) This code includes a loop that will check which files have already been downloaded and will continue wherever it left off.

In summary, this notebook will produce the following output: a folder structure with all of the census data (i.e., scanned images) for a given county/state in the 1950 census. The main folder housing all of the data is called "NARA_downloads_1950" (can be changed) and the sub-folders are named after the enumeration disctrict (in this case, "70-1", "70-2", etc.). Within each folder named after an enumeration disctrict are two folders called "Description Images" (contains the same information in each enumeration disctrict folder) and "Schedule Images" (contains scanned images of the completed census form(s) for that enumeration district). Last, the "NARA_downloads_1950" will also include the json file for the state (in this case, California) and a folder called "Map_Images" with the maps of the selected county/state.

Step 0: Make Sure AWS CLI is Installed¶

Before running any of the code below, make sure you have AWS CLI installed and updated. Here are instructions from Amazon: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html

Step 1: Downloading Census Images¶

This first section of our notebook will download the images for Sacramento, California.

In this first chunk of code we're defining the path to the AWS (Amazon Web Services) and creating a directory (download_dir) for our images.

In [ ]:

# Importing key libraries
import os, requests, json

# Defining key variables
CENSUS_YEAR = '1950'
AWS_SERVER = f"https://nara-{CENSUS_YEAR}-census.s3.us-east-2.amazonaws.com/"

# Change the value of DOWNLOAD_DIR here if you want to call your folder something else
DOWNLOAD_DIR = 'NARA_downloads_1950'

# Create a folder using the value of "DOWNLOAD_DIR" if one does not exist
if not os.path.exists(f"{DOWNLOAD_DIR}"):
    os.mkdir(f"{DOWNLOAD_DIR}")

Next, we'll use the function below to download the JSON file that houses the description for the data of a given state. For example, if we supply "wa", it will download Washington state's JSON file to our previously created directory ("NARA_downloads_1950").

In [ ]:

def download_state_json(state_abbrev):
    # Check to see if the json file for the state we're interested in exists in our folder
    if not os.path.exists(f"{DOWNLOAD_DIR}/{state_abbrev}.json"):
        url = f"{AWS_SERVER}metadata/json/{state_abbrev}.json"
        
        # A "GET" request is the standard web request for fetching a page/file
        print("downloading: "+url)
        r = requests.get(url)
        
        # We will write the GET content to a local JSON file
        with open(f"{DOWNLOAD_DIR}/{state_abbrev}.json", 'wb') as dest: 
            dest.write(r.content)

Here's how we'll call our function.

In [ ]:

download_state_json('ca')  # This takes a few seconds to download

This next function checks to see if we've already got the relevant JSON file in our directory and downloads it if it isn't there. Next, the function loads our JSON data into a variable and returns it as the output so we can begin processing that information.

In [ ]:

def get_state_data(state_abbrev):
    # Assign the json file to a variable called state_file
    state_file = f"{DOWNLOAD_DIR}/{state_abbrev}.json"
    
    # Download the json file if it is missing 
    if not(os.path.exists(state_file)):
        download_state_json(state_abbrev)
        
    # Open and read the json file
    with open(state_file, 'r') as f:
        return json.load(f)

Next we use the state JSON data to find the list of paths for map images, per county. In the 1950 Census JSON file the map image file paths are repeated in each enumeration district for a given county or city, so we only need to pull the file paths from the first enumeration district (at index 0) for a given county or city to download the relevant maps using OS and the AWS Command Line Interface (CLI).

While writing this function, I realized that there is a typo in the JSON metadata file. The file paths for image files should read something like this "...1950census-maps/California/Sacramento/14-a2-025-00459.jpg" but instead they're missing a / between the county name and the image name ("1950census-maps/California/Sacramento14-a2-025-00459.jpg"). The good news is that there is a small number of map images per county so the workaround would be using a command with AWS CLI to download each image in the folder housing the images ("1950census-maps/California/Sacramento"). We can reference the JSON file to see which map images are unique to the city of Sacramento in Sacramento county and manually remove any unnecessary files. In the JSON file, simply find the county/city you're interested in - so 44 in the case of Sacramento, Sacramento - and then look under 'enumeration' > '0' > 'map_image'. The city of Sacramento has 193 enumeration districts for the 1950 census, but since they are all pointing to the same map we can just look at the first district (at index 0 in the JSON file) to check which image files we need to keep. In this case, we only need 14-a2-025-00459.jpg and 14-a2-025-00459b.jpg.

In [ ]:

# This is a fucntion to check and create folders
def chk_create_dir(dest_path):
    # Checking to see if a folder exists and, if not, create one in our destination path
    # Destination path = the path to our DOWNLOAD_DIR folder
    if not os.path.exists(dest_path):
        os.mkdir(f"{dest_path}")

def get_map_images(dir_name, state, county):
    # Creating a destination for the images using our new folder
    dest_path = f"{DOWNLOAD_DIR}/{dir_name}"
    chk_create_dir(dest_path)
    # Using --recursive allows us to download all of the items in this folder
    map_command = f"aws s3 cp s3://nara-1950-census/1950census-maps/{state}/{county}/ {dest_path} --no-sign-request --recursive"
    print(map_command)
    os.system(map_command)

Now we can run our function and get all of the images for the state and county we're interested in.

In [ ]:

get_map_images('Map_Images', 'California', 'Sacramento') # This will take a few seconds

This next function creates a folder structure to house all of our schedule images based on the JSON file we downloaded previously. The final result of this function is that we'll have a folder for each enummeration disctrict in Sacramento with two subfolders for schedule and description images. To run this function, we'll need to know where our county is in the JSON file. To do this, we'll need to find where Sacramento, Sacramento is located and indexed in our JSON file (in this case it is at index 44).

The easiest way to do this would be to visit this page (https://www.archives.gov/developer/1950-census) from the National Archives, scroll down to "Enumeration District Summaries by State" then click on the JSON file for your state. You can then identify the county or counties by using the filter function (example: https://nara-1950-census.s3.us-east-2.amazonaws.com/metadata/json/ca.json).

Note: this can be included in the code in the future to remove this manual step, however there are some unique entries - Sacramento city appears as Sacramento, Sacramento for example.

In [ ]:

def create_ed_fldrs(c_index, json_file):
    with open(f"{DOWNLOAD_DIR}/{json_file}") as f:
        data = json.load(f)
    count_enum = len(data['county/city'][c_index]['enumeration'])
    # Creating empty lists for enunumeration district numbers, image file names, and a range
    ed_num_list = []
    img_file_list = []
    range_list = [*range(count_enum)]
    # Adding numbers and file names to each list respectively
    for num in range(count_enum):
        ed_num = data['county/city'][c_index]['enumeration'][num]['ed']
        ed_num_list.append(ed_num)
        # Creating folders for each enumeration district 
        chk_create_dir(f"{DOWNLOAD_DIR}/{ed_num}")
        chk_create_dir(f"{DOWNLOAD_DIR}/{ed_num}/Schedule_Images")
        chk_create_dir(f"{DOWNLOAD_DIR}/{ed_num}/Description_Images")

In [ ]:

create_ed_fldrs(44, 'ca.json')

Next, we can download all of the schedule image files for the county that we're interested in. Luckily, there aren't any typos in this part of the JSON file so it should be a straightforward process. This function will download all of the files and place them into a folder named after the given enumeration district code as it appears in the JSON file. This function will also check to see if the image files already exist to avoid re-downloading the same image and will return the file name and folder it's located in if it already exists.

In [ ]:

def get_sched_imgs(c_index, json_file):
    # Create & check folders 
    create_ed_fldrs(c_index, json_file)
    
    # Open the json file 
    with open(f"{DOWNLOAD_DIR}/{json_file}") as f:
        data = json.load(f)
        
    # Count how many enumeration districts we have
    count_enum = len(data['county/city'][c_index]['enumeration'])
    
    # Create empty lists for enunumeration district numbers, image file names, and a range
    ed_num_list = []
    img_file_list = []
    range_list = [*range(count_enum)]
    
    # Add numbers and file names to each list respectively
    for num in range(count_enum):
        ed_num = data['county/city'][c_index]['enumeration'][num]['ed']
        ed_num_list.append(ed_num)
        
    # Check which image files exist in each enumeration district folder 
    # and downloads only the files that don't yet exist
    for ed_num, num in zip(ed_num_list, range_list):
        for file in data['county/city'][c_index]['enumeration'][num]['schedule_image']['files']:
            img = f"{DOWNLOAD_DIR}/{ed_num}/Schedule_Images/" + file
            fldr = data['county/city'][c_index]['enumeration'][num]['schedule_image']['folder']
            dest_path = f"{DOWNLOAD_DIR}/{ed_num}/Schedule_Images"
            command = ['aws', 's3', 'cp', f"s3://nara-1950-census/{fldr}/{file}", f"{dest_path}", '--no-sign-request']
            if os.path.exists(img):
                print(f"yes {file} in {ed_num} folder")
            else:
                print(command)
                from subprocess import Popen, PIPE
                process = Popen(command, stdout=PIPE, stderr=PIPE)
                stdout, stderr = process.communicate()
                # print(stdout)

In [ ]:

get_sched_imgs(44, 'ca.json') # This will take a long time

Next, I'm going to modify the above function to create a directory with folders for all of the description images. These should take less time to download as there is only one image per enumeration district. Based on the JSON file it looks like each description image covers two enumeartion districts. I also discovered that there's a similar typo with the file path to these images. They should read ".../1950census-descriptions/California/California-0072/California-0072_0001.jpg" but they are missing the / between the folder name (California-0072) and the image name. I'm using the regular expressions (re) library in Python to get around this issue.

This function will similarly check to see if the file already exists and will list the file name and location if it exists in our directory.

In [ ]:

import re

def get_desc_imgs(c_index, json_file, state):
    # Checks that folders exist & creates them if not 
    create_ed_fldrs(c_index, json_file)
    with open(f"{DOWNLOAD_DIR}/{json_file}") as f:
        data = json.load(f)
    count_enum = len(data['county/city'][c_index]['enumeration'])
    # Creating empty lists for enunumeration district numbers and a range
    ed_num_list = []
    range_list = [*range(count_enum)]
    # Adding numbers and file names to each list respectively
    for num in range(count_enum):
        ed_num = data['county/city'][c_index]['enumeration'][num]['ed']
        ed_num_list.append(ed_num)
    # Downloads the desc image for the given enumeration district
    for ed_num, num in zip(ed_num_list, range_list):
        file = data['county/city'][c_index]['enumeration'][num]['description_image']
        if file is None:
            continue
        folder = re.search(fr"{state}-....", file).group()
        img = re.search(fr"{state}-...._.....jpg", file).group()
        f_path = f"1950census-descriptions/{state}/{folder}/{img}"
        dest_path = f"{DOWNLOAD_DIR}/{ed_num}/Description_Images/" + img
        command = ['aws', 's3', 'cp', f"s3://nara-1950-census/{f_path}", f"{dest_path}", '--no-sign-request']
        if os.path.exists(dest_path):
            print(f"yes {f_path} in {ed_num} folder")
        else:
            print(command)
            from subprocess import Popen, PIPE
            process = Popen(command, stdout=PIPE, stderr=PIPE)
            stdout, stderr = process.communicate()
            print(stdout)

In the following cell you can run the function that will download all of the images of the 1950 Sacramento census. This will take a fairly long time to complete. If it is interrupted you may run it again to resume the downloads from where you left off.

In [ ]:

get_desc_imgs(44, 'ca.json', 'California') # This will take a few seconds