Census Form Image Processing with Computer Vision and Machine Learning¶

Using image segmentation and handwritten character recognition to detect demographic codes in the 1950s census population schedules.¶

Author: Gregory Jansen
Contributor: Hebah Emara
Source Available: https://github.com/cases-umd/census-form-image-processing
License:
Census Form Image Processing with Computer Vision and Machine Learning by Gregory Jansen & Hebah Emara is licensed under CC BY-NC-SA 4.0

Introduction¶

These notebooks provide a guide for sifting through the population schedules of the US 1950 census to find the records that are of interest to researchers. This method allows researchers to better focus their personal transcription labor on the pages that contain relevant information. These code notebooks demonstrate a computer vision technique for segmenting population schedules to extract the individual cell images from the race column. Then the individual cell images are cleaned up and fed into two different neural network models, for identifying the handwritten race code within them. Finally, we created a user interface that allows a researcher to visually review any uncertain results from the above process and thereby create a reliable dataset containing only those population schedule pages that pertain to their research.

One the left, a colorful rendering of long lines detected in a sample docuemnt; On the right a document with a fitted template shown as red lines

Further Information¶

You can learn more about this project in the corresponding paper, which was presented at the IEEE BigData 2024 workshop on Computational Archival Science.

Learning Goals¶

Computational Practices:
- Manipulating data - Computational tools make it possible to efficiently and reliably manipulate large and complex archival holdings. Data manipulation includes sorting, filtering, cleaning, normalizing, and joining disparate datasets. (CTP:manipulating_data)
- Constructing computational models - An important practice… is the ability to create new or extend existing computational models. This requires being able to encode the model features in a way that a computer can interpret. (CTP:computational_models_constructing)
- Assessing different approaches to problem-solving - When there are multiple approaches to solving a problem, or multiple solutions to choose from, it is important to be able to assess the options and make an informed decision about which route to follow. Even if two different approaches produce the same correct result, there are other dimensions that should be considered when choosing a solution or approach such as cost, time, durability, extendibility, reusability, and flexibility. (CTP:problem-solving_approaches)
Archival Practices:
- Develop policies and procedures designed to serve the information needs of various user groups, based on evaluation of institutional mandates and constituencies, the nature of the collections, relevant laws and ethical considerations, and appropriate technologies. (ACA:rsa_evaluate_services)
- Identify facility, equipment, and technological needs and prepare and implement plans to meet those needs. (ACA:map_facility_equipment_tech)

Software and Tools¶

In order to run these notebooks a number of tools are needed. Most can be installed via pip and pipenv files have been provided for that purpose. However, OpenCV and the Amazon CLI are installed separately and these two tools may not be installable on all platforms, such as the MyBinder.org service. OpenCV is a computer vision toolkit and the Amazon CLI is used to download the US Census images, provided as a public AWS dataset, courtesy of the US National Archives and Records Administration.

Instructions are provided on the pages above.

Acquiring or Accessing the Data¶

This CASE module has a notebook devoted to downloading the necessary US Census images. Please see NARA 1950 Census Image Downloads.

Notebooks¶

In [ ]: