Census Form Image Processing with Computer Vision and Machine Learning by Gregory Jansen & Hebah Emara is licensed under CC BY-NC-SA 4.0
These notebooks provide a guide for sifting through the population schedules of the US 1950 census to find the records that are of interest to researchers. This method allows researchers to better focus their personal transcription labor on the pages that contain relevant information. These code notebooks demonstrate a computer vision technique for segmenting population schedules to extract the individual cell images from the race column. Then the individual cell images are cleaned up and fed into two different neural network models, for identifying the handwritten race code within them. Finally, we created a user interface that allows a researcher to visually review any uncertain results from the above process and thereby create a reliable dataset containing only those population schedule pages that pertain to their research.
You can learn more about this project in the corresponding paper, which was presented at the IEEE BigData 2024 workshop on Computational Archival Science.
In order to run these notebooks a number of tools are needed. Most can be installed via pip and pipenv files have been provided for that purpose. However, OpenCV and the Amazon CLI are installed separately and these two tools may not be installable on all platforms, such as the MyBinder.org service. OpenCV is a computer vision toolkit and the Amazon CLI is used to download the US Census images, provided as a public AWS dataset, courtesy of the US National Archives and Records Administration.
Instructions are provided on the pages above.
This CASE module has a notebook devoted to downloading the necessary US Census images. Please see NARA 1950 Census Image Downloads.