by EB, JC, SR Group Archivists de Triomphe
Using Computational Modeling to Analyze U.S. Manufacturing and WWII: Insights from 1940. U.S. Secretary of the Treasury Henry Morgenthau Jr. (HMJr) created a vast set of records spanning 1933 to 1945. One is a personal account of official federal press conferences, transcribed with the help of his personal assistant Henrietta Klotz, who then indexed the collection to aid in retrieval. The press conferences were held with other government officials, including then President Franklin D. Roosevelt, members of the press, and other appropriate parties, while HMJr was Treasury Secretary. The press conferences are physical, bound volumes housed at The Franklin D. Roosevelt Presidential Library and Museum; they are also available digitally through their website. The data for this analysis come from volumes 14-16 which cover the year 1940 (JC), with Volume 14 beginning January 4 and Volume 16 ending December 30. Inspired by the historical context of the dates covered in the dataset, this project aimed to use computational models to better understand how the press conferences reflected the surge in U.S. manufacturing as a result of World War II.
Despite improved OCR, the text still contained many errors, so we manually cleaned our dataset prior to analysis, focusing on the following (eight) selected keywords, and their variants: Manufacturing, Airplane, Machine, War, Factory/Plant, Defense, Engine, & Production.
The analysis demonstrated the prevalence of manufacturing-related key terms throughout the dataset and the increased use of the word “defense” toward the end of 1940. The data trends point to the United States’ involvement in war manufacturing and a possible shift toward defense production prior to joining World War II the following year. This project revealed the ability of computational modeling to assist in the textual analysis of large datasets, allowing researchers to examine the data to study specific topics. Our research team recommends further work with this dataset to explore the key terms in context and implications for improved archival description through digital curation (EB).
1940 was the first full year of World War II and the last full year before the United States joined the Allies. While reviewing the press conference transcripts, we noticed that HMJr frequently discussed the defense industry and his office’s work toward supporting war production efforts. Accordingly, we chose to focus on interpreting these volumes through a manufacturing point of view and formed the following research question: How do the Henry Morgenthau Jr. press conferences reflect the surge in American manufacturing in 1940 as a result of World War II? (EB)
We started the project with a goal of transforming the PDF volumes into a manipulatable dataset. We each worked with one of our assigned volumes to give us experience with computational methods. To get to the point that we could manipulate our volumes, we used the optical character recognition tool on DocDrop.org/ocr, resulting in a PDF with improved OCR. Once completed, we then converted the PDF to a .txt file using Convertio. Despite the improved OCR, the resulting text still contained many errors because of the scan quality as well as the volumes’ typewritten text. Typewritten manuscripts often cause trouble for OCR software because of the variations in letters (Stančić & Trbušić, 2020). Manually cleaning all of the errors would have taken more time than we had for the project. To make our work more manageable, we selected key terms for our analysis according to our research question and focused on manually cleaning up those terms in the documents. We focused on these terms and their accompanying variants: manufacturing/manufacturer, airplane/aircraft/plane, machine, war, factory/plant, engine, defense, and production. With the data cleaned, we then uploaded the text files to Voyant Tools to create visualizations. We each worked with our assigned volumes separately, but we also combined the text files to create one large file containing all the data for our three volumes to analyze the data in its entirety.
Among the tools and methods we tried for converting the PDFs to a manipulable file through optical character recognition (OCR) are:
Once our PDFs were in a manipulatable format, we then tried a variety of tools for analyzing the text:
We started the project with a goal of transforming the PDF volumes into a manipulatable dataset. We each worked with one of our assigned volumes to give us experience with computational methods. To get to the point that we could manipulate our volumes, we used the optical character recognition tool on DocDrop.org/ocr, resulting in a PDF with improved OCR. Once completed, we then converted the PDF to a .txt file using Convertio. Despite the improved OCR, the resulting text still contained many errors because of the scan quality as well as the volumes’ typewritten text. Typewritten manuscripts often cause trouble for OCR software because of the variations in letters (Stančić & Trbušić, 2020). Manually cleaning all of the errors would have taken more time than we had for the project. To make our work more manageable, we selected key terms for our analysis according to our research question and focused on manually cleaning up those terms in the documents. We focused on these terms and their accompanying variants: manufacturing/manufacturer, airplane/aircraft/plane, machine, war, factory/plant, engine, defense, and production. With the data cleaned, we then uploaded the text files to Voyant Tools to create visualizations. We each worked with our assigned volumes separately, but we also combined the text files to create one large file containing all the data for our three volumes to analyze the data in its entirety (EB).
With the datasets in Voyant Tools, we used the software’s wide selection of tools to explore trends in the volumes from a manufacturing perspective. The appearance of the Voyant Tools landing page upon loading a dataset into the platform is somewhat chaotic but it provides several options for visualizations and manipulations (Figure 1). Ultimately, we chose to focus on the Trends line graph, ScatterPlot, StreamGraph, and Cirrus word cloud tools (Figures 2, 3, 4, 5, & 6). These specific Voyant Tools allowed us to analyze the frequency of the selected key terms throughout the volumes (Voyant Tools, n.d.). In doing so, we had the opportunity to study changes in the use of terms over time in the individual volumes and across the entirety of the 1940 transcripts. For example, we were able to highlight changes in the use of the selected key terms throughout each volume individually using line graphs and scatter plots (Figures 2, 3, & 4). Additionally, we represented term frequency across all three volumes (Figures 5 & 6) (EB).
Figure 1: Voyant Landing Page Featuring Volume 14 of HMJr Press Conference Transcripts. Note: This graph illustrates the five sections of raw data overall from Volume 14 of the Morgenthau Press Conference transcripts. Each window includes different information such as vocabulary density, readability index, average words per sentence and phrases with the specified words – they are designed to work together to facilitate the exploration, analysis, and visualization of text data. Changing parameters in one window affects the other four data chambers. The graph demonstrates the frequency of words used in this volume (SR).
Figure 2: Line and Bar Graph of Select Key Terms in Volume 14 of HMJr Press Conference Transcripts. Note: This graph illustrates the frequency of selected key terms (our chosen vocabulary) for Volume 14 of the Morgenthau Press Conference transcripts. The graph demonstrates the use of terms throughout this volume, indicating a higher frequency in the use of terms “engine” and “plant*” at the beginning of the document with an increase in the term “war” and “airplane” near the end. This visual allows us to evaluate our specific volume or compare it with other volumes; the user can adapt the vertical or horizontal axis as needed. The frequency of certain words are easy to see, the visuals are clear and show the changes over time (SR).
Figure 3: Scatter Plot of Select Key Terms in Volume 15 (spanning May 2nd to October 10th, 1940) of HMJr Press Conference Transcripts. Note: This graph visually demonstrates the frequency of selected key terms for Volume 15 of the Henry Morgenthau Jr. Press Conference transcripts. The graph reveals the use of selected terms throughout the volume, indicating a higher frequency in the use of the terms “engine_”, “air_”, and “product_” at the beginning of the document with an increase in the terms “plant_” and “war” near the end. Compared to volume 14, the term plant drops down and engine really surges as aircraft engine demand and production greatly increased and Morgenthau was facilitating the transfer of airplane engine manufacturing and management over to the heads of GM and Chrysler (JC).
Figure 4: Line and Bar Graph of Select Key Terms in Volume 16 of HMJr Press Conference Transcripts. Note: This graph illustrates the frequency of selected key terms for Volume 16 of the Henry Morgenthau Press Conference transcripts. The graph demonstrates the use of terms throughout this volume, indicating a higher frequency in the use of terms “plant”, “air*”, and “machine” at the beginning of the document with an increase in the term “defense” at the end (EB).
Figure 5: Streamgraph of Selected Key Terms in Volumes 14-16 of HMJr Press Conference Transcripts. Note: This stream graph presents a visualization of the changes in frequency of selected key terms across all three Volumes in this study in chronological order. The graph demonstrates a steady use of the word “air*” along with a later increase in the use of the word “defense” (EB).
Figure 6: Cirrus Word Cloud of Select Key Terms in Volumes 14-16 of HMJr Press Conference Transcripts. Note: Word Cloud with most frequently occurring terms throughout Volumes 14, 15, and 16 of the Henry Morgenthau Jr. Press Conferences, spanning the year 1940. In addition to the selected key terms relating to manufacturing, other terms that appeared frequently in the press conferences throughout 1940 are included. As you can see, when all of the most used words are included, the selected key terms of engine, air, plane, plant, war, and defense are prominently featured, followed by tool, machine, and production. (For analysis, we omitted articles and pronouns, like a, the, you, etc. We also opted to leave out words that address someone, such as secretary and president, as well as those that pertain to HMJr’s role, like treasur, money, fund, and financ, as those would naturally have a high rate of use. In fact, the terms “secretar_ and “treasur_” are the most prevalent, with “president_”, “money”, and “fund_” appearing as frequently as selected key terms “engine_” and “air_”, followed by “financ_” and other selected key terms “plane_”, “plant_”, “war”, and “defense_”) (JC).
The HMJr press conferences are physical records created on a typewriter and later digitally scanned to PDF files. Consequently, they are not optimized for OCR and contain numerous errors. After converting the PDFs to .txt files for computational analysis and manually correcting errors concerning selected key terms, chosen based on the frequency of use, the data still contained too many errors to amend in the time allotted for this project. The presence of errors in the dataset can translate to missed datapoints in our analysis, which may skew our findings. Furthermore, by focusing our corrective efforts solely on errors related to select terminology concerning manufacturing, we may have inadvertently missed larger trends. Additionally, without consulting whole sentences or sections of the volumes, we have no information on the context surrounding our key terms and the circumstances of each term's appearance or frequency of use. Finally, the press conferences themselves were reiterated afterward by HMJr to be transcribed by his assistant Henrietta Klotz, meaning they are not direct recordings or transcribed in real-time and are reliant on memory or sparse notes from a singular perspective. While we can speculate that the records are accurate, they cannot be precise. Accordingly, the phrasing or intent, especially ascribed to other persons, will inevitably be at least partially affected by inherent biases.
The overall ethics and values derived from computational data modeling in an archival setting include professional relationships, authenticity, history and memory, and access and use (SAA, 2020). To foster professional relationships, we approached this analysis collaboratively and professionally, engendering rapport and cooperation with fellow archivists throughout the project. While there is no work that is completely unbiased or neutral, to strive for authenticity, we made every effort to document and analyze the data transparently and objectively in the time frame allotted. A core archival value is history and memory; the press conferences are unique in that no other U.S. cabinet member kept such extensive records during this period. We interpreted these primary source materials through data modeling and visualizations to provide insights into this historic time. To promote access and use, we used open source tools and share our findings openly with the archival community and the public (JC).
Subjects explored during this project included key components from computational modeling experimentation. Information retrieval helped create and facilitate useful statistics and models that were more user-friendly than others. The tool we used, Voyant Tools, provided models useful to future research and for our project expectations and timeframe. It also allowed us to gather relevant information for analysis now and in the future.
Since we had no control over the query optimization or the algorithms used, we chose the best tool available for implementing the information for further analysis and research. Machine learning and AI algorithms can be helpful in archival research with complete textual analysis. However, we did not find this to be the case for this project because of the errors in the OCR, so our content analysis was limited to the text frequency ratio throughout all three volumes - which limited our understanding of the context. We realized there were challenges with many of the tools available. For our purposes, we discovered that using digital curation for accuracy in a depository’s collection was wholly dependent on the type of manuscript and the program applied. Therefore, we were pleased with Voyant Tools’ results and visualizations. We noted that this project may be used in other disciplines or extrapolated to other data in World War II economics or politics from that timeframe.
Using the transcriptions from the HMJr Press Conferences has given us a new methodology in which we can glean information about historical events without reading it in its entirety. Digital curation and analysis using new tools have provided us with a new insight into these records. Computational thinking complements and combines mathematics and engineering with computer science (Wing, 2006). We engage this thinking using archival concepts. We concluded that while focusing on specific terms for our analysis, we may have inadvertently missed important details or context in the transcripts. However, computational methods added to our understanding of Henry Morgenthau Jr.’s focus on a variety of topics. As the Great Depression and economic changes took over the United States prior to entering World War II, HMJr contributed to the rise in U.S. manufacturing to support the defense industry and the war effort taking place in 1940. These documents provide a way to analyze and interpret events using computational analysis, such as his support for Franklin D. Roosevelt’s programs, and of the growth of the United States during a volatile and shifting landscape (SR).
Learning Outcomes. As we explored and studied a variety of models, theories, and visualizations in this project, we learned many valuable lessons. Our investigations and evaluations using an appropriate and useful computational tool resulted in productive collaboration, experimentation of computational modeling, and the application of visualization to glean historical information. These include:
The primary objective for this analysis satisfies Student Learning Outcome 4 of the graduate program at the University of Missouri for a degree in Master of Library and Information Science: Graduates will be able to assess community needs, formulate plans to respond to users of information agencies, and instruct users in using informational resources (SISLT, 2024). Other objectives include professional skills development and application of archival principles, ethics, and values through computational data modeling (JC).