by AO, MR, BS Group Dexterous Data-Analysts
The Morgenthau Press Conferences collection comprises two series: the transcripts of his press conferences (Series 2) and the corresponding index catalog cards (Series 1), both available as PDF files. Each index card contains the general topic name, subtopics related to the general topic, the corresponding transcript volume and page numbers where the general and subtopics are discussed, and the dates on which the press conferences took place. The transcripts are organized chronologically from 15 November 1933 to 23 July 1945 for a total of 27 volumes. The index cards are organized by topic in alphabetical order starting with “Advisory Council of National Sales Managers” and ending with “Zinc.” Within the alphabetical order, the subtopics are arranged chronologically.
Our group chose to analyze index cards for the press conferences that discuss Great Britain. The three main questions we asked and answered through our dataset were: “During what years was Great Britain talked about the most and least?”, “What subtopics about Great Britain had the longest conversations?”, and “What was the most discussed subtopic in relation to Great Britain?” The key patterns we found were that Great Britain was mentioned the most in 1940 and the least in 1939, the subtopic with the longest conversations was stabilization, and the subtopic that was discussed the most was stabilization.
Our findings can contribute to the story of the United States’ relationship with Great Britain during World War II. Because the United States and Great Britain were allies in World War II, the discussions about Great Britain can add insight into Great Britain’s role in World War II and how they worked together to fight the Axis powers. (MR)
Henry Morgenthau, Jr. served as Secretary of the Treasury from January 1934 to July 1945. The Morgenthau Press Conference Index, comprising Series 1 in the collection, contains 2725 individual index cards available in seven PDF files. Asked to select and evaluate a subset of the collection, our group chose to work with the fifteen index cards that covered the subject of Great Britain. These fifteen cards were contained in the third and fourth PDF files of the series. After looking through the index cards we noticed several cards relating to various countries and chose to focus on the country of Great Britain as it was an ally of the United States during World War II, which occurred during the period of these press conferences, and we felt there were an adequate number of index cards to do pattern analysis on this subset. Such data also spanned multiple years in the collection which also interested the group as we could see trends over time in regards to press conferences regarding Great Britain. We were curious what topics were discussed by Morgenthau in relation to Great Britain, how often these topics were discussed, and how much discussion took place about the individual topics. (BS)
To begin working with our dataset we first had to extract the data from the PDF files of origin. They had been through an Optical Character Recognition (OCR) process although it was a poor quality one: many of the image scans were dark and so not all text was captured by the OCR. Text was copied from the PDF files into a Google Doc file and then divided amongst the group so that each member of the team was responsible for (cleaning) five cards each. Thus the missing text was added by hand to create a Google Doc file with all text from the index cards. The text from our file was then placed into a Microsoft Excel Spreadsheet. The group identified and labeled six columns of data for the project, then saved it as a comma separated value (CSV) file to be able to use in OpenRefine. The CSV file was shared through our group’s files folder in Canvas to make sure everyone had access to the same data for opening in OpenRefine. (BS)
Once the CSV file was imported into OpenRefine there were still noticeable problems with the data that needed to be cleaned up prior to analysis. The “value.replace” transformation was used to remove text that represented keystrokes such as tabs, page breaks, text formatting such as underlining, and spacing. Transformation was also used to remove leading and trailing whitespace. To split dates into separate columns for month, day, and year, the “split into several columns” function was used.
Once the data was set into its proper columns, we used facets to make sure the same information existed in each column and finish the data cleanup. For example, some of the rows in the month column were showing up as numeric values and others as text values. We used the facets to change them all to be the same format of numeric value for each column. We also noticed that some characters in the text were missing from the OCR and the use of facets also identified the rows with issues so we could make sure they were all renamed to be the same values. After using facets for data cleanup we then used it to gather information to answer our questions about the dataset and create our visualizations. The information from the facets for each question was captured into an Excel spreadsheet so that visualizations could be created. (BS)
The first question we addressed for this dataset was to see how many times during the press conferences did Morgenthau mention Great Britain. We wanted to see if there was a change over time about the discussion of Great Britain in the dataset. To create this visualization, MR collected the information about the number of times each year that Great Britain was mentioned and used Microsoft Excel to create the graph in Figure 1 below. Great Britain was discussed the most in 1940 and the least in 1939. As Great Britain entered into World War II towards the end of 1939 it makes sense that many of Morgenthau's press conferences during 1940 were focused on Great Britain.
Figure 1: Mentions of Great Britain in Press Conferences. Note: This graph demonstrates how many times Great Britain was mentioned in Morgenthau’s press conferences each year. From the data collected, Great Britain was mentioned the most in 1940 and the least in 1939.
The second question we posed for visualization was which topics under the subject of Great Britain had the longest conversations. To find this information we hand-counted the corresponding transcript pages to see how long Morgenthau discussed each subject and AO entered this information into an Excel spreadsheet to create the graph as shown in Figure 2 below. The topic of stabilization was discussed the longest by Morgenthau in his press conferences regarding Great Britain. “Stabilization” was the process of the U.S. government trying to control inflation during WWII.
The third question we posed was which subject areas were most discussed regarding Great Britain. By inputting the number of times each Great Britain subject area was discussed into an Excel spreadsheet, BS made the chart shown in Figure 3 below. It shows how much each subject area makes up the whole of Morgenthau’s discussion of Great Britain between 1937 and 1943. The larger the subject area on the chart, the more times it was discussed. Stabilization was the area most discussed during the war years which corresponds to Figure 2 of stabilization having the most pages of discussion. (BS)
Balancing validity, objectivity, honesty, openness, accountability, and stewardship when transcribing the dataset chosen was extremely important to our team. Given the time period of these indexed cards some of the information in the transcripts themselves can be extremely sensitive.
In this study, our team focused on the validity of the data we were transcribing. Because the Optical Character Recognition (OCR) on our cards was low in quality our team decided to manually transcribe each of our cards. With this came a possibility of something being translated wrong or a clerical error to occur. We each did our own data interpretations as well as faceting (translations, to humanists). To ensure the validity of the text we were transcribing we copied and pasted what we could of the OCR text and typed the rest out by hand being careful not to miss anything along the way. In doing so, honesty has been a key element of the data, results, methods, and procedures.
Openness played a key role in the research conducted as well, as the group wanted to make sure that all the data collected and analyzed is available in a context that is understandable and clear. Since we too want the data to be an open resource for all, the group stayed accountable for any actions taken when transcribing and evaluating the data. Recognizing the data has long-term value we as stewards strove for the most efficient use of the data as we organized it and asked questions. (AO)
Using some of the Optical Character Recognition (OCR) available on the index cards, the group cleaned the data by hand to use in OpenRefine. Our group soon discovered that OpenRefine is a wonderful tool, but not for data that is still messy or dirty. Having done most of the index-transcriptions ourselves, moved them from a Google Doc, to an Excel file, and then to a CSV, some of the data were translated very messily and were very hard to clean up in OpenRefine. To continue this project our prime focus would be on cleaning up the OCR to where it can be easily copied and pasted from the index card PDF to another document. We, unfortunately, had a lot of issues with characters from our transcription being added into OpenRefine. For example, backslashes (\tab) representing the “tab” button. We hope that improving the OCR would take those unnecessary characters out of the equation making it easier to manipulate and facet in the software.
For the future, our group’s computational model can also contribute to studies about the United States’ relationship with Great Britain during World War II. As allies, the United States and Great Britain worked closely together on a myriad of issues and established a strong opposition to the Axis powers. The data found in the Morgenthau press conferences can be used to further examine the two countries’ relationship and how they worked together to ultimately win World War II. (AO)