by KD, MM, HS Group Hedy Lamarr
Our team completed a data modeling exploration within Series 1 of the Morgenthau Press Conferences collection, published by the Franklin D. Roosevelt Presidential Library and Museum. While Series 2 comprises a 27-volume recollection of the significant events of the Roosevelt Presidency, Series 1 contains an index of seven alphabetically arranged sections, providing the user with document-level access to various subjects. We specifically analyzed the second section’s 400-page index card file of alphabetical subjects from “Civilian Defense Office” to “Financing, Government” (HS, KD). Our guest lecturer’s article (Carter, Gondek, Underwood, Randby, and Marciano 2022) states that Isabella Diamond indexed Morgenthau’s press conferences in 1936. Working full-time as an experienced librarian, Diamond – and Morgenthau’s primary personal secretary Henrietta Klotz – significantly “influenced” the priority topics pursued by Morgenthau and the directions which would come to define Roosevelt’s presidential legacy (Gondek 2021).
To explore the correlations between finance within the government and currency exchange through keywords such as “currency,” “money,” “coin,” “silver,” “fiscal,” and “finance,” the group attempted to run an OCR output of the original PDF from the FDR Library, using iOS Preview (available on all Apple computers). Diamond both created a custom schema in 1936 and developed the microfilm in the 1940s for the set of press conferences, the contents of which vary and are dated from 1933 to 1945 in our portion (HS, KD). In an instance of good reproducibility with Carter’s article, it became apparent that the files digitized in 2014 still were minimally optimized for OCR, producing now mostly garbled and incomplete text. OCR was applied to the original document via DocDrop (via Convertio by KD) with the instructions we provide below.
To OCR a PDF:
The new OCR PDF was then uploaded to the PDF to CSV converter on the Convertio website. After applying the conversion, the CSV was downloaded and ready to be uploaded into OpenRefine for data cleanup.
When the dataset from the CSV file was initially imported into OpenRefine, each row of text from every single card was added as a separate row, with the inclusion of blank rows, as shown in Figure 1.
To clean up the data, the corresponding rows for each card were linked together to represent a single record. The subject heading of each card was split into a separate column titled “Headings.” The contents of each card were kept in a second column titled “Original Data,” and the dates were split into a third column titled “Date,” as shown in Figure 2.
Below are the steps taken in OpenRefine.
To Get the Headings Column:
Source: https://stackoverflow.com/questions/46489840/openrefine-transpose-rows-into-columns-based-on-text
To Split Other Data Into Columns:
Source: https://groups.google.com/g/openrefine/c/ZjU9vt0BWkc
After this point, the dataset was ready to be exported as a CSV to be used for creating a visualization in Microsoft Excel. (KD) Excel Chart Instructions:
We used the CSV from OpenRefine to model and create a visualization (in Excel) of our dataset representing the distribution of dates by month filtered by finance-related keywords from the cards. As seen in our scatter plot in Figure 3, the most active date from the dataset was in June of 1938, with 33 cards listing that date. There were also a few outliers: one card from December 1926, one card from November 1929, one card from February 1957, and two cards from December 1988. These are most likely the result of OCR and conversion errors. Most of the dates are from the late 1930s/early 1940s. (KD)
With our Excel file, we carefully corrected the (DocDrop) OCR’d data to match the original images and adding the book numbers which did not translate into OpenRefine. That was then sorted by date and imported into Zoho’s analytics function. Zoho was used to model the information and add filters to further model the data. Filters were helpful in narrowing the results to only the relevant information in records that contained key strings in either their heading, contents or both. Some key terms were “finan”, “fisc,” “mone,” “coin,” “debt,” and “silver.” The keyword-filtered and year-sorted dataset was then exported as an Excel sheet (with SB) and also used with Zoho’s Reports feature to create scatter, pie (shown), and line graph visualizations. Reports-generated visualize the distribution of years across the mentions of financial matters in the selected range of index cards.
A good portion of the OCR from DocDrop remains not perfectly optimized. DocDrop did not recognize all of the book and page numbers, causing it to be too challenging to create a column in OpenRefine for "Source," as planned. In the future, all the OCR errors, along with any typos and misspellings on the original cards, should be manually corrected. Such corrections will increase the quality and future use of the records (Underwood et al., 2017). As they note, this type of work takes an entire project team to complete. After the OCR has been perfectly optimized, the data can be further manipulated in OpenRefine, thus enabling the creation of more reliable visualizations. (KD)