by E, L, M Group McErLe Incorporated
The 27 volumes of the Press Conferences of Henry Morgenthau, Jr., Series 2, reside in the Franklin D. Roosevelt Presidential Library & Museum, covering the years from 1933 to 1945 when Morgenthau served as Secretary of the Treasury. We chose to analyze Volume 25 with 303 pages with a date range from November 4, 1943 to June 29, 1944. Within such scope, our primary focus was on Italy and associated economic impacts, and our secondary focus was to explore any relationships between Italy and a few key countries to identify Morgenthau’s main foreign affairs interests. The most extensive entries in the volume’s table of contents (TOC) (15 pages) are the terms “Financing” and “Military Currency,” with the documents tracking the financial health of several Western European countries. At the outset, Italy appeared to be the most frequently named nation in Volume 25, with the Mediterranean country’s military currency issues receiving extensive attention along with a TOC entry about German atrocities against Italians. (Mc)
Given how much attention the TOC gave to economic and Italian matters, our initial objective in manipulation and analysis of the one-volume dataset was to determine the (relative) prominence of Italy in Morgenthau’s financial briefings. In the process we identified and found recur rates for a group-selected sample of terms: Italy, money, currency, finance, stabilization, economy, economic, and reconstruction. Each member began with the same set of terms but were not limited to them in exploration. A higher recur rate of a term implied higher significance in the Morgenthau view. (Er)
Adobe Acrobat’s Export PDF tool did produce a CSV file for our initial data pull from the Volume 25 PDF but the data were illegible in comparison to the original. For the subsequent attempt, the free seven-day trial of Adobe Photoshop 2022 was downloaded for the purposes of manipulating lighting of the document to make it more easily readable by conversion software. After a quick brightness adjustment, the original PDF was split and batched into several documents of ten pages each. In pushing the PDF documents through OCRspace and subsequently through convertcsv.com, there was surprisingly little amount of searchable data. Troubleshooting revealed that our first four tools were still only recognizing a minimal amount of data as searchable, or extractable, data. At this juncture, the files were brought back into Photoshop and manipulated in the layer function for clarity, as well as brightness and contrast. Our brighter and cleaner files met with some success via the free OCR converter at OCRSpace to enable searching, although some early attempts did not yield very useful results: running the documents through OCR Engine1 produced files that only had stray letters scanned, rendering them practically impossible to search. The progressive versions of the OCR Engine gave better results, but the best files came from using OCR Engine5, which OCRSpace describes as “especially strong with text on complex backgrounds/low contrast.” However, OCRSpace was most useful for indicating what (data malfunction) made for its lesser and different searchable data. One of us had even more success running the ten-page batches through a Mac App called TEXTIFY, then converting them into CSV files that could be manipulated (Er, Mc). TEXTIFY output can be saved as different formats such as text, Word, Pages, Numbers, Excel, and CSV. Doing so and eventually pulling them all back into one file made it easier to run through OpenRefine or any other program. The CSV was manipulated around more to make only one column as the text never did read out any sort of headers. (Le)
OpenRefine then allowed separating and searching for some of the most common subjects gathered from the TOC, beginning with a run of terms like “finance,” “money,” and “currency.” Isolating these concepts in the text then led to additional search terms such as “stabilization” and “reconstruction.” The Figure 1 bar graph below shows those initial search results, a concept later greatly expanded upon. (Mc)
Figure 1: Frequency of appearance of terms selected from Morgenthau Volume 25.
In OpenRefine we started with Italy and found eighteen rows were populated. Then counts were taken of each of twelve total search terms listed below. To go sideways a bit some major players were added in: France, Britain, Germany, Russia, and China, with the same search terms. A chart correlating countries to terms is seen in Figure 2.
Figure 2: Recurrence of economic-related terms by country associated with World War II. Legend: Countries read left to right with the assigned colors: Italy (blue), France (green), Britain (gray), Germany (yellow), Russia (red), and China (fuchsia), respectively.
A corresponding page chart for the countries is illustrated in Figure 3. (Le)
Figure 3: Countries and their corresponding page numbers in Morgenthau Volume 25 with key shown based on Italy data.
Our 30 ten-page batches were merged back into one manipulable CSV of the same functionality. The next item of business in OpenRefine was to split the cells into smaller segments of text. That task was completed utilizing the “split multi-valued cells” function by transitioning from numbers to letters as opposed to by separator. Figure 4 exhibits our dataset once the data split was complete. From here, the text filter function was incorporated with success. (Er)
Figure 4: Screenshot of data after the “split multi-valued cells” function was performed in OpenRefine.
The progression of our data modeling was simply to find and take the various terms (variables) and plot them on graphs. The graphs represent frequency, recurrence and inter-connectivity between variables and they were accomplished using computer science, statistics and mathematics. Several tools from the Data Visualization Catalogue were explored but unfortunately, many required the creation of a sign-in before assessing what they were capable of.
Ultimately the secondary data from OpenRefine – instances of recurrence of agreed-upon terms – were keyed back in Excel to create the charts for visualization. Figure 5 illustrates the recurring appearance of single terms throughout the volume as well as the percentage rate the single term appears in the document as a whole. Stabilization by far is the most present term in this data selection at approximately 75 or 4.5%. As observed previously, other European and Asian countries’ presence in the volume indicates the scope of WWII and the Holocaust. Combined terms do indicate a higher instance of narrative on Italy and its official currency, however in the context of the overall volume such specificity is negligible, as seen in Figure 6. (Er)
Figure 5: Recur rate of single terms by number of appearances and percentage of the document.
Figure 6: Recur rate of combined terms by number of appearances and by percentage of the overall document.
There is much to gain from the Franklin D. Roosevelt Presidential Library’s embrace of digitization and open access, as evidenced by the ease with which one can query entire volumes from the Morgenthau collection. A shift has occurred since the 1970s when a scholar “accused the Library’s staff of withholding documents so that it could publish them first” (Danielson 1989: 53). It and many other institutions continually live up to the Society of American Archivists core value of promoting and providing “the widest possible accessibility of materials, while respecting legal and ethical access restrictions including public statutes, cultural protections, donor contracts, and privacy requirements” (SAA, 2020).
The trouble we had turning PDFs into searchable files shows that the quest for access does not end once an item is located online. Presenting such data as searchable text allows both patrons and professionals to explore the granular details of a collection without losing time to comb through pages of documents. Data manipulation and data modeling similarly allow people to view many holdings from a macro scale, thus practicing a level of access that accords with the ethical standards set forth by SAA. (Mc)
Our analysis of the spoken text revealed a number of interesting connections between financial topics and Italy, as well as a lack of such information about other European countries. Further studies of Volume 25 might explore the historical implications of Italy’s prominence in these press conferences: the concurrent battles of Anzio and Monte Cassino could have contributed to frequent mentions of their country. Additional data manipulations should first correct occasional spelling mistakes in the converted text that could affect subsequent searches. Correcting these errors en masse through OpenRefine could be a painstaking but worthwhile endeavor that ensures these documents will be accessible to diverse and curious users. (Mc)