1. Inventory and Analysis of a Digital Accession¶

One of the early steps in taking in a digital accession is to perform some kind of appraisal, including an inventory or survey of the materials. The survey will inform your next steps for caring for the deposit, including formulating an appropriate preservation plan. Hopefully this is aided by some kind of inventory and an interview provided by the donor, but often it is the case that a donor does not leave any inventory. When accessions are relatively small, we can perform an inventory by merely browsing the contents and making some notes on the format and directory structure of the files. However, when the number of files jumps into the hundreds, thousands, or tens of thousands, we need to reach for more powerful tools. This notebook demonstrates in Python code some methods for building an inventory for a set of files and then some code to summarize and visualize accession contents. Many of the same principles at work in these examples are at work in larger preservation tools or environments, such as BitCurator, or Preservica, but you may find them useful as complementary tools.

Issues an Inventory May Uncover¶

The inventory and summary may reveal the presence of file formats that have not been encountered previously by your institution. Such file formats may be unique to the donor's discipline or trade, or they may be unique to a particular researcher's practice. It is important to understand how these files are produced and the intellectual works within them, but first you have to find them. Folder structures are also sometimes standardized for a particular tool or practice. These folder structures are laid out according to a pattern that serves the workflow in which they were produced or managed. Folder structures are often used to create websites, geographic databases, audiovisual productions, or software projects.

You will also want to look for these issues, but they are beyond the scope of this notebook:

Folders or files that were not intended for deposit.
Private or embarassing material.
Computer viruses

Step One: Buildng the Inventory¶

The first step is to build a basic file inventory for a sample accession, using Python code. The Python code below collects information about every file within the samples folder, including subfolders and so on.. The details about each file are appended to a comma-separated values (CSV) text file. You can point the inventory code at any accession folder to build an inventory CSV file.

The inventory code records this information:

file name
file location - This is the folder path where the file lives.
file extension - This is the part of the filename after the period, such as "PDF" or "docx".
file format - This is the two-part MIME Type for the file. (More detail to follow)
file size - This is the size of the file in bytes.
modification timestamp - The date and time of the last time this file was modified.

In [25]:

import os
from os.path import join, getsize, getmtime
import datetime
import csv
import mimetypes

accession_dir = './samples/'

# function to identify file extension
def get_extension(name):
    result = None
    i = name.rfind('.')
    if i > 0:
        result = name[i+1:]
    return result

# TODO Find another sample folder that is more office docs..

file_count = 0
with open('inventory.csv', 'w', newline='') as csvfile:
    mywriter = csv.writer(csvfile, quoting=csv.QUOTE_MINIMAL)
    mywriter.writerow(['name', 'path', 'bytes', 'modified', 'mod_year','extension', 'category', 'mimetype', 'folder1', 'folder2'])
    
    # This is where the "crawl" of files starts, using the Python os.walk() function
    for folder, subfolders, files in os.walk(accession_dir):
        # Walk gives us the folder name (folder), subfolders list, and file names list for each folder in our accession.
        for name in files:
            file_count = file_count + 1  # Count this file
            fullpath = join(folder, name)  # The full path to a file is made by joining file name and folder name.
            
            mod_dt = datetime.datetime.fromtimestamp(getmtime(fullpath))  # modified timestamp is converted into a Python date
            
            (mime, encoding) = mimetypes.guess_type(fullpath)  # Python mimetypes module will guess a mimetype
            if mime is None:
                mime = "application/octet-stream"  # This is a generic "stream of bytes" mimetype
            category = mime.split('/')[0]  # The high-level mimetype is the half before the slash '/', such as 'text' or 'image'
            
            folder_path = folder[len(accession_dir):]  # We trim off the accession folder path to get just the path within the accession.
            path_segments = folder_path.split('/')
            folder1 = path_segments[0]
            folder2 = path_segments[1] if len(path_segments) > 1 else '.'
            
            mywriter.writerow([name, folder_path,  # This writes a line to the CSV file 
                               getsize(fullpath), 
                               mod_dt.isoformat(),
                               mod_dt.year,
                               get_extension(name),
                               category,
                               mime,
                               folder1,
                               folder2])
print("Inventory done: "+accession_dir)
print("Inventory file count:"+str(file_count))

Inventory done: ./samples/
Inventory file count:656

Inspect the Raw Inventory CSV¶

After you execute the file inventory code above, you will see a new CSV file in your notebook directory. If you like, you can open that inventory CSV file in Jupyter, which will display it in tabular form. The next block of code will display the first few lines of the file as raw text.

In [49]:

with open('inventory.csv', 'r') as csvfile:
    lines = ''
    for x in range(0,6):
        lines = lines + csvfile.readline()
print(lines)

name,path,bytes,modified,mod_year,extension,category,mimetype,folder1,folder2
fade.gif,Agency History,828,2015-02-17T17:11:48,2015,gif,image,image/gif,Agency History,.
hts-log.txt,Agency History,4489,2015-02-17T17:11:48,2015,txt,text,text/plain,Agency History,.
History of BOP.whtt,Agency History,0,2015-02-17T17:11:48,2015,whtt,application,application/octet-stream,Agency History,.
index.html,Agency History,5220,2015-02-17T17:11:48,2015,html,text,text/html,Agency History,.
backblue.gif,Agency History,4243,2015-02-17T17:11:48,2015,gif,image,image/gif,Agency History,.

You can see above that the first line of the CSV file is made up of column names. The data then follows, row by row.

Next we will load the same CSV data into a Python data science tool called Pandas. This means that Python will read your file into a Pandas "data frame" object that we will use going forward. Data frame objects hold tabular data for further processing and display. We will first print the same first few rows of data to give you a sense of how it is organized and let you see that it is working properly. We will also call the shape() function on the data frame object, which gives us the exact dimensions (row and columns) in the data frame.

In [44]:

import pandas as pd
import numpy as np
df = pd.read_csv('inventory.csv')

display(df.head())
print('dimensions: '+ str(df.shape))

	name	path	bytes	modified	mod_year	extension	category	mimetype	folder1	folder2
0	fade.gif	Agency History	828	2015-02-17T17:11:48	2015	gif	image	image/gif	Agency History	.
1	hts-log.txt	Agency History	4489	2015-02-17T17:11:48	2015	txt	text	text/plain	Agency History	.
2	History of BOP.whtt	Agency History	0	2015-02-17T17:11:48	2015	whtt	application	application/octet-stream	Agency History	.
3	index.html	Agency History	5220	2015-02-17T17:11:48	2015	html	text	text/html	Agency History	.
4	backblue.gif	Agency History	4243	2015-02-17T17:11:48	2015	gif	image	image/gif	Agency History	.

dimensions: (656, 10)

Step Two: Visualize the Inventory¶

The next step is to use some powerful Python tools to show us the inventory in a variety of way. These tools are not installed by default in Jupyter environments, so you will first have to run the following code block to install them.

In [45]:

!pip install plotly jupyter-dash

Requirement already satisfied: plotly in /opt/conda/lib/python3.8/site-packages (5.5.0)
Requirement already satisfied: jupyter-dash in /opt/conda/lib/python3.8/site-packages (0.4.0)
Requirement already satisfied: ansi2html in /opt/conda/lib/python3.8/site-packages (from jupyter-dash) (1.6.0)
Requirement already satisfied: retrying in /opt/conda/lib/python3.8/site-packages (from jupyter-dash) (1.3.3)
Requirement already satisfied: ipykernel in /opt/conda/lib/python3.8/site-packages (from jupyter-dash) (5.4.3)
Requirement already satisfied: flask in /opt/conda/lib/python3.8/site-packages (from jupyter-dash) (2.0.2)
Requirement already satisfied: ipython in /opt/conda/lib/python3.8/site-packages (from jupyter-dash) (7.31.0)
Requirement already satisfied: dash in /opt/conda/lib/python3.8/site-packages (from jupyter-dash) (2.0.0)
Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from jupyter-dash) (2.25.1)
Requirement already satisfied: tenacity>=6.2.0 in /opt/conda/lib/python3.8/site-packages (from plotly) (8.0.1)
Requirement already satisfied: six in /opt/conda/lib/python3.8/site-packages (from plotly) (1.15.0)
Requirement already satisfied: flask-compress in /opt/conda/lib/python3.8/site-packages (from dash->jupyter-dash) (1.10.1)
Requirement already satisfied: dash-html-components==2.0.0 in /opt/conda/lib/python3.8/site-packages (from dash->jupyter-dash) (2.0.0)
Requirement already satisfied: dash-core-components==2.0.0 in /opt/conda/lib/python3.8/site-packages (from dash->jupyter-dash) (2.0.0)
Requirement already satisfied: dash-table==5.0.0 in /opt/conda/lib/python3.8/site-packages (from dash->jupyter-dash) (5.0.0)
Requirement already satisfied: itsdangerous>=2.0 in /opt/conda/lib/python3.8/site-packages (from flask->jupyter-dash) (2.0.1)
Requirement already satisfied: Werkzeug>=2.0 in /opt/conda/lib/python3.8/site-packages (from flask->jupyter-dash) (2.0.2)
Requirement already satisfied: click>=7.1.2 in /opt/conda/lib/python3.8/site-packages (from flask->jupyter-dash) (7.1.2)
Requirement already satisfied: Jinja2>=3.0 in /opt/conda/lib/python3.8/site-packages (from flask->jupyter-dash) (3.0.3)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.8/site-packages (from Jinja2>=3.0->flask->jupyter-dash) (2.0.1)
Requirement already satisfied: brotli in /opt/conda/lib/python3.8/site-packages (from flask-compress->dash->jupyter-dash) (1.0.9)
Requirement already satisfied: jupyter-client in /opt/conda/lib/python3.8/site-packages (from ipykernel->jupyter-dash) (6.1.11)
Requirement already satisfied: traitlets>=4.1.0 in /opt/conda/lib/python3.8/site-packages (from ipykernel->jupyter-dash) (5.0.5)
Requirement already satisfied: tornado>=4.2 in /opt/conda/lib/python3.8/site-packages (from ipykernel->jupyter-dash) (6.1)
Requirement already satisfied: jedi>=0.16 in /opt/conda/lib/python3.8/site-packages (from ipython->jupyter-dash) (0.18.0)
Requirement already satisfied: backcall in /opt/conda/lib/python3.8/site-packages (from ipython->jupyter-dash) (0.2.0)
Requirement already satisfied: pexpect>4.3 in /opt/conda/lib/python3.8/site-packages (from ipython->jupyter-dash) (4.8.0)
Requirement already satisfied: setuptools>=18.5 in /opt/conda/lib/python3.8/site-packages (from ipython->jupyter-dash) (49.6.0.post20210108)
Requirement already satisfied: pygments in /opt/conda/lib/python3.8/site-packages (from ipython->jupyter-dash) (2.7.4)
Requirement already satisfied: decorator in /opt/conda/lib/python3.8/site-packages (from ipython->jupyter-dash) (4.4.2)
Requirement already satisfied: matplotlib-inline in /opt/conda/lib/python3.8/site-packages (from ipython->jupyter-dash) (0.1.3)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /opt/conda/lib/python3.8/site-packages (from ipython->jupyter-dash) (3.0.16)
Requirement already satisfied: pickleshare in /opt/conda/lib/python3.8/site-packages (from ipython->jupyter-dash) (0.7.5)
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /opt/conda/lib/python3.8/site-packages (from jedi>=0.16->ipython->jupyter-dash) (0.8.1)
Requirement already satisfied: ptyprocess>=0.5 in /opt/conda/lib/python3.8/site-packages (from pexpect>4.3->ipython->jupyter-dash) (0.7.0)
Requirement already satisfied: wcwidth in /opt/conda/lib/python3.8/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython->jupyter-dash) (0.2.5)
Requirement already satisfied: ipython-genutils in /opt/conda/lib/python3.8/site-packages (from traitlets>=4.1.0->ipykernel->jupyter-dash) (0.2.0)
Requirement already satisfied: jupyter-core>=4.6.0 in /opt/conda/lib/python3.8/site-packages (from jupyter-client->ipykernel->jupyter-dash) (4.7.1)
Requirement already satisfied: pyzmq>=13 in /opt/conda/lib/python3.8/site-packages (from jupyter-client->ipykernel->jupyter-dash) (20.0.0)
Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/lib/python3.8/site-packages (from jupyter-client->ipykernel->jupyter-dash) (2.8.1)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->jupyter-dash) (2020.12.5)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->jupyter-dash) (1.26.3)
Requirement already satisfied: chardet<5,>=3.0.2 in /opt/conda/lib/python3.8/site-packages (from requests->jupyter-dash) (4.0.0)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->jupyter-dash) (2.10)

Visualize Files by their Media Type¶

Our next step is to visualize the file format information in aggregate, so that we can make preservation decisions. Here we look at a summary of the media types using a standard called the IANA MIME type. IANA MIME types have a type and a subtype, separated by a slash (/) character.

type/subtype

For example, these are some common MIME types:

image/jpeg
text/plain
application/pdf

Sometimes MIME types are longer and more complex. Here you can see the difference between older MS Word documents and the more recent .docx documents:

application/msword
application/vnd.openxmlformats-officedocument.wordprocessingml.document

In the visualization below the area of each block reflects the number of files of each type and subtype. The color of each block represents the number of bytes stored in each type and subtype. This kind of visualization is called a Treemap and they were first developed in the early 90's by Ben Shneiderman in the Human-Computer Interaction Lab at the University of Maryland's iSchool.

These treemaps are interactive; with some detailed values presented when you hover your pointer on a block and a zoom-in feature when you click on a block.

In [46]:

import plotly.express as px

fig = px.treemap(df,
                 title="Total File Stored by Mimetype (Colored by Bytes Stored)",
                 path=[px.Constant("all"), 'category', 'mimetype'], 
                 color='bytes'
                )
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

Exercise: Mimetype Analysis¶

Look over the treemap above and notice which types of files are the most common (larger boxes). What conclusions can you draw about this accession based on the common file types? What does the treemap tell us is significant about the PDF files in this accession?

File Extension vs. Mimetype¶

A quick look at the same visualization for file extension (the part of the filename after the last dot) shows why we use mimetypes. File extensions are not always informative, whereas at least mimetypes give us a high-level category for each format.

Python's mimetype guessing module is useful for a quick analysis, but there are more sophisticated file format identification tools that you will find helpful. Tools such as Droid and Siegfried (recommended) offer the ability to identify detailed sub-formats for the files in your accessions, such as which version of PDF was used to produce a document. These tools require a fully functional computer with adminitrator access to install and so they are beyond the scope of this notebook demonstration, but they will catch a many more specific formats than Python.

In [47]:

import plotly.express as px

fig = px.treemap(df,
                 title="Total Files by File Extension (Colored by Bytes Stored)",
                 path=[px.Constant("all"), 'extension'], 
                 color='bytes'
                )
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

Exercise: Comparing File Extensions with Mimetype¶

The treemap above shows the distribution of files, by area, and bytes, by color, across file extensions. How does it differ from the treemap showing the same distributions across mimetypes? What information is missing?

Identify Important Folders¶

This treemap will help you to identify folders that may hold a lot of content or have a special structure, either holding lots of files or very large files. Run the visualization code below and note the way the treemap encoded both of these metrics visually.

Can you identify the folders that contain the most data in terms of bytes? Which folders contain a bunch of smaller files?

In [28]:

import plotly.express as px

fig = px.treemap(df,
                 title="Total Files per Folder (Depth of Two) (Colored by Bytes Stored)",
                 path=[px.Constant("all"), 'folder1', 'folder2'], 
                 color='bytes'
                )
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

Exercise: Analyzing Folder Structures¶

People organize files in a variety of ways. Sometimes their folder and sub-folder names are entirely indicative of the content. In other cases, folder names and structures are constrained by the technical considerations of the software tools that either produce or consume the files. As you look at the distribution of files, by area, and bytes, by color, across the top two folder levels what can you see?

What folder contained the largest files?

Why are there many folders called "www.bop.gov"?

Which folders are named to reflect a person's choices and which seem to be named due to a technical consideration?

Another View of Folder Structures¶

As mentioned at the start of this notebook, sometimes folder structures are part of a production workflow for a particular tool is discipline, such as GIS or audiovisual productions, or websites. One way to uncover such structures in larger accessions is to determine the most common folder and file names. Usually folders that are patterned by a workflow will reuse the same folder names or file names. These folder and filename "slots" determine the purpose of the folder or file within the larger workflow.

In the code below we take the "name" column from our data frame and convert the individual names into a shared set of unique categories (.astype("category")). Instead of separate text values, the names are now coded as category numbers. So each occurrence of the same name will have the same code. Pandas can then add up the occurrences of each code and give us value counts. (.value_counts()) Finally, we call the head() function to get just the first 25 most frequent names.

In [62]:

display(df["name"].astype("category").value_counts().head(n=25))

index.html               45
backblue.gif              7
fade.gif                  7
new.txt                   6
new.lst                   6
readme.txt                6
new.zip                   6
winprofile.ini            6
doit.log                  6
cookies.txt               6
left_repeat.jpg           5
matters-over.gif          5
business-over.gif         5
business.gif              5
locator.gif               5
careers-over.gif          5
careers.gif               5
locator-over.gif          5
blue_cover.gif            5
yellow_bar_middle.gif     5
left_curve.gif            5
arrow.gif                 5
hts-log.txt               5
home.gif                  5
home-over.gif             5
Name: name, dtype: int64

As we look at this list of the most frequent file names in this accession, we can clearly see that the most common file name is index.html, a web page file, with 45 occurances. Next there are some GIF files, usually these are web graphics. Then we have some files called "new.txt", "new.lst", and "readme.txt". It is clues like these that make digital appraisal somewhat of a detective story. You might dive deeper into these files, by reading the files themselves or by researching the file names on the Internet to see what workflows may have produced them.

TODO Differentiate Older and More Recent Folders¶

Depending upon the operating system, files will have a modified date and/or a created date. Usually the modified date is the more significant of these, because it indicates the last edit to a file and may indicate when it was completed or no longer relevant. When transferring files to our own storage we have to be careful to either capture or preserve these dates as metadata.

NOTE: Since this notebook is copied from a Git repository, the sample files included here do not have their original modified date. Since we cannot provide accurately dated files via Git, we are providing some alternative inventory CSV files that were prepared separately.

TODO: color by average timestamp TODO: find samples or supply an inventory file better good timestamps..

In [17]:

import plotly.express as px

fig = px.treemap(df,
                 title="Total Files per Year (Colored by Bytes Stored)",
                 path=[px.Constant("all"), 'mod_year'], 
                 color='mod_year'
                )
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

Dive into Folders¶

If your initial inventory reveals a sub-folder that is of particular interest, you can run the same visualizations against a specific subfolder. Let's pick a sub-folder and show the mimetype distribution of that subfolder by itself.

The first step is to subset our inventory dataframe, to include only the desired sub-folder.

In [29]:

subfolder_df = df[df['folder1'] == 'Agency History']
subfolder_df.tail()

Out[29]:

	name	path	bytes	modified	mod_year	extension	category	mimetype	folder1	folder2
121	new.zip	Agency History/hts-cache	359130	2015-02-17T17:11:41	2015	zip	application	application/zip	Agency History	hts-cache
122	new.lst	Agency History/hts-cache	5357	2015-02-17T17:11:41	2015	lst	application	application/octet-stream	Agency History	hts-cache
123	new.txt	Agency History/hts-cache	29763	2015-02-17T17:11:41	2015	txt	text	text/plain	Agency History	hts-cache
124	readme.txt	Agency History/hts-cache	612	2015-02-17T17:11:41	2015	txt	text	text/plain	Agency History	hts-cache
125	doit.log	Agency History/hts-cache	702	2015-02-17T17:11:41	2015	log	application	application/octet-stream	Agency History	hts-cache

Next we use the same treemap code as our prior mimetype visualization, but supply the new subfolder dataframe.

In [30]:

fig = px.treemap(subfolder_df,
                 title="Total File Stored by Mimetype (Colored by Bytes Stored)",
                 path=[px.Constant("all"), 'category', 'mimetype'], 
                 color='bytes'
                )
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

Exercise: Visualize a Sub-Folder¶

Pick a different top-level folder and then run a visualization that is limited to that folder. First subset the dataframe as shown above and then use that dataframe to run one of the prior visualizations.

In [ ]:

Looking at a Larger Deposit Inventory¶

A second and larger set of sample files was inventoried previous and those files are not included with this notebook. However, the inventory file is supplied and you can use it to visual a larger deposit. Below we will load this inventory and visual it with the MIME Type chart.

In [33]:

import pandas as pd
import numpy as np
df173 = pd.read_csv('rg-173-inventory.csv')

print(df173.shape)
df173.head()

(10413, 10)

Out[33]:

	name	path	bytes	modified	mod_year	extension	category	mimetype	folder1	folder2
0	spmjc108.txt	Copps/2001	21483	2016-04-06T19:23:18	2016	txt	text	text/plain	Copps	2001
1	spmjc101.txt	Copps/2001	1996	2016-04-06T19:23:16	2016	txt	text	text/plain	Copps	2001
2	spmjc106.txt	Copps/2001	15814	2016-04-06T19:23:17	2016	txt	text	text/plain	Copps	2001
3	spmjc107.html	Copps/2001	16333	2016-04-06T19:23:17	2016	html	text	text/html	Copps	2001
4	spmjc103.doc	Copps/2001	35840	2016-04-06T19:23:17	2016	doc	application	application/msword	Copps	2001

You can see above that this inventory describes 10,413 files, a more challenging quantity for manual appraisal. Now let's run our visualization on this data.

In [36]:

import plotly.express as px

fig = px.treemap(df173,
                 title="Total File Stored by Mimetype (Colored by Bytes Stored)",
                 path=[px.Constant("all"), 'category', 'mimetype'], 
                 color='bytes'
                )
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

Visualize Larger Deposit Folders¶

In [63]:

import plotly.express as px

fig = px.treemap(df173,
                 title="Total Files per Folder (Depth of Two) (Colored by Bytes Stored)",
                 path=[px.Constant("all"), 'folder1', 'folder2'], 
                 color='bytes'
                )
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

Frequent File Names¶

Let's reuse our code from the previous example to look over the common file names.

In [64]:

display(df173["name"].astype("category").value_counts().head(n=25))

[5]DocumentSummaryInformation    873
[5]SummaryInformation            873
WordDocument                     870
[1]CompObj                       870
1Table                           870
Data                             320
Welcome.html                       4
2001.gif                           4
Current User                       2
PowerPoint Document                2
Pictures                           2
stwek012.html                      2
st970807.html                      2
spsn903.txt                        2
st970807.txt                       2
st970811.html                      2
st970811.txt                       2
2000.gif                           2
1999.gif                           2
1998.gif                           2
spsn903.wp                         2
stgt139.doc                        2
080296.txt                         2
spjhq706.html                      2
stmkp102.txt                       2
Name: name, dtype: int64