One of the early steps in taking in a digital accession is to perform some kind of appraisal, including an inventory or survey of the materials. The survey will inform your next steps for caring for the deposit, including formulating an appropriate preservation plan. Hopefully this is aided by some kind of inventory and an interview provided by the donor, but often it is the case that a donor does not leave any inventory. When accessions are relatively small, we can perform an inventory by merely browsing the contents and making some notes on the format and directory structure of the files. However, when the number of files jumps into the hundreds, thousands, or tens of thousands, we need to reach for more powerful tools. This notebook demonstrates in Python code some methods for building an inventory for a set of files and then some code to summarize and visualize accession contents. Many of the same principles at work in these examples are at work in larger preservation tools or environments, such as BitCurator, or Preservica, but you may find them useful as complementary tools.
The inventory and summary may reveal the presence of file formats that have not been encountered previously by your institution. Such file formats may be unique to the donor's discipline or trade, or they may be unique to a particular researcher's practice. It is important to understand how these files are produced and the intellectual works within them, but first you have to find them. Folder structures are also sometimes standardized for a particular tool or practice. These folder structures are laid out according to a pattern that serves the workflow in which they were produced or managed. Folder structures are often used to create websites, geographic databases, audiovisual productions, or software projects.
You will also want to look for these issues, but they are beyond the scope of this notebook:
The first step is to build a basic file inventory for a sample accession, using Python code. The Python code below collects information about every file within the samples folder, including subfolders and so on.. The details about each file are appended to a comma-separated values (CSV) text file. You can point the inventory code at any accession folder to build an inventory CSV file.
The inventory code records this information:
import os
from os.path import join, getsize, getmtime
import datetime
import csv
import mimetypes
accession_dir = './samples/'
# function to identify file extension
def get_extension(name):
result = None
i = name.rfind('.')
if i > 0:
result = name[i+1:]
return result
# TODO Find another sample folder that is more office docs..
file_count = 0
with open('inventory.csv', 'w', newline='') as csvfile:
mywriter = csv.writer(csvfile, quoting=csv.QUOTE_MINIMAL)
mywriter.writerow(['name', 'path', 'bytes', 'modified', 'mod_year','extension', 'category', 'mimetype', 'folder1', 'folder2'])
# This is where the "crawl" of files starts, using the Python os.walk() function
for folder, subfolders, files in os.walk(accession_dir):
# Walk gives us the folder name (folder), subfolders list, and file names list for each folder in our accession.
for name in files:
file_count = file_count + 1 # Count this file
fullpath = join(folder, name) # The full path to a file is made by joining file name and folder name.
mod_dt = datetime.datetime.fromtimestamp(getmtime(fullpath)) # modified timestamp is converted into a Python date
(mime, encoding) = mimetypes.guess_type(fullpath) # Python mimetypes module will guess a mimetype
if mime is None:
mime = "application/octet-stream" # This is a generic "stream of bytes" mimetype
category = mime.split('/')[0] # The high-level mimetype is the half before the slash '/', such as 'text' or 'image'
folder_path = folder[len(accession_dir):] # We trim off the accession folder path to get just the path within the accession.
path_segments = folder_path.split('/')
folder1 = path_segments[0]
folder2 = path_segments[1] if len(path_segments) > 1 else '.'
mywriter.writerow([name, folder_path, # This writes a line to the CSV file
getsize(fullpath),
mod_dt.isoformat(),
mod_dt.year,
get_extension(name),
category,
mime,
folder1,
folder2])
print("Inventory done: "+accession_dir)
print("Inventory file count:"+str(file_count))
Inventory done: ./samples/ Inventory file count:656
After you execute the file inventory code above, you will see a new CSV file in your notebook directory. If you like, you can open that inventory CSV file in Jupyter, which will display it in tabular form. The next block of code will display the first few lines of the file as raw text.
with open('inventory.csv', 'r') as csvfile:
lines = ''
for x in range(0,6):
lines = lines + csvfile.readline()
print(lines)
name,path,bytes,modified,mod_year,extension,category,mimetype,folder1,folder2 fade.gif,Agency History,828,2015-02-17T17:11:48,2015,gif,image,image/gif,Agency History,. hts-log.txt,Agency History,4489,2015-02-17T17:11:48,2015,txt,text,text/plain,Agency History,. History of BOP.whtt,Agency History,0,2015-02-17T17:11:48,2015,whtt,application,application/octet-stream,Agency History,. index.html,Agency History,5220,2015-02-17T17:11:48,2015,html,text,text/html,Agency History,. backblue.gif,Agency History,4243,2015-02-17T17:11:48,2015,gif,image,image/gif,Agency History,.
You can see above that the first line of the CSV file is made up of column names. The data then follows, row by row.
Next we will load the same CSV data into a Python data science tool called Pandas. This means that Python will read your file into a Pandas "data frame" object that we will use going forward. Data frame objects hold tabular data for further processing and display. We will first print the same first few rows of data to give you a sense of how it is organized and let you see that it is working properly. We will also call the shape() function on the data frame object, which gives us the exact dimensions (row and columns) in the data frame.
import pandas as pd
import numpy as np
df = pd.read_csv('inventory.csv')
display(df.head())
print('dimensions: '+ str(df.shape))
name | path | bytes | modified | mod_year | extension | category | mimetype | folder1 | folder2 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | fade.gif | Agency History | 828 | 2015-02-17T17:11:48 | 2015 | gif | image | image/gif | Agency History | . |
1 | hts-log.txt | Agency History | 4489 | 2015-02-17T17:11:48 | 2015 | txt | text | text/plain | Agency History | . |
2 | History of BOP.whtt | Agency History | 0 | 2015-02-17T17:11:48 | 2015 | whtt | application | application/octet-stream | Agency History | . |
3 | index.html | Agency History | 5220 | 2015-02-17T17:11:48 | 2015 | html | text | text/html | Agency History | . |
4 | backblue.gif | Agency History | 4243 | 2015-02-17T17:11:48 | 2015 | gif | image | image/gif | Agency History | . |
dimensions: (656, 10)
The next step is to use some powerful Python tools to show us the inventory in a variety of way. These tools are not installed by default in Jupyter environments, so you will first have to run the following code block to install them.
!pip install plotly jupyter-dash
Requirement already satisfied: plotly in /opt/conda/lib/python3.8/site-packages (5.5.0) Requirement already satisfied: jupyter-dash in /opt/conda/lib/python3.8/site-packages (0.4.0) Requirement already satisfied: ansi2html in /opt/conda/lib/python3.8/site-packages (from jupyter-dash) (1.6.0) Requirement already satisfied: retrying in /opt/conda/lib/python3.8/site-packages (from jupyter-dash) (1.3.3) Requirement already satisfied: ipykernel in /opt/conda/lib/python3.8/site-packages (from jupyter-dash) (5.4.3) Requirement already satisfied: flask in /opt/conda/lib/python3.8/site-packages (from jupyter-dash) (2.0.2) Requirement already satisfied: ipython in /opt/conda/lib/python3.8/site-packages (from jupyter-dash) (7.31.0) Requirement already satisfied: dash in /opt/conda/lib/python3.8/site-packages (from jupyter-dash) (2.0.0) Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from jupyter-dash) (2.25.1) Requirement already satisfied: tenacity>=6.2.0 in /opt/conda/lib/python3.8/site-packages (from plotly) (8.0.1) Requirement already satisfied: six in /opt/conda/lib/python3.8/site-packages (from plotly) (1.15.0) Requirement already satisfied: flask-compress in /opt/conda/lib/python3.8/site-packages (from dash->jupyter-dash) (1.10.1) Requirement already satisfied: dash-html-components==2.0.0 in /opt/conda/lib/python3.8/site-packages (from dash->jupyter-dash) (2.0.0) Requirement already satisfied: dash-core-components==2.0.0 in /opt/conda/lib/python3.8/site-packages (from dash->jupyter-dash) (2.0.0) Requirement already satisfied: dash-table==5.0.0 in /opt/conda/lib/python3.8/site-packages (from dash->jupyter-dash) (5.0.0) Requirement already satisfied: itsdangerous>=2.0 in /opt/conda/lib/python3.8/site-packages (from flask->jupyter-dash) (2.0.1) Requirement already satisfied: Werkzeug>=2.0 in /opt/conda/lib/python3.8/site-packages (from flask->jupyter-dash) (2.0.2) Requirement already satisfied: click>=7.1.2 in /opt/conda/lib/python3.8/site-packages (from flask->jupyter-dash) (7.1.2) Requirement already satisfied: Jinja2>=3.0 in /opt/conda/lib/python3.8/site-packages (from flask->jupyter-dash) (3.0.3) Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.8/site-packages (from Jinja2>=3.0->flask->jupyter-dash) (2.0.1) Requirement already satisfied: brotli in /opt/conda/lib/python3.8/site-packages (from flask-compress->dash->jupyter-dash) (1.0.9) Requirement already satisfied: jupyter-client in /opt/conda/lib/python3.8/site-packages (from ipykernel->jupyter-dash) (6.1.11) Requirement already satisfied: traitlets>=4.1.0 in /opt/conda/lib/python3.8/site-packages (from ipykernel->jupyter-dash) (5.0.5) Requirement already satisfied: tornado>=4.2 in /opt/conda/lib/python3.8/site-packages (from ipykernel->jupyter-dash) (6.1) Requirement already satisfied: jedi>=0.16 in /opt/conda/lib/python3.8/site-packages (from ipython->jupyter-dash) (0.18.0) Requirement already satisfied: backcall in /opt/conda/lib/python3.8/site-packages (from ipython->jupyter-dash) (0.2.0) Requirement already satisfied: pexpect>4.3 in /opt/conda/lib/python3.8/site-packages (from ipython->jupyter-dash) (4.8.0) Requirement already satisfied: setuptools>=18.5 in /opt/conda/lib/python3.8/site-packages (from ipython->jupyter-dash) (49.6.0.post20210108) Requirement already satisfied: pygments in /opt/conda/lib/python3.8/site-packages (from ipython->jupyter-dash) (2.7.4) Requirement already satisfied: decorator in /opt/conda/lib/python3.8/site-packages (from ipython->jupyter-dash) (4.4.2) Requirement already satisfied: matplotlib-inline in /opt/conda/lib/python3.8/site-packages (from ipython->jupyter-dash) (0.1.3) Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /opt/conda/lib/python3.8/site-packages (from ipython->jupyter-dash) (3.0.16) Requirement already satisfied: pickleshare in /opt/conda/lib/python3.8/site-packages (from ipython->jupyter-dash) (0.7.5) Requirement already satisfied: parso<0.9.0,>=0.8.0 in /opt/conda/lib/python3.8/site-packages (from jedi>=0.16->ipython->jupyter-dash) (0.8.1) Requirement already satisfied: ptyprocess>=0.5 in /opt/conda/lib/python3.8/site-packages (from pexpect>4.3->ipython->jupyter-dash) (0.7.0) Requirement already satisfied: wcwidth in /opt/conda/lib/python3.8/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython->jupyter-dash) (0.2.5) Requirement already satisfied: ipython-genutils in /opt/conda/lib/python3.8/site-packages (from traitlets>=4.1.0->ipykernel->jupyter-dash) (0.2.0) Requirement already satisfied: jupyter-core>=4.6.0 in /opt/conda/lib/python3.8/site-packages (from jupyter-client->ipykernel->jupyter-dash) (4.7.1) Requirement already satisfied: pyzmq>=13 in /opt/conda/lib/python3.8/site-packages (from jupyter-client->ipykernel->jupyter-dash) (20.0.0) Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/lib/python3.8/site-packages (from jupyter-client->ipykernel->jupyter-dash) (2.8.1) Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->jupyter-dash) (2020.12.5) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->jupyter-dash) (1.26.3) Requirement already satisfied: chardet<5,>=3.0.2 in /opt/conda/lib/python3.8/site-packages (from requests->jupyter-dash) (4.0.0) Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->jupyter-dash) (2.10)
Our next step is to visualize the file format information in aggregate, so that we can make preservation decisions. Here we look at a summary of the media types using a standard called the IANA MIME type. IANA MIME types have a type and a subtype, separated by a slash (/) character.
type/subtype
For example, these are some common MIME types:
image/jpeg
text/plain
application/pdf
Sometimes MIME types are longer and more complex. Here you can see the difference between older MS Word documents and the more recent .docx documents:
application/msword
application/vnd.openxmlformats-officedocument.wordprocessingml.document
In the visualization below the area of each block reflects the number of files of each type and subtype. The color of each block represents the number of bytes stored in each type and subtype. This kind of visualization is called a Treemap and they were first developed in the early 90's by Ben Shneiderman in the Human-Computer Interaction Lab at the University of Maryland's iSchool.
These treemaps are interactive; with some detailed values presented when you hover your pointer on a block and a zoom-in feature when you click on a block.
import plotly.express as px
fig = px.treemap(df,
title="Total File Stored by Mimetype (Colored by Bytes Stored)",
path=[px.Constant("all"), 'category', 'mimetype'],
color='bytes'
)
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()
Look over the treemap above and notice which types of files are the most common (larger boxes). What conclusions can you draw about this accession based on the common file types? What does the treemap tell us is significant about the PDF files in this accession?
A quick look at the same visualization for file extension (the part of the filename after the last dot) shows why we use mimetypes. File extensions are not always informative, whereas at least mimetypes give us a high-level category for each format.
Python's mimetype guessing module is useful for a quick analysis, but there are more sophisticated file format identification tools that you will find helpful. Tools such as Droid and Siegfried (recommended) offer the ability to identify detailed sub-formats for the files in your accessions, such as which version of PDF was used to produce a document. These tools require a fully functional computer with adminitrator access to install and so they are beyond the scope of this notebook demonstration, but they will catch a many more specific formats than Python.
import plotly.express as px
fig = px.treemap(df,
title="Total Files by File Extension (Colored by Bytes Stored)",
path=[px.Constant("all"), 'extension'],
color='bytes'
)
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()
The treemap above shows the distribution of files, by area, and bytes, by color, across file extensions. How does it differ from the treemap showing the same distributions across mimetypes? What information is missing?
This treemap will help you to identify folders that may hold a lot of content or have a special structure, either holding lots of files or very large files. Run the visualization code below and note the way the treemap encoded both of these metrics visually.
Can you identify the folders that contain the most data in terms of bytes? Which folders contain a bunch of smaller files?
import plotly.express as px
fig = px.treemap(df,
title="Total Files per Folder (Depth of Two) (Colored by Bytes Stored)",
path=[px.Constant("all"), 'folder1', 'folder2'],
color='bytes'
)
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()
People organize files in a variety of ways. Sometimes their folder and sub-folder names are entirely indicative of the content. In other cases, folder names and structures are constrained by the technical considerations of the software tools that either produce or consume the files. As you look at the distribution of files, by area, and bytes, by color, across the top two folder levels what can you see?
What folder contained the largest files?
Why are there many folders called "www.bop.gov"?
Which folders are named to reflect a person's choices and which seem to be named due to a technical consideration?
As mentioned at the start of this notebook, sometimes folder structures are part of a production workflow for a particular tool is discipline, such as GIS or audiovisual productions, or websites. One way to uncover such structures in larger accessions is to determine the most common folder and file names. Usually folders that are patterned by a workflow will reuse the same folder names or file names. These folder and filename "slots" determine the purpose of the folder or file within the larger workflow.
In the code below we take the "name" column from our data frame and convert the individual names into a shared set of unique categories (.astype("category")
). Instead of separate text values, the names are now coded as category numbers. So each occurrence of the same name will have the same code. Pandas can then add up the occurrences of each code and give us value counts. (.value_counts()
) Finally, we call the head()
function to get just the first 25 most frequent names.
display(df["name"].astype("category").value_counts().head(n=25))
index.html 45 backblue.gif 7 fade.gif 7 new.txt 6 new.lst 6 readme.txt 6 new.zip 6 winprofile.ini 6 doit.log 6 cookies.txt 6 left_repeat.jpg 5 matters-over.gif 5 business-over.gif 5 business.gif 5 locator.gif 5 careers-over.gif 5 careers.gif 5 locator-over.gif 5 blue_cover.gif 5 yellow_bar_middle.gif 5 left_curve.gif 5 arrow.gif 5 hts-log.txt 5 home.gif 5 home-over.gif 5 Name: name, dtype: int64
As we look at this list of the most frequent file names in this accession, we can clearly see that the most common file name is index.html, a web page file, with 45 occurances. Next there are some GIF files, usually these are web graphics. Then we have some files called "new.txt", "new.lst", and "readme.txt". It is clues like these that make digital appraisal somewhat of a detective story. You might dive deeper into these files, by reading the files themselves or by researching the file names on the Internet to see what workflows may have produced them.
Depending upon the operating system, files will have a modified date and/or a created date. Usually the modified date is the more significant of these, because it indicates the last edit to a file and may indicate when it was completed or no longer relevant. When transferring files to our own storage we have to be careful to either capture or preserve these dates as metadata.
NOTE: Since this notebook is copied from a Git repository, the sample files included here do not have their original modified date. Since we cannot provide accurately dated files via Git, we are providing some alternative inventory CSV files that were prepared separately.
TODO: color by average timestamp TODO: find samples or supply an inventory file better good timestamps..
import plotly.express as px
fig = px.treemap(df,
title="Total Files per Year (Colored by Bytes Stored)",
path=[px.Constant("all"), 'mod_year'],
color='mod_year'
)
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()
If your initial inventory reveals a sub-folder that is of particular interest, you can run the same visualizations against a specific subfolder. Let's pick a sub-folder and show the mimetype distribution of that subfolder by itself.
The first step is to subset our inventory dataframe, to include only the desired sub-folder.
subfolder_df = df[df['folder1'] == 'Agency History']
subfolder_df.tail()
name | path | bytes | modified | mod_year | extension | category | mimetype | folder1 | folder2 | |
---|---|---|---|---|---|---|---|---|---|---|
121 | new.zip | Agency History/hts-cache | 359130 | 2015-02-17T17:11:41 | 2015 | zip | application | application/zip | Agency History | hts-cache |
122 | new.lst | Agency History/hts-cache | 5357 | 2015-02-17T17:11:41 | 2015 | lst | application | application/octet-stream | Agency History | hts-cache |
123 | new.txt | Agency History/hts-cache | 29763 | 2015-02-17T17:11:41 | 2015 | txt | text | text/plain | Agency History | hts-cache |
124 | readme.txt | Agency History/hts-cache | 612 | 2015-02-17T17:11:41 | 2015 | txt | text | text/plain | Agency History | hts-cache |
125 | doit.log | Agency History/hts-cache | 702 | 2015-02-17T17:11:41 | 2015 | log | application | application/octet-stream | Agency History | hts-cache |
Next we use the same treemap code as our prior mimetype visualization, but supply the new subfolder dataframe.
fig = px.treemap(subfolder_df,
title="Total File Stored by Mimetype (Colored by Bytes Stored)",
path=[px.Constant("all"), 'category', 'mimetype'],
color='bytes'
)
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()
Pick a different top-level folder and then run a visualization that is limited to that folder. First subset the dataframe as shown above and then use that dataframe to run one of the prior visualizations.
A second and larger set of sample files was inventoried previous and those files are not included with this notebook. However, the inventory file is supplied and you can use it to visual a larger deposit. Below we will load this inventory and visual it with the MIME Type chart.
import pandas as pd
import numpy as np
df173 = pd.read_csv('rg-173-inventory.csv')
print(df173.shape)
df173.head()
(10413, 10)
name | path | bytes | modified | mod_year | extension | category | mimetype | folder1 | folder2 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | spmjc108.txt | Copps/2001 | 21483 | 2016-04-06T19:23:18 | 2016 | txt | text | text/plain | Copps | 2001 |
1 | spmjc101.txt | Copps/2001 | 1996 | 2016-04-06T19:23:16 | 2016 | txt | text | text/plain | Copps | 2001 |
2 | spmjc106.txt | Copps/2001 | 15814 | 2016-04-06T19:23:17 | 2016 | txt | text | text/plain | Copps | 2001 |
3 | spmjc107.html | Copps/2001 | 16333 | 2016-04-06T19:23:17 | 2016 | html | text | text/html | Copps | 2001 |
4 | spmjc103.doc | Copps/2001 | 35840 | 2016-04-06T19:23:17 | 2016 | doc | application | application/msword | Copps | 2001 |
You can see above that this inventory describes 10,413 files, a more challenging quantity for manual appraisal. Now let's run our visualization on this data.
import plotly.express as px
fig = px.treemap(df173,
title="Total File Stored by Mimetype (Colored by Bytes Stored)",
path=[px.Constant("all"), 'category', 'mimetype'],
color='bytes'
)
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()
import plotly.express as px
fig = px.treemap(df173,
title="Total Files per Folder (Depth of Two) (Colored by Bytes Stored)",
path=[px.Constant("all"), 'folder1', 'folder2'],
color='bytes'
)
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()
Let's reuse our code from the previous example to look over the common file names.
display(df173["name"].astype("category").value_counts().head(n=25))
[5]DocumentSummaryInformation 873 [5]SummaryInformation 873 WordDocument 870 [1]CompObj 870 1Table 870 Data 320 Welcome.html 4 2001.gif 4 Current User 2 PowerPoint Document 2 Pictures 2 stwek012.html 2 st970807.html 2 spsn903.txt 2 st970807.txt 2 st970811.html 2 st970811.txt 2 2000.gif 2 1999.gif 2 1998.gif 2 spsn903.wp 2 stgt139.doc 2 080296.txt 2 spjhq706.html 2 stmkp102.txt 2 Name: name, dtype: int64