These notebooks showcase a workflow for performing natural language processing of various text formats in a born digital archive. Some of these techniques might also be applied to OCR text from analog sources.
Learn more about this project at the website
These notebooks will familiarize the reader with standard practices of NLP for extraction of topics and other data from text records.
These notebooks are written in Python 3 and use an NLP module called SpaCy to process text.
In order to use SpaCy in this Jupyter environment, you must first load the language model. This is a model of the language that you are processing. We use a small English model for these examples. All you have to do is execute the code block below and wait for the hourglass, shown in your browser tab, to disappear when the task is done.
import sys
!{sys.executable} -m spacy download en_core_web_sm
Collecting en_core_web_sm==2.2.5
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
|████████████████████████████████| 12.0 MB 2.0 MB/s eta 0:00:01
Requirement already satisfied: spacy>=2.2.2 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from en_core_web_sm==2.2.5) (2.2.4)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (4.44.1)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (0.4.1)
Requirement already satisfied: numpy>=1.15.0 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.18.2)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.1.3)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.0)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (2.23.0)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (3.0.2)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (2.0.3)
Requirement already satisfied: setuptools in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (46.1.3)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (0.6.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.2)
Requirement already satisfied: thinc==7.4.0 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (7.4.0)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.2)
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_sm==2.2.5) (1.6.0)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (1.25.8)
Requirement already satisfied: chardet<4,>=3.0.2 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (2.9)
Requirement already satisfied: certifi>=2017.4.17 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (2019.11.28)
Requirement already satisfied: zipp>=0.5 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_sm==2.2.5) (3.1.0)
Building wheels for collected packages: en-core-web-sm
Building wheel for en-core-web-sm (setup.py) ... done
Created wheel for en-core-web-sm: filename=en_core_web_sm-2.2.5-py3-none-any.whl size=12011738 sha256=71b0462de38b85f8316cda10a461572d1db5ea4889a6e3e37429e3b73fa0ff6e
Stored in directory: /tmp/pip-ephem-wheel-cache-l_vwk4ot/wheels/b5/94/56/596daa677d7e91038cbddfcf32b591d0c915a1b3a3e3d3c79d
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-2.2.5
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')
These example notebooks are configured to process document files in the data
directory. However, you are free
to supply your own interesting data in any of the following formats:
You may upload your files to a new folder and then point the notebooks are the new folder instead. You may wish to run the example code on the example data before taking this step.