#!/usr/bin/env python # coding: utf-8 # # NLP for Born Digital Archives # ### SpaCy Language Processing # * Contributors: Emily Higgs, Gregory Jansen, Richard Marciano, & Dev Pradhan # * Source Available: https://github.com/cases-umd/bdarchives-nlp # * License: [Creative Commons - Attribute 4.0 Intl](https://creativecommons.org/licenses/by/4.0/) # * [Lesson Plan for Instructors](./lesson-plan.ipynb) (TBA) # # ## Introduction # These notebooks showcase a workflow for performing natural language processing of various text formats in # a born digital archive. Some of these techniques might also be applied to OCR text from analog sources. # # Describe what the image shows # # ### Further Information # Learn more about this project at the [website](http://researchwebsite.example.org) # # ## Objectives # These notebooks will familiarize the reader with standard practices of NLP for extraction of topics and # other data from text records. # # ## Learning Goals # * Computational Practices: # * [Collecting Data](#collecting_data) # * [Modeling Data](#modeling_Data) # * Archival Practices: # * ?? # # Software and Tools # These notebooks are written in Python 3 and use an NLP module called SpaCy to process text. # # * [SpaCy](https://spacy.io/): Industrial Strength Natural Language Processing (in Python) # # In order to use SpaCy in this Jupyter environment, you must first load the language model. # This is a model of the language that you are processing. We use a small English model for # these examples. All you have to do is execute the code block below and wait for the hourglass, shown # in your browser tab, to disappear when the task is done. # In[1]: import sys get_ipython().system('{sys.executable} -m spacy download en_core_web_sm') # # Acquiring Data # These example notebooks are configured to process document files in the `data` directory. However, you are free # to supply your own interesting data in any of the following formats: # * Plain text documents # * Word documents (.doc or .docx) # * PDF documents # # You may upload your files to a new folder and then point the notebooks are the new folder instead. You may wish to # run the example code on the example data before taking this step. # # # Notebooks # 1. [Data Overview and Exploration](notebook-1.ipynb) # 1. [Data Cleaning and Preparation](notebook-2.ipynb) # 1. [Computation and Transformation](notebook-3.ipynb) # 1. [Visualization and Conclusion](notebook-4.ipynb) # In[ ]: