#!/usr/bin/env python # coding: utf-8 # # Text Processing of Archival Data Sources # ### Using SpaCy NLP and Regular Expressions # * Contributors: Emily Higgs, Gregory Jansen, Richard Marciano, & Dev Pradhan # * Source Available: https://github.com/cases-umd/bdarchives-nlp # * License: [Creative Commons - Attribute 4.0 Intl](https://creativecommons.org/licenses/by/4.0/) # * [Lesson Plan for Instructors](./lesson-plan.ipynb) (TBA) # # ## Introduction # These notebooks showcase a workflow for performing text processing of a "data sheet" from # archival source material. # # Describe what the image shows # # ### Further Information # Learn more about this project at the [website](http://researchwebsite.example.org) # # ## Objectives # These notebooks will familiarize the reader with standard practices of NLP for extraction of topics and # other data from text records. # # ## Learning Goals # * Computational Practices: # * [Collecting Data](#collecting_data) # * [Modeling Data](#modeling_Data) # * Archival Practices: # * ?? # # Software and Tools # These notebooks are written in Python 3 and use an NLP module called SpaCy to process text. # # * [SpaCy](https://spacy.io/): Industrial Strength Natural Language Processing (in Python) # # In order to use SpaCy in this Jupyter environment, you must first load the language model. # This is a model of the language that you are processing. We use a small English model for # these examples. All you have to do is execute the code block below and wait for the hourglass, shown # in your browser tab, to disappear when the task is done. # In[1]: import sys get_ipython().system('{sys.executable} -m spacy download en_core_web_sm') # # Acquiring Data # These example notebooks are configured to process plain text files in the `data` directory. However, you are free # to supply your own interesting data. You may upload your files to a new folder and then point the notebooks are the new folder instead. You may wish to # run the example code on the example data before taking this step. # # # Notebooks # 1. [TODO Data Overview and Exploration](data_overview.ipynb) # 1. [Spacy Text Processing for Named Entities](nlp_ner.ipynb) # 1. [Text Parsing with Regular Expressions](regex.ipynb) # 1. [TODO Visualization and Conclusion](notebook-4.ipynb) # In[ ]: