#!/usr/bin/env python
# coding: utf-8

# # Text Processing of Archival Data Sources
# ### Using SpaCy NLP and Regular Expressions
# * Contributors: Emily Higgs, Gregory Jansen, Richard Marciano, & Dev Pradhan
# * Source Available: https://github.com/cases-umd/bdarchives-nlp
# * License: [Creative Commons - Attribute 4.0 Intl](https://creativecommons.org/licenses/by/4.0/)
# * [Lesson Plan for Instructors](./lesson-plan.ipynb) (TBA)
# 
# ## Introduction
# These notebooks showcase a workflow for performing text processing of a "data sheet" from
# archival source material.
# 
# <img src="project_image.png" alt="Describe what the image shows" title="Image title" height="200" width="200">
# 
# ### Further Information
# Learn more about this project at the [website](http://researchwebsite.example.org)
# 
# ## Objectives
# These notebooks will familiarize the reader with standard practices of NLP for extraction of topics and
# other data from text records.
# 
# ## Learning Goals
# * Computational Practices:
#  * [Collecting Data](#collecting_data)
#  * [Modeling Data](#modeling_Data)
# * Archival Practices:
#  * ??

# # Software and Tools
# These notebooks are written in Python 3 and use an NLP module called SpaCy to process text.
# 
# * [SpaCy](https://spacy.io/): Industrial Strength Natural Language Processing (in Python)
# 
# In order to use SpaCy in this Jupyter environment, you must first load the language model.
# This is a model of the language that you are processing. We use a small English model for 
# these examples. All you have to do is execute the code block below and wait for the hourglass, shown
# in your browser tab, to disappear when the task is done.

# In[1]:


import sys
get_ipython().system('{sys.executable} -m spacy download en_core_web_sm')


# # Acquiring Data
# These example notebooks are configured to process plain text files in the `data` directory. However, you are free
# to supply your own interesting data. You may upload your files to a new folder and then point the notebooks are the new folder instead. You may wish to 
# run the example code on the example data before taking this step.
# 

# # Notebooks
# 1. [TODO Data Overview and Exploration](data_overview.ipynb)
# 1. [Spacy Text Processing for Named Entities](nlp_ner.ipynb)
# 1. [Text Parsing with Regular Expressions](regex.ipynb)
# 1. [TODO Visualization and Conclusion](notebook-4.ipynb)

# In[ ]: