#!/usr/bin/env python
# coding: utf-8

# # NLP for Born Digital Archives
# ### SpaCy Language Processing
# * Contributors: Emily Higgs, Gregory Jansen, Richard Marciano, & Dev Pradhan
# * Source Available: https://github.com/cases-umd/bdarchives-nlp
# * License: [Creative Commons - Attribute 4.0 Intl](https://creativecommons.org/licenses/by/4.0/)
# * [Lesson Plan for Instructors](./lesson-plan.ipynb) (TBA)
# 
# ## Introduction
# These notebooks showcase a workflow for performing natural language processing of various text formats in
# a born digital archive. Some of these techniques might also be applied to OCR text from analog sources.
# 
# <img src="project_image.png" alt="Describe what the image shows" title="Image title" height="200" width="200">
# 
# ### Further Information
# Learn more about this project at the [website](http://researchwebsite.example.org)
# 
# ## Objectives
# These notebooks will familiarize the reader with standard practices of NLP for extraction of topics and
# other data from text records.
# 
# ## Learning Goals
# * Computational Practices:
#  * [Collecting Data](#collecting_data)
#  * [Modeling Data](#modeling_Data)
# * Archival Practices:
#  * ??

# # Software and Tools
# These notebooks are written in Python 3 and use an NLP module called SpaCy to process text.
# 
# * [SpaCy](https://spacy.io/): Industrial Strength Natural Language Processing (in Python)
# 
# In order to use SpaCy in this Jupyter environment, you must first load the language model.
# This is a model of the language that you are processing. We use a small English model for 
# these examples. All you have to do is execute the code block below and wait for the hourglass, shown
# in your browser tab, to disappear when the task is done.

# In[1]:


import sys
get_ipython().system('{sys.executable} -m spacy download en_core_web_sm')


# # Acquiring Data
# These example notebooks are configured to process document files in the `data` directory. However, you are free
# to supply your own interesting data in any of the following formats:
# * Plain text documents
# * Word documents (.doc or .docx)
# * PDF documents
# 
# You may upload your files to a new folder and then point the notebooks are the new folder instead. You may wish to 
# run the example code on the example data before taking this step.
# 

# # Notebooks
# 1. [Data Overview and Exploration](notebook-1.ipynb)
# 1. [Data Cleaning and Preparation](notebook-2.ipynb)
# 1. [Computation and Transformation](notebook-3.ipynb)
# 1. [Visualization and Conclusion](notebook-4.ipynb)

# In[ ]: