Notebook

Evaluating Content¶

Although the search processed used by the team to retrieve the Tweet ID list was rigurous, some materials in the list may be unrelated. It is possible some posts match a hashtag but discuss subjects unrelated to the historical events of #RickyRenuncia movement from summer 2019.

Objectives¶

This notebook presents a minimal IPython graphical user interface (GUI) where participants, researchers and members of the original team could interact with content and classify it.

Learning Goals¶

Interact with twitter embedings.
Update SQLite3 database.
Classify content (minimum of 20 tweets).
Visualize state of the database.

Requirements¶

Tweeter API Credentials

The user will need to have created the twitter_secrets.py file based on twitter_secrets_example.py and set the variables to his API specifications. See Twitter API Credentials section.

Optional Requirements¶

OPTIONAL

Google API Credentials

A google.oauth2.service_account.Credentials object is required to interact with the google translate API to automatically see translations of text. This should help non-Spanish speakers interact with content in Spanish.

The user will need to have created/edited the google_translate_keys.json following the Google API Credentials section. This is optional, but will offer the user automatic translation of tweet text content to english (or other language).

Import Libraries¶

Add Library justifications

In [2]:

import ipywidgets as widgets
from IPython.core.display import display, HTML, update_display
import json, os, pickle
from random import seed, randint
# from tweet_requester.analysis import TweetJLAnalyzer, TweetAnalyzer
from tweet_requester.display import TweetInteractiveClassifier, \
JsonLInteractiveClassifier, TSess, prepare_google_credentials, PROCESSING_STAGES
# from twitter_secrets import C_BEARER_TOKEN 
from twitter_secrets_testing import C_BEARER_TOKEN 
JL_DATA="./tweetsRickyRenuncia-final.jsonl"
CACHE_DIR="tweet_cache"
SQLite_DB="tweets.db"

Download Data Set¶

A (dataset)[https://ia601005.us.archive.org/31/items/tweetsRickyRenuncia-final/tweetsRickyRenuncia-final.txt] of the tweets related to the investigation is public online. This list will be used as a basis for research.from os import isfile

In [3]:

import requests
from os.path import isfile, isdir

tweet_list_url = "https://ia601005.us.archive.org/31/items/tweetsRickyRenuncia-final/tweetsRickyRenuncia-final.txt"

# Download dataset if not present
if not isfile(JL_DATA):
    response = requests.get(tweet_list_url)
    with open(JL_DATA, 'wb') as handler:
        handler.write(response.content)
    print(f"Data downloaded at {JL_DATA}.")
else:
    print(f"Data available at {JL_DATA}.")

Data available at ./tweetsRickyRenuncia-final.jsonl.

Download Database and Cache¶

The code bellow uses a combination of python logic and terminal commands to download the compressed database and cache, and then extracted only if needed. Terminals commands can be addentified by the ! symbol at the beggining of the line. The commands wget, tar and gzip are Bash commands available outside of python.

A version of the database and cache is being shared publicly through the #RickyRenuncia Project Scalar Book. The database and cache should make the experience

In [5]:

from os.path import isfile, isdir

tmp_file = "tweet_cache.tar.gz"
tweets_db_url = "https://libarchivist.com/rrp/rickyrenuncia//tweets.db.gz"
tweet_cache_url = "https://libarchivist.com/rrp/rickyrenuncia//tweet_cache.tar.gz"

# Download database if not present
if not isfile("tweets.db.gz"):
    !wget "$tweets_db_url"
else:
    print("Database already available")

# Extract database if not present
if not isfile("tweets.db"):
    !gzip -kd "tweets.db.gz"
else:
    print("Database already extracted.")

# Download dataset if not present
if not isfile(tmp_file):
    !wget "$tweet_cache_url"
else:
    print("Compressed Cache already available!")

# Extract the cache if not present
if not isdir(CACHE_DIR):
    !tar -xf $tmp_file
    print("Cache extracted!")
else:
    print("Cache already extracted!")

Database already available
Database already extracted.
Compressed Cache already available!
Cache already extracted!

Create a TSess¶

The TSess object stores configuration and controls the connection used to retrieve content from the Twitter API. It is this object that requires your twitter credentials to create a connection.

Tweeter API Credentials are required to create the session.

In [4]:

tweet_session = TSess(
        C_BEARER_TOKEN, 
        compression_level=5, 
        sleep_time=3, # Minimal sleep between requests to avoid hitting rate limits
        cache_dir=CACHE_DIR, 
        hash_split=True
    )

The session even include rate limiting for requests. For bearer token app authentication the limit is 300 tweet lookups each 15 minutes (900 seconds). In other words 3 seconds per tweet. Read more at "Rate limits | Docs | Twitter Developer Platform".

Create Google Translate Credentials¶

After following the optional instructions from Google API Credentials run the code bellow. If the user did not acquire any credentials the code will default to no credentials.

In [5]:

google_credentials = prepare_google_credentials(
    credentials_file="./google_translate_keys_testing.json"
)

Create a JsonLInteractiveClassifier¶

A JsonLInteractiveClassifier object handles interactions with a local SQLite database, Twitter API (importing a TSess) and a GUI for early data curation.

In terms of the data integration process , the process of capturing the data from the API would fall under Extraction, the GUI for additional metadata and the SQLite database would fall under Transform as information is being stored in a different format for easier analysis. Any methods used to visualize the data or access attributes with less effort would fall under Load.

In that sense the JsonLInteractiveClassifier handles multiple stages on the ETL.

If you downloaded the database and cache on the previous steps continue with the next code. The parameter `pre_initialized` is set as `True` to continue working with previous version of the database.

In [6]:

classifier = JsonLInteractiveClassifier(
    tweet_ids_file=JL_DATA, 
    session=tweet_session, mute=True, 
    google_credentials=google_credentials,
    sqlite_db=SQLite_DB, pre_initialized=True)

# classifier = JsonLInteractiveClassifier(
#     tweet_ids_file=JL_DATA, 
#     session=tweet_session, mute=True, 
#     google_credentials=google_credentials,
#     sqlite_db=SQLite_DB, pre_initialized=False)

JsonLInteractiveClassifier¶

Where are results and details stored?¶

The JsonLInteractiveClassifier object includes a builtin SQLite3 database connection in the attribute db. It is possible to directly access the database after a connection is made.

The code bellow displays the tables and columns of the relational database.

In [ ]:

# Connect to the database
classifier.connect()

# Extract database object and create a cursor
database = classifier.db
cursor = database.cursor()

# SQL Command Display tables and columns
# Total of tweets with slang among the processed.
cursor.execute("""SELECT name FROM sqlite_master WHERE type='table';""")
tweet_schema = cursor.fetchall()
for table in tweet_schema:
    print("Table:",table[0])
    # Get Column Names
    cursor.execute(f"""SELECT * FROM {table[0]} LIMIT 1""")
    column_names = list(map(lambda x: x[0],cursor.description))
    print("Columns:", column_names, "\n\n")
    
# Close the cursor
cursor.close()

The database is composed of 3 tables: tweet, tweet_details, and tweet_traduction.

Table tweet includes only 2 columns the tweet_id, a string indicating the unique tweet_id, and state, an integer representing the processing stage of the tweet.
Table tweet_details includes 6 columns. Bellow you can see the SQL command used to create the table.

CREATE TABLE tweet_detail (
            tweet_id TEXT,
            has_media INTEGER,
            description TEXT,
            is_meme INTEGER,
            language TEXT,
            has_slang INTEGER,
            PRIMARY KEY("tweet_id"))

Table tweet_traduction includes 3 columns: tweet_id, target_language_code and traduction. This table works as a cache for storing google translation of a tweet's text in one or more languages.

Generate a Tweet State Report¶

It is possible to directly interact with the built-in database by accesing the SQLite3 database connection. Bellow the user interacts with the tweet table to get totals on the stage of processing using a 'Group By' SQL command.

In [ ]:

# Connect to the database
classifier.connect()

# Extract database object and create a cursor
database = classifier.db
cursor = database.cursor()

# Execute an sql command
cursor.execute("""
SELECT state, count(*) FROM tweet GROUP BY state;
""")
rows = cursor.fetchall()
cursor.close()
print("{:>20} : {:<12}".format("STATE", "TOTAL"))
for state, count in rows:
    state_name = PROCESSING_STAGES(state).name
    print("{:>20} : {:<12}".format(state_name, count))
    

Although it would be posible to store the state of the tweets as text fields in the database, that would increment storage and reduce efficiency. For that reason our team used an Enumerator Class called PROCESSING_STAGES to identify each stage with an integer value and a name.

The code bellow prints a table of the values and names used to represent each stage. In the database they are stored as numeric values.

In [ ]:

print("{:>5} | {:20} | {:}".format("VALUE", "NAME", "OBJECT"))
for stage in PROCESSING_STAGES:
    print("{:5} | {:20} | {:}".format(stage.value, stage.name, stage))

View some of the accepted tweets¶

The display_accepted method will display accepted tweets in pages. Pages can have arbitrary amounts of tweets and pages can be selected by changing the page parameter.

Try it! Change the page number. We recommend (5, 12 and 22)

In [ ]:

# Change the page number to see different tweets from the database
PAGE=3
classifier.display_accepted(page=PAGE, per_page=2)

Evaluate Some Tweets¶

Now is your turn!

Using the method StartEvaluation you can process some of the unprocessed tweets. This should display a tweet embeding into screeen and ask some questions that will store metadata on the tweets.

By default the StartEvaluations only processess tweets from the PREPROCESSED stage. using the preprocess_batch some tweets can be preprocessed before evaluation begins.

In [7]:

classifier.preprocess_batch(n=20)

Preprocessed 2006

In [ ]:

classifier.StartEvaluations()

Verify if the totals have changed.¶

In [ ]:

# Connect to the database
classifier.connect()

# Extract database object and create a cursor
database = classifier.db
cursor = database.cursor()

# Execute an sql command
cursor.execute("""
SELECT state, count(*) FROM tweet GROUP BY state;
""")
rows = cursor.fetchall()
cursor.close()
print("{:>20} : {:<12}".format("STATE", "TOTAL"))
for state, count in rows:
    state_name = PROCESSING_STAGES(state).name
    print("{:>20} : {:<12}".format(state_name, count))

# Close the cursor
cursor.close()

How many tweets are missing?... TODO

Exercise 1¶

Using the example above. Run 2 different queries to visualize:

the amount of tweets that have slang among the processed tweets.
the amount of tweets that have multimedia among the processed tweets.

Observe the column names of the tables tweet and tweet_details. The code bellow iterates through all tables and shows the column names.

Consider

What two tables will you need to join together to get both the tweet state and the details about slang and media?
What table has the tweet state?
What table has the media and slang information?
Visit SQL Inner Join Tutorial to get a better notion about what an "Inner Join" is.

The column state is integer and matches the PROCESSING_STAGE values seen before. The columns has_media and has_slang are integers but can be treated as boolean.

Replace any values between <VALUE-DESCRIPTION> to make the commands work:

<ENTER-GROUPING-COLUMN>
sdf

In [ ]:

# Connect to the database
classifier.connect()

# Extract database object and create a cursor
database = classifier.db
cursor = database.cursor()

In [ ]:

###### SQL Command 1
# Total of tweets with slang among the processed.
# cursor.execute("""
# SELECT <ENTER-GROUPING-COLUMN>, count(*) 
# FROM <TABLE-NAME> a
# INNER JOIN <DETAIL-TABLE-NAME> b
# ON a.tweet_id=b.tweet_id
# WHERE a.state=<STATE-VALUE-FOR-FINALIZED> 
# GROUP BY <ENTER-GROUPING-COLUMN>;
# """)
cursor.execute("""
SELECT has_slang, count(*) 
FROM tweet a
INNER JOIN tweet_user_detail b
ON a.tweet_id=b.tweet_id
WHERE a.state in (2,6)
GROUP BY has_slang;
""")

slang_rows = cursor.fetchall()

# Print Results to screen
print("{:>15} : {:<12}".format("Slang?", "TOTAL"))
for slang, count in slang_rows:
    if slang:
        state="Has Slang"
    else:
        state="No Slang"
    print("{:>15} : {:<12}".format(state, count))
    

In [ ]:

# SQL Command 2
# Total of tweets with multimedia among the processed and preprocessed.
# cursor.execute("""
# SELECT
#     <ENTER-GROUPING-COLUMN>,
#     count(*) 
# FROM
#     <TABLE-NAME> a
# INNER JOIN
#     <DETAIL-TABLE-NAME> b
# ON
#     a.tweet_id=b.tweet_id
# WHERE
#     a.state in (<STATE-VALUE-FOR-FINALIZED>, <STATE-VALUE-FOR-PREPROCESSED>)
# GROUP BY
#     <ENTER-GROUPING-COLUMN>
# ORDER BY
#     <ENTER-GROUPING-COLUMN> DESC;
# """)
cursor.execute("""
SELECT
    has_media,
    count(*) 
FROM
    tweet a
INNER JOIN
    tweet_auto_detail b
    ON
        a.tweet_id=b.tweet_id
WHERE
    a.state in (2,6) 
GROUP BY
    has_media
ORDER BY
    has_media DESC;
""")

multimedia_rows = cursor.fetchall()

# Print Results to screen
print("{:>12} : {:<12}".format("HAS MEDIA?", "TOTAL"))
for has_media, count in multimedia_rows:
    if has_media:
        state="Yes"
    else:
        state="No"
    print("{:>12} : {:<12}".format(state, count))
    

Exercise 2¶

Inspired by the commands above, write a command that lets you visualize the languages of the tweets.

To complete this exercise you will need to answer the next questions:

What table has language information?
What column holds the language information?

Remember to output the results.

In [ ]:

# SQL Command 3
# Total of tweets by language
cursor.execute("""
SELECT <ENTER-GROUPING-COLUMN>, count(*) 
FROM  <DETAIL-TABLE-NAME>
GROUP BY <ENTER-GROUPING-COLUMN>;
""")

language_rows = cursor.fetchall()

# Print Results to screen

Where is information stored?¶

In [ ]:

print(type(classifier))

The classifier is of type JsonLInteractiveClassifier. This class has two attributes, original_filename, a list of Tweet IDs, and sqlite_filename, the file holding the SQLite database,

In [ ]:

print(classifier.original_filename)
print(classifier.sqlite_filename)

In [ ]:

# Close the cursor
cursor.close()

What's next?¶

Continue to Media Rating to continue with the experience and learn how to use pandas to interact with an SQLite database.