Although the search processed used by the team to retrieve the Tweet ID list was rigurous, some materials in the list may be unrelated. It is possible some posts match a hashtag but discuss subjects unrelated to the historical events of #RickyRenuncia movement from summer 2019.
This notebook presents a minimal IPython graphical user interface (GUI) where participants, researchers and members of the original team could interact with content and classify it.
Tweeter API Credentials
The user will need to have created the twitter_secrets.py file based on twitter_secrets_example.py and set the variables to his API specifications. See Twitter API Credentials section.
OPTIONAL
Google API Credentials
A google.oauth2.service_account.Credentials object is required to interact with the google translate API to automatically see translations of text. This should help non-Spanish speakers interact with content in Spanish.
The user will need to have created/edited the google_translate_keys.json following the Google API Credentials section. This is optional, but will offer the user automatic translation of tweet text content to english (or other language).
Add Library justifications
import ipywidgets as widgets
from IPython.core.display import display, HTML, update_display
import json, os, pickle
from random import seed, randint
# from tweet_requester.analysis import TweetJLAnalyzer, TweetAnalyzer
from tweet_requester.display import TweetInteractiveClassifier, \
JsonLInteractiveClassifier, TSess, prepare_google_credentials, PROCESSING_STAGES
# from twitter_secrets import C_BEARER_TOKEN
from twitter_secrets_testing import C_BEARER_TOKEN
JL_DATA="./tweetsRickyRenuncia-final.jsonl"
CACHE_DIR="tweet_cache"
SQLite_DB="tweets.db"
A (dataset)[https://ia601005.us.archive.org/31/items/tweetsRickyRenuncia-final/tweetsRickyRenuncia-final.txt] of the tweets related to the investigation is public online. This list will be used as a basis for research.from os import isfile
import requests
from os.path import isfile, isdir
tweet_list_url = "https://ia601005.us.archive.org/31/items/tweetsRickyRenuncia-final/tweetsRickyRenuncia-final.txt"
# Download dataset if not present
if not isfile(JL_DATA):
response = requests.get(tweet_list_url)
with open(JL_DATA, 'wb') as handler:
handler.write(response.content)
print(f"Data downloaded at {JL_DATA}.")
else:
print(f"Data available at {JL_DATA}.")
Data available at ./tweetsRickyRenuncia-final.jsonl.
The code bellow uses a combination of python logic and terminal commands to download the compressed database and cache, and then extracted only if needed. Terminals commands can be addentified by the ! symbol at the beggining of the line. The commands wget, tar and gzip are Bash commands available outside of python.
A version of the database and cache is being shared publicly through the #RickyRenuncia Project Scalar Book. The database and cache should make the experience
from os.path import isfile, isdir
tmp_file = "tweet_cache.tar.gz"
tweets_db_url = "https://libarchivist.com/rrp/rickyrenuncia//tweets.db.gz"
tweet_cache_url = "https://libarchivist.com/rrp/rickyrenuncia//tweet_cache.tar.gz"
# Download database if not present
if not isfile("tweets.db.gz"):
!wget "$tweets_db_url"
else:
print("Database already available")
# Extract database if not present
if not isfile("tweets.db"):
!gzip -kd "tweets.db.gz"
else:
print("Database already extracted.")
# Download dataset if not present
if not isfile(tmp_file):
!wget "$tweet_cache_url"
else:
print("Compressed Cache already available!")
# Extract the cache if not present
if not isdir(CACHE_DIR):
!tar -xf $tmp_file
print("Cache extracted!")
else:
print("Cache already extracted!")
Database already available Database already extracted. Compressed Cache already available! Cache already extracted!
The TSess object stores configuration and controls the connection used to retrieve content from the Twitter API. It is this object that requires your twitter credentials to create a connection.
Tweeter API Credentials are required to create the session.
tweet_session = TSess(
C_BEARER_TOKEN,
compression_level=5,
sleep_time=3, # Minimal sleep between requests to avoid hitting rate limits
cache_dir=CACHE_DIR,
hash_split=True
)
The session even include rate limiting for requests. For bearer token app authentication the limit is 300 tweet lookups each 15 minutes (900 seconds). In other words 3 seconds per tweet. Read more at "Rate limits | Docs | Twitter Developer Platform".
After following the optional instructions from Google API Credentials run the code bellow. If the user did not acquire any credentials the code will default to no credentials.
google_credentials = prepare_google_credentials(
credentials_file="./google_translate_keys_testing.json"
)
A JsonLInteractiveClassifier object handles interactions with a local SQLite database, Twitter API (importing a TSess) and a GUI for early data curation.
In terms of the data integration process , the process of capturing the data from the API would fall under Extraction, the GUI for additional metadata and the SQLite database would fall under Transform as information is being stored in a different format for easier analysis. Any methods used to visualize the data or access attributes with less effort would fall under Load.
In that sense the JsonLInteractiveClassifier handles multiple stages on the ETL.
classifier = JsonLInteractiveClassifier(
tweet_ids_file=JL_DATA,
session=tweet_session, mute=True,
google_credentials=google_credentials,
sqlite_db=SQLite_DB, pre_initialized=True)
# classifier = JsonLInteractiveClassifier(
# tweet_ids_file=JL_DATA,
# session=tweet_session, mute=True,
# google_credentials=google_credentials,
# sqlite_db=SQLite_DB, pre_initialized=False)
The JsonLInteractiveClassifier object includes a builtin SQLite3 database connection in the attribute db. It is possible to directly access the database after a connection is made.
The code bellow displays the tables and columns of the relational database.
# Connect to the database
classifier.connect()
# Extract database object and create a cursor
database = classifier.db
cursor = database.cursor()
# SQL Command Display tables and columns
# Total of tweets with slang among the processed.
cursor.execute("""SELECT name FROM sqlite_master WHERE type='table';""")
tweet_schema = cursor.fetchall()
for table in tweet_schema:
print("Table:",table[0])
# Get Column Names
cursor.execute(f"""SELECT * FROM {table[0]} LIMIT 1""")
column_names = list(map(lambda x: x[0],cursor.description))
print("Columns:", column_names, "\n\n")
# Close the cursor
cursor.close()
The database is composed of 3 tables: tweet, tweet_details, and tweet_traduction.
Table tweet includes only 2 columns the tweet_id, a string indicating the unique tweet_id, and state, an integer representing the processing stage of the tweet.
Table tweet_details includes 6 columns. Bellow you can see the SQL command used to create the table.
CREATE TABLE tweet_detail (
tweet_id TEXT,
has_media INTEGER,
description TEXT,
is_meme INTEGER,
language TEXT,
has_slang INTEGER,
PRIMARY KEY("tweet_id"))
tweet_traduction includes 3 columns: tweet_id, target_language_code and traduction. This table works as a cache for storing google translation of a tweet's text in one or more languages.It is possible to directly interact with the built-in database by accesing the SQLite3 database connection. Bellow the user interacts with the tweet table to get totals on the stage of processing using a 'Group By' SQL command.
# Connect to the database
classifier.connect()
# Extract database object and create a cursor
database = classifier.db
cursor = database.cursor()
# Execute an sql command
cursor.execute("""
SELECT state, count(*) FROM tweet GROUP BY state;
""")
rows = cursor.fetchall()
cursor.close()
print("{:>20} : {:<12}".format("STATE", "TOTAL"))
for state, count in rows:
state_name = PROCESSING_STAGES(state).name
print("{:>20} : {:<12}".format(state_name, count))
Although it would be posible to store the state of the tweets as text fields in the database, that would increment storage and reduce efficiency. For that reason our team used an Enumerator Class called PROCESSING_STAGES to identify each stage with an integer value and a name.
The code bellow prints a table of the values and names used to represent each stage. In the database they are stored as numeric values.
print("{:>5} | {:20} | {:}".format("VALUE", "NAME", "OBJECT"))
for stage in PROCESSING_STAGES:
print("{:5} | {:20} | {:}".format(stage.value, stage.name, stage))
The display_accepted method will display accepted tweets in pages. Pages can have arbitrary amounts of tweets and pages can be selected by changing the page parameter.
Try it! Change the page number. We recommend (5, 12 and 22)
# Change the page number to see different tweets from the database
PAGE=3
classifier.display_accepted(page=PAGE, per_page=2)
Now is your turn!
Using the method StartEvaluation you can process some of the unprocessed tweets. This should display a tweet embeding into screeen and ask some questions that will store metadata on the tweets.
By default the StartEvaluations only processess tweets from the PREPROCESSED stage. using the preprocess_batch some tweets can be preprocessed before evaluation begins.
classifier.preprocess_batch(n=20)
Preprocessed 2006
classifier.StartEvaluations()
# Connect to the database
classifier.connect()
# Extract database object and create a cursor
database = classifier.db
cursor = database.cursor()
# Execute an sql command
cursor.execute("""
SELECT state, count(*) FROM tweet GROUP BY state;
""")
rows = cursor.fetchall()
cursor.close()
print("{:>20} : {:<12}".format("STATE", "TOTAL"))
for state, count in rows:
state_name = PROCESSING_STAGES(state).name
print("{:>20} : {:<12}".format(state_name, count))
# Close the cursor
cursor.close()
How many tweets are missing?... TODO
Using the example above. Run 2 different queries to visualize:
Observe the column names of the tables tweet and tweet_details. The code bellow iterates through all tables and shows the column names.
Consider
state?The column state is integer and matches the PROCESSING_STAGE values seen before.
The columns has_media and has_slang are integers but can be treated as boolean.
Replace any values between <VALUE-DESCRIPTION> to make the commands work:
# Connect to the database
classifier.connect()
# Extract database object and create a cursor
database = classifier.db
cursor = database.cursor()
###### SQL Command 1
# Total of tweets with slang among the processed.
# cursor.execute("""
# SELECT <ENTER-GROUPING-COLUMN>, count(*)
# FROM <TABLE-NAME> a
# INNER JOIN <DETAIL-TABLE-NAME> b
# ON a.tweet_id=b.tweet_id
# WHERE a.state=<STATE-VALUE-FOR-FINALIZED>
# GROUP BY <ENTER-GROUPING-COLUMN>;
# """)
cursor.execute("""
SELECT has_slang, count(*)
FROM tweet a
INNER JOIN tweet_user_detail b
ON a.tweet_id=b.tweet_id
WHERE a.state in (2,6)
GROUP BY has_slang;
""")
slang_rows = cursor.fetchall()
# Print Results to screen
print("{:>15} : {:<12}".format("Slang?", "TOTAL"))
for slang, count in slang_rows:
if slang:
state="Has Slang"
else:
state="No Slang"
print("{:>15} : {:<12}".format(state, count))
# SQL Command 2
# Total of tweets with multimedia among the processed and preprocessed.
# cursor.execute("""
# SELECT
# <ENTER-GROUPING-COLUMN>,
# count(*)
# FROM
# <TABLE-NAME> a
# INNER JOIN
# <DETAIL-TABLE-NAME> b
# ON
# a.tweet_id=b.tweet_id
# WHERE
# a.state in (<STATE-VALUE-FOR-FINALIZED>, <STATE-VALUE-FOR-PREPROCESSED>)
# GROUP BY
# <ENTER-GROUPING-COLUMN>
# ORDER BY
# <ENTER-GROUPING-COLUMN> DESC;
# """)
cursor.execute("""
SELECT
has_media,
count(*)
FROM
tweet a
INNER JOIN
tweet_auto_detail b
ON
a.tweet_id=b.tweet_id
WHERE
a.state in (2,6)
GROUP BY
has_media
ORDER BY
has_media DESC;
""")
multimedia_rows = cursor.fetchall()
# Print Results to screen
print("{:>12} : {:<12}".format("HAS MEDIA?", "TOTAL"))
for has_media, count in multimedia_rows:
if has_media:
state="Yes"
else:
state="No"
print("{:>12} : {:<12}".format(state, count))
Inspired by the commands above, write a command that lets you visualize the languages of the tweets.
To complete this exercise you will need to answer the next questions:
Remember to output the results.
# SQL Command 3
# Total of tweets by language
cursor.execute("""
SELECT <ENTER-GROUPING-COLUMN>, count(*)
FROM <DETAIL-TABLE-NAME>
GROUP BY <ENTER-GROUPING-COLUMN>;
""")
language_rows = cursor.fetchall()
# Print Results to screen
print(type(classifier))
The classifier is of type JsonLInteractiveClassifier. This class has two attributes, original_filename, a list of Tweet IDs, and sqlite_filename, the file holding the SQLite database,
print(classifier.original_filename)
print(classifier.sqlite_filename)
# Close the cursor
cursor.close()
Continue to Media Rating to continue with the experience and learn how to use pandas to interact with an SQLite database.