The team of the RickyRenuncia Project managed multiple adquicision procedures to preserve the incidents that occured during the summer 2019 related to the leave of office of Ex-governor Ricardo Rosello Nevarez.
The team collected artifacts and banners used during the demonstrations. When ever possible the artifacts where accompanied by audio interview and/or photograph of the demonstrators that produced and used this artifacts. Through social media and online word-of-mouth the team also contacted the community requesting imagery and content related to the activities of that summer.
In order to have a broad view of the many activities and demonstratiosn around the globe, one of the team members, Joel Blanco, decided to capture records of tweet activity in the web. This data was captured life during the days of the incident and requires processing and analysis to provide a valid interpretation of the information adquired.
A cleaned version of this dataset occupies over 7 gigabytes
but fits into 777 megabytes
when compressed using gzip
. Full text data can generally be easily compressed. Bellow we calculate the benefit of compressing this specific dataset.
Before utilizing this notebook the user will need initialize environment variables as specified at Developer_Registration.ipynb
# Calculare the storage benefits of compression
# Observations
original_size_G = 7
final_size_M = 777
# Unit transformation
giga_to_mega_rate = 1024.0
original_size_M = original_size_G * giga_to_mega_rate
# Calculate percent change
new_size_to_old_size = final_size_M / original_size_M
new_size_percent = new_size_to_old_size * 100.0
space_freed_percent = 100 - new_size_percent
print(
"The storage was reduced to {:.1f}%.\nAfter compression, {:.1f}% from the originaly occupied space was freed.".
format(new_size_percent, space_freed_percent)
)
The benefits can be very big specially for long term storage.
It is important to understand the type of data that is collected from a social media API (application programable interface). The file Data/Joel/tweetsRickyRenuncia-final.jsonl
is of jsonl format. If you are familiar with json files then this format is a composition of multiple json
strings each in a new line, the 'L' stands for line (jsonl = json-l = json-line
).
This data set was collected from Twitter in 2019. The Twitter API rescently went through an update, however this data uses the previous API conventions. We will use Pythons json
library to parse a random line from the source data to help you visualize the structure of this data. Observe that some of the content is readily availble (text field), while others are harder to parse (media url).
The full list of tweet ids is available here.
Bellow we show how a try/except and while loops can be used to loop through the data until a post with images is found.
dir_path = os.getcwd()
print(dir_path)
#os.chdir("/home/torrien/")
#dir_path = os.getcwd()
#print(dir_path)
#print(os.listdir())
#print(os.listfile())
JL_DATA="/home/rickyrenuncia/tweetsRickyRenuncia-final.jsonl"
# Get the SAMPLE_SIZE
SAMPLE_SIZE = 0.
with open(JL_DATA, "r") as data_handler:
for line in data_handler:
if line != "\n":
SAMPLE_SIZE += 1.
print(f"Sample Size:{int(SAMPLE_SIZE)}\n\n")
# Get a random integer to skip before taking single sample
# Try seeds 1 and 16 or any you want to test
seed(1)
skip_lines=randint(0,int(SAMPLE_SIZE-1))
# Reopen file using the with-open-as style and print out a single sample
with open(JL_DATA, 'r') as data_handler:
# Use next to skip a line, the for loop allows skipping multiple lines
for _ in range(skip_lines):
next(data_handler)
while True:
# Loop until a tweet with media.
try:
# Capture string
raw_data = data_handler.readline()
# Verify if the json has any 'meda_url_https' keys.
if 'media_url_https' not in raw_data:
continue
data = json.loads(raw_data)
except:
break
try:
i = 0
while True:
try:
media_url = data['retweeted_status']['entities']['media'][i]['media_url_https']
except:
i += 1
if i > 10:
media_url = "Could not quickly find a tweet with media."
raise #Pass error to previous try/except.
continue
break
except:
continue
print("Text:", data['text'])
# The Tweet URL is a twitter convention where both the tweet ID and the user's screen_name are required to access the status.
print("Tweet URL using user's screen_name:", f"https://twitter.com/{data['user']['screen_name']}/status/{data['id_str']}")
print("Tweet URL using user's ID :", f"https://twitter.com/{data['user']['id_str']}/status/{data['id_str']}")
print("Media:", media_url)
# print(f"In replay to: {json.dumps(data['retweeted_status'], indent=1)}")
print("\n")
# The indent and sort_keys in json.dumps "prettify" the output. Still not pretty.
# print("Raw Data:")
# print("#"*50)
# print(json.dumps(data, indent=4, sort_keys=True))
# print("#"*50)
break
As data analysts we need to understand the data before we can set goals.
SAMPLE_SIZE = 1113758
data = TweetJLAnalyzer(JL_DATA, reset=True, local_media=False, cache_size=2000)
size=getsizeof(data)
print(str(size))
print(str(size/1024.0))
most_retweeted_media = data.get_most_retweeted_media(40)
print("Ammount found: ", len(most_retweeted_media))
for rt_count, m_id, m in most_retweeted_media[15:21]:
print(m)
print("*"*20 + "\n" + str(rt_count) + " - " + str(m_id) + "\n" + "*"*20 + "\n\n")
most_retweeted_posts = data.get_most_retweeted(100,has_media=True)
# Save populars posts
with open("100_most_retweeted_posts.pickle",'wb') as handler:
pickle.dump(most_retweeted_posts, handler)
# Recall popular posts
with open("100_most_retweeted_posts.pickle",'rb') as handler:
most_retweeted_posts = pickle.load(handler)
import random
print("Ammount found: ", len(most_retweeted_posts))
for rt_count, tweet_id, key in random.sample(most_retweeted_posts[11:21], 10):
tweet = data.fetch_by_id(tweet_id)
if "renuncia" in tweet.data["text"].lower() or "puerto rico" in tweet.data["text"].lower() or "ricky" in tweet.data["text"].lower() or "rosell" in tweet.data["text"].lower():
print(tweet)
print("*"*20 + "\n" + str(rt_count) + " - " + str(tweet_id) + " - " + str(key) + "\n" + "*"*20 + "\n\n")
else:
# print(tweet.data["text"])
print(tweet)
print("*"*10 + "\n" + str(rt_count) + " - " + str(tweet_id) + " - " + str(key) + "\n\n")
# randint(0,SAMPLE_SIZE-6)
# print(data.head(5, 40, sep="\n" + "*"*100 + "\n\n"))
#RickyRenuncia
#RickyVeteYa
print(data.head(5, randint(0,SAMPLE_SIZE-6), sep="\n" + "*"*100 + "\n\n"))
print(data.head(2, sep="\n*************\n"))
print(type(data.retweet_cache))
print(str(data.retweet_cache.keys())[:400])
print(str(data.retweet_cache)[:400])
print(data.retweet_cache[0][0])
print(str(data.quoteOf)[:400])
print(str(data.retweetOf)[:400])
print(str(data.retweet_cache)[:400])
retweet_counts = list(data.retweet_cache.keys())
retweet_counts.sort(reverse=True)
quote_counts = list(data.quote_cache.keys())
quote_counts.sort(reverse=True)
print(str(retweet_counts)[:400])
print(str(quote_counts)[:400])
sample_t = data.fetch_by_position(112)
print(json.dumps(sample_t.data, indent=4))
# Find a video tweet
SAMPLE_SIZE = 1113758
count = 0
media_ids=[]
with open(JL_DATA,'r') as data_file:
for _ in range(SAMPLE_SIZE):
count+=1
if count%200000 == 0:
print(f"Done with: {count}")
tweet = TweetAnalyzer(data_file.readline())
if tweet.hasMedia:
# print("HasMedia",tweet.hasMedia)
if len(tweet.media) > 0:
for m in tweet.media:
if m.mtype().lower() != "photo" and m.id not in media_ids:
media_ids.append(m.id)
print(m.id, m.mtype(), m.url())
# print(m.data)
else:
print("Length 0??")
try:
print(tweet.data["entities"]["media"])
except:
print("No Media at HERE")
try:
print(tweet.data["retweeted_status"]["entities"]["media"])
except:
print("No Media at RETWEET_STATUS")
print(json.dumps(tweet.data))
break
print(f"DONE: {count}")
def h(p, q):
return (p, q)
interact(h, p=10, q=fixed(20))
interact(f, x=IntSlider(min=0, max=30, step=1, value=15))
@interact(x=(0.0,20.0,0.5))
def h(x=5.5):
return x
@interact(x=(8,20))
def aTitle(x=12):
display(HTML(f"<h1 style='font-size:{x}px'>Hello!</h1>"))
import ipywidgets as widgets
from IPython.display import display
button = widgets.Button(description="Click Me!")
output = widgets.Output()
display(button, output)
output.my_n = 0
def on_button_clicked(b):
with output:
output.clear_output()
output.my_n+=1
print(f"Button clicked. {output.my_n}")
button.on_click(on_button_clicked)
# Adding Required Libraries
import ipywidgets as widgets
from IPython.core.display import display, HTML, update_display
import json, os, pickle
from random import seed, randint
from tweet_rehydrate.analysis import TweetJLAnalyzer, TweetAnalyzer, getsizeof
from tweet_rehydrate.display import TweetInteractiveClassifier, JsonLInteractiveClassifier, TSess, prepare_google_credentials
from twitter_secrets import C_API_KEY, C_API_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET, C_BEARER_TOKEN
JL_DATA="/home/rickyrenuncia/tweetsRickyRenuncia-final.jsonl"
tweet_session = TSess(
C_BEARER_TOKEN,
compression_level=5,
sleep_time=3,
cache_dir="./.tweet_cache_split/",
hash_split=True
)
google_credentials = prepare_google_credentials(credentials_file="./google_translate_keys.json")
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) <ipython-input-3-afc3c9b63e0b> in <module> 6 hash_split=True 7 ) ----> 8 google_credentials = prepare_google_credentials(credentials_file="./RickyRenuncia-case-module_shared/google_translate_keys.json") /home/rickyrenuncia/RickyRenuncia-case-module_shared/tweet_rehydrate/display.py in prepare_google_credentials(credentials_file) 34 credentials_file = environ.get("GOOGLE_APPLICATION_CREDENTIALS", "") 35 assert credentials_file ---> 36 google_credentials = service_account.Credentials.\ 37 from_service_account_file('/path/to/key.json') 38 return google_credentials ~/.virtualenvs/jupyterhub/lib/python3.8/site-packages/google/oauth2/service_account.py in from_service_account_file(cls, filename, **kwargs) 236 credentials. 237 """ --> 238 info, signer = _service_account_info.from_filename( 239 filename, require=["client_email", "token_uri"] 240 ) ~/.virtualenvs/jupyterhub/lib/python3.8/site-packages/google/auth/_service_account_info.py in from_filename(filename, require) 70 info and a signer instance. 71 """ ---> 72 with io.open(filename, "r", encoding="utf-8") as json_file: 73 data = json.load(json_file) 74 return data, from_dict(data, require=require) FileNotFoundError: [Errno 2] No such file or directory: '/path/to/key.json'
# jl_display = JsonLInteractiveClassifier(
# tweet_ids_file="tweetsRickyRenuncia-final.txt",
# session=tweet_session, mute=False)
# Flier Boletin Promocion
# 30 de Abril
jl_display = JsonLInteractiveClassifier(
tweet_ids_file="tweetsRickyRenuncia-final.txt",
session=tweet_session,pre_initialized=True, sqlite_db = ".tweetsRickyRenuncia-final.txt.db")
ic| 'Need to request value'
Loading Tweet...
ic| 'Need to request value'
Si ven # como Ric**SeQueda o de esa índole: NO LE DEN REPLY, NO LE DEN RETWEET, NO LE DEN QUOTE. Esto ayuda a impulsar su tag. Lxs que apoyan a Ricky saben que le responderíamos enojadxs a sus tweets, promocionando su popularidad. NO PERMITAN QUE ESTO PASE, PORQUE #RICKYRENUNCIA
— Lain (@agridvlce) July 17, 2019
ic| 'Skipped', kwargs.get("value", ""): ''
jl_display.display_another()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-1-49c762080679> in <module> ----> 1 jl_display.display_another() NameError: name 'jl_display' is not defined
test_tweet = TweetInteractiveClassifier(tweet_id="1150943952616468486", session=tweet_session)
ic| 'Value in Cache' ic| 'Value in Cache'
test_tweet.url()
'https://twitter.com/any_user/status/1150943952616468486'
# output = widgets.Output()
html = test_tweet.oEmbeded()
# print(html)
# with output:
display(HTML(html))
ic| 'Value in Cache'
More than 20,000 took the streets today -the fourth consecutive day- demanding that @ricardorossello step down or that the House put begin with the impeachment proceedings. #RickyRenuncia pic.twitter.com/wMSHJAgKIF
— M Rodriguez Banchs (@mrbanchs) July 16, 2019
print(test_tweet.text())
print(test_tweet.hasMedia)
print(test_tweet.hasLocalMedia)
print(test_tweet.data.keys())
print(test_tweet.data.get("entities", {}))
print(test_tweet.data.get("extended_entities", {}))
More than 20,000 took the streets today -the fourth consecutive day- demanding that @ricardorossello step down or that the House put begin with the impeachment proceedings. #RickyRenuncia https://t.co/wMSHJAgKIF True True dict_keys(['created_at', 'id', 'id_str', 'full_text', 'truncated', 'display_text_range', 'entities', 'extended_entities', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'possibly_sensitive', 'lang']) {'hashtags': [{'text': 'RickyRenuncia', 'indices': [173, 187]}], 'symbols': [], 'user_mentions': [{'screen_name': 'ricardorossello', 'name': 'Ricardo Rosselló', 'id': 80013913, 'id_str': '80013913', 'indices': [84, 100]}], 'urls': [], 'media': [{'id': 1150943327073816581, 'id_str': '1150943327073816581', 'indices': [188, 211], 'media_url': 'http://pbs.twimg.com/ext_tw_video_thumb/1150943327073816581/pu/img/NEJsf-B609d2c93F.jpg', 'media_url_https': 'https://pbs.twimg.com/ext_tw_video_thumb/1150943327073816581/pu/img/NEJsf-B609d2c93F.jpg', 'url': 'https://t.co/wMSHJAgKIF', 'display_url': 'pic.twitter.com/wMSHJAgKIF', 'expanded_url': 'https://twitter.com/mrbanchs/status/1150943952616468486/video/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'small': {'w': 383, 'h': 680, 'resize': 'fit'}, 'large': {'w': 720, 'h': 1280, 'resize': 'fit'}, 'medium': {'w': 675, 'h': 1200, 'resize': 'fit'}}}]} {'media': [{'id': 1150943327073816581, 'id_str': '1150943327073816581', 'indices': [188, 211], 'media_url': 'http://pbs.twimg.com/ext_tw_video_thumb/1150943327073816581/pu/img/NEJsf-B609d2c93F.jpg', 'media_url_https': 'https://pbs.twimg.com/ext_tw_video_thumb/1150943327073816581/pu/img/NEJsf-B609d2c93F.jpg', 'url': 'https://t.co/wMSHJAgKIF', 'display_url': 'pic.twitter.com/wMSHJAgKIF', 'expanded_url': 'https://twitter.com/mrbanchs/status/1150943952616468486/video/1', 'type': 'video', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'small': {'w': 383, 'h': 680, 'resize': 'fit'}, 'large': {'w': 720, 'h': 1280, 'resize': 'fit'}, 'medium': {'w': 675, 'h': 1200, 'resize': 'fit'}}, 'video_info': {'aspect_ratio': [9, 16], 'duration_millis': 16430, 'variants': [{'bitrate': 2176000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/ext_tw_video/1150943327073816581/pu/vid/720x1280/bwitkY71BNkxtVNl.mp4?tag=10'}, {'content_type': 'application/x-mpegURL', 'url': 'https://video.twimg.com/ext_tw_video/1150943327073816581/pu/pl/ZqRRTPZyiioDzUOF.m3u8?tag=10'}, {'bitrate': 632000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/ext_tw_video/1150943327073816581/pu/vid/320x568/mFbp7Wn5ejQw23LD.mp4?tag=10'}, {'bitrate': 832000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/ext_tw_video/1150943327073816581/pu/vid/360x640/4BaXjVBo5F_Kl46e.mp4?tag=10'}]}, 'additional_media_info': {'monetizable': False}}]}
(test_tweet.url(),test_tweet.isRetweet, test_tweet.retweeted_status.url())
test_tweet.display()
test_tweet.data.keys()
dict_keys(['created_at', 'id', 'id_str', 'full_text', 'truncated', 'display_text_range', 'entities', 'extended_entities', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'possibly_sensitive', 'lang'])
test_tweet.data[]