This notebook is meant to follow Evaluating Content. The minimal requirements for this notebook are met by utilizing this notebook first.
The SQLite3 database created in that notebook will be accessed using the Pandas library as an alternative means of accessing the data.
Create big data visualizations using Pandas, Seaborn and Matplotlib packages. Interact with data from an SQLite3 database using Pandas.
# Load Packages
# Enable Matplotlib Juupyter Widget Backend
%matplotlib widget
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.colors import ListedColormap
from shutil import copyfile
# Copy Database
DB_FILE = "./pandas_tweets.db"
copyfile("tweets.db", DB_FILE)
# Connect to database copy
connection = sqlite3.connect(DB_FILE)
Pandas pd includes a method called pd.read_sql_query that given an SQL query and a database connection will generate a Dataframe.
sqlite_master Table¶Lets try it getting the table names from the database copy using the connection and this query:
SELECT name, sql
FROM sqlite_master
WHERE type='table';
The table sqlite_master is part of the SQLite3 structure and can be used to get information about the structure of the database. This query in particular retrieves the tables and the commands used to create them.
tables = pd.read_sql_query(
"""SELECT name, sql
FROM sqlite_master
WHERE type='table';""",
connection
)
tables.head(15)
A complete table can be retrieved by using a simmilar method pd.read_sql_table that takes a table name and the connection.
auto_detail = pd.read_sql_query("SELECT * FROM tweet_auto_detail;", connection)
auto_detail.loc[ auto_detail.has_media == 0, "has_media_label"] = "Media"
auto_detail.loc[ auto_detail.has_media == 1, "has_media_label"] = "No Media"
auto_detail.head(10)
auto_detail["datePublished"].describe()
A transformation is required to change this float64 values to date and time. The function pd.Timestamp accepts float value timestamps and transforms them into readable dates. The method apply allows performing a transformation to values in a dataframe, this transformations most accept a single input and return a single output.
The function timestamp2DateTimeBySegment will be our transformation in this occasion.
def timestamp2DateTimeBySegment(ts: float):
"""Transformation from float to DateTime by segment_size"""
segment_size = 3600.0*24
minus4_TZ = 3600.0*-4.0
return pd.Timestamp(int((ts)/segment_size)*segment_size, unit='s', tz='UTC')
# auto_detail["datePublished_DT"] = auto_detail["datePublished"].apply(lambda x: pd.Timestamp(int((x+minus4_TZ)/seconds_in_hour)*seconds_in_hour, unit='s'))
auto_detail["datePublished_DT"] = auto_detail["datePublished"].apply(timestamp2DateTimeBySegment)
print("Original Float Timestamps:")
print(auto_detail["datePublished"].describe(datetime_is_numeric=True))
print("\nTransformed into DateTime by the hour:")
print(auto_detail["datePublished_DT"].describe(datetime_is_numeric=True))
What is the earliest date of a tweet captured in this data set?
# Answer here with code
#Example latest date would be:
print(auto_detail["datePublished_DT"].max())
# Response:
Histograms are a great tool to visualize frequency over any one variable.
The code bellow produces a histogram of the datePublished_DT.
import numpy as np
# Create a Mask to limit date range
start_date = "2019-07-11"
end_date = "2019-08-01"
mask=(auto_detail["datePublished_DT"] >= start_date) & (auto_detail["datePublished_DT"] < end_date)
# Take sample
sample_dates = auto_detail[mask]
#Generate histogram bin limits with numpy
edges=np.histogram_bin_edges(sample_dates["datePublished"])
edges=pd.DataFrame(edges, columns=("datePublished",))
edges["datePublished_DT"] = edges["datePublished"].apply(timestamp2DateTimeBySegment)
# Visualize a histogram
sample_dates.hist(column="datePublished_DT", xrot=25, bins=edges["datePublished_DT"], figsize=(7,7), backend="matplotlib")
plt.title("Histogram: Date Published")
plt.ylabel("Frequency")
plt.xlabel("Date")
It is posible to generate multiple histograms for different groups of the data. In particular we can see a time distribution of number of tweets over time for each language.
In this occasion the column is set to datePublished_DTdatePublished_DT and we also use the by aparameter to set the grouping column.
auto_detail.hist(column='datePublished_DT', by='language', bins=45, figsize=(8,9), sharex=True, sharey=True);
From the histograms above it is very clear that most activity was in Spanish followed by English. It is one of the reasons the team integrated the work with the Google Translation API for the visualizations.
The column language is currently of type object as this values are separate strings. Pandas offers a different data type that reduces memory usage called Category. Categories are particularly useful when a few values will be repeated many times.
auto_detail.dtypes
auto_detail['language']=auto_detail['language'].astype('category')
print(auto_detail['language'].describe())
We can verify that the datatype has changed from object to category.
auto_detail[['language']].dtypes
To get the totals per language we can use a group by statement with a count operation. as shown bellow.
count_by_lang = auto_detail[["language", "tweet_id"]].groupby(
["language"]
).count()
count_by_lang=count_by_lang.rename(
columns = {"tweet_id":"Total"}, inplace = False)
count_by_lang.transpose()
Most user activity appears to be in Spanish followed by English.
A. How many tweets have multimedia?
A. Transform the has_media column into a categorical column to enhance efficiency:
# Using Group By & Count
# Display how many tweets have media and how many don't
# Transform the column here
# Print dtype of the columns to display the updated format
auto_detail.dtypes
Tweets can be ranked by two readily available metrics Retweet Count and Favorite Count. Other metrics such as Comment Count and Quote Count require access to API other than the standard API. However they could be used in a similar way to the ones shown bellow.
We are particularly intrested in discovering relevant multimedia shared through social media. This will require filtering our dataset to only the tweets with multimedia.
The easiest way to filter a dataframe on a categorical column would be to use the .loc method. Using this method it is possible to select only the rows that comply with a specific condition.
The sort_values method allows arranging the dataframe in descending or ascending order, read more on the documentation.
Using this pandas methods allows us to quickly identify popular tweets with multimedia.
# Get Top 20 Tweets with Media by Retweet Count
from tweet_requester.display import prettyPrintDataFrame
# page skips to the next N
page=1
N = 20
top_N_retweet = auto_detail.loc[auto_detail["has_media"]==1].sort_values("retweetCount", ascending=False).head(N*page).tail(N)[["retweetCount", "tweet_id", "url"]]
prettyPrintDataFrame(top_N_retweet, max_column=60)
# Get Top 20 Tweets with Media by Favorite(❤️) Count
# page skips to the next N
page=1
N = 20
top_N_favorite = auto_detail.loc[auto_detail["has_media"]==1].sort_values("favoriteCount", ascending=False).head(N*page).tail(N)[["favoriteCount", "tweet_id", "url"]]
prettyPrintDataFrame(top_N_favorite, max_column=60)
# Graph Date vs
# auto_detail.datePublished.apply(lambda dt: int(str(dt.year)+ str(dt.month) + str(dt.day) +str(dt.hour) + str(dt.minute)))
# auto_detail.datePublished.apply(lambda dt: "{:02}{:02}{:02}".format(dt.day, dt.hour, dt.minute))
It is possible to create multivariate relationship visualizations both in 2D and 3D format using the matplolib library.
This plot displays both the correlation between retweets and favorites, but also displays how most of the shares were concentrated in time.
import matplotlib.dates as mdates
fig4 = plt.figure(figsize=(8,6))
ax = Axes3D(fig4, auto_add_to_figure=False)
fig4.add_axes(ax)
# get colormap from seaborn
cmap = ListedColormap(sns.color_palette("husl", 2).as_hex())
dates = auto_detail["datePublished_DT"]
sc = ax.scatter(
auto_detail["retweetCount"], # X
auto_detail["favoriteCount"], # Y
auto_detail["datePublished"], # Z
s=40,
c=auto_detail[["has_media"]],
cmap=cmap,
alpha=1,
marker="o"
)
ax.set_xlabel("Retweet Count")
ax.set_ylabel("Favorite Count")
ax.set_zlabel("Date Published")
# ax.zaxis.set_major_formatter(mdates.DayLocator(interval=1))
# Legend
plt.legend(*sc.legend_elements(), bbox_to_anchor=(1., 1), loc=2)
plt.show()
The FacetGrid is particularly useful to demonstrate different patterns for in different categories.
The grid of graphs bellow displays how languages other than Spanish and English didn't show as much user activity in terms of retweets and favorites.
groups = sns.FacetGrid(auto_detail, col="has_media", row="language", hue="datePublished_DT", legend_out=False)
groups.map(sns.scatterplot, "favoriteCount", "retweetCount")
The notebooks of Evaluating Content and this notebook, Media_Rating, offer an example of how multiple notebooks can be used in a Curatorial and Analytical environment. In particualr the notebooks demonstrate that custom web interfaces can be developed inside Jupyter Notebooks to process the data in different stages and generate consistent reports.