Notebook

Media Rating by Retweet and Quote Count¶

This notebook is meant to follow Evaluating Content. The minimal requirements for this notebook are met by utilizing this notebook first.

The SQLite3 database created in that notebook will be accessed using the Pandas library as an alternative means of accessing the data.

Objectives¶

Create big data visualizations using Pandas, Seaborn and Matplotlib packages. Interact with data from an SQLite3 database using Pandas.

Learning Goals¶

Use Pandas to extract SQLite3 database data.
Become familiar with Pandas Dataframes.
Utilize Seaborn package to create visualizations.
Recognize different types of graphs that can be used to represent multivariate datasets.

Requirements¶

Visit Evaluating Content to prepare the environment for this notebook.

Prepare the environment¶

Load Packages
Create a copy of the original database
Open a connection.

In [ ]:

# Load Packages

# Enable Matplotlib Juupyter Widget Backend
%matplotlib widget
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.colors import ListedColormap
from shutil import copyfile

In [ ]:

# Copy Database
DB_FILE = "./pandas_tweets.db"
copyfile("tweets.db", DB_FILE)

In [ ]:

# Connect to database copy
connection = sqlite3.connect(DB_FILE)

Access Database Query¶

Pandas pd includes a method called pd.read_sql_query that given an SQL query and a database connection will generate a Dataframe.

The `sqlite_master` Table¶

Lets try it getting the table names from the database copy using the connection and this query:

SELECT name, sql 
FROM sqlite_master 
WHERE type='table';

The table sqlite_master is part of the SQLite3 structure and can be used to get information about the structure of the database. This query in particular retrieves the tables and the commands used to create them.

In [ ]:

tables = pd.read_sql_query(
    """SELECT name, sql
    FROM sqlite_master 
    WHERE type='table';""",
    connection
)
tables.head(15)

Retrieve a comple table¶

A complete table can be retrieved by using a simmilar method pd.read_sql_table that takes a table name and the connection.

In [ ]:

auto_detail = pd.read_sql_query("SELECT * FROM tweet_auto_detail;", connection)

auto_detail.loc[ auto_detail.has_media == 0, "has_media_label"] = "Media"
auto_detail.loc[ auto_detail.has_media == 1, "has_media_label"] = "No Media"

auto_detail.head(10)

Describing the data¶

Date range¶

The DatePublished column holds timestamps of when the data was published. This format does not make it easy to know the actual dates.

In [ ]:

auto_detail["datePublished"].describe()

Transform timestamps to dates¶

A transformation is required to change this float64 values to date and time. The function pd.Timestamp accepts float value timestamps and transforms them into readable dates. The method apply allows performing a transformation to values in a dataframe, this transformations most accept a single input and return a single output.

The function timestamp2DateTimeBySegment will be our transformation in this occasion.

In [ ]:

def timestamp2DateTimeBySegment(ts: float):
    """Transformation from float to DateTime by segment_size"""
    segment_size = 3600.0*24
    minus4_TZ = 3600.0*-4.0
    return pd.Timestamp(int((ts)/segment_size)*segment_size, unit='s', tz='UTC')

# auto_detail["datePublished_DT"] = auto_detail["datePublished"].apply(lambda x: pd.Timestamp(int((x+minus4_TZ)/seconds_in_hour)*seconds_in_hour, unit='s'))
auto_detail["datePublished_DT"] = auto_detail["datePublished"].apply(timestamp2DateTimeBySegment)
print("Original Float Timestamps:")
print(auto_detail["datePublished"].describe(datetime_is_numeric=True))
print("\nTransformed into DateTime by the hour:")
print(auto_detail["datePublished_DT"].describe(datetime_is_numeric=True))

Exercise 1¶

What is the earliest date of a tweet captured in this data set?

In [ ]:

# Answer here with code

#Example latest date would be:
print(auto_detail["datePublished_DT"].max())

# Response:

Visualizing Time data¶

Histograms are a great tool to visualize frequency over any one variable.

The code bellow produces a histogram of the datePublished_DT.

In [ ]:

import numpy as np

# Create a Mask to limit date range
start_date = "2019-07-11"
end_date = "2019-08-01"
mask=(auto_detail["datePublished_DT"] >= start_date) & (auto_detail["datePublished_DT"] < end_date)

# Take sample
sample_dates = auto_detail[mask]

#Generate histogram bin limits with numpy
edges=np.histogram_bin_edges(sample_dates["datePublished"])
edges=pd.DataFrame(edges, columns=("datePublished",))
edges["datePublished_DT"] = edges["datePublished"].apply(timestamp2DateTimeBySegment)

# Visualize a histogram
sample_dates.hist(column="datePublished_DT", xrot=25, bins=edges["datePublished_DT"], figsize=(7,7), backend="matplotlib")
plt.title("Histogram: Date Published")
plt.ylabel("Frequency")
plt.xlabel("Date")

Frequency by Language and Date¶

It is posible to generate multiple histograms for different groups of the data. In particular we can see a time distribution of number of tweets over time for each language.

In this occasion the column is set to datePublished_DTdatePublished_DT and we also use the by aparameter to set the grouping column.

In [ ]:

auto_detail.hist(column='datePublished_DT', by='language', bins=45, figsize=(8,9), sharex=True, sharey=True);

From the histograms above it is very clear that most activity was in Spanish followed by English. It is one of the reasons the team integrated the work with the Google Translation API for the visualizations.

Generate a table of frequencies¶

The column language is currently of type object as this values are separate strings. Pandas offers a different data type that reduces memory usage called Category. Categories are particularly useful when a few values will be repeated many times.

In [ ]:

auto_detail.dtypes

In [ ]:

auto_detail['language']=auto_detail['language'].astype('category')
print(auto_detail['language'].describe())

We can verify that the datatype has changed from object to category.

In [ ]:

auto_detail[['language']].dtypes

Group By & Count¶

To get the totals per language we can use a group by statement with a count operation. as shown bellow.

In [ ]:

count_by_lang = auto_detail[["language", "tweet_id"]].groupby(
    ["language"]
).count()
count_by_lang=count_by_lang.rename(
    columns = {"tweet_id":"Total"}, inplace = False)
count_by_lang.transpose()

Most user activity appears to be in Spanish followed by English.

Exercise 2¶

A. How many tweets have multimedia? A. Transform the has_media column into a categorical column to enhance efficiency:

In [ ]:

# Using Group By & Count
# Display how many tweets have media and how many don't

In [ ]:

# Transform the column here

# Print dtype of the columns to display the updated format
auto_detail.dtypes

Ranking Tweets with Multimedia by User Interaction¶

Tweets can be ranked by two readily available metrics Retweet Count and Favorite Count. Other metrics such as Comment Count and Quote Count require access to API other than the standard API. However they could be used in a similar way to the ones shown bellow.

We are particularly intrested in discovering relevant multimedia shared through social media. This will require filtering our dataset to only the tweets with multimedia.

The easiest way to filter a dataframe on a categorical column would be to use the .loc method. Using this method it is possible to select only the rows that comply with a specific condition.

The sort_values method allows arranging the dataframe in descending or ascending order, read more on the documentation.

Using this pandas methods allows us to quickly identify popular tweets with multimedia.

In [ ]:

# Get Top 20 Tweets with Media by Retweet Count
from tweet_requester.display import prettyPrintDataFrame


# page skips to the next N
page=1
N = 20

top_N_retweet = auto_detail.loc[auto_detail["has_media"]==1].sort_values("retweetCount", ascending=False).head(N*page).tail(N)[["retweetCount", "tweet_id", "url"]]
prettyPrintDataFrame(top_N_retweet, max_column=60)

In [ ]:

# Get Top 20 Tweets with Media by Favorite(❤️) Count

# page skips to the next N
page=1
N = 20

top_N_favorite = auto_detail.loc[auto_detail["has_media"]==1].sort_values("favoriteCount", ascending=False).head(N*page).tail(N)[["favoriteCount", "tweet_id", "url"]]
prettyPrintDataFrame(top_N_favorite, max_column=60)

In [ ]:

# Graph Date vs
# auto_detail.datePublished.apply(lambda dt: int(str(dt.year)+ str(dt.month) + str(dt.day) +str(dt.hour) + str(dt.minute)))
# auto_detail.datePublished.apply(lambda dt: "{:02}{:02}{:02}".format(dt.day, dt.hour, dt.minute))

Multi Variable Relations¶

It is possible to create multivariate relationship visualizations both in 2D and 3D format using the matplolib library.

3D Scatter Plot¶

This plot displays both the correlation between retweets and favorites, but also displays how most of the shares were concentrated in time.

In [ ]:

import matplotlib.dates as mdates

fig4 = plt.figure(figsize=(8,6))
ax = Axes3D(fig4, auto_add_to_figure=False)
fig4.add_axes(ax)

# get colormap from seaborn
cmap = ListedColormap(sns.color_palette("husl", 2).as_hex())
dates = auto_detail["datePublished_DT"]

sc = ax.scatter(
    auto_detail["retweetCount"], # X
    auto_detail["favoriteCount"], # Y
    auto_detail["datePublished"], # Z
    s=40,
    c=auto_detail[["has_media"]],
    cmap=cmap,
    alpha=1,
    marker="o"
)
ax.set_xlabel("Retweet Count")
ax.set_ylabel("Favorite Count")
ax.set_zlabel("Date Published")
# ax.zaxis.set_major_formatter(mdates.DayLocator(interval=1))

# Legend
plt.legend(*sc.legend_elements(), bbox_to_anchor=(1., 1), loc=2)

plt.show()

FacetGrid¶

The FacetGrid is particularly useful to demonstrate different patterns for in different categories.

The grid of graphs bellow displays how languages other than Spanish and English didn't show as much user activity in terms of retweets and favorites.

In [ ]:

groups = sns.FacetGrid(auto_detail, col="has_media", row="language", hue="datePublished_DT", legend_out=False)
groups.map(sns.scatterplot, "favoriteCount", "retweetCount")

Conclusion¶

The notebooks of Evaluating Content and this notebook, Media_Rating, offer an example of how multiple notebooks can be used in a Curatorial and Analytical environment. In particualr the notebooks demonstrate that custom web interfaces can be developed inside Jupyter Notebooks to process the data in different stages and generate consistent reports.

Media Rating by Retweet and Quote Count¶

Objectives¶

Learning Goals¶

Requirements¶

Prepare the environment¶

Access Database Query¶

The sqlite_master Table¶

Retrieve a comple table¶

Describing the data¶

Date range¶

Transform timestamps to dates¶

Exercise 1¶

Visualizing Time data¶

Frequency by Language and Date¶

Generate a table of frequencies¶

Group By & Count¶

Exercise 2¶

Ranking Tweets with Multimedia by User Interaction¶

Multi Variable Relations¶

3D Scatter Plot¶

FacetGrid¶

Conclusion¶

The `sqlite_master` Table¶