Notebook

Capturing Twitter data with Twarc¶

Background¶

During the protests in July 2019 asking for the resignation of the Governor of Puerto Rico, Joel Blanco-Rivera began capturing tweets with twarc, a command line tool and Python library for archiving Twitter data. This tool was selected after reading about its application for archiving twitter data of other events by members of Documenting the Now. The goal was to capture tweets about the protests that used the hashtag #RickyRenuncia, which became the protest slogan in social media.

This module uses the experience of archiving #RickyRenuncia Twitter data to explain how to capture Twitter data with twarc, generate reports and export the data to csv and geojson files. Through the module, you will be able to create and curate your own dataset.

Objectives¶

Capture and curate twitter data with twarc.

Generate reports about the contents of a Twitter dataset.

Export a JSON file to CSV and GeoJson.

Learning goals¶

Understand the main requirements for installing twarc and getting access to Twitter's API.

Learn about the structure of command lines to run twarc with Python.

Learn about Twitter API for Academic Research and version two of Twarc.

Expected Interaction¶

The main function of this module is to provide information about installing and running twarc. To use twarc, users will need to install and set up the tool in their computers.

About Twarc¶

Twarc is a command line tool and Python library for arhiving Twitter data. It uses Twitter API to search and capture tweets according to parameters given in the search command line. The captured tweets are saved as a JSON file. It was developed by Ed Summers from the Maryland Institute for Technology in the Humanities. It is also part of Documenting the Now, a project that creates tools for social media archiving to chronicle significant events, prioritizing ethical practices in the collection and preservation of social media content (https://www.docnow.io).

Installing twarc¶

There are two things to do before installing Twarc. First, you need to register an application at https://developer.twitter.com. Second, you need to install Python if you don't have it installed yet. After these two steps, follow the installation instructions in https://twarc-project.readthedocs.io/en/latest. It includes instructions for Mac and Windows. For this tutorial, twarc was used in a Mac and the commands are run in the application Terminal. The program is stored in a "twarc-master" folder. Once twarc is set up, we can begin collecting tweets.

Capturing and saving tweets¶

Open Terminal and change the directory to where the folder twarc-master is located:

In [ ]:

from IPython.display import Image

print ('Changing the directory in Terminal')

local_image = Image (filename='./images/twarc-rickyrenuncia-1.gif')

local_image

The basic command line for capturing tweets is "twarc search" and the search query. In this example the query is to search for tweets with #RickyRenuncia and save it as a JSON file:

twarc search '#RickyRenuncia' > tweets_rickyrenuncia_20210208.jsonl

The JSON file is saved in the twarc-master folder.

Since Twitter API only captures tweets from seven to nine days back, it is very possible that you will have multiple json files from the same search parameters. In this case, you can combine all files and save them into one JSON file.

In [ ]:

from IPython.display import Image

print ('Combine multiple files')

local_image = Image (filename='./images/twarc-rickyrenuncia-2.gif')

local_image

The above clip shows the use of the command line cat, where you can list the filename of all the files you want to combine. But you can also truncate the file name, as shown in the clip. This means that twarc will identify all the files that begin with tweetsRickyRenuncia and save them into one JSON file. Make sure that all the files are stored in the same folder.

Since there might be overlap in the tweets stored in the individual files, the combined JSON file will have duplicates. Here you can use one of the tools that are part of Twarc utilities, and saved in the /utils folder. To eliminate duplicates use the command line deduplicate.

.utils/deduplicate.py tweetsRickyRenuncia.jsonl > tweetsRickyRenuncia_deduplicate.jsonl

Activity¶

Identify a topic or event and create your own dataset of tweets using Twarc.

Did you find any difficulties installing and/or running twarc?

If yes, which were the difficulties? How would you solve them?

Converting to CSV and GeoJSON formats¶

Once you have a dataset you can use a number of tools to create reports and visualizations. We will cover this later. You can also convert the json file to other formats, like a CSV table and a GeoJSON file. Converting to other formats will open the doors for additional data analysis and visualization strategies.

To convert your dataset into a CSV table, use the json2csv.py script, located in the /utils folder. The command line is:

./utils/json2csv.py tweetsRickyRenuncia_deduplicate.jsonl > tweetsRickyRenuncia_deduplicate.csv

GeoJSON is an open standard format used to represent geographical features. A GeoJSON file of the tweets can be used for data visualization, like the following example:

In [ ]:

from IPython.display import Image

print ('Map with pins of tweets by location')

local_image = Image (filename='./images/mapaTweetsRickyRenuncia1.png')

local_image

To create a GeoJSON file you will use the command line geojson.py:

./utils/geojson.py tweetsRickyRenuncia_deduplicate.jsonl > tweetsRickyRenuncia.geojson

It is important to mention that not all the collected tweets have coordinates. Indeed, is very probable that just a small percentage will include geo data. Therefore, the map example above is just a representation of part of all the collected tweets.

Activity¶

Convert your dataset to CSV and GeoJSON data. What type of data analysis, representation or visualization you can do with these converted files?

Twarc-report¶

Twarc-report contains a number of utilities to generate reports about the dataset and for visualizations. You can download the file and read the installation instructions here: https://github.com/pbinkley/twarc-report. It is important to place the twarc-report folder within the twarc-master folder.

In [ ]:

from IPython.display import Image

print ('Change directory')

local_image = Image (filename='./images/twarc-rickyrenuncia-3.gif')

local_image

The command line to generate a report is reportprofile.py.

For more information on the utilities of Twarc-report visit https://github.com/pbinkley/twarc-report.

Activity¶

Create a report of your dataset.

How many tweets did you collect?

What are the three more shared images?

twarc2¶

In 2020, Twitter released version 2 of its API, which included the "Twitter API for Academic Research" or "Academic API". This service allows a user access to Twitter's full archive. One of the advantages this track is that you will not have the 7 to 9 days limit to capture data with twarc. Developers released version 2 of twarc, updating the tool to make it compatible with the Academic API.

If you would like to capture data using the Academic API, there are two important things to do. First, you need to apply for an "Academic project" with Twitter. You can read more about the requirements and about how to apply for an Academic API here.

Second, you need to upgrade twarc. Follow the instructions here. In the following video, Ed Summers provides a quick demo to use twarc2 and access the full Twitter archive with the Academic API:

In [ ]:

from IPython.display import YouTubeVideo
# quick demo of how to use twarc2 to talk to the Twitter V2 API and access the full historical archive using the Academic Research Product Track.
# Video credit: Ed Summers.
YouTubeVideo('t1kT719vxlQ')

Discussion questions¶

Researchers use data science tools to collect and analyze twitter data. In addition, Twitter already provides access to academic researchers to its whole archive. Do you think archivists should also collect Twitter data? Why?

What considerations archivists should take into account when preserving Twitter data?

What are the implications of APIs in relation to the preservation of social media content.

What's next?¶

Continue to Evaluating Content to continue with the experience.