Capturing Twitter data with Twarc

This module explains how to capture Twitter data with Twarc, generate reports and export the data to csv and geojson files.

About Twarc

Twarc is a command line tool and Python library for arhiving Twitter data. It uses Twitter API to search and capture tweets according to parameters given in the search command line. The captured tweets are saved as a JSON file. It was developed by Ed Summers from the Maryland Institute for Technology in the Humanities. It is also part of Documenting the Now, a project that creates tools for social media archiving to chronicle significant events, prioritizing ethical practices in the collection and preservation of social media content (https://www.docnow.io).

Installing twarc

There are two things to do before installing Twarc. First, you need to register an application at https://developer.twitter.com. Second, you need to install Python if you don't have it installed yet. After these two steps, follow the installation instructions in https://twarc-project.readthedocs.io/en/latest.

Capturing and saving tweets

Once Twarc is set up, we can begin collecting tweets. For this tutorial, twarc was used in Mac, and the program is stored in a "twarc-master" folder. Open Terminal and change the directory to where the folder twarc-master is located:

In [1]:

from IPython.display import Image

print ('Changing the directory in Terminal')

local_image = Image (filename='./images/twarc-rickyrenuncia-1.gif')

local_image

Changing the directory in Terminal

Out[1]:

<IPython.core.display.Image object>

The basic command line for capturing tweets is "twarc search" and then specifying the search query. In this example the query is to search for tweets with #RickyRenuncia and save it as a JSON file:

twarc search '#RickyRenuncia' > tweets_rickyrenuncia_20210208.jsonl

The JSON file is saved in the twarc-master folder.

Since Twitter API only captures tweets from seven to nine days back, it is very possible that you will have multiple json files from the same search parameters. In this case, you can combine all files and save them into one JSON file.

In [18]:

from IPython.display import Image

print ('Combine multiple files')

local_image = Image (filename='./images/twarc-rickyrenuncia-2.gif')

local_image

Combine multiple files

Out[18]:

<IPython.core.display.Image object>

The above clip shows the use of the command line cat, where you can list the filename of all the files you want to combine. But you can also truncate the file name, as shown in the clip. This means that twarc will identify all the files that begin with tweetsRickyRenuncia and save them into one JSON file. Make sure that all the files are stored in the same directory.

Since there might be overlap in the tweets stored in the individual files, the combined JSON file will have duplicates. Here you can use one of the tools that are part of Twarc utilities, and saved in the /utils folder. To eliminate duplicates use the command line deduplicate.

.utils/deduplicate.py tweetsRickyRenuncia.jsonl > tweetsRickyRenuncia_deduplicate.jsonl

GeoJSON format

GeoJSON is an open standard format used to represent geographical features. A GeoJSON file of the tweets can be used for data visualization, like the following example:

In [3]:

from IPython.display import Image

print ('Change directory')

local_image = Image (filename='./images/mapaTweetsRickyRenuncia1.png')

local_image

Change directory

Out[3]:

To create a GeoJSON file you will use the command line geojson.py:

./utils/geojson.py tweetsRickyRenuncia_deduplicate.jsonl > tweetsRickyRenuncia.geojson

It is important to mention that not all the collected tweets have coordinates. Indeed, is very probable that just a small percentage will include geo data. Therefore, the map example above is just a representation of part of all the collected tweets.

Twarc-report

Twarc-report contains a number of utilities to generate reports about the dataset and for visualizations. You can download the file and read the installation instructions here: https://github.com/pbinkley/twarc-report. It is important to place the twarc-report folder within the twarc-master folder.

In [2]:

from IPython.display import Image

print ('Change directory')

local_image = Image (filename='./images/twarc-rickyrenuncia-3.gif')

local_image

Change directory

Out[2]:

<IPython.core.display.Image object>

The command line to generate a report is reportprofile.py.

For more information on the utilities of Twarc-report visit https://github.com/pbinkley/twarc-report.

In [ ]: