There are two things to do before installing Twarc. First, you need to register an application at https://developer.twitter.com. Second, you need to install Python if you don't have it installed yet. After these two steps, follow the installation instructions in https://twarc-project.readthedocs.io/en/latest.
Once Twarc is set up, we can begin collecting tweets. For this tutorial, twarc was used in Mac, and the program is stored in a "twarc-master" folder. Open Terminal and change the directory to where the folder twarc-master is located:
from IPython.display import Image
print ('Changing the directory in Terminal')
local_image = Image (filename='./images/twarc-rickyrenuncia-1.gif')
local_image
Changing the directory in Terminal
<IPython.core.display.Image object>
The basic command line for capturing tweets is "twarc search" and then specifying the search query. In this example the query is to search for tweets with #RickyRenuncia and save it as a JSON file:
The JSON file is saved in the twarc-master folder.
Since Twitter API only captures tweets from seven to nine days back, it is very possible that you will have multiple json files from the same search parameters. In this case, you can combine all files and save them into one JSON file.
from IPython.display import Image
print ('Combine multiple files')
local_image = Image (filename='./images/twarc-rickyrenuncia-2.gif')
local_image
Combine multiple files
<IPython.core.display.Image object>
The above clip shows the use of the command line cat, where you can list the filename of all the files you want to combine. But you can also truncate the file name, as shown in the clip. This means that twarc will identify all the files that begin with tweetsRickyRenuncia and save them into one JSON file. Make sure that all the files are stored in the same directory.
Since there might be overlap in the tweets stored in the individual files, the combined JSON file will have duplicates. Here you can use one of the tools that are part of Twarc utilities, and saved in the /utils folder. To eliminate duplicates use the command line deduplicate.
GeoJSON is an open standard format used to represent geographical features. A GeoJSON file of the tweets can be used for data visualization, like the following example:
from IPython.display import Image
print ('Change directory')
local_image = Image (filename='./images/mapaTweetsRickyRenuncia1.png')
local_image
Change directory
To create a GeoJSON file you will use the command line geojson.py:
It is important to mention that not all the collected tweets have coordinates. Indeed, is very probable that just a small percentage will include geo data. Therefore, the map example above is just a representation of part of all the collected tweets.
Twarc-report contains a number of utilities to generate reports about the dataset and for visualizations. You can download the file and read the installation instructions here: https://github.com/pbinkley/twarc-report. It is important to place the twarc-report folder within the twarc-master folder.
from IPython.display import Image
print ('Change directory')
local_image = Image (filename='./images/twarc-rickyrenuncia-3.gif')
local_image
Change directory
<IPython.core.display.Image object>
The command line to generate a report is reportprofile.py.
For more information on the utilities of Twarc-report visit https://github.com/pbinkley/twarc-report.