Community Data Science Workshops (Fall 2014)/Day 2 Twitter project

Page MovedAll material related to the Community Data Science Workshops have been moved from the OpenHatch wiki to a new dedicated wiki and this page is no longer being updated here. Please visit the new version of the page on the Community Data Science Collective wiki.

Building a Dataset using the Twitter API

In this project, we will explore a few ways to gather data using the Twitter API. Once we've done that done, we will extend this to code to create our own datasets of tweets that we might be able to use to ask and answer questions in the final session.

Goals

Get set up to build datasets with the Twitter API
Have fun collecting different types of tweets using a variety of ways to search
Pratice reading and extending other people's code
Create a few collections of Tweets you can do research with in the final section

Prerequisite

To participate in the Twitter afternoon session, you must have registered with Twitter as a developer before the session by following the Twitter authentication setup instructions. If you did not do this, or if you tried but did not succeed, please attend one of the other two sessions instead.

Download and test the Twitter project

If you are confused by these steps, go back and refresh your memory with the Friday Nov 7th Tutorial

(Estimated time: 10 minutes)

Potential exercises

Who are my followers?

1) Use sample 2 to get your followers.

2) For each of your followers, get *their* followers (investigate time.sleep to throttle your computation)

3) Identify the follower you have that also follows the most of your followers.

4) How many handles follow you but none of your followers?

5) Repeat this for people you follow, rather than that follow you.

Topics and Trends

1) Use sample 3 to produce a list of 1000 tweets about a topic.

2) Look at those tweets. How does twitter interpret a two word query like "data science"

3) Eliminate retweets [hint: look at the tweet object!]

4) For each tweet original tweet, list the number of times you see it retweeted.

5) Get a list of the URLs that are associated with your topic.

Geolocation

1) Alter the streaming algorithm to include a "locations" filter. You need to use the order sw_lng, sw_lat, ne_lng, ne_lat for the four coordinates.

2) What are people tweeting about in Times Square today?

2.5) Bonus points: set up a bounding box around TS and around NYC as a whole. Can you find words that are more likely to appear in TS?

3) UW is playing Arizona in football today. Set up a bounding box around the Arizona stadium and around UW. Can you identify tweets about football? Who tweets more about the game?

you can use d = api.search(geocode='37.781157,-122.398720,1mi') to do
static geo search.