Author Archive

WebSci ’17 Tutorial Note– Analyzing Geolocated Data with Twitter

September 22nd, 2017


Prof. Bruno Gonçalves, New York University



09:00 -10:20 theory session

10:30 -12:00 practical session

Theory Session:

GPS-enabled smartphone: provides precise geographic locations

Jan,17 global digital snapshot

Social MEowDia Explained- different behaviors on different social media


Anatomy of a tweet: short (start as a message system), hashtag, how many times shared, timestamp, location (comes from your GPS system), background info—metadata,


Text-content, User, Geo, URL, etc.

Geolocated Tweets:

Follows a user’s geo info over time

GPS Coordinates vs World Population

Smartphone ownership—highest among adults, higher education/ income levels (results from survey)

Market Penetration: larger user group in higher GDP countries

Age Distribution

Demographics: ICWSM’11 375(2011)

Language and Geography: different languages show different distributions among geographic location, for example, Spanish and English distributions in NYC

Multilayer Network:

Retweet- information layers




Follower- social layers

Link Function–ICWSM’ 11, 89 (2011)

Cluster—retweets ~= agreement; mention ~= discussion

Retweets and mention have very different meanings

The Strength of Ties: chains of ties

Interviews to find out how individuals found out about opportunities

Mostly from acquaintance or friend of friends

It argued that the degree of overlap of two individual’s social networks varies directly with the strength of their tie to one another.

Neighborhood Overlap

Network Structures: arrows-retweets; cluster-different friendship communities; dots- users; people/user serves as a bridge between communities.

Links: internal, between groups, intermediary, etc.



Retweet- information layers




 Follower- social layers


Geographic location

Twitter follower distance

Locality: measures percentage of a user’s friend who lives in the same country.

Co-occurrences and social ties

Geotagged Flickr Photos

Divide the world into a grid, count number of cells on which two individuals were within a given interval

Measures: share photo within a period of time in the same grid – likelihood of becoming friends

Mobility: school/work—home—vacation—move to different city/country

Airline Flights: in Europe within 24h

Commuting: train, subway, bus, etc.

Realistic Epidemic Spreading

Human Mobility: Statistical Model

Privacy (Sci Rep 3, 1376(2013))

How many indicators we need to identify a unique person.

Mobility and Social Network (PLoS One 9, E92196 (2014))

Geo-Social Properties- Matrix of social behavior over distance: Probability of a link, reciprocity, Clustering, Triangle disparity

Geo-Social Model:

Starting position of user u

Visit a random neighbor                    jump to a new location

New position of u

Model fitting: probability of visiting old friend vs meeting new friend

Human Diffusion: how people are moving around on map (J.R.Sco. Interface 12, 20150473 (2015))

Residents and Tourists

City Communities

Practical Session:

Environment Requirement: anaconda & python

Registering an Application

API basics

The Twitter module provides the OAuth interface, we just need to provide the right credentials.

Best to keep the credentials in a dict and parametrize our calls with the dict keyswitch accounts.

.Twitter(auth) takes an OAuth instance as an argument and returns a Twitter object.

Authenticating with the API

In the remainder of this course, the accounts dict will live inside the file.

4 basic types of objects: tweets, users, entities, places.

Searching for Tweets

.search.tweets(query, count)

  • query is the content to search for
  • count is the maximum number of results to return (from most recent tweets)

returns dict with a list of ‘statuses’

Social Connections

.friends.ids() and .followers.ids() returns a list of up to 500 of a user’s friends or followers for a given screen_name or user_id.

Results is a dict containing multi-fields.

User Timeline

.statuses.user_timeline() returns a set of tweets posted by a single user.

Important options:
include_rst = ‘true’ to include retweet

Count = 200 is max # of tweets to return in each call

Trim_user = ‘true’ to not include the user information

Max_id = 1234 to include only tweets with an id lower than 1234

Return at most 200 tweets in each call, can get all of a user’s tweets up to 3200 with multiple calls

Social Interaction

Data processing extended from user timeline


High productive software for complex network

Come with anaconda

Simple python interface

Four different types of graphs

  • Graph—undirected graph
  • DiGraph—directed graph
  • MultiGraph—multi-edged graph
  • MultiDiGraph—multi-edged directed graph

Similar interface for all graphs

Nodes can be any type of python object

Growing graph—add nodes, edges, etc.

Graph Properties

  • .nodes() return a list nodes
  • .edges()
  • .degree() return a dict with each node degree .in_degree()/ .out_degree() for DiGraph
  • .is_connected()
  • .is_weakly/strongly_connected()
  • .connected_components()

Snowball Sampling–

Commonly used in Social Science and Computer Science

  • Start with a single node
  • Get friends list
  • For each friend get the friend list
  • Repeat for a fixed number of layers or until enough

Generates a connected component graph

Streaming Geocoded data–

The streaming api provides real time data, subject to filter

Use TwitterStream instead of Twitter object

  • .status.filter(track = 1) while return tweets that matches the query q in real time
  • return generator that you can iterate over
  • .status.filter(locations = bb) will return tweets that occur within the bounding box bb in real time

bb is a comma separated pair of lon/lat coordinates.


Open specification developed by ESRI, still the current leader in the commercial GIS software

Shapefiles aren’t actual files

But actually a set of files sharing the same name but with different extensions.

The actual set of files changes depending on the contents, but 3 files are usually present:

  • .shp—also commonly referred to as the shapefile contains geometric info
  • .dbf—a simple database containing the feature attribute table
  • .shx—a spatial index



Pyshp defines utility functions to load and manipulate shapefiles programmatically.

The shapefile module handles the most common operations:

  • .reader(filename) return a reader object
  • reader.records()/iterRecords()
  • reader.shapes()/iterShapes()
  • reader.shapeRecords()/iterShapeRecords()

shape objects contain several fields:
bbox lower left and upper right x,y coordinates (long/lat)

Simple shapefile plot–


Shaplely defines geometric object under shapely.geometry

                   Points, polygon, multip-polygon, shapes()

And common operations

                   .crosses, .contains, etc..

shape object provides useful field to query a shapes properties:

                    .centroid, .area, .bounds, etc..

Filter Points with a shapefile–

Twitter Places–

Twitter defines a “coordinates” filed in tweets

There is also a place field that we glossed over

The place object contains also geographic info, but at a courser resolution than the coordinated filed

Each place has a unique place_id, a bouding_box and some geographical information such as country and full_name.

Places can be of several different types: admin, city, neighborhood, poi

Place Attributes: Key, street_address, phone, post_code, region, ios3, twitter, URL, App:id, etc.

Filter points and places–   


VN:F [1.9.22_1171]
Rating: 7.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: tetherless world Tags: