WebSci ’17 Tutorial Note– Analyzing Geolocated Data with Twitter
Speaker:
Prof. Bruno Gonçalves, New York University
(http://www.bgoncalves.com/)
Schedule
09:00 -10:20 theory session
10:30 -12:00 practical session
Theory Session:
GPS-enabled smartphone: provides precise geographic locations
Jan,17 global digital snapshot
Social MEowDia Explained- different behaviors on different social media
Twitter:
Anatomy of a tweet: short (start as a message system), hashtag, how many times shared, timestamp, location (comes from your GPS system), background info—metadata,
Metadata:
Text-content, User, Geo, URL, etc.
Geolocated Tweets:
Follows a user’s geo info over time
GPS Coordinates vs World Population
Smartphone ownership—highest among adults, higher education/ income levels (results from survey)
Market Penetration: larger user group in higher GDP countries
Age Distribution
Demographics: ICWSM’11 375(2011)
Language and Geography: different languages show different distributions among geographic location, for example, Spanish and English distributions in NYC
Multilayer Network:
Retweet- information layers
|
Mention
|
Follower- social layers
Link Function–ICWSM’ 11, 89 (2011)
Cluster—retweets ~= agreement; mention ~= discussion
Retweets and mention have very different meanings
The Strength of Ties: chains of ties
Interviews to find out how individuals found out about opportunities
Mostly from acquaintance or friend of friends
It argued that the degree of overlap of two individual’s social networks varies directly with the strength of their tie to one another.
Neighborhood Overlap
Network Structures: arrows-retweets; cluster-different friendship communities; dots- users; people/user serves as a bridge between communities.
Links: internal, between groups, intermediary, etc.
Groups
Geography
Retweet- information layers
|
Mention
|
Follower- social layers
|
Geographic location
Twitter follower distance
Locality: measures percentage of a user’s friend who lives in the same country.
Co-occurrences and social ties
Geotagged Flickr Photos
Divide the world into a grid, count number of cells on which two individuals were within a given interval
Measures: share photo within a period of time in the same grid – likelihood of becoming friends
Mobility: school/work—home—vacation—move to different city/country
Airline Flights: in Europe within 24h
Commuting: train, subway, bus, etc.
Realistic Epidemic Spreading
Human Mobility: Statistical Model
Privacy (Sci Rep 3, 1376(2013))
How many indicators we need to identify a unique person.
Mobility and Social Network (PLoS One 9, E92196 (2014))
Geo-Social Properties- Matrix of social behavior over distance: Probability of a link, reciprocity, Clustering, Triangle disparity
Geo-Social Model:
Starting position of user u
Visit a random neighbor jump to a new location
New position of u
Model fitting: probability of visiting old friend vs meeting new friend
Human Diffusion: how people are moving around on map (J.R.Sco. Interface 12, 20150473 (2015))
Residents and Tourists
City Communities
Practical Session:
https://github.com/bmtgoncalves/WebSci17
Environment Requirement: anaconda & python
Registering an Application
API basics
The Twitter module provides the OAuth interface, we just need to provide the right credentials.
Best to keep the credentials in a dict and parametrize our calls with the dict keyswitch accounts.
.Twitter(auth) takes an OAuth instance as an argument and returns a Twitter object.
Authenticating with the API
In the remainder of this course, the accounts dict will live inside the twitter_accounts.py file.
4 basic types of objects: tweets, users, entities, places.
Searching for Tweets
.search.tweets(query, count) https://dev.twitter.com/docs/api/1.1/get/search/tweets
- query is the content to search for
- count is the maximum number of results to return (from most recent tweets)
returns dict with a list of ‘statuses’
Social Connections
.friends.ids() and .followers.ids() returns a list of up to 500 of a user’s friends or followers for a given screen_name or user_id.
Results is a dict containing multi-fields.
User Timeline
.statuses.user_timeline() returns a set of tweets posted by a single user.
Important options:
include_rst = ‘true’ to include retweet
Count = 200 is max # of tweets to return in each call
Trim_user = ‘true’ to not include the user information
Max_id = 1234 to include only tweets with an id lower than 1234
Return at most 200 tweets in each call, can get all of a user’s tweets up to 3200 with multiple calls
Social Interaction
Data processing extended from user timeline
NetworkX–networkx_demo.py
High productive software for complex network
Come with anaconda
Simple python interface
Four different types of graphs
- Graph—undirected graph
- DiGraph—directed graph
- MultiGraph—multi-edged graph
- MultiDiGraph—multi-edged directed graph
Similar interface for all graphs
Nodes can be any type of python object
Growing graph—add nodes, edges, etc.
Graph Properties
- .nodes() return a list nodes
- .edges()
- .degree() return a dict with each node degree .in_degree()/ .out_degree() for DiGraph
- .is_connected()
- .is_weakly/strongly_connected()
- .connected_components()
Snowball Sampling–snowball.py
Commonly used in Social Science and Computer Science
- Start with a single node
- Get friends list
- For each friend get the friend list
- Repeat for a fixed number of layers or until enough
Generates a connected component graph
Streaming Geocoded data–twitter_location.py
The streaming api provides real time data, subject to filter
Use TwitterStream instead of Twitter object
- .status.filter(track = 1) while return tweets that matches the query q in real time
- return generator that you can iterate over
- .status.filter(locations = bb) will return tweets that occur within the bounding box bb in real time
bb is a comma separated pair of lon/lat coordinates.
Shapefiles
Open specification developed by ESRI, still the current leader in the commercial GIS software
Shapefiles aren’t actual files
But actually a set of files sharing the same name but with different extensions.
The actual set of files changes depending on the contents, but 3 files are usually present:
- .shp—also commonly referred to as the shapefile contains geometric info
- .dbf—a simple database containing the feature attribute table
- .shx—a spatial index
QGIS
Pyshp–hapefile_load.py
Pyshp defines utility functions to load and manipulate shapefiles programmatically.
The shapefile module handles the most common operations:
- .reader(filename) return a reader object
- reader.records()/iterRecords()
- reader.shapes()/iterShapes()
- reader.shapeRecords()/iterShapeRecords()
shape objects contain several fields:
bbox lower left and upper right x,y coordinates (long/lat)
…
Simple shapefile plot–plot_shapefile.py
Shapely–shapefile_shape_properties.py
Shaplely defines geometric object under shapely.geometry
Points, polygon, multip-polygon, shapes()
And common operations
.crosses, .contains, etc..
shape object provides useful field to query a shapes properties:
.centroid, .area, .bounds, etc..
Filter Points with a shapefile–shapefile_filter.py
Twitter Places–shapefile_filter_places.py
Twitter defines a “coordinates” filed in tweets
There is also a place field that we glossed over
The place object contains also geographic info, but at a courser resolution than the coordinated filed
Each place has a unique place_id, a bouding_box and some geographical information such as country and full_name.
Places can be of several different types: admin, city, neighborhood, poi
Place Attributes: Key, street_address, phone, post_code, region, ios3, twitter, URL, App:id, etc.
Filter points and places–plot_shapefile_points.py
Aggregation–shapefile_filter_aggregate.py