WebSci ’17 Tutorial Note– Analyzing Geolocated Data with Twitter

September 22nd, 2017


Prof. Bruno Gonçalves, New York University



09:00 -10:20 theory session

10:30 -12:00 practical session

Theory Session:

GPS-enabled smartphone: provides precise geographic locations

Jan,17 global digital snapshot

Social MEowDia Explained- different behaviors on different social media


Anatomy of a tweet: short (start as a message system), hashtag, how many times shared, timestamp, location (comes from your GPS system), background info—metadata,


Text-content, User, Geo, URL, etc.

Geolocated Tweets:

Follows a user’s geo info over time

GPS Coordinates vs World Population

Smartphone ownership—highest among adults, higher education/ income levels (results from survey)

Market Penetration: larger user group in higher GDP countries

Age Distribution

Demographics: ICWSM’11 375(2011)

Language and Geography: different languages show different distributions among geographic location, for example, Spanish and English distributions in NYC

Multilayer Network:

Retweet- information layers




Follower- social layers

Link Function–ICWSM’ 11, 89 (2011)

Cluster—retweets ~= agreement; mention ~= discussion

Retweets and mention have very different meanings

The Strength of Ties: chains of ties

Interviews to find out how individuals found out about opportunities

Mostly from acquaintance or friend of friends

It argued that the degree of overlap of two individual’s social networks varies directly with the strength of their tie to one another.

Neighborhood Overlap

Network Structures: arrows-retweets; cluster-different friendship communities; dots- users; people/user serves as a bridge between communities.

Links: internal, between groups, intermediary, etc.



Retweet- information layers




 Follower- social layers


Geographic location

Twitter follower distance

Locality: measures percentage of a user’s friend who lives in the same country.

Co-occurrences and social ties

Geotagged Flickr Photos

Divide the world into a grid, count number of cells on which two individuals were within a given interval

Measures: share photo within a period of time in the same grid – likelihood of becoming friends

Mobility: school/work—home—vacation—move to different city/country

Airline Flights: in Europe within 24h

Commuting: train, subway, bus, etc.

Realistic Epidemic Spreading

Human Mobility: Statistical Model

Privacy (Sci Rep 3, 1376(2013))

How many indicators we need to identify a unique person.

Mobility and Social Network (PLoS One 9, E92196 (2014))

Geo-Social Properties- Matrix of social behavior over distance: Probability of a link, reciprocity, Clustering, Triangle disparity

Geo-Social Model:

Starting position of user u

Visit a random neighbor                    jump to a new location

New position of u

Model fitting: probability of visiting old friend vs meeting new friend

Human Diffusion: how people are moving around on map (J.R.Sco. Interface 12, 20150473 (2015))

Residents and Tourists

City Communities

Practical Session:


Environment Requirement: anaconda & python

Registering an Application

API basics

The Twitter module provides the OAuth interface, we just need to provide the right credentials.

Best to keep the credentials in a dict and parametrize our calls with the dict keyswitch accounts.

.Twitter(auth) takes an OAuth instance as an argument and returns a Twitter object.

Authenticating with the API

In the remainder of this course, the accounts dict will live inside the twitter_accounts.py file.

4 basic types of objects: tweets, users, entities, places.

Searching for Tweets

.search.tweets(query, count)  https://dev.twitter.com/docs/api/1.1/get/search/tweets

  • query is the content to search for
  • count is the maximum number of results to return (from most recent tweets)

returns dict with a list of ‘statuses’

Social Connections

.friends.ids() and .followers.ids() returns a list of up to 500 of a user’s friends or followers for a given screen_name or user_id.

Results is a dict containing multi-fields.

User Timeline

.statuses.user_timeline() returns a set of tweets posted by a single user.

Important options:
include_rst = ‘true’ to include retweet

Count = 200 is max # of tweets to return in each call

Trim_user = ‘true’ to not include the user information

Max_id = 1234 to include only tweets with an id lower than 1234

Return at most 200 tweets in each call, can get all of a user’s tweets up to 3200 with multiple calls

Social Interaction

Data processing extended from user timeline


High productive software for complex network

Come with anaconda

Simple python interface

Four different types of graphs

  • Graph—undirected graph
  • DiGraph—directed graph
  • MultiGraph—multi-edged graph
  • MultiDiGraph—multi-edged directed graph

Similar interface for all graphs

Nodes can be any type of python object

Growing graph—add nodes, edges, etc.

Graph Properties

  • .nodes() return a list nodes
  • .edges()
  • .degree() return a dict with each node degree .in_degree()/ .out_degree() for DiGraph
  • .is_connected()
  • .is_weakly/strongly_connected()
  • .connected_components()

Snowball Sampling–snowball.py

Commonly used in Social Science and Computer Science

  • Start with a single node
  • Get friends list
  • For each friend get the friend list
  • Repeat for a fixed number of layers or until enough

Generates a connected component graph

Streaming Geocoded data–twitter_location.py

The streaming api provides real time data, subject to filter

Use TwitterStream instead of Twitter object

  • .status.filter(track = 1) while return tweets that matches the query q in real time
  • return generator that you can iterate over
  • .status.filter(locations = bb) will return tweets that occur within the bounding box bb in real time

bb is a comma separated pair of lon/lat coordinates.


Open specification developed by ESRI, still the current leader in the commercial GIS software

Shapefiles aren’t actual files

But actually a set of files sharing the same name but with different extensions.

The actual set of files changes depending on the contents, but 3 files are usually present:

  • .shp—also commonly referred to as the shapefile contains geometric info
  • .dbf—a simple database containing the feature attribute table
  • .shx—a spatial index



Pyshp defines utility functions to load and manipulate shapefiles programmatically.

The shapefile module handles the most common operations:

  • .reader(filename) return a reader object
  • reader.records()/iterRecords()
  • reader.shapes()/iterShapes()
  • reader.shapeRecords()/iterShapeRecords()

shape objects contain several fields:
bbox lower left and upper right x,y coordinates (long/lat)

Simple shapefile plot–plot_shapefile.py


Shaplely defines geometric object under shapely.geometry

                   Points, polygon, multip-polygon, shapes()

And common operations

                   .crosses, .contains, etc..

shape object provides useful field to query a shapes properties:

                    .centroid, .area, .bounds, etc..

Filter Points with a shapefile–shapefile_filter.py

Twitter Places–shapefile_filter_places.py

Twitter defines a “coordinates” filed in tweets

There is also a place field that we glossed over

The place object contains also geographic info, but at a courser resolution than the coordinated filed

Each place has a unique place_id, a bouding_box and some geographical information such as country and full_name.

Places can be of several different types: admin, city, neighborhood, poi

Place Attributes: Key, street_address, phone, post_code, region, ios3, twitter, URL, App:id, etc.

Filter points and places–plot_shapefile_points.py   


VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: tetherless world Tags:

Get Off Your Twitter

August 25th, 2017

Web Science, more so than many other disciplines of Computer Science, has a special focus on its humanist qualities – no surprise in that the Web is ultimately an instrument for human expression and cooperation. Naturally, lots of current research in Web Science centers on people and their patterns of behavior, making social media a potent source of data for this line of work.


Accordingly, much time has been devoted to analyzing social networks – perhaps to a fault. Much of the ACM’s Web Science ‘17 conference centered on social media; more specifically, Twitter. While it may sound harsh, the reality is that many of the papers presented at WebSci’17 could be reduced to the following pattern:

  1. There’s Lots of Political Polarization
  2. We Want to Explore the Political Landscape
  3. We Scraped Twitter
  4. We Ran (Sentiment Analysis/Mention Extraction/etc.)
  5. and We Found Out Something Interesting About the Political Landscape

Of the 57 submissions included in the WebSci’17 proceedings, 17 mention ‘Twitter’ or ‘tweet’ in the abstract or title; that’s about 3 out of every 10 submissions, including posters. By comparison, only seven mention Facebook, with some submissions mentioning both.


This isn’t to demean the quality or importance of such work; there’s a lot to be gained from using Twitter to understand the current political climate, as well as loosely quantifying cultural dynamics and understanding social networks. However, this isn’t the only topic in Web Science worth exploring, and Twitter certainly shouldn’t be the ultimate arbitrator of that discussion. While Twitter provides a potent means for understanding popular sentiment via a well-controlled dataset, it is still only a single service that attracts a certain type of user and is better for pithy sloganeering than it is for deep critical analysis, or any other form of expression that can’t be captured in 140 characters.


One of my fellow conference-goers also noticed this trend. During a talk on his submission to WebSci’17, Holge Holtzmann, a researcher from Germany working with Web archives, offered a truism that succinctly captures what I’m saying here: that Twitter ought not to be the only data source researchers are using when doing Web Science.


In fact, I would argue that Mr. Holtzmann’s focus, Web archives, could provide a much richer basis for testing our cultural hypotheses. While more old school, Web archives capture a much, much larger and more representative span of the Web from it’s inception to the dawn of social media than Twitter could ever hope to.


The winner for Best Paper speaks directly to the new possibilities offered by working with more diverse datasets. Applying a deep learning approach to Web archives, the authors examined the evolution of front-end Web design over the past two decades. Admittedly, I wasn’t blown away by their results; they claimed that their model had generated new Web pages in the style of different eras, but didn’t show an example, which was underwhelming. But that’s beside the point; the point is that this is a unique task which couldn’t be accomplished by leaning exclusively on Twitter or any other social media platform.


While I remain critical of the hyper-focus of the Web Science community on social media sites – and especially Twitter – as a seed for its work, I do admire the willingness to wade into cultural and other human-centric issues. This is a rare trait in technological disciplines in general, but especially fields of Computer Science; you’re far more likely to read about gains in deep reinforcement learning than you are to read about accommodating cultural differences in Web use (though these don’t necessarily exclude each other). To that point, the need to provide greater accessibility to the Web for disadvantaged groups and to preserve rapidly-disappearing Web content were widely noted, leaving me optimistic for the future of the field as a way of empowering everyone on the Web.


Now time to just wean ourselves off Twitter a bit…

VN:F [1.9.22_1171]
Rating: 9.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

WebSci ’17

August 14th, 2017

The Web Science Conference was hosted by Rensselaer Polytechnic Institute this year. The Tetherless World Constellation was heavily involved in organizing the event and ensuring the conference ran smoothly.The venue for the conference was the Franklin Plaza in downtown Troy. It was a great venue, with a beautiful rooftop.

On 25th June, there were a set of workshops organized for the attendees. I was a student volunteer at the “Algorithm Mediated Online Information Access (AMOIA)” workshop. We started the day off with a set of talks. The common theme for these talks were to reduce the bias in services we use online. We then spent the next few hours in a discussion on the “Role of recommendation algorithms in online hoaxes and fake news.”

Prof. Peter Fox and Prof Deborah McGuinness, who were the Main Conference Chairs, kicked off the Conference on 26th June. Steffen Staab gave his keynote talk on “The Web We Want“.  After the keynote talk, we jumped right into a series of talks. A few topics caught my attention during each session. Venkata Rama Kiran Garimella’s talk on “The Effect of Collective Attention on Controversial Debates on Social Media” was very interesting, as was the talk on “Recommendations for groups in location-based social networks” by Fred Ayala. We ended the talks with a Panel disscussion on “The ethics of doing Web Science”. After the panel discussions, we headed to the roof for some dinner and the Web Science Poster Session. There were plenty of Posters at the session. Congrui Li and Spencer Norris from TWC presented their work at the poster session.


27th of June was the day of the conference I was most looking forward to, since they had a session on “Networks : Structure, Identifiers, Search”. I found all the talk presented here very fascinating and useful. Particularly the talk “Herirachichal Change Point Detection” and “Adaptive Edge Probing” by Yu Wang and Sucheta Soundarajan respectively. I plan to use the work they presented in one of my current research projects. At the end of the day on 27th June, the award for the papers and posters were presented. Helena Webb won the best paper award. She presented her work on “The ethical challenges of publishing Twitter data for research dissemination”. Venkata Garimella won the best student paper award. Tetherless’ own Spencer Norris won the best poster award.

On 28th June, we started the day of by giving a set of talks on the topic chosen for the Hackthon, “Network Analysis for Non-Social Data”. Here I presented my work on how Network Analysis techniques can be leveraged and applied in the field of Earth Science. After these talk, the hackathon presentations were made by the participants. At lunch , Ahmed Eliesh from TWC won first place in the Hackathon. After lunch, we had the last 2 sessions at WebSci ’17. In these talks, Shawn Jones’ talk present Yasmin Alnomany’s work on “Generating Stories from Archived Collections” and Helena Webb’s best paper winning talk on “The ethical challenges of publishing Twitter data for research dissemination” piqued my interest.

Overall, attending the web science conference was a very valuable experience for me. There was plenty to learn, lots of networking opportunities and a generally jovial atmosphere around the conference. Here’s Looking forward to the next year’s conference in Amsterdam.



VN:F [1.9.22_1171]
Rating: 8.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Do we have a magic flute for K-12 Web Science?

October 27th, 2015

In early July of 2015, Tetherless World Constellation (TWC) opened its door for four young men of the 2015 summer program of Rensselaer Research Experience for High School Students. The program covered a period of four weeks and each student was asked to choose a small and focused topic for research experience. They each were also asked to prepared a poster and present it in public at the end of the program.

Web Science was the discipline chosen by the four high school students at TWC. Before their arrival several professors, research scientists and graduate students formed a mentoring group, and officially I was assigned the task to mentor two of the four students. Such a fresh experience! And then a question came up was: do we have a curriculum of Web Science for High School Students? And for a period of four weeks? We do have excellent textbooks for Semantic Web, Data Science, and more, but most of them are not for high school students. Also the ‘research centric’ feature of the summer program indicated that we should not focus only on teaching but perhaps needed to spend more time on advising a small research project.

My simple plan was, for week 1 we focused on basic concepts, for weeks 2 and 3 the students were assigned a specific topic taken from an existing project, and for week 4 we focused on result analysis, wrap up and poster preparation. A google doc was used to record the basic concepts, technical resources and assignments we introduced and discussed in week 1. I thought those materials could be a little bit more for the students, but to my surprise they took them up really fast, which gave me the confidence to assign them research topics from ongoing projects. One of the students was asked to do statistical analysis of records on the Deep Carbon Observatory Data Portal, and presented the results in interactive visualizations. The other student worked on the visualization of geologic time and connections to Web resources such as Wikipedia. Technologies used were RDF database, SPARQL query, JavaScript, D3.js and JSON data format.

Hope the short program has evoked the students’ interest to explore more and deeper in Web Science. Some of them will soon graduate from high school and go to universities. Wish them good luck!

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Historic launch of the Global Partnership for Sustainable Development Data

September 30th, 2015

An information email in early September from Simon Hodson, the CODATA Executive Director, attracted my deep interest. His email was about the high-level political launch for the Global Partnership for Sustainable Development Data. I was interested because I have worked on Open Data in the past few years and the experience shows that Open Data much more comprehensive than a sole technical issue. I was excited to see that there will be such an event initiated by political partners and focusing on social impacts. And thanks to the support from the CODATA Early Career Data Professionals Working Group, which made it possible for me to head to New York City to attend the forum in person on September 28th.

The forum was held in the Jade Room of the Waldorf Astoria hotel, and lasted for three hours from 2 to 5PM, with a tight but well-organized schedule of about 10 lightning talks, four panels and about 30 commitment introductions from the partners. The panels and lightning talks focused on why open data is needed, how to make data open and, especially, what and the value of open data for The 17 Global Goals for Sustainable Development and the social impact that the data can generate. I was happy to see that the successful stories of open geospatial data were mentioned several times in the lightening talks and the panels. For example, delegates from the World Resources Institute presented the Global Forest Watch-Fires (GFW-Fires), which provides near-real time information from various resources that can enable people to take prompt response before the fire be out of control. During the partner introductions, I heard more exciting news about the actions that the stakeholders in governments, academia, industry and non-profit organizations are going to take actions to support the joint efforts of the Global Partnership for Sustainable Development Data. For example, the Children’s Investment Fund Foundation will invest $20m to improve data on coverage of nutrition interventions and other key indicators by 2020 in several countries; the DigitalGlobe commits to provide three countries with evaluation licenses to their BaseMap service as well as training sessions for human resources; the Planet Labs commits $60 million in geospatial imagery to support the global community; and the William and flora Hewlett Foundation is proposing to commit about $3m to the start-up support of the secretariat for a Global Partnership for Sustainable Development Data. A list of the current partners is accessible on the partnership’s website.

The Global Partnership for Sustainable Development Data has a long-term vision for the year 2030: A world in which everyone is able to engage in solving the world’s greatest problems by (1) Effectively Using Data and (2) Fostering Trust and Accountability in the Sharing of Data. The pioneering partners in this effort have already committed to deliver more than 100 data driven projects worldwide to pave the pathway for the vision 2030. For the first year, the partnership will work together to achieve these goals: (1) Improve the Effective Use of Data, (2) Fill Key Data Gaps, (3) Expand Data Literacy and Capacity, (4) Increase Openness and leverage of Existing Data, and (5) Mobilize Political Will and Resources.

The forum was chaired by Prof. Sanjeev Khagram, with over 200 attendees from various backgrounds. During the reception time after the forum, I had a brief chat with Prof. Khagram about CODATA and also the Early Career Data Professionals Working Group, as well as the potential collaborations. He informed me that the partnership is open and invites broad participation to address the sustainable development goals. Prof. Khagram also mentioned that a bigger event, the World Data Forum, will take place in 2016. I also had the opportunity to catch up with Dr. Bob Chen from CIESIN, Columbia University about recent activities. It seems that ‘climate change’ is the topic of focus for several conferences in the year 2015, such as the International Scientific Conference, the Research Data Alliance Sixth Plenary Meeting and the United Nations Climate Change Conference, and Paris is the city for all these three events.

The report A World That Counts: Mobilising The Data Revolution for Sustainable Development, prepared by the United Nation Secretary-General’s Independent Expert Advisory Group on a Data Revolution for Sustainable Development, provides more background information about the Global Partnership for Sustainable Development Data.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)