Archive for the ‘Web Science’ Category

Get Off Your Twitter

August 25th, 2017

Web Science, more so than many other disciplines of Computer Science, has a special focus on its humanist qualities – no surprise in that the Web is ultimately an instrument for human expression and cooperation. Naturally, lots of current research in Web Science centers on people and their patterns of behavior, making social media a potent source of data for this line of work.


Accordingly, much time has been devoted to analyzing social networks – perhaps to a fault. Much of the ACM’s Web Science ‘17 conference centered on social media; more specifically, Twitter. While it may sound harsh, the reality is that many of the papers presented at WebSci’17 could be reduced to the following pattern:

  1. There’s Lots of Political Polarization
  2. We Want to Explore the Political Landscape
  3. We Scraped Twitter
  4. We Ran (Sentiment Analysis/Mention Extraction/etc.)
  5. and We Found Out Something Interesting About the Political Landscape

Of the 57 submissions included in the WebSci’17 proceedings, 17 mention ‘Twitter’ or ‘tweet’ in the abstract or title; that’s about 3 out of every 10 submissions, including posters. By comparison, only seven mention Facebook, with some submissions mentioning both.


This isn’t to demean the quality or importance of such work; there’s a lot to be gained from using Twitter to understand the current political climate, as well as loosely quantifying cultural dynamics and understanding social networks. However, this isn’t the only topic in Web Science worth exploring, and Twitter certainly shouldn’t be the ultimate arbitrator of that discussion. While Twitter provides a potent means for understanding popular sentiment via a well-controlled dataset, it is still only a single service that attracts a certain type of user and is better for pithy sloganeering than it is for deep critical analysis, or any other form of expression that can’t be captured in 140 characters.


One of my fellow conference-goers also noticed this trend. During a talk on his submission to WebSci’17, Holge Holtzmann, a researcher from Germany working with Web archives, offered a truism that succinctly captures what I’m saying here: that Twitter ought not to be the only data source researchers are using when doing Web Science.


In fact, I would argue that Mr. Holtzmann’s focus, Web archives, could provide a much richer basis for testing our cultural hypotheses. While more old school, Web archives capture a much, much larger and more representative span of the Web from it’s inception to the dawn of social media than Twitter could ever hope to.


The winner for Best Paper speaks directly to the new possibilities offered by working with more diverse datasets. Applying a deep learning approach to Web archives, the authors examined the evolution of front-end Web design over the past two decades. Admittedly, I wasn’t blown away by their results; they claimed that their model had generated new Web pages in the style of different eras, but didn’t show an example, which was underwhelming. But that’s beside the point; the point is that this is a unique task which couldn’t be accomplished by leaning exclusively on Twitter or any other social media platform.


While I remain critical of the hyper-focus of the Web Science community on social media sites – and especially Twitter – as a seed for its work, I do admire the willingness to wade into cultural and other human-centric issues. This is a rare trait in technological disciplines in general, but especially fields of Computer Science; you’re far more likely to read about gains in deep reinforcement learning than you are to read about accommodating cultural differences in Web use (though these don’t necessarily exclude each other). To that point, the need to provide greater accessibility to the Web for disadvantaged groups and to preserve rapidly-disappearing Web content were widely noted, leaving me optimistic for the future of the field as a way of empowering everyone on the Web.


Now time to just wean ourselves off Twitter a bit…

VN:F [1.9.22_1171]
Rating: 9.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

WebSci ’17

August 14th, 2017

The Web Science Conference was hosted by Rensselaer Polytechnic Institute this year. The Tetherless World Constellation was heavily involved in organizing the event and ensuring the conference ran smoothly.The venue for the conference was the Franklin Plaza in downtown Troy. It was a great venue, with a beautiful rooftop.

On 25th June, there were a set of workshops organized for the attendees. I was a student volunteer at the “Algorithm Mediated Online Information Access (AMOIA)” workshop. We started the day off with a set of talks. The common theme for these talks were to reduce the bias in services we use online. We then spent the next few hours in a discussion on the “Role of recommendation algorithms in online hoaxes and fake news.”

Prof. Peter Fox and Prof Deborah McGuinness, who were the Main Conference Chairs, kicked off the Conference on 26th June. Steffen Staab gave his keynote talk on “The Web We Want“.  After the keynote talk, we jumped right into a series of talks. A few topics caught my attention during each session. Venkata Rama Kiran Garimella’s talk on “The Effect of Collective Attention on Controversial Debates on Social Media” was very interesting, as was the talk on “Recommendations for groups in location-based social networks” by Fred Ayala. We ended the talks with a Panel disscussion on “The ethics of doing Web Science”. After the panel discussions, we headed to the roof for some dinner and the Web Science Poster Session. There were plenty of Posters at the session. Congrui Li and Spencer Norris from TWC presented their work at the poster session.


27th of June was the day of the conference I was most looking forward to, since they had a session on “Networks : Structure, Identifiers, Search”. I found all the talk presented here very fascinating and useful. Particularly the talk “Herirachichal Change Point Detection” and “Adaptive Edge Probing” by Yu Wang and Sucheta Soundarajan respectively. I plan to use the work they presented in one of my current research projects. At the end of the day on 27th June, the award for the papers and posters were presented. Helena Webb won the best paper award. She presented her work on “The ethical challenges of publishing Twitter data for research dissemination”. Venkata Garimella won the best student paper award. Tetherless’ own Spencer Norris won the best poster award.

On 28th June, we started the day of by giving a set of talks on the topic chosen for the Hackthon, “Network Analysis for Non-Social Data”. Here I presented my work on how Network Analysis techniques can be leveraged and applied in the field of Earth Science. After these talk, the hackathon presentations were made by the participants. At lunch , Ahmed Eliesh from TWC won first place in the Hackathon. After lunch, we had the last 2 sessions at WebSci ’17. In these talks, Shawn Jones’ talk present Yasmin Alnomany’s work on “Generating Stories from Archived Collections” and Helena Webb’s best paper winning talk on “The ethical challenges of publishing Twitter data for research dissemination” piqued my interest.

Overall, attending the web science conference was a very valuable experience for me. There was plenty to learn, lots of networking opportunities and a generally jovial atmosphere around the conference. Here’s Looking forward to the next year’s conference in Amsterdam.



VN:F [1.9.22_1171]
Rating: 8.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Do we have a magic flute for K-12 Web Science?

October 27th, 2015

In early July of 2015, Tetherless World Constellation (TWC) opened its door for four young men of the 2015 summer program of Rensselaer Research Experience for High School Students. The program covered a period of four weeks and each student was asked to choose a small and focused topic for research experience. They each were also asked to prepared a poster and present it in public at the end of the program.

Web Science was the discipline chosen by the four high school students at TWC. Before their arrival several professors, research scientists and graduate students formed a mentoring group, and officially I was assigned the task to mentor two of the four students. Such a fresh experience! And then a question came up was: do we have a curriculum of Web Science for High School Students? And for a period of four weeks? We do have excellent textbooks for Semantic Web, Data Science, and more, but most of them are not for high school students. Also the ‘research centric’ feature of the summer program indicated that we should not focus only on teaching but perhaps needed to spend more time on advising a small research project.

My simple plan was, for week 1 we focused on basic concepts, for weeks 2 and 3 the students were assigned a specific topic taken from an existing project, and for week 4 we focused on result analysis, wrap up and poster preparation. A google doc was used to record the basic concepts, technical resources and assignments we introduced and discussed in week 1. I thought those materials could be a little bit more for the students, but to my surprise they took them up really fast, which gave me the confidence to assign them research topics from ongoing projects. One of the students was asked to do statistical analysis of records on the Deep Carbon Observatory Data Portal, and presented the results in interactive visualizations. The other student worked on the visualization of geologic time and connections to Web resources such as Wikipedia. Technologies used were RDF database, SPARQL query, JavaScript, D3.js and JSON data format.

Hope the short program has evoked the students’ interest to explore more and deeper in Web Science. Some of them will soon graduate from high school and go to universities. Wish them good luck!

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Open Science in an Open World

December 21st, 2014

I began to think about a blog for this topic after I read a few papers about Open Codes and Open Data published in Nature and Nature Geoscience in November 2014. Later on I also noticed that the editorial office of Nature Geoscience made a cluster of articles themed on Transparency in Science (, which really created an excellent context for further discussion of Open Science.

A few weeks later I attended the American Geophysical Union (AGU) Fall Meeting at San Francisco, CA. That is used to be a giant meeting with more than 20,000 attendees. My personal focus is presentations, workshops and social activities in the group of Earth and Space Science Informatics. To summarize the seven-day meeting experience with a few keywords, I would choose: Data Rescue, Open Access, Gap between Geo and Info, Semantics, Community of Practice, Bottom-up, and Linking. Putting my AGU meeting experience together with thoughts after reading the Nature and Nature Geoscience papers, now it is time for me to finish a blog.

Besides incentives for data sharing and open source policies of scholarly journals, we can extend the discussion of software and data publication, reuse, citation and attribution by shedding more light on both technological and social aspects of an environment for open science.

Open science can be considered as a socio-technical system. One part of the system is a way to track where everything goes and another is a design of appropriate incentives. The emerging technological infrastructure for data publication adopts an approach analogous to paper publication and has been facilitated by community standards for dataset description and exchange, such as DataCite (, Open Archives Initiative-Object Reuse and Exchange ( and the Data Catalog Vocabulary ( Software publication, in a simple way, may use a similar approach, which calls for community efforts on standards for code curation, description and exchange, such as the Working towards Sustainable Software for Science ( Simply minting Digital Object Identifiers to codes in a repository makes software publication no difference from data publication (See also: . Attention is required for code quality, metadata, license, version and derivation, as well as metrics to evaluate the value and/or impact of a software publication.

Metrics underpin the design of incentives for open science. An extended set of metrics – called altmetrics – was developed for evaluating research impact and has already been adopted by leading publishers such as Nature Publishing Group ( Factors counted in altmetrics include how many times a publication has been viewed, discussed, saved and cited. It was very interesting to read some news about funders’ attention to altmetrics ( on my flight back from the AGU meeting – from the 12/11/2014 issue of Nature which I picked from the NPG booth at the AGU meeting exhibition hall. For a software publication the metrics might also count how often the code is run, the use of code fragments, and derivations from the code. A software citation indexing service – similar to the Data Citation Index ( of Thomson Reuters – can be developed to track citations among software, datasets and literature and to facilitate software search and access.

Open science would help everyone – including the authors – but it can be laborious and boring to give all the fiddly details. Fortunately fiddly details are what computers are good at. Advances in technology are enabling the categorization, identification and annotation of various entities, processes and agents in research as well as the linking and tracing among them. In our 06/2014 Nature Climate Change article we discussed the issue of provenance of global change research ( Those works on provenance capture and tracing further extend the scope of metrics development. Yet, incorporating those metrics in incentive design requires the science community to find an appropriate way to use them in research assessment. A recent progress is that NSF renamed Publications section as Products in the biographical sketch of funding applicants and allowed datasets and software to be listed ( To fully establish the technological infrastructure and incentive metrics for open science, more community efforts are still needed.

VN:F [1.9.22_1171]
Rating: 8.5/10 (6 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Data Management – Serendipity in Academic Career

November 11th, 2014

A few days ago I began to think about the topic for a blog and the first reflection in my mind was ‘data management’ and then a Chinese poem sentence ‘无心插柳柳成荫’ followed. I went to Google for an English translation of that sentence and the result was ‘Serendipitiously’. Interesting, I never saw that word before and I had to use a dictionary to find that ‘serendipity’ means unintentional positive outcomes, which expresses the meaning of that Chinese sentence quite well. So, I regard data management as serendipity in my academic career. I think that’s because I was trained as a geoinformatics researcher through my education in China and the Netherlands, how it comes that most of my current time is being spent on data management?

One clue I could see is that I have been working on ontologies, vocabularies and conceptual models for geoscience data services, which is relevant to data management. Another more relevant clue is a symposium ‘Data Management in Research: A Challenging Issue’ organized at University of Twente campus in 2011 spring. Dr. David Rossiter, Ms. Marga Koelen, I and a few other ITC colleagues attend the event. That symposium highlighted both technical and social/cultural issues faced by the 3TU.Datacentrum (, a data repository for the three technological universities in the Netherlands. It is very interesting to see that several topics of my current work had already discussed in that symposium, whereas I paid almost no attention because I was completely focused on my vocabulary work at that time. Since now I am working on data management, I would like to introduce a few concepts relevant to it and the current social and technical trends.

Data management, in simple words, means what you will do with your datasets during and after a research. Conventionally, we treat paper as the ‘first class’ product of research and many scientists pay less attention to data management. This may lower the efficiency of research activities and hinder communications among research groups in different institutions. There is even a rumor that 80% of a scientist’s time is spent on data discovery, retrieval and assimilation, and only 20% of time is for data analysis and scientific discovery. An ideal situation is that reverse the allocation of time, but that requires efforts on both a technical infrastructure for data publication and a set of appropriate incentives to the data authors.

After coming to United States the first data repository caused my attention was the California Digital Library (CDL) (, which is similar to the services offered by 3TU.Datacentrum. I like the technical architecture CDL work not only because they provide a place for depositing datasets but also, and more importantly, they provide a series of tools and services ( to allow users to draft data manage plans to address funding agency requirements, to mint unique and persistent identifiers to published datasets, and to improve the visibility of the published datasets. The word data publication is derived from paper publication. By documenting metadata, minting unique identifiers (e.g., Digital Object Identifiers (DOIs)), and archiving copies of datasets into a repository, we can make a piece of published dataset similar to a piece of published paper. The identifier and metadata make the dataset citable, just like what we do with published papers. A global initiative, the DataCite, had been working on standards of metadata schema and identifier for datasets, and is increasing endorsed by data repositories across the word, including both CDL and 3TU.Datacentrum. A technological infrastructure for data publication is emerging, and now people begin to talk about the cultural change to treat data as ‘first class’ product of research.

Though funding agencies already require data management plans in funding proposals, such as the requirements of National Science Foundation in US and the Horizon 2020 in EU (A Google search with key word ‘data management’ and the name of the funding agency will help find the agency’s guidelines), The science community still has a long way to go to give data publication the same attention as what they do with paper publication. Various community efforts have been take to promote data publication and citation. The FORCE11 published the Joint Declaration of Data Citation Principles ( in 2013 to promote good research practice of citing datasets. Earlier than that, in 2012, the Federation of Earth Science Information Partners published Data Citation Guidelines for Data Providers and Archives (, which offers more practical details on how a piece of published dataset should be cited. In 2013, the Research Data Alliance ( was launched to build the social and technical bridges that enable open sharing of data, which enhances existing efforts, such as CODATA (, to promote data management and sharing.


To promote data citation, a number of publishers have launched so called data journals in recent years, such as Scientific Data ( of Nature Publishing Group, Geoscience Data Journal ( of Wiley, and Data in Brief ( of Elsevier. Such a data journal often has a number of affiliated and certified data repositories. A data paper allows the authors to describe a piece of dataset published in a repository. A data paper itself is a journal paper, so it is citable, and the dataset is also citable because there are associated metadata and identifier in the data repository. This makes data citation flexible (and perhaps confusing): you can cite a dataset by either citing the identifier of the associated data paper, or the identifier of the dataset itself, or both. More interestingly, a paper can cite a dataset, a dataset can cite a dataset, and a dataset can also cite paper (e.g., because the dataset may be derived from tables in a paper). The Data Citation Index ( launched by Thomson Reuters provides services to index the world’s leading data repositories, connect datasets to related literature indexed in the Web of Science database and to search and access data across subjects and regions.

Although there is such huge progress on data publication and citation, we are not yet there to fully treat data as ‘first class’ products of research. A recent good news is that, in 2013, the National Science Foundation renamed Publications section in biographical sketch of funding applicants as Products and allowed datasets and software to be listed there ( However, this is still just a small step. We hope more similar incentives appear in academia. For instance, even we have the Data Citation Index, are we ready to mix the data citation and paper citation to generate the H-index of a scientist? And even there is such an H-index, are we ready to use it in research assessment?

Data management involves so many social and technological issues, which make it quite different from pure technical questions in geoinformatics research. This is an enjoyable work and in the next step I may spend more time on data analysis, for which I may introduce a few ideas in another blog.

VN:F [1.9.22_1171]
Rating: 9.0/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)