Archive for the ‘personal ramblings’ Category

Get Off Your Twitter

August 25th, 2017

Web Science, more so than many other disciplines of Computer Science, has a special focus on its humanist qualities – no surprise in that the Web is ultimately an instrument for human expression and cooperation. Naturally, lots of current research in Web Science centers on people and their patterns of behavior, making social media a potent source of data for this line of work.


Accordingly, much time has been devoted to analyzing social networks – perhaps to a fault. Much of the ACM’s Web Science ‘17 conference centered on social media; more specifically, Twitter. While it may sound harsh, the reality is that many of the papers presented at WebSci’17 could be reduced to the following pattern:

  1. There’s Lots of Political Polarization
  2. We Want to Explore the Political Landscape
  3. We Scraped Twitter
  4. We Ran (Sentiment Analysis/Mention Extraction/etc.)
  5. and We Found Out Something Interesting About the Political Landscape

Of the 57 submissions included in the WebSci’17 proceedings, 17 mention ‘Twitter’ or ‘tweet’ in the abstract or title; that’s about 3 out of every 10 submissions, including posters. By comparison, only seven mention Facebook, with some submissions mentioning both.


This isn’t to demean the quality or importance of such work; there’s a lot to be gained from using Twitter to understand the current political climate, as well as loosely quantifying cultural dynamics and understanding social networks. However, this isn’t the only topic in Web Science worth exploring, and Twitter certainly shouldn’t be the ultimate arbitrator of that discussion. While Twitter provides a potent means for understanding popular sentiment via a well-controlled dataset, it is still only a single service that attracts a certain type of user and is better for pithy sloganeering than it is for deep critical analysis, or any other form of expression that can’t be captured in 140 characters.


One of my fellow conference-goers also noticed this trend. During a talk on his submission to WebSci’17, Holge Holtzmann, a researcher from Germany working with Web archives, offered a truism that succinctly captures what I’m saying here: that Twitter ought not to be the only data source researchers are using when doing Web Science.


In fact, I would argue that Mr. Holtzmann’s focus, Web archives, could provide a much richer basis for testing our cultural hypotheses. While more old school, Web archives capture a much, much larger and more representative span of the Web from it’s inception to the dawn of social media than Twitter could ever hope to.


The winner for Best Paper speaks directly to the new possibilities offered by working with more diverse datasets. Applying a deep learning approach to Web archives, the authors examined the evolution of front-end Web design over the past two decades. Admittedly, I wasn’t blown away by their results; they claimed that their model had generated new Web pages in the style of different eras, but didn’t show an example, which was underwhelming. But that’s beside the point; the point is that this is a unique task which couldn’t be accomplished by leaning exclusively on Twitter or any other social media platform.


While I remain critical of the hyper-focus of the Web Science community on social media sites – and especially Twitter – as a seed for its work, I do admire the willingness to wade into cultural and other human-centric issues. This is a rare trait in technological disciplines in general, but especially fields of Computer Science; you’re far more likely to read about gains in deep reinforcement learning than you are to read about accommodating cultural differences in Web use (though these don’t necessarily exclude each other). To that point, the need to provide greater accessibility to the Web for disadvantaged groups and to preserve rapidly-disappearing Web content were widely noted, leaving me optimistic for the future of the field as a way of empowering everyone on the Web.


Now time to just wean ourselves off Twitter a bit…

VN:F [1.9.22_1171]
Rating: 9.5/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

My report on Open Government Data camp 2011

November 2nd, 2011

A few days ago I (Alvaro Graves) participated in the Open Government Data Camp 2011 in Warsaw, Poland, where people from different groups, organizations and governments met to discuss issues related to Open Data at government level. Here are some of the most important issues found in theese talk, in my opinion.

The current state of OGD

David Eaves, an activist who advises the city of Vancouver, Canada in issues about Open Data, gave a keynote in which he described his views on the current state of Open Data movement. First, it is striking that the success stories are not just a few anymore (as or but there are dozens (perhaps hundreds), both at national, regional and local levels. Similarly, the term Open Government Data is becoming increasingly popular, which is good because it is easier to stop explaining the ‘what’ and start focusing in the ‘how’.

Another interesting point is how the movement of Open Government Data already passed an inflection point, where it is no longer seen as people demanding from the outside, but being increasingly being invited to help working on these initiatives from within the government. For many, this change in perspective can be confusing and may create some concerns of Open Data being absorbed in a bureaucratic system that makes impossible to implement Open Data initiatives. However, it is clear that in order for these changes to occur, the movement can not reject to collaborate with governments.

Local initiatives, by locals

A talk that I really liked was by Ton Zylstra, who lives in the city of Enschede, the Netherlands. This city has only 150,000 inhabitants. He wanted an Open Data initiative there, however, it was difficult to convince the authorities, so he with a group of people decided to start working on their own. Inviting a handful of hackers to a bar, they created their first application that used data from Twitter, Foursquare, and the venues of a local festival. Eventually they convinced the municipal government that the default option for local data ought to be open.

From this experience, Ton showed several important lessons: You have to create something concrete, no matter if it is small: This implies something that requires little funding (the first beers at the bar were free) and short-term (no more than a couple of weeks). It does not matter if it is something original or not, there are some great ideas out there that deserve to be copied and are very useful for the local community.

How the Open Data died

Another very interesting keynote was by Chris Taggart, founder of OpenCorporates, who warned of the risks that the Open Data movement is facing today. His main concern is the lack of relevance in terms of impact Open Data has on society. For example, he mentioned that so far no one’s business depends on Open Data (although this is not true, there are a few out there, but I have to concede they are rare examples). In general, making data available is not enough, it is necessary for it to be used either in applications, by data journalists, etc. Also, it is fundamental to link different sites with Open Data (something quite uncommon in the movement), so that people can find out more information. Finally, I liked his idea that if the Open Data does not cause problems to its incumbents, then it is not working.

Redefining what is public

Finally another talk that I found interesting was the idea of ​​Dave Rasiej, founder of Personal Democracy, and Nigel Shaldbolt, professor at University of Southampton, to redefine “the public” in terms of data that “is available on the Web in machine-processable formats.” That is, uploading a bunch of PDFs with scanned tables does not make that information public, because it is not easily accessible. This initiative raises the bar of what public data is, especially when compared to the FOIA (Freedom of Information Act) that allows you to request information from government. Note that this applies to all information, as Rasiej so vehemently described it.

So… what did you talked about at OGDCamp?

In my case, I presented a system for publishing Linked Data called LODSPeaKr, which can be used for the rapid publication of government data and to create applications based on Linked Data. In the near future I will be writing more about this framework, but for now you can see my presentation here.

VN:F [1.9.22_1171]
Rating: 9.5/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Unanticipated consequences: Saving

April 14th, 2011

I had a bizarre dream last night, one of those surreal shockers. The details aren’t important, but I realized on waking up that the dream’s theme was all about unanticipated consequences.  I realized I needed to write this post.

To set some context: I went to bed upset last night.  I was upset at two things, one is an article on techcrunch entitled “Five Open Questions For Before We #SaveTheData,” the other was my response to the article.  I hope I can respond to the first and apologize for the second.  I want to make one thing clear, however, before I start – I am a strong supporter of, I think it is a great experiment in democracy resulting from bold leadership, and if it dies in the current budget cutting it will be an enduring embarrassment for the USA and a major loss to government transparency.

The article I was upset about was written by Kate Ray (@kraykray), an amazingly bright and articulate young woman who has made several very impressive videos and online articles that I am a fan of.  She recently was one of the co-founders of “NerdCollider,” a website designed to bring intelligent discussion to interesting issues — an idea I support.  I was proud to be an early contributor to one of their discussions, which asked “What would you change about to get more people to care?

In the TechCrunch blog post I mentioned above, Kate takes several quotes from this discussion and reflects on their import — is taking some of the key issues into account?  As a good reporter, Kate’s OpEd is actually quite objective – she reports on several comments made by people, including me, as to issues the site has in terms of its effort to share government data.   TechCrunch is a very influential site, the article title has been tweeted and retweeted hundreds of times to hundreds of thousands of potential readers (congrats to Kate on this viral takeup), raising awareness of Congress’ narrow-minded goal of killing the project, which I guess is a good thing.  Unfortunately, the choice of the word “Before” in “… Before we #savethedata” has a negative implication, and I’m hoping that doesn’t kill off the positive efforts that the #savethedata meme was designed to promote.

In her article, Kate brings up important issues, but what she doesn’t make clear is that most of the people she quotes are indeed strong supporters of the Open Government movement and fans of  The seeming criticisms were actually constructive responses to the question of how we could get more people to care (a positive), and not meant to say what was wrong with the site that must be fixed before the site was useful.  It’s already very useful, but like any new effort, there’s always room for improvement. However, those changes will never happen if the site is forced to go dark!

As I said, Kate’s article has been phenomenally well tweeted, in fact, if you look at #savethedata the stream is so filled with pointers to this article that one can no longer easily find the link to the Petition created by the Sunlight Foundation to help stop the budget cuts — that petition is where the #savethedata meme started (thanks @EllnMiller).  Kate also doesn’t point to the great HuffPost article by @bethnoveck explaining why cutting the funding to this and other egovernment sites will threaten American jobs which was also retweeting around the #savethedata meme.

So I hope one unanticipated consequence of this article is that it doesn’t help cause the death of by killing off the awareness of its importance or losing the momentum on the petition that could save it.

But, as Arlo Guthrie used to say, “that’s not what I came here to talk about tonight…”

In my response to Kate’s article, I referred to her making factual errors.  This is a horrible thing to accuse a young journalist of, and I was being unfair.  The errors I wanted to point out were not in Kate’s piece, but in the chart chosen to go along.  It appears to show a flatline in the interest in, using figures from (as Kate told me later in a separate tweet) on “unique visits.”  I don’t know where gets the data, but the tracking of the  number of visitors on the site — which are reported on the site on a daily basis seems to show a much larger number with a more positive trend (over 180,000 visits in March).  It’s unclear why there is this discrepancy (I suspect it’s in how figures uniqueness for sites they don’t control), but it is clear it isn’t Kate’s fault.   She also cites the number of downloads in her article as 1.5M since Oct 2010, which is the number reported on, but as of last week, the site broke 2M downloads, and the number is trending up.

Anyway, I’m digressing again (occupational hazard of a college professor) — the key point is the errors are not Kate’s and that she was reflecting on what she found.

I also was upset that she quoted me out of context – in my nerdcollider response I made it clear I was supporting, and offering some constructive solutions to the question of how we could make the site better.  As the quote appears in her piece, it looks like I’m saying the data is poorly organized on the site — but what I was actually saying is that in the incredible richness of  data sets available ( hosted over 300,000 datasets at last count!) we have to explore new ways to search for data  — it’s a wonderful problem to have!  But I did say what she quoted, and as she pointed out to me, correctly, one of the good things about nerdcollider is that the full context of the quotes are there to be cited.  She’s right.

So just as I hope Kate’s piece doesn’t have the unanticipated consequence of hurting, I hope my admittedly intemperate response doesn’t have the unanticipated consequence of hurting the reputation of this young potential online media star.

@kraykray – I apologize.

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: +2 (from 2 votes)
Author: Categories: linked data, open data, personal ramblings, twitter Tags:

Why the term ‘data publication’?

December 14th, 2010

Over the last 6 months I have been present in at least 10 distinct discussions around topics such as data publication, data citation and data attribution. At first I was engaged in the topics but very quickly I kept pausing and asking myself, what’s the use case (duh!). What I was hearing was coming from ‘data people’ (yes, I am one of them). What I wanted to hear was: “I want to be cited for the datasets I spend a lot of time and intellectual effort collecting, calibrating and analyzing”, or “… really I want to get credit for that as much as the one or two publications I might get”. I’ve heard this, in fact I’ve said it myself many times. So what’s the problem? Well, when a researcher wants credit and citation for a piece of work, they prepare and publish a paper, yes a body of intellectual work. Our communities and disciplines have spent many centuries developing this approach. So, if want I really want is credit and citation for my data, why do I need to publish it? At present, many people are getting such credit but it is an informal way such as narrative level acknowledgement in the text of the paper and not formal (Parsons, Duerr and Minster 2010 EOS). That’s as good as no acknowledgement unless someone sees it and records it somewhere. The mechanism for paper citation is now well established, I cite your paper in my paper and your citation count increases and gets reported. If you are up for promotion or tenure or review and that count is taken into account, you get credit. It’s the identification of the artifact that counts not the fact that it is published. In short, the capability that is needed is: a way to identify your data contribution and a way to record it (and thus count it). Identification and reference, that’s it. Now, I am not writing about ‘publication data publication’, i.e. the data that is the foundation for figures, tables, and other descriptions in a published paper. I am all for that data being made available as a part of the publication. That is also another story. I am addressing just regular data (collections/ sets).

For now I am suggesting that there are other models to make data available to start with, and one of them is the software release cycle/ process. Alpha, pre-beta, beta, release candidate, release, revision, documentation, feedback, bug fixes… it is more like the process for data that I know of. Now, this may not be the right approach but I think we should explore it, and others. I’m no longer in favour of just adopting a model (marriage) of convenience (publishing). We are savvy enough to take a step back and implement a model that meets the needs of the data scientists who deserve it most. Yes, there’s more to be said. Tag your it.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Some thoughts on the Google/Verizon deal

August 5th, 2010

As a long-time Web technologist, one of the many creators of the Semantic Web technology recently being put into use by Google, and a government expert on the Internet and Web, I find myself worried about the reported deal emerging between Google and Verizon. If, as reported, this would truly allow the differential handling of packets based on pay, then it would clearly be a threat to the net as we know it, and a potential disaster for the small smart-ups and freelance Web developers that are so important to our technology’s eco-system.

It is certainly within Google’s right to make money, and to use the Web technologies that were freely donated to the world by people like myself or, significantly more importantly Tim Berners-Lee (a strong proponent of net neutrality). However, allowing preferential packet routing provides a means for the control and exploitation of these technologies that goes beyond the original intent.

The social affects are also quite worrying. I don’t see how a deal like this can avoid  increasing the width of the digital divide between those who can afford enhanced service and those who can.  It also seems like it will have different impacts in some societies than others, making Web behaviors even less predictable, and more susceptible to government control, than they are today.

Within the US, practice has maintained net neutrality where legislators have been remiss and where the courts have rightly been unwilling to impose policy in in the absence of legislation.  The Google-Verizon deal has been reported by some Google fans as Google reaching a compromise with Verizon that might otherwise allow the latter to impose its own models, and by some others as Google clearly violating their own “do no evil” motto.  Either way, it is a worrisome deal that is likely to set the precedent for many others, and to scare legislators from doing what they should: As a candidate, President Obama’s commitment was  to net neutrality, stating that he would be second to no one as a proponent of a free and open Internet.  The Democratic Congress has not rallied to the President’s side on this, nor have the Republicans rallied to their stated goal of providing a fair playing field for startup industry. The action taken by large companies to set their own rules is likely to cause these gun shy legislators to take an opposing action in a year where so many are fighting for re-election.

So I find myself joining those who are calling on Google minimally for more transparency into what is happening and preferably to continue their own opposition to preferential charging.  In 2006 Google urged Americans to “take action to preserve Internet freedom.”  Today their policy blog is surprisingly silent on the reported negotiations.   My motivation to call on Google for at least a response comes from their own call – it is an action I take to preserve Internet freedom.

As a corporation dependent on the Internet for communication, on the start-ups for continued innovation, and on academic researchers to keep a flow of new Web technologies transitioning into practice, I hope Google will heed their own call to action and will do the right thing.

VN:F [1.9.22_1171]
Rating: 8.7/10 (10 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)