Data.gov – it’s useful, but also could be better.
The “Nerd Collider” Web site invited me to be a “power nerd” and respond to the question “What would you change about Data.gov to get more people to care?” The whole discussion including my response can be found here. However, I hope people won’t mind my reprinting my response here, as the TWC blog gets aggregated to some important Linked Data/Semantic Web sites.
I was puzzling over how I wanted to respond until I saw the blog in the Guardian – http://www.guardian.co.uk/news/datablog/2011/apr/05/data-gov-crisis-obama – which also reflects this flat line as a failure, and poses, by contrast, the number of hits the Guardian.com website gets. This is such a massive apples vs. oranges error that I figure I should start there.
So, primarily, let’s think about what visits to a web page are about — for the Guardian, they are lots of people coming to read the different articles each day. However, for data.gov, there isn’t lot of repeat traffic – the data feeds are updated on a relatively slow basis, and once you’ve downloaded some, you don’t have to go back for weeks or months until the next update. Further, for some of the rapidly changing data, like the earthquake data, there are RSS feeds so once setup, one doesn’t return to the site. So my question is, are we looking at the right number?
In fact, the answer is no — if you want to see the real use of data.gov, take a look at the chart at http://www.data.gov/metric/visitorstats/monthlyredirecttrend — the number of total downloads of dataset since 2009 is well over 1,000,000 and in February of this year (the most recent data available) there were over 100,000 downloads — so the 10k number appears to be tracking the wrong thing – the data is being downloaded and that implies it is being used!!
Could we do better? Yes, very much so. Here’s things I’m interested in seeing (and working with the data.gov team to make available)
1 – Searching for data on the site is tough — keyword search is not a good way to look for data (for lots of reasons) and thus we need better ways – doing this really well is a research task I’ve got some PhD students working on, but doing better than is there requires some better metadata and approach. There is already work afoot at data.gov (assuming funding continues) to improve this significantly.
2 – Tools for using the data, and particularly for mashing it up, need to be more easily used and more widely available. My group makes a lot of info and tools available at http://logd.tw.rpi.edu – but a lot more is needed. This is where the developer community could really help.
3 – Tools to support community efforts (see the comment by Danielle Gould to this effect) are crucial – she says it better than I can so go read that.
4- there are efforts by data.gov to create communities – these are hard to get going, but could be a great value in the long run. I suggest people look to these at the data.gov communities site, and think about how they could be improved to bring more use – I know the data.gov leadership team would love to get some good comments about that.
5 – We need to find ways to turn the data release into a “conversation” between government and users. I have discussed this with Vivek Kundra numerous times and he is a strong proponent (and we have thought about writing a paper on the subject if time ever allows). The British data.gov.uk site has some interesting ideas along this line, based on open streetmap and similar projects, but I think one could do better. This is the real opportunity for “government 2.0″ – a chance for citizens to comment just on legislation, but to help make sure the data that informs the policy decisions is the best it can be.
So, to summarize, there are things we can do to improve things, many of which are getting done. However, the numbers in the graph above are misleading, and don’t really reflect the true usage of data.gov per se, let alone the other sites and sites like the LOGD site I mention above which are powered by data.gov.