Archive

Posts Tagged ‘Open data’

Open Science in an Open World

December 21st, 2014

I began to think about a blog for this topic after I read a few papers about Open Codes and Open Data published in Nature and Nature Geoscience in November 2014. Later on I also noticed that the editorial office of Nature Geoscience made a cluster of articles themed on Transparency in Science (http://www.nature.com/ngeo/focus/transparency-in-science/index.html), which really created an excellent context for further discussion of Open Science.

A few weeks later I attended the American Geophysical Union (AGU) Fall Meeting at San Francisco, CA. That is used to be a giant meeting with more than 20,000 attendees. My personal focus is presentations, workshops and social activities in the group of Earth and Space Science Informatics. To summarize the seven-day meeting experience with a few keywords, I would choose: Data Rescue, Open Access, Gap between Geo and Info, Semantics, Community of Practice, Bottom-up, and Linking. Putting my AGU meeting experience together with thoughts after reading the Nature and Nature Geoscience papers, now it is time for me to finish a blog.

Besides incentives for data sharing and open source policies of scholarly journals, we can extend the discussion of software and data publication, reuse, citation and attribution by shedding more light on both technological and social aspects of an environment for open science.

Open science can be considered as a socio-technical system. One part of the system is a way to track where everything goes and another is a design of appropriate incentives. The emerging technological infrastructure for data publication adopts an approach analogous to paper publication and has been facilitated by community standards for dataset description and exchange, such as DataCite (http://www.datacite.org), Open Archives Initiative-Object Reuse and Exchange (http://www.openarchives.org/ore) and the Data Catalog Vocabulary (http://www.w3.org/TR/vocab-dcat). Software publication, in a simple way, may use a similar approach, which calls for community efforts on standards for code curation, description and exchange, such as the Working towards Sustainable Software for Science (http://wssspe.researchcomputing.org.uk). Simply minting Digital Object Identifiers to codes in a repository makes software publication no difference from data publication (See also: http://www.sciforge-project.org/2014/05/19/10-non-trivial-things-github-friends-can-do-for-science/) . Attention is required for code quality, metadata, license, version and derivation, as well as metrics to evaluate the value and/or impact of a software publication.

Metrics underpin the design of incentives for open science. An extended set of metrics – called altmetrics – was developed for evaluating research impact and has already been adopted by leading publishers such as Nature Publishing Group (http://www.nature.com/press_releases/article-metrics.html). Factors counted in altmetrics include how many times a publication has been viewed, discussed, saved and cited. It was very interesting to read some news about funders’ attention to altmetrics (http://www.nature.com/news/funders-drawn-to-alternative-metrics-1.16524) on my flight back from the AGU meeting – from the 12/11/2014 issue of Nature which I picked from the NPG booth at the AGU meeting exhibition hall. For a software publication the metrics might also count how often the code is run, the use of code fragments, and derivations from the code. A software citation indexing service – similar to the Data Citation Index (http://wokinfo.com//products_tools/multidisciplinary/dci/) of Thomson Reuters – can be developed to track citations among software, datasets and literature and to facilitate software search and access.

Open science would help everyone – including the authors – but it can be laborious and boring to give all the fiddly details. Fortunately fiddly details are what computers are good at. Advances in technology are enabling the categorization, identification and annotation of various entities, processes and agents in research as well as the linking and tracing among them. In our 06/2014 Nature Climate Change article we discussed the issue of provenance of global change research (http://www.nature.com/nclimate/journal/v4/n6/full/nclimate2141.html). Those works on provenance capture and tracing further extend the scope of metrics development. Yet, incorporating those metrics in incentive design requires the science community to find an appropriate way to use them in research assessment. A recent progress is that NSF renamed Publications section as Products in the biographical sketch of funding applicants and allowed datasets and software to be listed (http://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp). To fully establish the technological infrastructure and incentive metrics for open science, more community efforts are still needed.

VN:F [1.9.22_1171]
Rating: 8.2/10 (5 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Data.gov – it’s useful, but also could be better.

April 5th, 2011

The “Nerd Collider” Web site invited me to be a “power nerd” and respond to the question “What would you change about Data.gov to get more people to care?”  The whole discussion including my response can be found here.  However, I hope people won’t mind my reprinting my response here, as the TWC blog gets aggregated to some important Linked Data/Semantic Web sites.

My response:

I was puzzling over how I wanted to respond until I saw the blog in the Guardian – http://www.guardian.co.uk/news/datablog/2011/apr/05/data-gov-crisis-obama – which also reflects this flat line as a failure, and poses, by contrast, the number of hits the Guardian.com website gets. This is such a massive apples vs. oranges error that I figure I should start there.

So, primarily, let’s think about what visits to a web page are about — for the Guardian, they are lots of people coming to read the different articles each day. However, for data.gov, there isn’t lot of repeat traffic – the data feeds are updated on a relatively slow basis, and once you’ve downloaded some, you don’t have to go back for weeks or months until the next update. Further, for some of the rapidly changing data, like the earthquake data, there are RSS feeds so once setup, one doesn’t return to the site. So my question is, are we looking at the right number?

In fact, the answer is no — if you want to see the real use of data.gov, take a look at the chart at http://www.data.gov/metric/visitorstats/monthlyredirecttrend — the number of total downloads of dataset since 2009 is well over 1,000,000 and in February of this year (the most recent data available) there were over 100,000 downloads — so the 10k number appears to be tracking the wrong thing – the data is being downloaded and that implies it is being used!!

Could we do better? Yes, very much so. Here’s things I’m interested in seeing (and working with the data.gov team to make available)

1 – Searching for data on the site is tough — keyword search is not a good way to look for data (for lots of reasons) and thus we need better ways – doing this really well is a research task I’ve got some PhD students working on, but doing better than is there requires some better metadata and approach. There is already work afoot at data.gov (assuming funding continues) to improve this significantly.

2 – Tools for using the data, and particularly for mashing it up, need to be more easily used and more widely available. My group makes a lot of info and tools available at http://logd.tw.rpi.edu – but a lot more is needed. This is where the developer community could really help.

3 – Tools to support community efforts (see the comment by Danielle Gould to this effect) are crucial – she says it better than I can so go read that.

4- there are efforts by data.gov to create communities – these are hard to get going, but could be a great value in the long run. I suggest people look to these at the data.gov communities site, and think about how they could be improved to bring more use – I know the data.gov leadership team would love to get some good comments about that.

5 – We need to find ways to turn the data release into a “conversation” between government and users. I have discussed this with Vivek Kundra numerous times and he is a strong proponent (and we have thought about writing a paper on the subject if time ever allows). The British data.gov.uk site has some interesting ideas along this line, based on open streetmap and similar projects, but I think one could do better. This is the real opportunity for “government 2.0″ – a chance for citizens to comment just on legislation, but to help make sure the data that informs the policy decisions is the best it can be.

So, to summarize, there are things we can do to improve things, many of which are getting done. However, the numbers in the graph above are misleading, and don’t really reflect the true usage of data.gov per se, let alone the other sites and sites like the LOGD site I mention above which are powered by data.gov.

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)