Archive

Archive for the ‘linked data’ Category

Open Science in an Open World

December 21st, 2014

I began to think about a blog for this topic after I read a few papers about Open Codes and Open Data published in Nature and Nature Geoscience in November 2014. Later on I also noticed that the editorial office of Nature Geoscience made a cluster of articles themed on Transparency in Science (http://www.nature.com/ngeo/focus/transparency-in-science/index.html), which really created an excellent context for further discussion of Open Science.

A few weeks later I attended the American Geophysical Union (AGU) Fall Meeting at San Francisco, CA. That is used to be a giant meeting with more than 20,000 attendees. My personal focus is presentations, workshops and social activities in the group of Earth and Space Science Informatics. To summarize the seven-day meeting experience with a few keywords, I would choose: Data Rescue, Open Access, Gap between Geo and Info, Semantics, Community of Practice, Bottom-up, and Linking. Putting my AGU meeting experience together with thoughts after reading the Nature and Nature Geoscience papers, now it is time for me to finish a blog.

Besides incentives for data sharing and open source policies of scholarly journals, we can extend the discussion of software and data publication, reuse, citation and attribution by shedding more light on both technological and social aspects of an environment for open science.

Open science can be considered as a socio-technical system. One part of the system is a way to track where everything goes and another is a design of appropriate incentives. The emerging technological infrastructure for data publication adopts an approach analogous to paper publication and has been facilitated by community standards for dataset description and exchange, such as DataCite (http://www.datacite.org), Open Archives Initiative-Object Reuse and Exchange (http://www.openarchives.org/ore) and the Data Catalog Vocabulary (http://www.w3.org/TR/vocab-dcat). Software publication, in a simple way, may use a similar approach, which calls for community efforts on standards for code curation, description and exchange, such as the Working towards Sustainable Software for Science (http://wssspe.researchcomputing.org.uk). Simply minting Digital Object Identifiers to codes in a repository makes software publication no difference from data publication (See also: http://www.sciforge-project.org/2014/05/19/10-non-trivial-things-github-friends-can-do-for-science/) . Attention is required for code quality, metadata, license, version and derivation, as well as metrics to evaluate the value and/or impact of a software publication.

Metrics underpin the design of incentives for open science. An extended set of metrics – called altmetrics – was developed for evaluating research impact and has already been adopted by leading publishers such as Nature Publishing Group (http://www.nature.com/press_releases/article-metrics.html). Factors counted in altmetrics include how many times a publication has been viewed, discussed, saved and cited. It was very interesting to read some news about funders’ attention to altmetrics (http://www.nature.com/news/funders-drawn-to-alternative-metrics-1.16524) on my flight back from the AGU meeting – from the 12/11/2014 issue of Nature which I picked from the NPG booth at the AGU meeting exhibition hall. For a software publication the metrics might also count how often the code is run, the use of code fragments, and derivations from the code. A software citation indexing service – similar to the Data Citation Index (http://wokinfo.com//products_tools/multidisciplinary/dci/) of Thomson Reuters – can be developed to track citations among software, datasets and literature and to facilitate software search and access.

Open science would help everyone – including the authors – but it can be laborious and boring to give all the fiddly details. Fortunately fiddly details are what computers are good at. Advances in technology are enabling the categorization, identification and annotation of various entities, processes and agents in research as well as the linking and tracing among them. In our 06/2014 Nature Climate Change article we discussed the issue of provenance of global change research (http://www.nature.com/nclimate/journal/v4/n6/full/nclimate2141.html). Those works on provenance capture and tracing further extend the scope of metrics development. Yet, incorporating those metrics in incentive design requires the science community to find an appropriate way to use them in research assessment. A recent progress is that NSF renamed Publications section as Products in the biographical sketch of funding applicants and allowed datasets and software to be listed (http://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp). To fully establish the technological infrastructure and incentive metrics for open science, more community efforts are still needed.

VN:F [1.9.22_1171]
Rating: 8.5/10 (6 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

FAKE open access publications in now days and my suggestion

September 24th, 2013

Open Access in now days is such a *FAKE* idea. It is the author’s paper, not the publisher’s. Currently what a reader pays is for the typesetting according to the format of a publisher. A author can make his own manuscript (not the pdf from the publisher) anywhere online for access. Now a author pays hundreds to a publisher for Open Access to his paper. I URGE, publishers should provide a *FREE* function that allows a author registers a link to his author-made version of a paper on the landing page of the DOI of a published paper. This is the *TRUE* Open Access. What most readers need is the meaning of a paper, not the typesetting. If one do cares the typesetting, he can pay a subscription to get the publisher’s version. University or institutional libraries should build facilities and functionalities that support employees to register and upload author-made versions of publications – to improve the visibility and accessibility of the academic work of the institution itself.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

My report on Open Government Data camp 2011

November 2nd, 2011

A few days ago I (Alvaro Graves) participated in the Open Government Data Camp 2011 in Warsaw, Poland, where people from different groups, organizations and governments met to discuss issues related to Open Data at government level. Here are some of the most important issues found in theese talk, in my opinion.

The current state of OGD

David Eaves, an activist who advises the city of Vancouver, Canada in issues about Open Data, gave a keynote in which he described his views on the current state of Open Data movement. First, it is striking that the success stories are not just a few anymore (as Data.gov or Data.gov.uk) but there are dozens (perhaps hundreds), both at national, regional and local levels. Similarly, the term Open Government Data is becoming increasingly popular, which is good because it is easier to stop explaining the ‘what’ and start focusing in the ‘how’.

Another interesting point is how the movement of Open Government Data already passed an inflection point, where it is no longer seen as people demanding from the outside, but being increasingly being invited to help working on these initiatives from within the government. For many, this change in perspective can be confusing and may create some concerns of Open Data being absorbed in a bureaucratic system that makes impossible to implement Open Data initiatives. However, it is clear that in order for these changes to occur, the movement can not reject to collaborate with governments.

Local initiatives, by locals

A talk that I really liked was by Ton Zylstra, who lives in the city of Enschede, the Netherlands. This city has only 150,000 inhabitants. He wanted an Open Data initiative there, however, it was difficult to convince the authorities, so he with a group of people decided to start working on their own. Inviting a handful of hackers to a bar, they created their first application that used data from Twitter, Foursquare, and the venues of a local festival. Eventually they convinced the municipal government that the default option for local data ought to be open.

From this experience, Ton showed several important lessons: You have to create something concrete, no matter if it is small: This implies something that requires little funding (the first beers at the bar were free) and short-term (no more than a couple of weeks). It does not matter if it is something original or not, there are some great ideas out there that deserve to be copied and are very useful for the local community.

How the Open Data died

Another very interesting keynote was by Chris Taggart, founder of OpenCorporates, who warned of the risks that the Open Data movement is facing today. His main concern is the lack of relevance in terms of impact Open Data has on society. For example, he mentioned that so far no one’s business depends on Open Data (although this is not true, there are a few out there, but I have to concede they are rare examples). In general, making data available is not enough, it is necessary for it to be used either in applications, by data journalists, etc. Also, it is fundamental to link different sites with Open Data (something quite uncommon in the movement), so that people can find out more information. Finally, I liked his idea that if the Open Data does not cause problems to its incumbents, then it is not working.

Redefining what is public

Finally another talk that I found interesting was the idea of ​​Dave Rasiej, founder of Personal Democracy, and Nigel Shaldbolt, professor at University of Southampton, to redefine “the public” in terms of data that “is available on the Web in machine-processable formats.” That is, uploading a bunch of PDFs with scanned tables does not make that information public, because it is not easily accessible. This initiative raises the bar of what public data is, especially when compared to the FOIA (Freedom of Information Act) that allows you to request information from government. Note that this applies to all information, as Rasiej so vehemently described it.

So… what did you talked about at OGDCamp?

In my case, I presented a system for publishing Linked Data called LODSPeaKr, which can be used for the rapid publication of government data and to create applications based on Linked Data. In the near future I will be writing more about this framework, but for now you can see my presentation here.

VN:F [1.9.22_1171]
Rating: 9.5/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Unanticipated consequences: Saving data.gov

April 14th, 2011

I had a bizarre dream last night, one of those surreal shockers. The details aren’t important, but I realized on waking up that the dream’s theme was all about unanticipated consequences.  I realized I needed to write this post.

To set some context: I went to bed upset last night.  I was upset at two things, one is an article on techcrunch entitled “Five Open Questions For Data.gov Before We #SaveTheData,” the other was my response to the article.  I hope I can respond to the first and apologize for the second.  I want to make one thing clear, however, before I start – I am a strong supporter of http://data.gov, I think it is a great experiment in democracy resulting from bold leadership, and if it dies in the current budget cutting it will be an enduring embarrassment for the USA and a major loss to government transparency.

The article I was upset about was written by Kate Ray (@kraykray), an amazingly bright and articulate young woman who has made several very impressive videos and online articles that I am a fan of.  She recently was one of the co-founders of “NerdCollider,” a website designed to bring intelligent discussion to interesting issues — an idea I support.  I was proud to be an early contributor to one of their discussions, which asked “What would you change about Data.gov to get more people to care?

In the TechCrunch blog post I mentioned above, Kate takes several quotes from this discussion and reflects on their import — is data.gov taking some of the key issues into account?  As a good reporter, Kate’s OpEd is actually quite objective – she reports on several comments made by people, including me, as to issues the site has in terms of its effort to share government data.   TechCrunch is a very influential site, the article title has been tweeted and retweeted hundreds of times to hundreds of thousands of potential readers (congrats to Kate on this viral takeup), raising awareness of Congress’ narrow-minded goal of killing the project, which I guess is a good thing.  Unfortunately, the choice of the word “Before” in “… Before we #savethedata” has a negative implication, and I’m hoping that doesn’t kill off the positive efforts that the #savethedata meme was designed to promote.

In her article, Kate brings up important issues, but what she doesn’t make clear is that most of the people she quotes are indeed strong supporters of the Open Government movement and fans of Data.gov.  The seeming criticisms were actually constructive responses to the question of how we could get more people to care (a positive), and not meant to say what was wrong with the site that must be fixed before the site was useful.  It’s already very useful, but like any new effort, there’s always room for improvement. However, those changes will never happen if the site is forced to go dark!

As I said, Kate’s article has been phenomenally well tweeted, in fact, if you look at #savethedata the stream is so filled with pointers to this article that one can no longer easily find the link to the Petition created by the Sunlight Foundation to help stop the budget cuts — that petition is where the #savethedata meme started (thanks @EllnMiller).  Kate also doesn’t point to the great HuffPost article by @bethnoveck explaining why cutting the funding to this and other egovernment sites will threaten American jobs which was also retweeting around the #savethedata meme.

So I hope one unanticipated consequence of this article is that it doesn’t help cause the death of data.gov by killing off the awareness of its importance or losing the momentum on the petition that could save it.

But, as Arlo Guthrie used to say, “that’s not what I came here to talk about tonight…”

In my response to Kate’s article, I referred to her making factual errors.  This is a horrible thing to accuse a young journalist of, and I was being unfair.  The errors I wanted to point out were not in Kate’s piece, but in the chart chosen to go along.  It appears to show a flatline in the interest in data.gov, using figures from (as Kate told me later in a separate tweet) compete.com on “unique visits.”  I don’t know where compete.com gets the data, but the tracking of the  number of visitors on the data.gov site — which are reported on the site on a daily basis seems to show a much larger number with a more positive trend (over 180,000 visits in March).  It’s unclear why there is this discrepancy (I suspect it’s in how compete.com figures uniqueness for sites they don’t control), but it is clear it isn’t Kate’s fault.   She also cites the number of downloads in her article as 1.5M since Oct 2010, which is the number reported on data.gov, but as of last week, the site broke 2M downloads, and the number is trending up.

Anyway, I’m digressing again (occupational hazard of a college professor) — the key point is the errors are not Kate’s and that she was reflecting on what she found.

I also was upset that she quoted me out of context – in my nerdcollider response I made it clear I was supporting data.gov, and offering some constructive solutions to the question of how we could make the site better.  As the quote appears in her piece, it looks like I’m saying the data is poorly organized on the site — but what I was actually saying is that in the incredible richness of  data sets available (data.gov hosted over 300,000 datasets at last count!) we have to explore new ways to search for data  — it’s a wonderful problem to have!  But I did say what she quoted, and as she pointed out to me, correctly, one of the good things about nerdcollider is that the full context of the quotes are there to be cited.  She’s right.

So just as I hope Kate’s piece doesn’t have the unanticipated consequence of hurting data.gov, I hope my admittedly intemperate response doesn’t have the unanticipated consequence of hurting the reputation of this young potential online media star.

@kraykray – I apologize.

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: +2 (from 2 votes)
Author: Categories: linked data, open data, personal ramblings, twitter Tags:

Data.gov – it’s useful, but also could be better.

April 5th, 2011

The “Nerd Collider” Web site invited me to be a “power nerd” and respond to the question “What would you change about Data.gov to get more people to care?”  The whole discussion including my response can be found here.  However, I hope people won’t mind my reprinting my response here, as the TWC blog gets aggregated to some important Linked Data/Semantic Web sites.

My response:

I was puzzling over how I wanted to respond until I saw the blog in the Guardian – http://www.guardian.co.uk/news/datablog/2011/apr/05/data-gov-crisis-obama – which also reflects this flat line as a failure, and poses, by contrast, the number of hits the Guardian.com website gets. This is such a massive apples vs. oranges error that I figure I should start there.

So, primarily, let’s think about what visits to a web page are about — for the Guardian, they are lots of people coming to read the different articles each day. However, for data.gov, there isn’t lot of repeat traffic – the data feeds are updated on a relatively slow basis, and once you’ve downloaded some, you don’t have to go back for weeks or months until the next update. Further, for some of the rapidly changing data, like the earthquake data, there are RSS feeds so once setup, one doesn’t return to the site. So my question is, are we looking at the right number?

In fact, the answer is no — if you want to see the real use of data.gov, take a look at the chart at http://www.data.gov/metric/visitorstats/monthlyredirecttrend — the number of total downloads of dataset since 2009 is well over 1,000,000 and in February of this year (the most recent data available) there were over 100,000 downloads — so the 10k number appears to be tracking the wrong thing – the data is being downloaded and that implies it is being used!!

Could we do better? Yes, very much so. Here’s things I’m interested in seeing (and working with the data.gov team to make available)

1 – Searching for data on the site is tough — keyword search is not a good way to look for data (for lots of reasons) and thus we need better ways – doing this really well is a research task I’ve got some PhD students working on, but doing better than is there requires some better metadata and approach. There is already work afoot at data.gov (assuming funding continues) to improve this significantly.

2 – Tools for using the data, and particularly for mashing it up, need to be more easily used and more widely available. My group makes a lot of info and tools available at http://logd.tw.rpi.edu – but a lot more is needed. This is where the developer community could really help.

3 – Tools to support community efforts (see the comment by Danielle Gould to this effect) are crucial – she says it better than I can so go read that.

4- there are efforts by data.gov to create communities – these are hard to get going, but could be a great value in the long run. I suggest people look to these at the data.gov communities site, and think about how they could be improved to bring more use – I know the data.gov leadership team would love to get some good comments about that.

5 – We need to find ways to turn the data release into a “conversation” between government and users. I have discussed this with Vivek Kundra numerous times and he is a strong proponent (and we have thought about writing a paper on the subject if time ever allows). The British data.gov.uk site has some interesting ideas along this line, based on open streetmap and similar projects, but I think one could do better. This is the real opportunity for “government 2.0” – a chance for citizens to comment just on legislation, but to help make sure the data that informs the policy decisions is the best it can be.

So, to summarize, there are things we can do to improve things, many of which are getting done. However, the numbers in the graph above are misleading, and don’t really reflect the true usage of data.gov per se, let alone the other sites and sites like the LOGD site I mention above which are powered by data.gov.

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)