GeoData 2011 – Experiences
GeoData 2011 was a great platform for me to learn from and get familiar with data scientists in academia and various other organizations. The workshop focused on current practices and future directions of data life cycle, integration and citation. It was a very important resource for my research on data citation and was also a perfect follow-up for the data science course which I took last term. The workshop’s highlight was the three breakout sessions on Data life cycle, Integration and Citation. Each of these breakout sessions were preceded by thought provoking talks by experts in the respective areas.
The first day focused on aspects of data life cycle. Prof. Peter Fox gave a talk on the various stages of a data life cycle and presented a couple of data life cycle models. The talk was well received. Some of the slides he presented, especially the “data-knowledge-information” diagram and the “pyramid of data and its audiences” were referenced at various points in the breakout session I attended. While most members reached a consensus on a life cycle model similar to the one suggested by Prof. Fox, some members suggested addition of a “disposal” and a “data definition” stage. There were also suggestions to have two separate value streams for data. Gaps in the life cycle were investigated. Many participants highlighted the need for incentives for better data management.
The Second day started with reports from the data life cycle breakouts. Following the reports, Jim Barret gave a talk on “GeoSpatial Integration”. He suggested a systematic collaborative effort towards data integration and proposed building and publishing a national supply chain plan for data. Rich Signell, USGS, chaired our breakout. To concur with Prof. Fox’s metaphor – “Dead fruit lying on the ground”, our team highlighted successful data integration efforts at OGC and UniData and discussed the importance of communicating those standards to the community. Rich Signell also pointed out the huge demand for people trained in producing quality data.
The Data Citation breakout was my personal favorite. Mark Parsons from NSIDC presented his hypothesis that “~80% of citation scenarios for geospatial data can be addressed with basic citations”. He gave us a homework exercise to come up with citations for 3 use cases. In my breakout group, we split up into small 3 person subgroups. I teamed up with Rich Signell and Ben Lewis. I presented a use case featuring an EPA data set ( How do we cite data from http://www.epa.gov/cgi-bin/htmSQL/mxplorer/query_daily.hsql?poll=42101&msaorcountyName=1&msaorcountyValue=1). Rich Signell showed one of his use cases – a data set in a THREDDS server, which had the same scientific content available in different file formats. Signell raised questions about granularity of citation. Our group also discussed the possibility of using SHA-1 hash values as identifiers to avoid having a central authority having control over identifiers. Personally, I feel a DOI or Handle – like identifier would be the best option, as it would act as both an identifier and a locator with the benefits of persistence. Bruce Barkstrom asked questions about lineage of data citations. As presented in one of the breakout reports, if we use information in a map built from 100 datasets, do we cite the map or the 100 datasets? Many research questions were thrown in the breakout. The breakout session was an excellent opportunity for me to come up with additional use cases and get an idea of the issues around data citation.
Apart from the workshop’s central theme, I also had various semantic web and provenance related discussions with various participants. I met people who were very interested in provenance concepts. Provenance was a hot topic of discussion in both the data life cycle and data citation breakouts. I had discussions about PML and the Inference Web with folks from ORNL and Harvard. They were very excited about it and showed lot of interest. There seems to be a huge interest and demand for tools facilitating Provenance collection and integration. People are showing a lot of interest in tools like csv2rdf4lod, which has built-in provenance support. There was also a lot of interest in using semantic web technologies for GeoInformatics.
The workshop has given me a lot of insight into data citations, life cycle and integration. I hope to make best use of this experience in my efforts to come up with proper data citation methods. People need incentives to produce quality data and Data Citations could be the answer.