The GeoData2011 workshop was a tremendous experience to hear about the state-of-the-art in the data lifecycle, data integration, and data citation, and to participate in dialogs that will define the path forward in each of these areas. It was humbling to be surrounded by the combined centuries of experience in geoscience and the data pipeline. There were members from virtually every community in geoscience and organizations specializing in every stage of the data lifecycle. That being said, there were some key takeaways that I collected from each of the workshop foci (lifecycle, integration, and citation).
The data lifecycle is a hard thing to define. Our own Prof. Peter Fox gave the workshop a starting point with a simple, three-level model involving acquisition, curation, and preservation. Of course, the data lifecycle is not by any means a simple entity, and it’s likely that there is not a one-size-fits-all framework or abstraction for every instantiation. Some participants thought there needed to be a distinction between the “original intent” data cycle and further cycles. Others viewed the data lifecycle as an endless spiral of acquisition, curation, and preservation. One of the breakout sessions divided the simple lifecycle further to include more granular stages such as collection planning, processing, and migration. Even with the varied viewpoints on how to define a data lifecycle, the breakout sessions all pointed to metadata as the primary target for its improvement. The need to identify points where metadata must be captured, to build better tools and automate the process of capturing metadata within data collection instruments, and to educate scientists on the importance of metadata emerged as critical paths to improving the data lifecycle. As a Semantic Web group, we can proudly say that we are good at dealing with metadata. That being said, I still think we can improve in certain areas. To start, more so than the geoscience community, I think we can nail down a data lifecycle abstraction for acquiring, curating, and preserving Semantic Web data. We can also do a better job of capturing metadata throughout our data pipeline, and tools like csv2rdf4lod should be celebrated for their efforts in doing this.
For me, the data integration sessions may have been the most interesting part of the workshop. In our group at TWC, data integration is a task that many of us do on a daily basis: transforming data from different formats to RDF to enable interoperability, applying community vocabularies and constructing vocabulary mappings to enable a consistent view over data. However, most of the data integration tasks we perform are to achieve a certain goal, or to implement a specific use case; the goals that most of the GeoData participants had in mind were much more ambitious. There was no use case, or specific domain; rather, the workshop focused on enabling data integration in the broad, multidisciplinary domain of geoscience. The participants were primed with a talk by Jim Barrett from Enterprise Planning Solutions, where he mentioned the need to move data integration up the value chain, from the use side to the supply side of data. I think there were mixed feelings on the extent to which data integration be moved up the value chain. Most recognized that there is generally a tradeoff in ability to integrate data and the ability to capture everything in the original data acquisition. The breakout session I participated in for this topic had a few interesting suggestions, namely that each role in the data lifecycle (e.g., producer, archiver, user) needed to maximize “integratability,” distinct from the callout to move integration up the pipeline. It was also mentioned that identifying limitations of data transformations (i.e., what has enabled integration) and constraints on data transformations (i.e., what can enable integration) is important, and there are constructs in the ISO 19115 standard for doing this (MD_Usage and MD_Constraint). There is tremendous potential to apply semantics in this area, through vocabularies and reasoning capabilities, to notify users of the limitations of the products they are using, and to provide warnings before exceeding constraints.
The last major focus of the workshop was on data citation. Before attending GeoData2011, I realized the significance of data citation, but only after the workshop did I realize that it was truly within the grasp of the scientific community. Mark Parsons from the National Snow and Ice Data Center presented some ongoing work in data citation, such as DataCite and the Dataverse Network project, as well as his own theories on data citation. He hypothesized that 80% of available data can be cited as is, without the need for any special data citation platforms, and set the breakout groups on the task of writing data citations and identifying gaps. Some particularly tricky datasets were identified that might need alternative approaches, including taxonomic data (which is frequently changing) and hydrographic data (which is often compiled from many individual cruises into a homogeneous database). What I found most interesting was that Parsons was suggesting that we cite data in exactly the same way as we make other citations in our publications; that we need to treat data citations as if they are equally important to journal and in-proceedings references. Data citation is critical to the work we are doing at TWC. We are almost unanimously working with someone else’s data in our lab. As such, when we publish on what we’ve done, or even when we post visualizations and mashups of Data.gov datasets, we need to include references to the original data that are just like the references we’d put in any of our publications. In our LOGD site, we should be making appropriate data citations on the pages we create for converted datasets. Making these simple changes to the way we do science and educating students, scientists, and even publishers is the only way to make progress in data citation.
So those are my takeaways. The GeoData2011 workshop was an excellent opportunity to learn about the state-of-the-art and the path forward in the data lifecycle, data integration, and data citation. In short, let’s identify the data lifecycle for Semantic Web data, keep building tools that automatically capture metadata, and add appropriate citations for integrated datasets, visualizations, and mashups that we create. I look forward to applying the information I absorbed from the many interesting dialogs that occurred. In fact, I will be looking into the ESIP Discovery Cluster in the coming weeks to see where my work on S2S and Semantic Web services can be applied to improve on their discovery services (especially for their OpenSearch conventions).