AGU Fall Meeting 2013 Sessions and Submissions

Printer-friendly version

Back to AGU FM 2013 Page


  • Marshall
    • Ontology development for provenance tracing in National Climate Assessment of the US Global Change Research Program - to Session IN008
    • A justification for semantic training in data curation frameworks development - to Session ED030

What to Submit

  • ECOOP (Stace and Massimo)
  • DCO
  • Provenance in software (Patrick, OPeNDAP)
  • CMSPV (Linyun)
  • RDESC (probably not according to Jesse et. al.)
  • GCIS-IMSAP (Marshall)
  • SemantAqua and SemantEco (Evan in IN025 or IN011)
  • DataONE summer internship on annotator extensions (Katie, Patrice, Brendan, Evan, Tim, Deborah)
  • Peter - two invitations (see below; GC045 and IN021) - submitted.
  • Deborah - I am not doing any alone this year - but collaborating on 3
  • John?

What Sessions might we want to submit to?

  • IN032 Semantically Enabling Annotation, Discovery, Access, and Integration of Scientific Data.
    • Conveners: Brian Wilson, Xiaogang (Marshall) Ma, Tom Narock, Peter Fox
    • Abstract: Data providers are now publishing more metadata in more integrable forms, e.g. Atom/RSS 'casts', as Linked Open Data, or as ISO Metadata records. The existence of data collections and granules advertised via Web technologies, such as datacasts, available web services via service casts, and geophysical events with relevant datasets & images as event casts provides a great opportunity. The challenge now is to overlay semantics on this metadata world to support rich annotation, discovery/access and semantic integration while building bridges between older metadata technologies and new semantic RDF & OWL to support SPARQL querying and inference. We seek contributions on vocabularies, tools, approaches, and experience.
  • Session IN010 entitled Data Scientists Come of Age
    • Conveners: Peter Fox, Benjamin Branch, Ruth Duerr, Lesley Wyborn
    • Abstract: It is hard to click a Web page and not see mention of data scientists: in some they are even called sexy. In the past, most geoscientists did not have skills to effectively manage, curate, preserve and analyze complex volumes of digital data, whilst data professionals did not understand the science. Today the required skills lie in the domain of data scientists: people who know both the special needs of science data AND have domain expertise in science data, data structures, formats, vocabularies, ontologies, etc. This session seeks expositions from practicing data scientists to tell THEIR story – credentials, knowledge and skills; technical and scientific needs; and incentives and rewards that are important. We seek ways to make data science routine.
  • (IN025) Models, Principles and Best Practices for Scientific Data Portal Development
    • Conveners: Linyun Fu, Margaret Glasscoe, Christine Laney
    • Session Description: Research areas like space physics, meteorology and geology are inherently based on large amounts of experimental and/or observational data. Data portals are indispensable to the conduct of research in these areas. While the scientists know a lot about the data they collect, they are not necessarily good Web engineers and/or designers. This session brings together general models, principles and best practices specifically related to building geoportals to help scientists as well as Web engineers and designers better understand, develop and work with data portals.
  • IN011. Data Stewardship: in Theory and in Practice
    • Conveners: Cyndy Chandler, Deborah McGuinness, Dawn Wright, Lesley Wyborn
    • Description: Data stewardship is vital to science of today and tomorrow. Yet stewardship and its many roles are not consistently defined, conceptualized, or implemented even within the same discipline or organization. There is a gap between theory and practice. This session begins to bridge that gap by examining roles, perspectives, and attributes of the overall stewardship enterprise from proposal through preservation. We explore theoretical and practical approaches for understanding complex issues like* tracking provenance, added value, and credit through the data lifecycle* defining elements of data quality* scaling of complex processes* scientist perspectives and normsWe seek to reexamine worldviews, explore alternatives, and evolve data stewardship.
  • IN024. Leveraging Architectures and Open Standard Data Services to Broker End-to-End Earth and Space Science Analytical Workflows
    • Conveners: Lesley Wyborn, Robert Woodcock, George Percivall, Mark Gahegan
    • Description: As cyberinfrastructures grow in capacity, barriers to using such systems become higher and there is now increasing demand on developing reusable and repeatable workflows. The objective of this session is to share examples of how complete workflows — from data source through results publication — are constructed and transparently published. Are there common architectural patterns that can be leveraged? Are there identifiable, re-usable services for provenance and analysis service interfaces that could become open standards? What are the design principles for determining processing service granularity and representation in provenance records?


ECO-OP by Stace

Session: IN032. Semantically Enabling Annotation, Discovery, Access, and Integration of Scientific Data

Provenance for actionable data products and indicators in marine ecosystem assessments

Stace Beaulieu, Andrew Maffei, Peter Fox, Patrick West, Massimo Di Stefano, Jonathan Hare, and Michael Fogarty

Ecosystem-based management of Large Marine Ecosystems (LMEs) involves the sharing of data and information products among a diverse set of stakeholders – from environmental and fisheries scientists to policy makers, commercial entities, nonprofits, and the public. Often the data products that are shared have resulted from a number of processing steps and may also have involved the combination of a number of data sources. The traceability from an actionable data product or indicator back to its original data source(s) is important not just for trust and understanding of each final data product, but also to compare with similar data products produced by the different stakeholder groups. For a data product to be traceable, its provenance, i.e., lineage or history, must be recorded and preferably machine-readable. We are collaborating on a use case to develop a software framework for the bi-annual Ecosystem Status Report (ESR) for the U.S. Northeast Shelf LME. The ESR presents indicators of ecosystem status including climate forcing, primary and secondary production, anthropogenic factors, and integrated ecosystem measures. Our software framework retrieves data, conducts standard analyses, provides iterative and interactive visualization, and generates final graphics for the ESR. The specific process for each data and information product is updated in a metadata template, including data source, code versioning, attribution, and related contextual information suitable for traceability, repeatability, explanation, verification, and validation. Here we present the use of standard metadata for provenance for data products in the ESR, in particular the W3C provenance (PROV) family of specifications, including the PROV-O ontology which maps the PROV data model to RDF. We are also exploring extensions to PROV-O in development (e.g., PROV-ES for Earth Science Data Systems, D-PROV for workflow structure). To associate data products in the ESR to domain-specific ontologies we are also exploring the Global Change Information System ontology, BCO-DMO Ocean Data Ontology, and other relevant published ontologies (e.g., Integrated Ocean Observing System ontology). We are also using the mapping of ISO 19115-2 Lineage to PROV-O and comparing both strategies for traceability of marine ecosystem indicators. The use of standard metadata for provenance for data products in the ESR will enable the transparency, and ultimately reproducibility, endorsed in the recent NOAA Information Quality Guidelines. Semantically enabling not only the provenance but also the data products will yield a better understanding of the connected web of relationships between marine ecosystem and ocean health assessments conducted by different stakeholder groups.


Provenance Capture in Data Access and Data Manipulation Software

Authors: Patrick West (1), Peter Fox (1), Deborah McGuinness (1), James Gallagher (2), Dan Holloway (2), Nathan Potter (2)

(1) Tetherless World Constellation, Rensselaer Polytechnic Institute, Troy, NY
(2) OPeNDAP, Narragansett, RI

In IN011

There is increasing need to trace back the origins of data products, whether images or charts in a report, data obtained from a sensor on an instrument, a generated dataset referenced in a research paper, in government reports on the environment, or in a publication or poster presentation. Yet, most software applications that perform data access and manipulation keep only limited history of the data, i.e. the provenance. Imagine the following scenario: There is a figure in a report showing multiple graphs and plots related to global climate, the report is being drafted for a government agency. The graphs and plots are generated using an algorithm from an iPython Notebook, developed by a researcher who is using a particular data portal, where the algorithm pulls data from four data sets from that portal. That data is aggregated together over the time dimension, constrained to a few parameters, accessed using a particular piece of data access software, and converted from one datatype to another datatype; All the processing on the data sets was conducted by three different researchers from a public university, on a project funded by the same government agency requesting the report, with one Principal Investigator and two Co-Investigators. In this scenario, today we’re lucky to get a blob of text under the figure that might say a couple things about the figure with a reference to a publication that was written a few years ago. Data citation, data publishing information, licensing information, and provenance are all lacking in the derived data products.

What we really want is to be able to trace the figure all the way back to the original datasets, including what was done to those datasets; and to see information about the researchers, the project, the agency funding, the award, and the organizations collaborating on the project. In this paper we discuss the need for such information and traceback features, as well as new technologies and standards that can help us become better data stewards. Specifically, we will talk about the new PROV recommendation from the W3C, recently published, and existing and new features in the OPeNDAP software stack that can help facilitate the incorporation of citation, licensing, and provenance information and the ability to click through to retrieve that information.



IN024. Leveraging Architectures and Open Standard Data Services to Broker End-to-End Earth and Space Science Analytical Workflows

An open source approach to enable the reproducibility of scientific workflows in the ocean sciences

Massimo Di Stefano12 , Peter Fox12, Patrick West1, Jon Hare3, Stace Beaulieu2 & Andrew Maffei2

[1] Tetherless World Constellation, Rensselaer Polytechnic Institute, Troy - NY - [2] Woods Hole Oceanographic Institution, Woods Hole - MA. - [3] Northeast Fisheries Science Center, NOAA, Woods Hole - MA.

Every scientist should be able to rerun data analyses conducted by his or her team and regenerate the figures in a paper. However, all too often the correct version of a script goes missing, or the original raw data is filtered by hand and the filtering process is undocumented, or there is lack of collaboration and communication among scientists working in a team.

Here we present 3 different use cases in ocean sciences in which end-to-end workflows are tracked. The main tool that is deployed to address these use cases is based on a web application (IPython Notebook) that provides the ability to work on very diverse and heterogeneous data and information sources, providing an effective way to share the and track changes to source code used to generate data products and associated metadata, as well as to track the overall workflow provenance to allow versioned reproducibility of a data product. Use cases selected for this work are:

1 ) A partial reproduction of the Ecosystem Status Report (ESR) for the Northeast U.S. Continental Shelf Large Marine Ecosystem. Our goal with this use case is to enable not just the traceability but also the reproducibility of this biannual report, keeping track of all the processes behind the generation and validation of time-series and spatial data and information products. An end-to-end workflow with code versioning is developed so that indicators in the report may be traced back to the source datasets.

2 ) Realtime generation of web pages to be able to visualize one of the environmental indicators from the Ecosystem Advisory for the Northeast Shelf Large Marine Ecosystem web site.

3 ) Data and visualization integration for ocean climate forecasting. In this use case, we focus on a workflow to describe how to provide access to online data sources in the NetCDF format and other model data, and make use of multicore processing to generate video animation from time series of gridded data.

For each use case we show how complete workflows - from data source through results publication - are constructed and transparently published via the IPython Notebook. Our current work in development includes the incorporation of the W3C PROV provenance standard into the metadata of the JavaScript Object Notation (JSON) file of each Notebook. We are sharing our design principles for the granularity of these linked data provenance records with others in NOAA and NASA data communities. We conclude by reporting on end-user experience and satisfaction with these new capabilities.


Drop Box



ToolMatch by Nancy Hoebelheinrich

Discovering accessibility, display, and manipulation of data in a data portal

Authors: Nancy Hoebelheinrich (1), Patrick West
(2), Peter Fox (2), Chris Lynnes

(1) Knowledge Motifs LLC, San Mateo, CA
(2) Tetherless World Constellation, Rensselaer Polytechnic Institute
(3) Goddard Space Flight Center

In IN032

The accessibility of science data products is becoming increasingly easier,
with more and more data and scientific community portals coming online all
the time. But what can one do with the data product once it has been found?
Can I visualize the data product as a map, plot, or graph? Can I import the
data into a particular data manipulation tool like MatLab or IDL or iPython
Notebook? How is the dataset accessible, and what kind of data products can
be generated from it? ToolMatch is a crowd source approach (ontological
model, information model, RDF Schema) that allows data and tool providers,
and portal developers to enable user discovery of what can be done with a
science data product, or conversely, which science data products are usable
within a given tool.

Example queries may include "I need data for Carbon dioxide (CO2)
concentrations, a climate change indicator, for the summer of 2012, that can
be accessed via OPeNDAP Hyrax and plotted as a timeseries.", or "I need data
with measurements of atmospheric aerosol optical depth sliced along latitude
and longitude, returned as netcdf data, and accessible in MatLab."

This contribution outlines the progress of the ToolMatch development, plans
for utilizing its capabilities, and efforts to leverage and enhance the use
of ToolMatch in various portals.


  • GC045: Big Data: Pushing the Frontiers of Environmental Science.

In an era of advancing technologies and analytical tools, Big Data is altering the way scientists study and analyze information. While poised to change the fields of ecology and other sciences, defining the technical and cultural aspects of Big Data, and integrating analytical approaches that can extract information across multiple spatio-temporal scales and heterogenous data sets, remains challenging. This session welcomes contributions addressing: the ability of Big Data to push environmental science to new frontiers; the recognition of large and interoperable data sets to function as collaborative platforms for discovery; data and information flow; and defining how scientists are equipped with tools to maximize the use and integration of large data sets.

Invited presentation: How Environmental Informatics is Preparing Us for the Big Data Era

Peter Fox TWC/RPI

With increasing attention to how environmental researchers present and explain results based on interpretation of increasingly diverse and heterogeneous data and information sources and a renewed emphasis on good data practices, informatics practitioners have responded to this challenge with maturing informatics-based approaches. These approaches include, but are not limited to, use case development; information modeling and architectures; elaborating vocabularies; mediating interfaces to data and related services on the Web; and traceable provenance. The Big Data era broadly defined, presents numerous challenges to both individuals and research teams. In environmental science especially, sub-fields that were data-poor are becoming data-rich (volume, type and mode), while some that were largely model/ simulation driven are now dramatically shifting to data-driven or least to data-model assimilation approaches. These paradigm shifts make it very hard for researchers used to one mode of doing science to shift to another, let alone produce products of their work that are usable or understandable by non-specialists. At the same time, it is at these frontiers where much of the exciting environmental science needs to be performed and appreciated. In this contribution key informatics approaches, i.e. methods rather than specific technologies, that have been successfully applied to several environmental applications will be presented and discussed. Conclusions and future directions will also be outlined and discussed.

  • IN021. Information Model Driven Architectural Components for Science Data Repositories and Archives

Research and development across the space science communities have resulted in a wealth of architectural components for building data repositories and archives. Of special interest are open source components that allow science data providers and users to directly participate in the development of data repositories using information model driven methodologies. This session invites papers on Information Models, Ontologies, open source software components, related technologies, and case studies where model driven approaches are being used to meet the expectations of modern scientists for science data discovery, access and use.

Progress in Open-World, Integrative, Collaborative Science Data Platforms.

Peter Fox and the DCO-DS team.

As collaborative, or network science spreads into more Earth and space science fields, both the participants and their funders have expressed a very strong desire for highly functional data and information capabilities that are a) easy to use, b) integrated in a variety of ways, c) leverage prior investments and keep pace with rapid technical change, and d) are not expensive or time-consuming to build or maintain. In response, and based on our accumulated experience over the last decade and a maturing of several key technical approaches, we have adapted, extended, and integrated several open source applications and frameworks that handle major portions of functionality for these platforms. At minimum, these functions include: an object-type repository, collaboration tools, an ability to identify and manage all key entities in the platform, and an integrated portal to manage diverse content and applications, with varied access levels and privacy options.

At a conceptual level, science networks (even small ones) deal with people, and many intellectual artifacts produced or consumed in research, organizational and/our outreach activities, as well as the relations among them. Increasingly these networks are modeled as knowledge networks, i.e. graphs with named and typed relations among the 'nodes'. Nodes can be people, organizations, datasets, events, presentations, publications, videos, meetings, reports, groups, and more. In this heterogeneous ecosystem, it is also important to use a set of common informatics approaches to co-design and co-evolve the needed science data platforms based on what real people want to use them for.

In this contribution, we present our methods and results for information modeling, adapting, integrating and evolving a networked data science and information architecture based on several open source technologies (Drupal, VIVO, the Comprehensive Knowledge Archive Network; CKAN, and the Global Handle System; GHS). In particular we present both the instantiation of this data platform for the Deep Carbon Observatory, including key functional and non-functional attributes, how the smart mediation among the components is modeled and managed, and discuss its general applicability.