DCO-DS Narrative

Printer-friendly version

1. What is the main issue, problem or subject and why is it important?


The Deep Carbon Observatory (DCO), with four Science Directorates (Deep Energy, Deep Life, Extreme Physics and Chemistry, and Reservoirs and Fluxes; https://dco.gl.ciw.edu/ - Appendices B-E), is a bold decadal initiative with an ultimate aim: to transform the scientific and public understanding of Carbon in the complex deep Earth system. In extreme physical, chemical and biological environments DCO researchers seek discoveries concerning Carbon-bearing fluids and materials, microbial life, where Carbon resides in what quantities and the nature of its fluxes.


Leading up to the decadal lifetime of DCO, DCO-Data Science (hereafter DCO-DS) will assemble the “Deep Earth Computer”; creating a fundamental change in the conduct of Carbon-related research, resting upon a 21st century data science platform (elements of such a platform are discussed below), and a series of aggregate data holdings that have never existed before. The platform will also coexist with, and fundamentally enhance key community (academic, commercial and agency) data resources already in existence. The Deep Earth Computer will embody “in memory” current and parts of the past state of information and knowledge of the deep Earth, based on rich data generated and assembled by DCO and its aligned and allied activities starting, but not stopping, with Carbon. En route, DCO will inevitably stimulate changes in the conduct of research across several Earth sciences disciplines; enabled by routine worldwide collaboration among DCO scientists and the communities they are part of.


In the first years of the DCO-DS, as a step towards the creation of the Deep Earth Computer, significant data science efforts will be unified to implement a Deep Carbon Virtual Observatory (DCVO) to help the DCO to attain it’s four over-arching decadal science goals. DCVO will be a collaborative scalable education and research environment for searching, accessing, integrating, and analyzing distributed observational, experimental, and model databases. The Data Science challenges to creating the DCVO come from the need to support the complex data needs that result from work in the DCO.


For success, the DCO-DS will need to:

  • Develop data methodologies to handle a global network of volcano monitoring stations of high sample throughput,
  • Create tools to Process new streams of sensor data returned in near-real time for science and public engagement;
  • Rapidly Model new data that results from probing into the deep interior with a combination of drilling and experimental innovations;
  • Encode and store the simulation results of computations in new temperature and pressure regimes of Carbon physics/ chemistry/ biology;
  • Capture data coming from new specialized laboratory instrumentation;
  • Integrate data from different disciplines; genomic sequence data and increasingly complex metadata of life in extreme environments; and
  • Innovate new data tools that are quickly and easily accessible to a wide audience, including tools to visualize and analyze large data holdings.

Operationally, the DCVO will need to find the right balance of data/model holdings, portals and client software that researchers can use with minimal effort or interference, making it as if all the materials were available on a local computer in the user’s preferred language: i.e. appear to be local and integrated. DCO-DS will develop the DCVO by mid-decade as a basis for the essential data infrastructure that is required by DCO and the community, in the long term. Conceptually, the DCVO is the first incarnation of the data science platform introduced above leading to the Deep Earth Computer.


In pursuit of DCO decadal science goals, and as a result of data science discussions among DCO scientists and the DCO secretariat, six candidate data outcomes for the DCO at the end of its decade have emerged (and more may arise as the DCO effort evolves). DCO-DS is important since it will pay a key role in facilitating the creation of each of the following: A Virtual Mineral Laboratory, The Global Census of Deep Life, The Global Census of Deep Fluids, The Global Volcano Observatory, The Global State of High Pressure and Temperature Carbon (and related) Materials, and A Global Inventory of Diamonds with Inclusions (GIDI).


Underlying these major new capabilities will need to be an Earth Materials Data Infrastructure (EMDI) spanning all DCO Science Directorates. The genesis of EMDI will be in the data infrastructure elements as part of DCO-DS; in the Deep Carbon Virtual Observatory, Leveraging Community activities, Science Network, and Visualization and Discovery (1.i - 1.iv below) and comprise ~90% of the DCO-DS effort (personnel and funds) in this project period.


The era of DCO will be remembered as a time of fundamental change. This shift in the conduct of science will require a transition to data and software infrastructures that facilitate networked science that scales from more traditional small investigator/student efforts all the way to large international and multi-disciplinary teams – all on the same data infrastructure. Many of the initial activities for DCO-DS involve data and information structuring guidance, framework adaptation (explained more in following sections), network and collaboration stimulation, integration and support of DCO Engagement (led by the University of Rhode Island in a separately submitted AP Sloan proposal) goals and priorities, as well as addressing some key and immediate data needs to put a version of this data infrastructure in place as soon as possible. Ultimately, the DCO-DS platform will facilitate collaboration, data generation and use being as valuable as scientific publication. Additional peer norms (e.g. credit for data production) will emerge for a new generation of researchers that will take what DCO is doing for granted. The perpetual value of data and information products that DCO will generate or inspire are likely to be incalculable, and probably not foreseen.


To deliver on such bold claims require upfront attention to 1) immediate research value, 2) persistent and easy to access and use science infrastructure that adds to scientific progress and 3) does not detract or distract from it. To enact these three attributes, requires experience, skill and excellence in the relatively new field of Data Science, underpinned by modern informatics. Finally, for the most part the participants of DCO, the people, are new to data science concepts, methods, and techniques. We do not underestimate the accommodations that will be required for socio-cultural adaptation within DCO and thus the science network of DCO figures prominently into the DCO-DS.


Key Elements of the Data Science Effort


Given DCO’s intensive data and computational needs, each of the activities embedded in the Science Directorates and instrument development initiatives of the Deep Carbon Observatory must adapt or adopt data science and data management solutions to fulfill both their decadal strategic objectives and their day-to-day tasks. As such, the Data Science effort for DCO will continually assess in detail the data science and data management needs in each DCO activity and for the DCO as a whole, by using a combination of informatics methods; use case development, requirements analysis, inventories and interviews. To ensure a balanced allocation of DCO-DS resources to advance DCO Directorate goals, the priorities for DCO-DS will be established by the Data Science Advisory Committee on a regular basis.


We now turn to the description of six key elements of a DCO-DS platform (with expected average percentage of the data science personnel effort/funds for the project period in parenthesis): i) the Deep Carbon Virtual Observatory (50%), ii) Leverage and enhancement of the existing community data resources (20%), iii) the Deep Carbon Observatory Science Network (a virtual organization) (10%), iv) Visualization and scientific exploration (10%), v) Data as a first-class object, and vi) Data Science education and working activities (v-vi comprise the remaining 10%). In subsequent sections we then describe the cross cutting functions of the various Data Science Teams that will work on these component activities (Section 2).


i. The Deep Carbon Virtual Observatory


Modern approaches to science, to account for the reality of distributed and heterogeneous data, have introduced the concept of presenting discovery, access and use of that data through a virtual view of the holdings in strong preference to attempting to bring the data and metadata sources together in one place (the “if we only had one database for X” fallacy). This concept arose formally in Astronomy (ca. 2000) but has spread significantly since. Hence, the notion of an observatory, as used in the astronomical sense, is metaphorical and in many instances, the term virtual laboratory – or virtual repository – may be more applicable. Figure 1 provides a schematic block architecture for the multi-layered DCVO. The lower layer represents a variety of existing or required DCO data sources, the second layer integrates scientific processing tools and applications in present use for DCO researchers, and also includes discovery and search services, product generation, and the integration of data. Most importantly, these integrated data products can be fed into the top level which can produce visual and analytic applications and other aggregate analyses. All of these capabilities will be assessed in relation to DCO project needs. Development is via an evolutionary and iterative approach, which is needed to accommodate both common and diverse data science capabilities identified to date (see Appendices: B - E).


RPI’s Tetherless World Constellation (TWC) has developed numerous successful eScience applications using a methodology we have formalized from studying science communities and determining requirements for supporting large-scale eScience efforts, aimed at trained scientists who needed to work in interdisciplinary settings [Fox et al. 2009, 2011]. We have successfully deployed and refined this methodology in focused scientific communities – typically with user communities of trained scientists numbering up to the low thousands. The longest lived of the deployments of this methodology has been in place for six years and is an interdisciplinary virtual observatory focused around solar, solar-terrestrial, aeronomy and space physics topics – the Virtual Solar-Terrestrial Observatory [for evaluation, successes and implications of such data infrastructures, see McGuinness et al. 2007, Fox et al. 2009]. The methodology has been reused in a wide array of topic areas including volcanology, plate tectonics, and atmospheric responses to volcanic eruption [Fox et al. 2007], coronal imaging data available from observatories such as the Mauna Loa Solar Observatory [Fox et al. 2009], arctic sea ice [Parsons et al. 2012], satellite and ground-based aerosol data [Leptoukh et al. 2011], biological and chemical oceanography data [Rozell et al. 2010], and more. Further, as of July 2012, TWC has been awarded funding from the USGS to develop energy.data.gov and ocean.data.gov for the inter-agency Coastal and Marine Spatial Planning program, and from the U.S. Global Change Research Program to develop climate.data.gov supporting the National Climate Assessment.


Schematic for Deep Carbon Virtual Observatory and Interoperability


Figure 1. Conceptual diagram for potential components of the DCVO.


ii. Leverage and Enhancement of Community Data Resources


Generating, assembling and analyzing the libraries of new and complex data created by the DCO will require that we actively manage the inherent complexity to allow integration of information and knowledge across multiple scales spanning traditional disciplinary boundaries. Curation of these data is also a critical element of such assembly. DCO-DS will build or further develop strong relationships, direct resources, and in some cases creating joint funding opportunities with DCO-identified community resources. Some initial ones include EarthChem, PetDB (geochemical data) and SESAR (sample registry) at Lamont Doherty Earth Observatory/ Columbia University (http://earthchem.org), MetPetDB at RPI (http://metpetdb.rpi.edu), genomic data with taxonomic counts, sequences, and a variety of other products in Visualization and Analysis of Microbial Population Structures [VAMPS] at the Marine Biology Laboratory (http://vamps.mbl.edu), and many “islands” of thermochemical databases, such as the Library of Experimental Phase Relations [LEPR] with Mark Ghiorso (http://lepr.ofm-research.org/index.php), Raman spectra, X-ray diffraction and chemistry data for minerals [RRUFF] with Bob Downs (http:;//rruff.info), the mineral and locality database [MINDAT] with Jolyon Ralph (http://mindat.org) – the last three data sources are under-resourced and are considered at-risk in the context of the DCO community. Strategic investments are required across this spectrum. Not only will DCO-DS leverage these resources, it will actively engage and enhance them, working with the responsible individuals.


This is a significant undertaking, which DCO-DS will dedicate resources toward. For example, bringing the V-CAFÉ (Volcanic Carbon Atmospheric Flux Experiment) into a networked standards-based sensor environment, sending data to the Web in near-real time for research and education use, and understanding and accommodating the impacts on other world wide networks devoted to volcanic monitoring (e.g. Network for Observation of Volcanic and Atmospheric Change) and observation, is both a community and a scientific undertaking. Thus, a significant and important aspect of the DCO-DS effort is targeting “boundary activities” directed at the task of establishing data infrastructure interfaces (data, metadata and services) that allow all community resources to be a part of the DCVO as well as allowing DCO data to flow into these locations such that it is still known and attributed to DCO and discoverable. DCO-DS specifically budgets funds (20% in the first phase) each year for interacting with these key boundary activities.


Figure 2 presents an early schematic of how data flows for DCO can be conceived to co-exist with community resources. Early on, the DCO-DS Community Team will work with the DCO-DS Data Infrastructure and Data Management Teams (i.e. this denotes a significant activity for DCO-DS) on boundary activities to target key (and ready) opportunities for leverage and enhancement. For example, enhancing DCO-DS and community datasets with RPI’s dataset markup recently accepted by the multi-search engine Schema.org as part of their standard, would provide new leveraging to the Web search ability of DCO datasets (schema.org microdata markup applied to scientific datasets will allow metadata about the dataset to have higher “search engine optimization” results, and thus DCO datasets will be more easily found by the major search engines, and the results more clearly displayed).


Schematic for Deep Carbon Virtual Observatory Data Flows


Figure 2. Data flow schematic indicating an interoperable arrangement for DCO generated data to sensibly land in authoritative repository or commercial applications such as search engines. DCO-DS will develop and fund these interface specification and adaptations by the community.


iii. The DCO Science Network.


Past research has shown that the identification and facilitation of interactions within large-scale “scientific networks” can be integral to enhancing the impact of large-scale scientific teams [Borgman et al. 2008; Falk-Krzesinski et al. 2010]. To create such a model for the DCO, The DCO-DS Community Team will work with the Secretariat to identify both representative and leading people and projects among the Science Directorates (in addition to those named in Appendices B-E); representative, in order that a current model of the present “virtual” social network (notably influenced by the DCO Directorate structure) will be developed through analyzing and enhancing what exists now in terms of the rich scientific exchanges already underway in DCO (emails, documents, workshop and conferences). The Community Team will work first with these targeted projects and particular individuals (people as well as key data resources (Section ii)), then more generally to characterize roles and relationships within the DCO Science Network and to identify key liaisons within each project team. Since the nature and the intensity of the networking around data science and data management proposed may represent a cultural change for many teams, the DCO-DS will work with the Secretariat and Engagement team to clearly articulate the value proposition of the platforms and infrastructure, as well as educate people in their use, where needed.


iv. Visualization and Scientific Exploration


A critical aspect of data intensive science is the need for different types of users, whether they be scientists themselves, funders of science, or the concerned public, to be able to apply visualizations to understand and discover the relations among and between the data. Unfortunately, visualization too often becomes an end product of scientific analysis, rather than as an exploration and discovery tool used throughout the research life cycle. However, new database technologies, coupled with emerging Web-based technologies, hold a key to lowering the cost of visualization generation and allow greater integration in the scientific process.


The Data Science Visualization Team will engage DCO researchers to bring innovative visualization and analytics approaches (especially new tools but also adaptation of existing tools) into every phase of DCO science research [cf. Fox and Hendler 2011]. The lack of development environments for interdisciplinary research conducted on large-scale datasets hampers research at every stage and the problems has a common cause — the lack of better visualization tools.


When DCO’s large, heterogeneous and distributed data is added to the equation, further frustration, at the least, ensues. As a result, using existing platforms, the programmers of 21st century interactive visualizations, let alone the scientists, are reduced to working in the same fashion with the same tools as 20th century database programmers. Over the past two years, in conjunction with RPI’s Experimental Media and Performing Arts Center the tools of digital artists have been brought to bear on the aforementioned exploratory data analysis and visualization challenges. In particular will bring tools like Field [Downie et al. 2011], an open-source extensible, visualization framework, to be used for DCO’s large-scale, web-based scientific data analysis and visualization.


v. Data as a first class science object


As noted in Appendices: B-E, DCO data types and sources are numerous and include, but are not limited to: 3-D seismic imagery (some from commercial sources), geophysical samples (existing and new); some hydrated, historical reanalysis (e.g. programs such as the International Ocean Drilling Program; IODP), new cores and biological, chemical and geophysical analyses, inclusions in diamonds and other minerals (leading to a global inclusion database), samples of Carbon-bearing surface rocks (from eruptions), contextual data such as location, age, petrologic, chemical and biological settings, temperature, pressure, density, etc., related measurements (e.g. radon), diverse mineralogical data, reaction rates at new pressures and temperatures, volcano gas emission data in a network flowing into a near-real time web-accessible environment, spectra from new instrumentation, Carbon and related (O-H-N) isotopic analyses, chemical and physical characteristics between organisms and minerals (biotic experiments) and compositional data and microbial sequence data and metadata.


When data is a first-class object, it may itself be the subject of discourse. For example, in EPC collaborators, using scientific visualizations, may interrogate the data origins of particular results, and indeed visualizations may be revised while unambiguously tracking the various sources and the conditions under which they were produced. An early contribution of the Data Infrastructure Team will therefore be the establishment of an identification infrastructure for DCO scientific data objects that will facilitate the long-term management, discovery and exploitation of the range of artifacts; reports, papers, visualizations, tables, presentations, data products, interpretative analyses, etc. produced by DCO researchers. Integrated with this naming infrastructure will be a rich metadata management system enabling data integration, federation and the construction of higher-level applications and visualizations in the DCVO, see Fig. 1. Web services will allow query (whether by users or applications) using the identifier for key metadata, access means, derivative products, provenance records, related artifacts of all types, etc. distinguishing citations attributable to DCO.


vi. Data Science Education and Data Management Training


Attention to Data Science is becoming ubiquitous. Surrounding this attention is a proliferation of studies, reports, conferences and workshops on Data, Data Science and workforce. The explicit naming of RPI's Data Science, Informatics and eScience curriculum offerings (e.g. in the Research Data Workforce Summit; Appendix N) is a strong indication of worldwide recognition. RPI already has a multi-faculty, multi-course, interdisciplinary Data Science Research Center (DSRC – http://www.dsrc.rpi.edu/ ), education activities and engaged faculty. RPI's Data Science curriculum offerings are unique. Further, RPI's informatics offerings are firmly embedded in its science and engineering programs unlike a large majority of such educational offerings elsewhere are in either schools of information or library sciences (bioinformatics may be the only exception). Our educational hypothesis is that a Data Scientist can be educated to become a unique contributor at the interface of science disciplines. Our unique contribution to DCO will be to provide workshop training and webinars specifically on Data Science for DCO researchers so they can be embedded in DCO science projects and particularly to facilitate the development of young DCO data scientists.


2. Partnering with the DCO Secretariat, Science Directorates, DCO Engagement and Broader Community


The six key elements of DCO-DS introduced feature a series of Activities and Functions that we discuss in this section as well as introduce the Data Science Teams and their roles. While many science needs are known for the present DCO project, some will come and go over the DCO lifetime. The DCO-DS fundamental approach (see Methodology in Section 5 and Appendix F) is: capture the science goal, assess and then adopt, adapt and (only if needed) develop. Elements such as Data Infrastructure, Data Management, Community, and Visualization are discussed below and map directly to the named Data Science Teams. DCO-DS will work with the Secretariat and through the Data Science Advisory Committee to ensure that the explicit needs of the four Directorates, and DCO, are met. The Community Team will bootstrap a DCO-DS community and then leverage that social/technical network to build profiles or templates of the data science and data management needs inherent to each project embedded within DCO Directorates (including Instrumentation pilots via the Secretariat), and for the DCO as a whole. The Data Science Data Infrastructure and Data Management Teams will model those needs using a combination of informatics methods including use case development, requirements analysis, inventories and interviews.


Data Science Needs Analysis and Prioritization: The first application of the DCO Science Network (see Section 1.iii above) will be to characterize, aggregate and disseminate the specific data science and data management requirements of projects across DCO. Identification and interactions with key node projects in the network, especially the handful of projects likely to have representative needs, will be through the respective Directorate liaisons on the Data Science Advisory Committee and will focus on building the investigative instruments that will later be formalized and facilitated through Web-based tools: use case development and collection, requirements analysis, inventories and interviews with project teams. We recognize that most effective standards of practice emerge from the community to which they apply, and the technical contributions of this activity to be similarly community focused and driven.


Community Engagement: The Data Science Community Team will coordinate the recruiting of DCO project liaisons (to complement the Directorate liaisons on the Data Science Advisory Committee) to form an evolving DCO-DS community of practice and serve as the primary link between their specific data science and data management functions and the DCO-DS activities. The Community Team, working in cooperation with DCO Engagement, will create an active community/social network of practitioners supporting the overall DCO engagement in-reach effort. To foster community, the Community Team will coordinate best practices working sessions and Coordinated Data Analysis/ Exploration Workshops (aimed at providing hands-on training) focused on data handling, data integration and visualization for the DCO researchers. At least one workshop per year will be held in conjunction with DCO All-Program, Scientific Steering Committee, Communication and Engagement Workshops, existing community workshops (e.g. the IGSN workshop at the recent Goldschmidt Conference, AGU, GSA, MSA), or as DCO topic-specific working sessions.


Boundary Activities: As previously noted in 1.ii above, these activities will be prominent in the early phases of DCO-DS. We have identified three immediate efforts that will be funded by the $90K/ year allocation in the DCO-DS budget. These are: within the R&F, stimulating the volcano gas emissions database effort as a collaboration between a new DCO project DECADE, the Smithsonian Global Volcanism Program (GVP; http://www.volcano.si.edu), EarthChem and DCO-DS. Within DL, the adaptation and augmentation of VAMPS capability for the ever growing computational and data access needs of the microbial sequence data. A forthcoming cross-Directorate (DE, EPC, R&F) effort is to initiate the database of diamonds with inclusions, and offers an opportunity to leverage sample registries in the community as well. Another cross-Directorate effort will be evolving the interface capabilities of EarthChem, PetDB, and SESAR (sample registration) to support broad DCO data flows (discussed earlier in Figure 2).


Visualization: The Data Science Visualization Team will work closely with all DCO participants to develop and design innovative ways to manage, present and explore DCO data, information, and assets, and showcase and illustrate scientific results. In particular the Visualization Team will work with the specific DCO science groups that are most in need of new means and tools to ensure utilization of the most relevant emerging frameworks and platforms to support DCO collaborations. We will also develop visualization products and tools to enhance the understanding of DCO’s findings and rapidly disseminate them to the scientific community and public in a professional, way. As these tools mature it is expected that DCO participants will move to a self-serve model for product generation, one that meshes with the DCO collaboration and dissemination infrastructure. The Data Science Visualization Team will also work closely with the DCO Engagement to showcase and illustrate scientific results to a general audience leveraging DCO science visualizations.


Bibliographic Infrastructure: The Data Science Data Infrastructure Team will work with the DCO Engagement Team to implement a DCO-wide Bibliographic Infrastructure [VIVO or “VIVO-like”], incorporating the digital object /name resolution infrastructure, extensible metadata model, efficient ingest workflows, and semantic discovery capabilities to track and manage an ever-growing list of DCO publications and their associated digital objects (data, with provenance). These distributed services will enable updating, searching, sorting, importing, and exporting of bibliographic entries utilizing keyword, date, author, title, publication, Digital Object Identifiers (DOIs), Handle Identifiers (Appendix M) or other identifiers (for e.g. HTTP URIs), and other appropriate fields. Not only will the platform serve the DCO community, it will prove to be a valuable resource for other researchers, students, the public, the press, and those interested in the scientific results of this decade-long program. DCO management will use this infrastructure for reporting, etc. and the science community for easily accessing and keeping track of their (and their colleagues) publications; also a resource for DCO Engagement.


File and Information Sharing: The Data Science Data Infrastructure Team will recommend and implement an appropriate Web-based file- and information-sharing approach to meet DCO digital object and information repository needs. This approach, part of the DCVO described earlier, will be designed to fulfill DCO science requirements and available to all DCO researchers (with login). It will be used for managing information, resources, and digital objects associated with research needs in data science and data management as well as communications and management functions, with a consistent identifier/resolution infrastructure and metadata model, defined ingest workflows, citation and attribution mechanisms, metric and reporting capabilities, and semantic discovery tools.


Data Science Outreach and Education: In order to facilitate network building across DCO, the Data Science Community Team will establish a Data Science Resource Center that will provide educational resources and a discussion forum, enabling the DCO-DS Community to establish its own definitional norms and best practices as it grows together. Various members of the Data Science Teams will act as facilitators and “first responders,” but as DCO-DS best practices become socialized, i.e. part of the way researchers do their work, the community itself will respond more quickly. Experience from other networked science communities indicate that members continue to participate well after their funded involvement ends.


Coordination: The DCO-DS will work jointly with the Engagement Team to enhance communications among DCO researchers and ensure that their scientific data is accessible and public under appropriate data-access policies. DCO-DS and Engagement will work closely with DCO Executive Committee peers to ensure there is ample discussion of objectives, enhanced creative thought in how to present scientific findings, and, in short, an effective, collaborative working team with shared objectives and goals.


Visual Materials Repository: The Data Science Data Management Team will partner with the Engagement team to design and develop a platform capable of handling metadata including date, time, and location crediting information with the DCO identifier resolution infrastructure including use restrictions, resolution details, and accurate content descriptions. The Data Science team will provide advice and counsel on the most efficient platform for the Visual Materials Repository taking into account workflows based on prototypes, best practices for data visualization consumption, archiving, and version management protocols.


3. What is the state of the research on this question?


In advancing DCO-DS in an integrated manner, we combine a) informatics methods and tools, including the Semantic Web, b) data science analytic and visualization tools and techniques, and c) life cycle data and information management best practices, d) extensive and documented experience in software-based data infrastructure deployments to enable both individual and collaborative, science and education, and e) knowledge as to how science networks (or virtual organizations) on the scale and heterogeneity of DCO can be stimulated and what data and social/ community infrastructure supports such a network. There are many related work efforts in each of these individual areas (e.g. Illinois and Indiana for Informatics, George Mason University for Data Science, and the University of New Mexico, University of Colorado, Boulder, Lamont Doherty Earth Observatory, and many others for Data Management) but no other group to our knowledge has the requisite expertise and deep, integrative research skill in all five areas required for the ambitious, large-scale inter- and multi- disciplinary science effort that DCO-DS is.


4. What is the research methodology?


Data science combines aspects of informatics, data management, library science, computer science and physical science using supporting cyberinfrastructure and information technology (Appendix F). Data science is changing the way all of these disciplines work individually and collaboratively. The research data life cycle needs and requirements analysis for existing DCO projects will continue and be conducted by interviews structured around three broad phases of the data life cycle (acquisition, curation and preservation – since these typically have different people/roles involved) and placed in the context of emergent best practices for Data Management plans for connected portions of the data life cycle. For just one example, see the MIT DDI Alliance view (see detail in Appendix L).


For effective return on DCO investments a clear understanding of the relative importance and resource investments is needed along the life cycle – e.g. emphasis on the immediate research benefit close to the acquisition stage in comparison to preservation stages. The goal is to determine what the similar and different data management approaches/ principles are for the DCO and its Directorates to effectively fulfill their objectives, As a positive consequence, new DCO projects will be informed by the current range of data science and data management plan options, and be asked to respond to these topics in their proposals. We will add to online tools such as the Data Management Plan generator (https://dmp.cdlib.org/) that has data management plans for NSF programs, and a few foundations. The Data Science Data Management Team will add a template for AP Sloan/ DCO purposes (first 6 months) and work with prospective DCO projects to include plans in their proposals and retrofit plans for existing projects.


The core challenge in creating a data management infrastructure is implementing support for the creation, management and dissemination of life cycle metadata associated with the research data itself. Each DCO project must be equipped with tools to properly generate and administer life cycle metadata, and to disseminate that metadata to participate in the Deep Carbon Virtual Observatory. This will facilitate immediate research use, dissemination, collaboration and re-use of data resources. The ongoing needs of new DCO projects will be assessed using the TWC’s Use Case development methodology (see Appendix F), a formal approach to data and information systems (i.e. Informatics) development and essential as DCO-DS progresses toward the goal of a Deep Earth Computer.


5. What will be the output from the research project?


During the three years, this project will have the following element outcomes/deliverables:

  1. The Deep Carbon Virtual Observatory that allows discovery, disseminate and use of DCO data and is used by the DCO community and other researchers.
  2. Community Data Resource agreements and interface specifications in place with a number of organizations as prioritized by the Data Science Advisory Committee.
  3. A DCO Science Network that routinely collaborates in research on-line.
  4. Visualization infrastructure and methodologies in use for DCO research and Engagement.
  5. Data from DCO projects becoming as valuable as the research papers written about them.
  6. Data Science and Data Management education and training will have an impact on the way DCO researchers, particularly those in their early career, conduct their science and share data.

These overall outcomes for this period of the project are intended to maximize the return on investment in the generation of DCO data, information and knowledge products in the pursuit of DCO decadal science, data science and engagement goals.


Appendix - References

  • Benedict, J.L., McGuinness, D.L., & Fox, P. 2007, A Semantic Web-based Methodology for Building Conceptual Models of Scientific Information, EOS Trans. AGU, 88(52), Fall Meeting Suppl., Abstract IN53A-0950.
  • Borgman, C. et al. 2008, Fostering Learning in the Networked World, Report of the NSF Cyber Learning Task Force.
  • Data Quality Screening Service (DQSS), http://tw.rpi.edu/web/project/DQSS
  • Downie, M., Kaiser, P., Enloe, D., Fox, P., Hendler, J., Ameres, E., Goebel, J. 2011, Evolving a Rapid Prototyping Environment for Visually and Analytically Exploring Large-Scale Linked Open Data, IEEE Large Data Analysis and Visualization, in press.
  • EC 2010, Riding the wave. How Europe can gain from the rising tide of scientific data, Final report of the High Level Expert Group on Scientific Data, European Commission, Brussels, 2010, 36pp.
  • H. J. Falk-Krzesinski, Katy Börner et al. 2010, Advancing the Science of Team Science, Clinical and Translational Science, 3 (5), pp 263–266. DOI: 10.1111/j.1752-8062.2010.00223.x
  • Field: http://www.openendedgroup.com/field/wiki
  • P. Fox, D. McGuinness, R. Raskin, A. K. Sinha 2007, A Volcano Erupts: Semantically Mediated Integration of Heterogeneous Volcanic and Atmospheric Data'', ACM Proceedings of the CyberInfrastructure: Information Management in eScience (CIMS). Doi: 10.1145/1317353.1317355
  • P. Fox, D. McGuinness, L. Cinquini, P. West, J. Garcia, and J. Benedict, 2009, Ontology-supported Scientific Data Frameworks: The Virtual Solar-Terrestrial Observatory Experience, Computers and Geosciences, special issue on Geoscience Knowledge Representation for Cyberinfrastructure, 35, #4, 724-738.
  • Fox, P., Hendler, J. 2009, Semantic eScience: Encoding Meaning in Next-Generation Digitally Enhanced Science, in The Fourth Paradigm: Data Intensive Scientific Discovery, Eds. Tony Hey, Stewart Tansley and Kristin Tolle, Microsoft External Research, pp. 145-150.
  • Peter Fox and James Hendler, 2011, Changing the Equation on Scientific Data Visualization, Science, Vol. 331 no. 6018 pp. 705-708, DOI: 10.1126/science.1197654 online at http://www.sciencemag.org/content/331/6018/705.full and http://escience.rpi.edu/publications/
  • Fox, P., McGuinness, D.L. and the VSTO team 2011, Semantic Cyberinfrastructure: the VSTO experience, in Geoinformatics: Cyberinfrastructure for the Solid Earth Sciences, eds. G.R. Keller and C. Baru, Cambridge University Press, Cambridge, UK, pp. 21-36.
  • Fox, P. 2011, The Rise of Informatics as a Research Domain, in Proceedings of the Water Information Research and Development Alliance, CSIRO Publications, pp. 125-132.
  • Fox, P. and Harris, R. 2012, ICSU and the Challenges of Data and Information Management for International Science, Data Science Journal, in press.
  • Fox, P. et al. 2012, An Open-World Iterative Methodology for the Development of Semantically-enabled Applications (in preparation).
  • GRDI 2011, Global Research Data Infrastructures: The GRDI2020 Vision, project funded by the European Commission, 7th Framework Programme for Research and Technological Development, www.grdi2020.eu
  • ICSU SCCID 2011, Interim Report of the International Council for Science Strategic Coordinating Committee on Information and Data, available from www.icsu.org
  • Integrated Ecosystem Assessment Interoperability Initiative (ECOOP), http://tw.rpi.edu/web/project/ECOOP
  • Leptoukh G. Et al. 2010, Towards Consistent Characterization of Quality and Uncertainty in Multi-sensor Aerosol Level 3 Satellite Data. EOS Trans, A21-03 presented at 2010 Fall Meeting, AGU, San Francisco, Calif., 13-17 Dec.
  • Linked Open Government Data at http://logd.tw.rpi.edu
  • D. McGuinness, P. Fox, L. Cinquini, P. West, J. Garcia, J. L. Benedict, and D. Middleton, The Virtual Solar-Terrestrial Observatory: A Deployed Semantic Web Application Case Study for Scientific Research, AI Magazine, Vol. 29, 2007.
  • Multi-sensor Data Synergy Advisor (MDSA), http://tw.rpi.edu/web/project/MDSA
  • NSF 2011a, A Report Of The National Science Foundation Advisory Committee For Cyberinfrastructure Task Force On Cyberlearning And Workforce Development, NSF, April 2011] http://www.nsf.gov/od/oci/taskforces/
  • NSF 2011b, A Report Of The National Science Foundation Advisory Committee For Cyberinfrastructure Task Force On Data And Visualization, NSF, March 2011] http://www.nsf.gov/od/oci/taskforces/
  • Parade: Partnership for Accessing Data in Europe 2009, Strategy for a European Data Infrastructure, white paper.
  • Parsons, M. and Fox, P. 2012, Is Data Publication The Right Metaphor? Data Science Journal, under review.
  • Parsons M, Khalsa S, Pulsifer P, Duerr R, Fox P, McGuinness DL, McCusker JP 2012, The Many Dimensions of Sea Ice - A Beginning Ontology, AAG, in press.
  • RDWS 2010, Report on the Research Data Workforce Summit, Chicago, Dec 6, 2010, https://www.ideals.illinois.edu/bitstream/handle/2142/25830/RDWS_Report_Final.pdf
  • Rozell E., Maffei A.R., Beaulieu S.E., Fox P. 2010, A Framework for Integrating Oceanographic Data Repositories. EOS Trans, IN23A-1349 presented at 2010 Fall Meeting, AGU, San Francisco, Calif., 13-17 Dec.
  • Science 2010, Special Online Collection: Dealing with Data, http://www.sciencemag.org/site/special/data/
  • Semantic eScience Framework, http://tw.rpi.edu/web/project/SeSF
  • Virtual Solar-Terrestrial Observatory (VSTO), http://tw.rpi.edu/web/project/VSTO
  • VIVO; http://vivo.cornell.edu
  • D.N. Williams, R. Ananthakrishnan, et al., The Earth System Grid: Enabling Access to Multi-Model Climate Simulation Data. In: Bulletin of the American Meteorological Society, 90(2): 195-205, 2009.

Appendix F: RPI/TWC Informatics Methodology Approach and Method


Please refer to TWC Semantic Web Methodology


Appendix G. The Tetherless World Constellation at RPI


The Tetherless World Constellation (TWC; http://tw.rpi.edu ) at RPI explores the research and engineering principles that underlie the Web, that can be used to enhance the Web's reach beyond the desktop and laptop computer, and to develop new technologies and languages that expand the capabilities of the Web. The faculty and staff use powerful scientific and mathematical and visualization techniques from many disciplines to explore the modeling of the Web from network- and information- centric views. TWC goals include making the next generation web natural to use while being responsive to the growing variety of policy and social needs, whether in the area of privacy, intellectual property, general compliance, or provenance. The semantic eScience/data science research theme of TWC focuses on semantic data frameworks, next generation virtual observatories, semantic data and knowledge integration and knowledge provenance for science.


Within TWC, PI Fox’s research utilizes state-of-the-art modeling techniques, Internet-based technologies, including the Semantic Web, and applies them to large-scale distributed scientific repositories addressing the full life-cycle of data and information within specific science and engineering disciplines as well as among disciplines. Fox currently leads several federally funded data science activities that span the spectrum from computer and information science research to geoscience/ environmental applications.


Appendix L: MIT Data Documentation Initiative Alliance – Data Life Cycle


MIT Data Documentation Initiative Alliance - Data Life Cycle


Figure L. The DDI life cycle depiction.


The Data Documentation Initiative (DDI; Fig. L), an international standard originating in 1995 for describing data from the social, behavioral, and economic sciences, serves as an effective template for modeling the research data life cycle. DDI metadata accompanies and enables the conceptualization, collection, processing, distribution, discovery, analysis, re-purposing, and archiving of research data. See http://libraries.mit.edu/guides/subjects/data-management/cycle.html


Appendix M: Data as a First-Class Object for Science


Every item of scientific data has a history and a context from which it is produced, regardless of whether it is the result of a measurement, derived by applying models to measured data, or synthesized from first principles. Indeed the very practice of science over the centuries has depended upon the unambiguous joining and identification of data and context. Throughout most of that history the mechanism for such binding has been the scientific article, and to this day in many of the sciences datasets are referenced by the journal articles that introduce them and document their creation. This model limits every facet of science, from propagation to discovery to consumption of data, and including discourse about both the specific data, the processes that produced the data, and the limitations and special considerations of that data.


The scientific process depends upon treating each data element as a first-class object, meaning not only that every data element is unambiguously identified, but that the infrastructure enables metadata about each data element to be considered as part of the consumption of that data. If the data is experimental, the metadata may include instrumentation and configuration parameters; if the data is computational, the source data and applied models are specified.


In regard to the need to identify objects, a dual DOI/Handle (http://handle.net) approach is needed:

  • For (formally) published articles and their associated data, we will support the well-established DOI name resolution standard in our infrastructure. This will include encouraging and facilitating the use of DataCite (http://datacite.org), a special case of DOI for use with co-published scientific datasets
  • For pre-publication and non-published artifacts --- essentially, everything that hasn't been formally published --- we'll operate our own Handle Server, registered as part of the Global Handle System. In essence, we'll have DOI-like DCO Identifiers

There will be a DCO identifier proxy similar to those operated by the Corporation for National Research Archives (CNRI; http://www.cnri.reston.va.us/) that will allow DCO identifiers to be treated as resolvable HTTP URIs.


Appendix N: Background on Data/ Data Science Education and Training


Numerous reports, worldwide have highlighted that significant attention must be directed to Data Science education, training and workforce considerations. Examples include: “Train a new generation of data scientists, and broaden public understanding” from an EU Expert Group [EU 2010], “…the nation faces a critical need for a competent and creative workforce in science, technology, engineering and mathematics (STEM)...” [NSF 2011a], "We note two possible approaches to addressing the challenge of this transformation: revolutionary (paradigmatic shifts and systemic structural reform) and evolutionary (such as adding data mining courses to computational science education or simply transferring textbook organized content into digital textbooks).” [NSF 2011b], and “The training programs that NSF establishes around such a data infrastructure initiative will create a new generation of data scientists, data curators, and data archivists that is equipped to meet the challenges and jobs of the future." [NSF 2011c]. These direct and high-level assessments point to the almost complete absence of academically rigorous and paradigm-shifting Data Science education progress, and call for such programs to be established.


Further context on Data Science education needs, highlighting RPI's prominence arise from two sources. First, the interim report of the International Council for Science Strategic Coordinating Committee on Information and Data [ICSU SCCID 2011], features this excerpt from section 4.2.4 Data scientists and professionals: "An unfortunate state in the recognition of data science, is that there is a lack of appreciation of the need for a set of professional knowledge in skill in key areas, many of which have not been emphasized to date, e.g. professional approaches to the management of data over its lifecycle. As such, the effort required to be a data scientists is not valued sufficiently by the remainder of the scientific community." SCCID Recommendation 6 reads: “We recommend the development of education at university level in the new and vital field of data science. The curriculum included in appendix N can be used as a starting point for curriculum development.” Appendix N. is entitled “Example curriculum for data science” and explicitly uses the “Curriculum for Data Science taught at Rensselaer Polytechnic Institute, USA”.


The Research Data Workforce Summit (held in Dec 2010) report, states (in the “New Profession” section) [RDWS 2011]:"The iSchools’ programs are designed for training information professionals in data curation and data management, and RPI and George Mason are focused on training in data science, informatics, and data management for students in the sciences and beyond."