DCO-DS Technology Infrastructure

Printer-friendly version

This page will list the different technology components that we have looked at, and chosen to use, for the DCO-DS project.

Dataset Registration and Deposit

The DCO Infrastructure won't be storing ALL of the data generated within the DCO community. BUT ... there are instances where, for smaller projects, for projects just getting started, or for any project that is looking for a place to store data ... we will be providing a mechanism for data storage. And we will be doing that using the CKAN.

As their website states: CKAN is a powerful data management system that makes data accessible – by providing tools to streamline publishing, sharing, finding and using data. - See more at: http://ckan.org/#sthash.wzxqyyv8.dpuf

At first, we weren't sure that CKAN would be the right mechanism for this ... see http://tw.rpi.edu/web/project/DCO-DS/Meetings/20130410-DS#ckan. But we soon discovered that the flexibility of CKAN, the integration with other application platforms (VIVO, Drupal, and more) ... that CKAN was DEFINITELY the right application for the job. That, and we learned more from other projects and other students within our lab that CKAN can most definitely do what we need it to do, and learned just how they use it.

In addition to providing this service for the community; along with integration with VIVO for data registration, assignment of DCO-IDs, and metadata entry and access; we are also developing data management plans that can be used by the community in order to assist with these complex issues.

Persistent Naming Infrastructure

Each and every object in the DCO portal will have a unique handle associated with it. Whether a person, organization, document, dataset, image, photo, and so on. These handles can then be used in a handle proxy to be able to locate the object within the DCO system. For example, if one were to go to http://dx.handle.net and enter the id 11121/4317-8058-4791-8747-CC you will be directed to a dataset "Noble gas isotope abundances in terrestrial fluids".

The Deep Carbon Observatory has registered it's own prefix within the handle system, 11121. That, along with a unique identifier, combine to make the DCO-ID.

DCO also has a handle proxy that will be used within the system to identify handles. So, given that same DCO-ID, you can browse to http://dx.deepcarbon.net/11121/4317-8058-4791-8747-CC, and it will take you to that dataset page.

We use the following technologies to support our unique, persistent naming scheme:

  • Handle System: Handle.net
  • DCO-ID Handle System prefix: 11121

VIVO

VIVO is an open source semantic web application originally developed and implemented at Cornell. When installed and populated with researcher interests, activities, and accomplishments, it enables the discovery of research and scholarship across disciplines at that institution and beyond. VIVO supports browsing and a search function which returns faceted results for rapid retrieval of desired information. Content in any local VIVO installation may be maintained manually, brought into VIVO in automated ways from local systems of record, such as HR, grants, course, and faculty activity databases, or from database providers such as publication aggregators and funding agencies.

At the heart is the VIVO core ontology, which includes VIVO specific terms, but also incorporates already existing ontologies such as FOAF (Friend of a Friend), the BIBO ontology which represents documents, Citations, SKOS, Dublin Core Elements and Terms, and others.

In addition to the ontologies provided with VIVO, we can also add our own ontologies. In our case, we've created our own ontology, the DCO Ontology, which extends the existing core ontologies and the other ontologies that we will be using, the newly created Prov-O, the O&M (Observations and Measurements) Ontology, the VSTO (Virtual Solar-Terrestrial Observatory) Ontology which was also created in the Tetherless World Constellation. And we're taking a look at the BCO-DMO Ontology from our work with the Woods Hole Oceanographic Institution, and our own Tetherless World Ontology.

DCO Community Portal

Based on Drupal Commons (Drupal 6), the DCO community portal is meant to be a place where all members of the DCO community can go, find out information about the different communities and groups within all of DCO, contribute content, participate in discussions, add comments to posts, subscribe to the DCO newsletter as well as any others created in the DCO community, and have their profile information available. It is meant to be the rallying point for everything DCO related.

We have made quite a few modifications to this installation of Drupal Commons. Specifically, we have implemented our own theme (available in our subversion repository) for a unique style. This was developed in collaboration with the DCO Engagement team at the University of Rhode Island.

We have also implemented our own module, to take advantage of the many hooks available within the Drupal framework. The hooks that we take advantage of allow us to integrate better our Drupal instance with our VIVO and CKAN installations, tying together the community portal with the systems that collect and make available all the pieces of information provided within the DCO Community, including documents, images, events, data collections, metadata about all these resources, and more.

Some of the featured Drupal Modules that we have installed include:

Faceted/Hierarchical Browsing

In order to allow for users to browse through various concepts, such as datasets, publications and projects, we are using facetview2 faceted search infrastructure a pure javascript frontend for ElasticSearch. ElasticSearch is implemented on top of Apache Lucene and utilizes an inverted index to maximize performance and provide quick text-based searches. Information from the VIVO knowledge store is ingested into ElasticSearch on a regular basis, once every hour.

Why ElasticSearch instead of SOLR: We chose ElasticSearch over SOLR for many reasons. The most important reason is that ElasticSearch allows us to have nested objects in the stored documents, whereas SOLR is a flattened document. For example, a publication has a list of authors and was published in a peer-reviewed scientific journal. In ElasticSearch the top-level document is the publication. Then there are nested objects representing authors and the journal. Each of those nested objects has additional information about those concepts, such as the name and associated organization for each author and the publisher and editor of the journal. In SOLR you don't get the nested objects but instead a flattened structure with no capability to tie objects together.

The previous version, using the S2S faceted browsing infrastructure, relied on dynamic SPARQL queries against the VIV knowledge store. This proved to be very slow as doing text-based queries against a triple store is very slow. S2S was initially designed and developed by a former graduate student at the Tetherless World, Eric Rozell as part of an internship at Woods Hole Oceanographic Institution. S2S originally stood for Ship 2 Shore, but the design of the framework was general enough to allow for anyone to use the browser to browse anything, whether the queries for individual widgets queried a triple store, a relational database, or a text file. And any kind of widget can be created and used as long as it is semantically represented. More information on S2S.

For the DCO-DS project we have created multiple browsers, and they are:

Visualizing Semantic Information in Drupal

Members of the DCO-DS team, for a different project, have implemented a Drupal module, available only in Drupal 1.6 currently, to query an endpoint in order to retrieve semantically enriched information about entities and display them in a meaningful and visually pleasing way. The module utilizes a modular design so that the functionality can be used in Drupal pages, MediaWiki pages, simple web pages, or other content management systems. For the Deep Carbon Observatory project we have installed the module in the Drupal portal, referenced above.

Basically, the twsparql module is called from within a Drupal page using a sparql tag, like so:

<sparql endpoint="http://udco.tw.rpi.edu/vivo/admin/sparqlquery?resultFormat=RS_XML&rdfResultFormat=RDF%2FXML" query="http://tw.rpi.edu/queries/dco/dco_proj_dates.rq" xslt="http://joshwood.info/dcosparql/xml-to-html.xsl"/>

The endpoint can be specified as an attribute to the sparql tag, or if not specified, uses the configured default. A query is loaded from the file specified by the query attribute, where the default location of the query file can be on the local file system, or can be retrieved from the Web. Parameters to the query can be specified in the tag as well. For example, if we are interested in a specific entity, we could specify its uri. The query is run against the endpoint and an RDF/XML document returned. The document is translated using the XSLT file specified by the xslt attribute. Again, this file can be located locally and configured in Drupal, or a URL can be used to specify an xslt from the Web. And the resulting HTML displayed in the page.

Examples of this can be found here:

DCO Semantic Web Services

  • SPARQL Endpoint - Anyone has access to the DCO Data Portal SPARQL Endpoint provided by VIVO. This endpoint allows people, using SPARQL, to query any and all of the information stored in the DCO Data Portal.

https://info.deepcarbon.net/vivo/admin/sparqlquery

For example, to see all of the triples associated with the DCO Data Science Team, run this query:

SELECT ?p ?o WHERE { <http://info.deepcarbon.net/individual/d5cb52d9b-cb37-4710-ba50-c47f70fff7b9> ?p ?o . }

Visualization

  • Field
  • d3.js
    • d3.js is a powerful and effective data visualization tool. It supports interactivity to a greater extent since it is based on Javascript. It facilitates implementation and coding by providing a number of examples on their website. It also provides support for different types of visualization ranging for simple bar charts to complex data representations.
    • Few of the visualizations created using d3.js for DCO can be found in the following links
  • Google charts
    • Google charts package with R can be used to create visualizations but it was limited in capability. Other than the basic visualizations, there wasn't support for creating new visualizations.
  • Shiny R studio
    • Shiny R studio comes with a robust set of data analysis capabilities by allowing one to use the packages from R environment. Few of the drawbacks of this tool is that the support for creating new visualizations is limited at this point of time. Secondly, there is very less scope for interactivity of these visualizations.

Single Sign-On

With numerous components making up the DCO Portal we needed a way for DCO members to have a single username and password and be able to log in once and use the different components while logged in without have to log into each. To do this we chose to use Shibboleth with an LDAP back-end to store the user's login information. For LDAP we're using the Apache Directory Server and to manage the user's information the Apache Directory Studio, though ldap commands can be entered on the server directly.

During user registration the user enters in information that will eventually be stored in the DCO Data Portal. Part of this information is a username and password. Once the user clicks Submit they are sent an email to confirm their request. Once confirmed a DCO User Administrator can accept their request or deny their request. If accepted then all the information that they entered during registration is stored in the DCO Data Portal and their login information is pushed to the local LDAP repository.

Using the Shibboleth system users are then able to log in once into any of the components of the system and be logged in to all of the components for the duration of the login session.

DCO Data Management

 

Related Portals

  • GeoSamples
    • Related to Diamonds
    • Run by EarthChem/IEDA
    • Uses IGSN (sample number)
  • PetDB
    • Petrology DataBase
    • Also run by EarthChem/IEDA
    • Also uses IGSN

Research Data Management Resources