Archive for December, 2011

Characterizing quality for science data products

December 30th, 2011

Characterizing quality for a science data product is hard. We have been working on this issue in our Multi-Sensor Data Synergy Advisor (MDSA) project with Greg Leptoukh and Chris Lynnes from the NASA Goddard Space Flight Center (GSFC). The following is my opinion on what product quality means and how it can be characterized. This work was presented as a poster at the AGU FM 2011 meeting.

Science product quality is hard to define, characterize, and act upon. Product quality reflects a comparison against standard products of a similiar kind, but it is also reflective of the fitness-for-use of the product for the end-user. Users weigh quality characteristics (e.g. accuracy, completeness, coverage, consistency, representativeness) based on their intended use for the data, and therefore quality of a product can be different based on different users’ needs and interests.  Despite the subjective nature of quality assertions, and their sensitivity to users fitness-for-use, most quality information is provided by the product producer and the subjective criteria used to determine quality is opaque, if available at all.

If users are given product quality information at all, this information usually comes in one of two forms:

  • tech reports where extensive statistical analysis is reported on very specific characteristics of the product
  • in the form of subjective and unexplained statements such as ‘good’, ‘marginal’, ‘bad’.

This is either information overload that is not easy for the user to quickly assess or a near lack of the type of information that a user needs to make their own subjective quality assessment.

Is there a smilar scenario in common-day life where users are presented with quality information that they can readily understand and act upon?

There is, and you see it every day in the supermarket.

a common application of information used to make subjective quality assessments

Nutrition Facts labels provide nutrition per serving information (e.g. amount of Total Fat, Total Carbohydrates, Protein) and how the the listed amounts per serving compare to a perspective daily diet.

The comparison to a standard 2,000 calorie diet provides the user with a simple assessment tool for the usefulness of food item in their unique diet. Quality assertions, such as whether this food is ‘good’, or ‘bad’ for the consumer’s diet are left to the consumer – but are relatively easy to make with the available information.

A ‘quality facts’ label for a scientific data product, showing computed values for community-recognized quality indicators, would go a long way towards enabling a nutrition label-like presentation of quality that is easy for science users to consume and act upon.

an early mockup of a presentation of quality information for a science data product

We have begun working on mockups of what such a presentation of quality could look like, and have constructed a basic quality model that would allow us to express in RDF the information that would be used to construct a quality facts label.

Our quality model primer presents our high-level quality model and its application to an aerosol satellite data product in detail.

Our poster presentation was a hit at AGU, where we received a great deal of positive feedback on it.  This nutrition label-like presentation is immediately familiar, and supports the metaphor of science users ‘shopping’ for the best data product to fit their needs.

We still have a long way to go on developing our presentation, but the feedback from discussions at AGU tells me that our message resonated with our intended audience.

VN:F [1.9.22_1171]
Rating: 7.1/10 (10 votes cast)
VN:F [1.9.22_1171]
Rating: +5 (from 5 votes)

S2S Feedback at AGU Fall Meeting 2011

December 19th, 2011

The AGU Fall Meeting 2011 was a busy meeting and, as usual, the Tetherless World Constellation (TWC) received quite a bit of attention in terms of best practices and tool support for Semantic eScience. I gave two poster presentations during the Semantic, Linked Data, and Drupal-based Solutions for Science (IN31B) poster session. I had one poster in IN31B about creating linked data for AGU abstracts. The second poster was in IN31A, a session about the Real Use of Open Standards and Technologies, however it became apparent that I was more interested in talking about it as an IN31B poster. It was a poster on S2S, and there was a range of feedback, which I discuss in this blog, including enthusiasts who wanted to implement it, skeptics who felt it was not an “interoperable” solution, and faceted browse developers who wanted to know why S2S needed so much complexity.

Addressing the first type of feedback is not difficult. I want everyone to be able to deploy an S2S interface for their data. However, I often have to hold myself back, because I know that the software is not to a point that it can be easily reused without a significant amount of hand-holding on my part. The basic problems are documentation and complexity of installation. While the documentation problem can be easily fixed, the problem of installation will remain until the S2S back-end architecture is updated. The back-end architecture depends on a triplestore deployed on one of TWC’s machines for indexing metadata about S2S services. I plan to move the back-end to a linked data crawler approach next spring, removing the dependencies on TWC triplestores and enabling wider installation.

The second type of feedback was more interesting to address. It’s always good to hear constructive criticism about a project. The argument was, because S2S uses its own vocabulary to describe, i.a., Web services, “widgets”, and parameters, it is not interoperable because existing tools will not understand those vocabularies. I have two primary defenses to that. The first is that S2S allows you to define virtually any term so that they can be used by old tools and new tools. For instance, S2S allows you to define each of the OpenSearch vocabulary terms including “results”, “searchTerms”, “startIndex”, and “count”. Each of these have in fact been implemented by our OpenSearch services for S2S, so when a traditional OpenSearch tool finds an S2S OpenSearch service, it should still be able to use it. The second defense is, if you do not agree with the S2S vocabulary, find a vocabulary with as much tool support as S2S for developing faceted browse or advanced search interfaces. At the time the S2S project started, we found no vocabularies for defining the “extensibility” aspects of OpenSearch (i.e., the fact that URIs can be used in place of any of the OpenSearch terms). So we did define those vocabularies, and we specifically designed them for S2S’s purpose. I’d be happy to collaborate with anyone who has a broader or different purpose from S2S to extend the vocabulary to their needs, or map S2S terms to their terms.

The last type of feedback was why the S2S framework has so much complexity. I’m not sure there is one good response to that inquiry, I think the complexity is useful when you look at the big picture for S2S. For one, S2S was never explicitly designed to be a framework for faceted browsing interfaces. Rather, it was designed to develop configurable user interfaces, with a heavy emphasis on reusability of user interface components. Faceted browsing became the focus because we had two use cases that were best implemented with faceted browse. Another complexity issue was in the number of queries made by an S2S faceted browser compared to something like Apache Solr. For instance, a browser with 6 facets could potentially require 7 queries to populate the browser with data in S2S (1 per facet plus 1 for the results). In Solr, there is a single query that can return all facets and facet values. The design decision in S2S was that a data manager may need to query a remote source to determine what its facet values are. Alternatively, the data manager may have a single input that it does not wish to facet (say, for performance reasons). In either case, we designed S2S to be as flexible as possible, which in some cases means it takes a little more effort to set up when compared to something more rigid, such as Apache Solr.

VN:F [1.9.22_1171]
Rating: 9.0/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)
Author: Categories: tetherless world Tags:

Is Data Publication the right metaphor?

December 15th, 2011

VN:F [1.9.22_1171]
Rating: 6.6/10 (5 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: tetherless world Tags: , ,

TWC Undergrads Visualize Linked Open Corporate Data

December 1st, 2011

Two undergraduate members of the Tetherless World team, Alexei Bulazel and Bharath Santosh recently wrote great summaries of their work creating visualizations based on linked open corporate data aggregated through the ORGPedia project. In this post I’ll include snippets from their posts; I encourage you to check out their full posts and the demos they link to!

First, a bit of context (from the ORGPedia site):

ORGPedia: The Open Organizational Data Project, led by NYLS Professor and former United State Deputy CTO Beth Noveck (Project Lead) and TWC Senior Constellation Professor Jim Hendler (Tech Lead) explores how to create the legal, policy and technology framework for a data exchange to facilitate efficient comparison of organizational data across regulatory schemes as well as public reuse and annotation of that data. By designing a universal exchange rather than a new numbering scheme, OrgPedia aims to achieve goals like improving corporate transparency and efficiency, organizational performance, risk management, and data-driven regulatory policy–without having to wait until legislation is enacted for a single, legal entity identifier.

To date, TWC’s contribution to ORGPedia has been to aggregate data from a variety of sources, develop an experimental site to serve as a platform for integrating the data and prototyping ORGPedia concepts, and develop data visualizations and mashups that demonstrate the potential of an open system of canonical identifiers for corporate entities. Led by TWC Ph.D. student Xian Li, undergrads Alexei Bulazel and Bharath Santosh teamed together to create interesting visualizations based on the data aggregated.

Bharath first describes a visualization he created that allows users to analyze various financial properties of the financial sectors in the US using our aggregated data:

The visualization itself is through Google Motion Charts which is in Google’s Visualization API. It is an interactive multidimensional graph of a dataset of sectors and the mean of various financial properties across the sector’s companies. The data shown above is represented is represented in millions USD. The Motion Chart allows for really neat temporal analysis of data in various forms. Clicking the play button shows the change in properties from 2008 to 2011. There are also three different styles you can view the data: bubbles(shown above), bar charts, line graphs. These can be switched in the top corner.

The dataset behind the visualization was created in R. I made a sparql query that would access Orgpedia’s datasets and pull out sector of the US and the companies and their stock tickers within the sectors. Then I took these companies and pulled in their income statements from Google Finance and went through each sector and averaged various properties from the sector’s companies’ financial statements. The data manipulation in R took some getting used to, but now its very easy for me to transform data frames, matrices, and other objects in R. After the dataset was created and cleaned for non-existent values its just defining properties of the Motion Chart and running it. It generates a html file with the graph and data represented in javascript. All the data processing and manipulation takes around 15 minutes mostly due to the large amount of data to be downloaded.

Bharath then goes on to describe the compelling visualization he and Alexei created of the “social network” of corporate board members:

…The visualization utilizes data from and gathers data about board members of various companies in the US and shows the members in a force graph that shows which board members are on multiple boards (Board Members Network):

The graph visualization is done using the D3 visualization toolkit’s Forced Graph. Each node represents a board member. The clustered colored nodes are a group of members on the same board. The multicolored nodes represent board members that are on multiple boards. Mousing over a node shows you their name and the companies they work for. Clicking a node takes you to their page. The graph shows many interesting relationships between various companies and board members. Especially Steven S Reinemund who resides on 5 different boards.

On his blog, Alexei provides additional detail about the work they did to prepare the data for the visualization:

The project involved creating an interactive graph visualization of connections between members of corporate boards (the final product can be found here). Given a list of a few hundred stock tickers and access to the LittleSis API, the goal was to ultimately produce a JSON file of board members that could be use by the D3.js force-directed graph framework. I started by looking up each ticker symbol, yielding a JSON file with a unique ID number for each company. My script then queried the API for actual company page associated with that ID and stored the names, company associations, and URIs of each board member. Finally, a JSON file for the D3.js graph was output describing the ~2800 board members and the links between each of them.

While I had used Python a bit for command line scripting, I hadn’t really dug into it before this project. The work gave me a better taste for the language and its capabilities. I made extensive use of the “urllib” library for accessing web content, and worked with opening up the data in JSON files. Bharath helped me with the syntax of program and some of the graph construction. While I was aware of Python’s reputation for ease of use and high level abstraction, working with it let me experience this abstraction first hand, I was very impressed. The ease with which complex multistep operations could be completed let me focus more on the flow of the data through the process rather than the specifics of handling it. The project also gave me a bit more hands on experience with JSON.

The reader is encouraged to read both Alexei‘s and Bharath‘s blogs for more details on these great contributions by a couple of our TWC undergrads!

VN:F [1.9.22_1171]
Rating: 9.7/10 (3 votes cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)
Author: Categories: tetherless world Tags: