Archive

Author Archive

S2S Feedback at AGU Fall Meeting 2011

December 19th, 2011

The AGU Fall Meeting 2011 was a busy meeting and, as usual, the Tetherless World Constellation (TWC) received quite a bit of attention in terms of best practices and tool support for Semantic eScience. I gave two poster presentations during the Semantic, Linked Data, and Drupal-based Solutions for Science (IN31B) poster session. I had one poster in IN31B about creating linked data for AGU abstracts. The second poster was in IN31A, a session about the Real Use of Open Standards and Technologies, however it became apparent that I was more interested in talking about it as an IN31B poster. It was a poster on S2S, and there was a range of feedback, which I discuss in this blog, including enthusiasts who wanted to implement it, skeptics who felt it was not an “interoperable” solution, and faceted browse developers who wanted to know why S2S needed so much complexity.

Addressing the first type of feedback is not difficult. I want everyone to be able to deploy an S2S interface for their data. However, I often have to hold myself back, because I know that the software is not to a point that it can be easily reused without a significant amount of hand-holding on my part. The basic problems are documentation and complexity of installation. While the documentation problem can be easily fixed, the problem of installation will remain until the S2S back-end architecture is updated. The back-end architecture depends on a triplestore deployed on one of TWC’s machines for indexing metadata about S2S services. I plan to move the back-end to a linked data crawler approach next spring, removing the dependencies on TWC triplestores and enabling wider installation.

The second type of feedback was more interesting to address. It’s always good to hear constructive criticism about a project. The argument was, because S2S uses its own vocabulary to describe, i.a., Web services, “widgets”, and parameters, it is not interoperable because existing tools will not understand those vocabularies. I have two primary defenses to that. The first is that S2S allows you to define virtually any term so that they can be used by old tools and new tools. For instance, S2S allows you to define each of the OpenSearch vocabulary terms including “results”, “searchTerms”, “startIndex”, and “count”. Each of these have in fact been implemented by our OpenSearch services for S2S, so when a traditional OpenSearch tool finds an S2S OpenSearch service, it should still be able to use it. The second defense is, if you do not agree with the S2S vocabulary, find a vocabulary with as much tool support as S2S for developing faceted browse or advanced search interfaces. At the time the S2S project started, we found no vocabularies for defining the “extensibility” aspects of OpenSearch (i.e., the fact that URIs can be used in place of any of the OpenSearch terms). So we did define those vocabularies, and we specifically designed them for S2S’s purpose. I’d be happy to collaborate with anyone who has a broader or different purpose from S2S to extend the vocabulary to their needs, or map S2S terms to their terms.

The last type of feedback was why the S2S framework has so much complexity. I’m not sure there is one good response to that inquiry, I think the complexity is useful when you look at the big picture for S2S. For one, S2S was never explicitly designed to be a framework for faceted browsing interfaces. Rather, it was designed to develop configurable user interfaces, with a heavy emphasis on reusability of user interface components. Faceted browsing became the focus because we had two use cases that were best implemented with faceted browse. Another complexity issue was in the number of queries made by an S2S faceted browser compared to something like Apache Solr. For instance, a browser with 6 facets could potentially require 7 queries to populate the browser with data in S2S (1 per facet plus 1 for the results). In Solr, there is a single query that can return all facets and facet values. The design decision in S2S was that a data manager may need to query a remote source to determine what its facet values are. Alternatively, the data manager may have a single input that it does not wish to facet (say, for performance reasons). In either case, we designed S2S to be as flexible as possible, which in some cases means it takes a little more effort to set up when compared to something more rigid, such as Apache Solr.

VN:F [1.9.22_1171]
Rating: 9.0/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)
Author: Categories: tetherless world Tags:

IOGDC Presentation @ I-Semantics 2011, Triplification Challenge

September 26th, 2011

The I-Semantics 2011 Conference, co-located with I-KNOW, was held Sept. 7 – 9, 2011 in Graz, Austria. The conference covered a range of topics, including Web-scale recommendation systems, information visualization, semantic content engineering, Web science, social Web, SemWeb applications, and the list goes on. Given the broad scope of the conference, I decided to target the talks that were most compatible with our research agendas at TWC. Namely, I looked for work that was related to Linked Open Data, work that was applicable to our Semantic eScience Framework project, and some natural language and machine learning work that I am personally interested in.

Linked Data was a major theme of I-Semantics.  I saw an interesting talk on a RESTful architecture for both reading and writing Linked Data.  The architecture placed some restrictions on how the data could be structured and queried, defining ontological concepts of “records” and “layers” used to annotate the data, which when aggregated together, essentially form named graphs.  They also make an interesting use of the HTTP range header to retrieve partial records.  Another interesting presentation was a vocabulary for creating linked data versions of call for papers for scientific publications.  I think we should look into this vocabulary at TWC, particularly for our website, as it may be a good solution to keeping people up to date on relevant submission deadlines for publication.   Much of the other work on Linked Data was in the Triplification Challenge session.  We presented our own work on the International Open Government Data Catalog (now IOGD Search, or IOGDS) at this session, which I discuss in the final paragraph.  Other submissions included a “trip planner”, which used both LOD resources as well as the Open Provenance Model to annotate tourism-related information on the Web; also, there was an interesting application for annotating and performing semantic search over annotations of online media.  Our primary competitor in the Triplification Challenge (for the Open Government Track) was the work on Open Data Albania.  The authors gave a demo of their website, which was based on CKAN.  I found the most interesting part of the presentation was that they automatically convert datasets published in their catalog into Google Data Tables, which are compatible with various Google Viz tools, such as the Google MotionChart.

There were a number of interesting presentations that covered research of interest in the Semantic eScience Framework project and I discuss two of them here.  The first was a presentation that I saw on a knowledge federation framework for biomedical applications.  The interesting part about this framework was the parallels in the design of this knowledge federation framework with the design of S2S.  The framework, called Coeus, uses a “connector” (referred to as an “adapter” in S2S) to attach data sources in various formats (i.e., CSV, XML, RDB, RDF) to the framework.  It then simplifies the application development process by reducing the effort required to aggregate multiple “connected” resources.  The other interesting research that I saw was in the poster session on Thursday afternoon.  One of the posters was on ontology modularization, and the authors had an interesting view of the structure of modular ontologies.  In the past, we have investigated “three-layer” modularization architectures, such as this one, for VSTO and SeSF.  This work was a variation on the “three-layer” architecture, where the layers were not separated by levels of expressivity, but levels of abstraction.  The purpose of the more abstract ontologies was to provide a frame for domain experts to rapidly/easily build their applications off of.  I have been in contact with the authors and they will be interested to hear if we apply this architecture in our SeSF ontology development. They will also be presenting on this work at ISWC 2011.

The best paper award for the conference went to Pablo Mendes and the DBPedia Spotlight team.  I was very interested in this work because I am working on a project for the Federation of Earth Science Informatics Partners that extracts entities from American Geophysical Union abstracts using DBPedia Spotlight.  The presentation discussed the general functionality of Spotlight, some of the immediate changes in upcoming releases, and the future direction of the project.

The last part I wanted to discuss was our own presentation in the Triplification Challenge.  There were two tracks for the Challenge, an Open Track and an Open Government Data Track.  We competed in the latter.  The talk went extremely well (we won), and there were a number of interesting comments and questions to follow.  One person asked how we keep our data up to date, which is extremely relevant to IOGDS.  I believe at the time of the presentation, some of the catalogs had been converted more than 3 months prior, which meant we were likely missing a lot of updates.  Another discussion regarded IOGDS involvement with the CKAN community, which would be a step towards keeping the IOGDS up to date.  Lastly, there was a question about the degree to which the project performs Semantic search; while IOGDS does perform free text search over most (all?) of the literal values in the catalog, I discussed that building and demonstrating an open government ontology is a topic of importance to the project.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

GeoData2011 Takeaways

March 6th, 2011

The GeoData2011 workshop was a tremendous experience to hear about the state-of-the-art in the data lifecycle, data integration, and data citation, and to participate in dialogs that will define the path forward in each of these areas.  It was humbling to be surrounded by the combined centuries of experience in geoscience and the data pipeline.  There were members from virtually every community in geoscience and organizations specializing in every stage of the data lifecycle.  That being said, there were some key takeaways that I collected from each of the workshop foci (lifecycle, integration, and citation).

The data lifecycle is a hard thing to define.  Our own Prof. Peter Fox gave the workshop a starting point with a simple, three-level model involving acquisition, curation, and preservation.  Of course, the data lifecycle is not by any means a simple entity, and it’s likely that there is not a one-size-fits-all framework or abstraction for every instantiation.  Some participants thought there needed to be a distinction between the “original intent” data cycle and further cycles.  Others viewed the data lifecycle as an endless spiral of acquisition, curation, and preservation.  One of the breakout sessions divided the simple lifecycle further to include more granular stages such as collection planning, processing, and migration.  Even with the varied viewpoints on how to define a data lifecycle, the breakout sessions all pointed to metadata as the primary target for its improvement. The need to identify points where metadata must be captured, to build better tools and automate the process of capturing metadata within data collection instruments, and to educate scientists on the importance of metadata emerged as critical paths to improving the data lifecycle.  As a Semantic Web group, we can proudly say that we are good at dealing with metadata.  That being said, I still think we can improve in certain areas.  To start, more so than the geoscience community, I think we can nail down a data lifecycle abstraction for acquiring, curating, and preserving Semantic Web data.  We can also do a better job of capturing metadata throughout our data pipeline, and tools like csv2rdf4lod should be celebrated for their efforts in doing this.

For me, the data integration sessions may have been the most interesting part of the workshop.  In our group at TWC, data integration is a task that many of us do on a daily basis: transforming data from different formats to RDF to enable interoperability, applying community vocabularies and constructing vocabulary mappings to enable a consistent view over data.  However, most of the data integration tasks we perform are to achieve a certain goal, or to implement a specific use case; the goals that most of the GeoData participants had in mind were much more ambitious.  There was no use case, or specific domain; rather, the workshop focused on enabling data integration in the broad, multidisciplinary domain of geoscience.  The participants were primed with a talk by Jim Barrett from Enterprise Planning Solutions, where he mentioned the need to move data integration up the value chain, from the use side to the supply side of data.  I think there were mixed feelings on the extent to which data integration be moved up the value chain.  Most recognized that there is generally a tradeoff in ability to integrate data and the ability to capture everything in the original data acquisition.  The breakout session I participated in for this topic had a few interesting suggestions, namely that each role in the data lifecycle (e.g., producer, archiver, user) needed to maximize “integratability,” distinct from the callout to move integration up the pipeline.  It was also mentioned that identifying limitations of data transformations (i.e., what has enabled integration) and constraints on data transformations (i.e., what can enable integration) is important, and there are constructs in the ISO 19115 standard for doing this (MD_Usage and MD_Constraint).  There is tremendous potential to apply semantics in this area, through vocabularies and reasoning capabilities, to notify users of the limitations of the products they are using, and to provide warnings before exceeding constraints.

The last major focus of the workshop was on data citation.  Before attending GeoData2011, I realized the significance of data citation, but only after the workshop did I realize that it was truly within the grasp of the scientific community.  Mark Parsons from the National Snow and Ice Data Center presented some ongoing work in data citation, such as DataCite and the Dataverse Network project, as well as his own theories on data citation.  He hypothesized that 80% of available data can be cited as is, without the need for any special data citation platforms, and set the breakout groups on the task of writing data citations and identifying gaps.  Some particularly tricky datasets were identified that might need alternative approaches, including taxonomic data (which is frequently changing) and hydrographic data (which is often compiled from many individual cruises into a homogeneous database).  What I found most interesting was that Parsons was suggesting that we cite data in exactly the same way as we make other citations in our publications; that we need to treat data citations as if they are equally important to journal and in-proceedings references.  Data citation is critical to the work we are doing at TWC.  We are almost unanimously working with someone else’s data in our lab.  As such, when we publish on what we’ve done, or even when we post visualizations and mashups of Data.gov datasets, we need to include references to the original data that are just like the references we’d put in any of our publications.  In our LOGD site, we should be making appropriate data citations on the pages we create for converted datasets.  Making these simple changes to the way we do science and educating students, scientists, and even publishers is the only way to make progress in data citation.

So those are my takeaways.  The GeoData2011 workshop was an excellent opportunity to learn about the state-of-the-art and the path forward in the data lifecycle, data integration, and data citation.  In short, let’s identify the data lifecycle for Semantic Web data, keep building tools that automatically capture metadata, and add appropriate citations for integrated datasets, visualizations, and mashups that we create.  I look forward to applying the information I absorbed from the many interesting dialogs that occurred.  In fact, I will be looking into the ESIP Discovery Cluster in the coming weeks to see where my work on S2S and Semantic Web services can be applied to improve on their discovery services (especially for their OpenSearch conventions).

VN:F [1.9.22_1171]
Rating: 9.0/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: tetherless world Tags:

Reflections of an AGU “First-Timer”

January 10th, 2011

In the past few weeks, and during the conference, I collected a lot of opinions about the American Geophysical Union (AGU) Fall Meeting.  Some said it was their most important week of the year, others said it was unlike any other conference they attend, a conference on steroids if you will, and I think there was a general consensus that the event was downright exhausting.  I can’t comment with the second point, as this was the first conference that I’ve attended; however, I agree that this was easily the most important week to my research to date, and, yes, I am exhausted.

The week was exciting.  I got to meet a lot of people interested in eScience and informatics.  There was a broad coverage of research topics in the Earth and Space Science Informatics sessions, covering nearly every aspect of data management and curation.  There were presentations on the applications of data transfer formats, including netCDF and HDF, and conformance to metadata standards, such as ISO 19115.  Beyond simple data and metadata representations, many were interested in interoperability techniques, including OPeNDAP for application-level interoperability of data, and also vocabulary-level interoperability, which I found was a much more difficult concept for data curators to find value in.  Data curators are not yet seeing the importance of vocabulary interoperability because the “killer app” that utilizes integrated vocabularies is lagging far behind the initiatives towards interoperability.  I find this to be surprising as the “killer app” that many are looking for is a simple portal for spatio-temporal data search.  The major hindrances to the development of a spatio-temporal data search portal is not only metadata interoperability (through standards and vocabularies), but also search efficiency.  Technologies like Apache Solr are emerging for distributed indexing and query for keyword search over the metadata of deep web holdings, but spatio-temporal indices are being included in these technologies as an after thought.

Beyond the data/metadata/vocabulary research presented at the conference, there were numerous presentations on the importance of standardized web services, and many applications demonstrating the use of standards developed by the Open Geospatial Consortium, Amazon, and Open Archives Initiative, to name a few.  In the many presentations, I don’t think there was a clear indication of what the most popular web service standard was, nor was there an indication that the diverse set of available standards would converge.  Actually, Dan Pilone and his collaborators from NASA’s Earth Observing System (EOS) Clearinghouse (ECHO) put together a nice set of posters comparing the utility of web service standards from the perspective of both the developer and the end-user.  All of this work was particularly interesting to my research, which is starting to converge towards application- (or service-)level semantics.

My poster presentation was more inline with those focused on the applications and integration of web services.  I presented S2S, an application level framework for building customized user dashboards for data search and visualization that supports techniques such as advanced search and faceted browse.  The poster was well received, and I gathered both positive feedback and, to an extent, skepticism.  I spent the afternoon explaining the benefits to be reaped from application-level semantics, the successful application of the Semantic Web Methodology and Technology Development Process, and, for the skeptics, justifying the complexities of an application ontology.  I was able to compile the input of the conversations surrounding the poster, realizing the need to incorporate a data abstraction into the application ontology, and the need to provide further abstraction on the service concept to support a broader range of web services.  I think such abstractions will help to dissolve interest in web service standard convergence, and could serve as a foundation to a “killer app” that will spark further interest in vocabulary interoperability, and application-level semantics.

The week was exhausting.  I find that I need to work on my ability to absorb information, as even just days after the conference, while putting together this blog, I feel fuzzy on the details of the many conversations I partook in.  One of the problems I faced was that I am not great at taking notes and participating in a discussion concurrently.  In addition to this, I found it hard to get the chance to sit down and reflect on conversations had, both because some conversations would span an hour or more, and also because I found I was moving directly from one conversation to the next (too many things to do in too little time!).  I plan to work on my ability to jot down notes in the midst of discussions, and will be weary of transitioning to new conversations without taking a break (to reflect) in future conferences.  If any of these reflections on my experiences have taken your interest, feel free to shoot me an email (or find me in the lab, TWCers!).

VN:F [1.9.22_1171]
Rating: 9.5/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: tetherless world Tags: