Archive

Archive for December, 2010

Fall 2010 TWC Undergraduate Research Summary

December 20th, 2010

The Fall 2010 semester marked the beginning of the Tetherless World Constellation’s undergraduate research program at Rensselaer Polytechnic Institute (RPI). Although TWC has enjoyed significant contributions from RPI undergrads since its inception, this term we stepped up our game by more “formally” incorporating a group of undergrads into TWC’s research programs, established regular meetings for the group, and with input from the students began outfitting their own space in RPI’s Winslow Building.

Patrick West, my fellow TWC undergrad research coordinator and I asked the students to blog about their work throughout the semester; with the end of term, we asked them to post summary descriptions of their work and their thoughts about the fledgling TWC undergrad research program itself. We’ve provided short summaries and links to those blogs below…

  • Cameron Helm began the term coming up to speed on SPARQL and RDF, experimented with several of the public TWC endpoints, and then worked with Phillip on basic visualizations. He then slashed his way through the tutorials on TWC’s LOGD Portal, eventually creating impressive visualizations such as this earthquake map. Cameron is very interested in the subject of data visualization and looks to do more work in this area in the future.
  • After a short TWC learning period, Dan Souza began helping doctoral candidate Evan Patton create an Android version of the Mobile Wine Agent application, with all the amazing visualization and data integration required, including Twitter and Facebook integration. Mid-semester Dan also responded to the call to help with the crash” development of the Android/iPhone TalkTracker app, in time for ISWC 2010 in early November. Dan continues to work with Evan and others for early 2011 releases of Android, iPhone/iPad Touch and iPad versions of the Mobile Wine Agent.
  • David Molik reports that he learned web coding skills, ontology creation, server installation and administration. David contributed to the development and operation of a test site for the new, semantic web savvy website for the Biological and Chemical Oceanography Data Management Office BCO-DMO of the Woods Hole Oceanographic Institute.
  • Jay Chamberlin spent much of his time working on the OPeNDAP Project, an open source server to distribute scientific data that is stored in various formats. His involvement included everything from learning his way around the OPeNAP server, to working with infrastructure such as TWC’s LDAP services, to helping migrate documentation from the previous Wiki to the new Drupal site, to actually implementing required changes to the OPeNDAP code base.
  • Phillip Ng worked on a wide variety of projects this fall, starting with basic visualizations, helping with ISWC applications, and including iPad development for the Mobile Wine Agent. Phillip’s blog is fascinating to read as he works his way through the challenges of creating applications, including his multi-part series on implementing the social media features.
  • Alexei Bulazel began working with Dominic DiFranzo on a health-related mashup using Data.gov datasets and is now working on a research paper with David on “human flesh search engine” techniques, a topic that top thinkers including Tetherless World Senior Constellation Professor Jim Hendler have explored in recent talks. Note: For more background on this phenomena, see e.g. China’s Cyberposse, NY Times (03 Mar 2010)

Many of these students will be continuing on with these or other projects at TWC in 2011; we also expect several new students to be joining the group. The entire team at the Tetherless World Constellation thanks them for their efforts and many important contributions this fall, and looks forward to being amazed by their continued great work in the coming year!

John S. Erickson, Ph.D.

VN:F [1.9.22_1171]
Rating: 9.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Why the term ‘data publication’?

December 14th, 2010

Over the last 6 months I have been present in at least 10 distinct discussions around topics such as data publication, data citation and data attribution. At first I was engaged in the topics but very quickly I kept pausing and asking myself, what’s the use case (duh!). What I was hearing was coming from ‘data people’ (yes, I am one of them). What I wanted to hear was: “I want to be cited for the datasets I spend a lot of time and intellectual effort collecting, calibrating and analyzing”, or “… really I want to get credit for that as much as the one or two publications I might get”. I’ve heard this, in fact I’ve said it myself many times. So what’s the problem? Well, when a researcher wants credit and citation for a piece of work, they prepare and publish a paper, yes a body of intellectual work. Our communities and disciplines have spent many centuries developing this approach. So, if want I really want is credit and citation for my data, why do I need to publish it? At present, many people are getting such credit but it is an informal way such as narrative level acknowledgement in the text of the paper and not formal (Parsons, Duerr and Minster 2010 EOS). That’s as good as no acknowledgement unless someone sees it and records it somewhere. The mechanism for paper citation is now well established, I cite your paper in my paper and your citation count increases and gets reported. If you are up for promotion or tenure or review and that count is taken into account, you get credit. It’s the identification of the artifact that counts not the fact that it is published. In short, the capability that is needed is: a way to identify your data contribution and a way to record it (and thus count it). Identification and reference, that’s it. Now, I am not writing about ‘publication data publication’, i.e. the data that is the foundation for figures, tables, and other descriptions in a published paper. I am all for that data being made available as a part of the publication. That is also another story. I am addressing just regular data (collections/ sets).

For now I am suggesting that there are other models to make data available to start with, and one of them is the software release cycle/ process. Alpha, pre-beta, beta, release candidate, release, revision, documentation, feedback, bug fixes… it is more like the process for data that I know of. Now, this may not be the right approach but I think we should explore it, and others. I’m no longer in favour of just adopting a model (marriage) of convenience (publishing). We are savvy enough to take a step back and implement a model that meets the needs of the data scientists who deserve it most. Yes, there’s more to be said. Tag your it.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Food+Tech Hackathon

December 9th, 2010

On December 4th, developers, designers, entrepreneurs, and general food enthusiasts came together at the Food+Tech Hackathon to develop and explore applications to help evolve the food and information technology community. The event, which was part of the International Open Data Hackathon, was in New York City and was organized by Danielle Gould from Food + Tech Connect, Marc Alt from Open Source Cities, and Tian He from Gojee.

Evan Patton and I had a chance to come down and help out with the day’s hacking. I kicked off the event with a lecture on Open Data and the Semantic Web. I gave some background on the Open Data movement in the last few years, discussed some of the current challenges in open data, and talked about how Semantic Web technologies can help address these challenges.

Evan helped explain some of his work on publishing USDA nutrition data on semanticdiet.com and discussed the Wine Agent’s food ontology and recommendation with participants. Semantic Diet uses semantic web technologies to bring together nutrition data, recipes contributed by users and crawled off the web, and personal dietary needs. Having these data organized and encoded using semantic technologies allowed groups to query and reason about food data, and even link it into their own hackathon ideas.

During the hackathon there were thirteen groups work on everything from application to help people eat more sustainably to projects that allowed people to understand price fluctuations in food products over time. We were thrilled to see some of the groups using some semantic data and technologies provided by Semantic Diet and TWC’s LinkedOpen Government Data project. Evan and I spent most of our time educating and assisting teams on using semantic technologies and data. It was great to see so many people enthusiastic about semantics and thinking about how they could use open data to start a project or improve existing projects.

All in all I feel the hackathon was a huge success. At the end of the day we had many applications and projects that have potential to really move forward and make a real impact in the community. Evan and I would like to thank the sponsors and organizers of the first ever Food+Tech Hackathon and hope to help and participate in many more.

Links to other great blog posts on the Food+Tech Hackathon:

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Suggestions to the Supercomputing Community

December 4th, 2010

As mentioned in my last blog post, I recently participated in a birds-of-a-feather (BOF) on semantic graph/database processing at Supercomputing 2010 (SC10).  My general research interest is in high-performance computing (HPC) for the semantic web, so this BOF was a great fit.  At the BOF, I very briefly made three suggestions to HPC researchers; in this blog post, I expand on and explain these suggestions.  I welcome feedback, particularly from those in the semantic web community who have something to share with the supercomputing community.

1. There is a need for good benchmarks from a HPC perspective.

By “good,” I primarily mean that the datasets and queries need to be realistic.  In other words, the data should reflect data that occurs in the real world, and queries should reflect queries that would be posed by actual users or systems.  By “HPC perspective,” I mean that it needs to test strong scaling (change in time for fixed total dataset size and varying number of processors) and weak scaling (change in time for fixed dataset size per processor and varying number of processors).

The Lehigh University Benchmark (LUBM) [1] has arguably been the most widely used benchmark likely because it is one of the earliest benchmarks that provide a data generator and a standard set of queries.  It is targeted towards inferencing. However, LUBM datasets are not only synthetic, but they are quite unrealistic.  In addition to uniform distribution of data, it suffers from other inadequacies like few links between universities and the use of a single, nonsensical phone number for every person (“xxx-xxx-xxxx”).  Therefore, LUBM datasets do not provide a realistic data distribution and thus cannot test the ability of systems to handle realistic selectivity and skew.

There is also the Berlin SPARQL Benchmark (BSBM) [2], but it is “built around an e-commerce use case” and “illustrates the search and navigation pattern of a consumer looking for a product” [3].  From a HPC perspective, we will likely be more concerned with overall run-time of queries or reasoning processes (or whatever other interesting processes) rather than handling interaction with users.

Finally, there is SP2Bench [4].  This is perhaps the most useful benchmark for SPARQL benchmarking.  It provides a data generator that mimics statistical properties of DBLP data, and it provides a set of sensible queries.  Therefore, the dataset is more realistic than LUBM, and it is focused on SPARQL query (whereas LUBM focuses on reasoning).

However, there is still a need for a good reasoning benchmark from a HPC perspective.  It’s difficult to be more specific than that because providing such a benchmark is still very much an open research topic.  Clearly there needs to be an ontology that uses features from various reasoning standards (e.g., RDFS, OWL) and a corresponding data generator.  There should also be some way to verify validity of inferences based on certain entailments.  Again, this is very much an open research topic which is why I made the suggestion but have few answers myself.

2. Consider existing reasoning standards as starting points.

This may be the more controversial of my suggestions, but there is good reason for it.  Recent history indicates that the reasoning standards continue to iteratively evolve based on the needs of the community.

Consider RDFS (by which I mean RDFS entailment as defined in RDF Semantics).  First of all, it is technically undecidable [5], but in a way that is trivial and easily overcome.  Secondly, few systems (in my experience) completely support inferences based on literal generalization, XML literals, and container-membership properties.  Other rules, like “everything is a resource,” are generally trivial and uninteresting.  More commonly, implementations align with a fragment of RDFS that I call RDFS Muñoz [6] (originally termed the ρdf fragment), which essentially boils down to domains, ranges, subclasses, and subproperties. Perhaps Muñoz said it best:

“Efficient processing of any kind of data relies on a compromise between the size of the data and the expressiveness of the language describing it. As we already pointed out, in the RDF case the size of the data to be processed will be enormous, as current developments show …. Hence, a program to make RDF processing scalable has to consider necessarily the compromise between complexity and expressiveness. Such a program amounts essentially to look for fragments of RDF with good behavior with respect to complexity of processing.” [6]

Consider also OWL 1.  How many scalable systems completely support one of the OWL 1 fragments (Lite, DL, Full)?  I cannot say for sure, but my impression from experience and feedback from others is that the cost for higher expressivity can often be too expensive in terms of performance, especially as you scale dataset size.  Perhaps it is for this reason that OWL Horst [7] (originally termed the pD* fragment) has gained popularity as (arguably) the most widely supported OWL fragment.

Now there is OWL 2OWL 2 RL (a fragment of OWL 2) is “inspired by description logic programs and pD* [OWL Horst]” [8]. The SAOR paper from ISWC 2010 [9] has already shown a subset of OWL 2 RL rules for which closure can be efficiently produced in parallel.

So my point is this. Reasoning standards capture well-defined and understood fragments, but research and practice continue to explore subfragments that are suitable for certain problems, and as the subfragments become stable and gain popularity, they inspire future standards. It is an iterative process, so it is not necessary to become obsessed with fully complying with existing standards (unless that is actually necessary to meet your use case). It is probably more interesting to search for fragments of the standards that fit certain HPC paradigms.

3. Review the literature to reconsider approaches that were once considered less viable.

This suggestion seems obvious.  As an example, I recently did a literature review of parallel join processing, and one thing I noticed is that a majority of the literature is focused on shared-nothing architectures.  In 1992, DeWitt and Gray stated:

“A consensus on parallel and distributed database system architecture has emerged.  This architecture is based on a shared-nothing hardware design ….” [10]

However, in 1996, Norman, Zurek, and Thanisch directly opposed (or reversed) the claim of DeWitt and Gray saying:

“We argue that shared-nothingness is no longer the consensus hardware architecture and that hardware resource sharing is a poor basis for categorising parallel DBMS software architectures if one wishes to compare the performance characteristics of parallel DBMS products.” [11]

The popularity of the shared-nothing paradigm was probably further fueled by the advent of inexpensive supercomputing by way of Beowulf clusters and Networks of Workstations (around the mid 90’s).  However, many modern supercomputers provide shared-disk and shared-memory paradigms.  The Blue Gene/L in our Computational Center for Nanotechnology Innovations (CCNI) is networked with a General Parallel File System (GPFS).  Making use of GPFS, the Blue Gene/L could be considered shared-disk in a programmatic sense.  The Cray XMT uses large shared-memory. Rahm points out that a major advantage of shared-disk is its potential for truly dynamic load-balancing [11], so lets look back at some of the shared-disk and shared-memory research that has been done [12-15].

All of that just to say, a review of literature is in order. Potential sources of inspiration include parallel databases, parallel graph algorithms, deductive databases, and graph databases.

Jesse Weaver
Ph.D. Student, Patroon Fellow
Tetherless World Constellation
Rensselaer Polytechnic Institute

[1] Guo, Pan, Heflin.  LUBM: A benchmark for OWL knowledge base systems.  JWS 2005.
[2] Bizer, Schultz.  The Berlin SPARQL Benchmark.  IJSWIS 2009.
[3] http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/
[4] Schmidt, Hornung, Lausen, Pinkel.  SP2Bench: A SPARQL Performance Benchmark.  ICDE 2009.
[5] Weaver.  Redefining the RDFS Closure to be Decidable.  RDF Next Steps 2010.
[6] Muñoz, Pérez, Gutierrez.  Simple and Efficient Minimal RDFS.  JWS 2009.
[7] ter Horst.  Completeness, decidability and complexity of entailment for RDF Schema and a semantic extensions involving the OWL vocabulary.  JWS 2005.
[8] http://www.w3.org/TR/owl2-profiles/#OWL_2_RL
[9] Hogan, Pan, Polleres, Decker. SAOR: Template Rule Optimisations for Distributed Reasoning over 1 Billion Linked Data Triples. ISWC 2010.
[10] DeWitt, Gray.  Parallel Database Systems: The Future of High Performance Database Systems.  Communications of the ACM 1992.
[11] Norman, Zurek, Thanisch.  Much Ado About Shared-Nothing.  SIGMOD Record 1996.
[12] Rahm.  Parallel Query Processing in Shared Disk Database Systems.  SIGMOD Record 1993.
[13] Lu, Tan.  Dynamic and Load-balanced Task-Oriented Database Query Processing in Parallel Systems.  EDBT 1992.
[14] Märtens.  Skew-Insensitive Join Processing in Shared-Disk Database Systems.  IADT 1998.
[15] Moon, On, Cho.  Performance of Dynamic Load Balanced Join Algorithms in Shared Disk Parallel Database Systems.  Workshop on Future Trends of Distributed Computing Systems 1999.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: Semantic Web, tetherless world Tags: