Archive

Archive for the ‘tetherless world’ Category

Scaling Up at the Tetherless World Constellation in 2009

November 20th, 2009

Since this is my first post to the Tetherless World blog, perhaps a brief introduction is in order.  I’m Jesse Weaver, one of Jim Hendler’s Ph.D. students in the Tetherless World Constellation (TWC) at Rensselaer Polytechnic Institute (RPI).  My general research interest is in high-performance computing for the semantic web.  Specifically, I have been looking at employing parallelism on cluster architectures for rule-based reasoning and RDF query.  Since joining TWC in Fall 2008, I have been working with colleagues toward this end, and it is that work that I would like to share in this blog post.

Jim and I recently published a paper at ISWC 2009 entitled Parallel Materialization of the Finite RDFS Closure for Hundreds of Millions of Triples.  Since the time that paper was accepted (and as presented at ISWC), we have actually scaled to billions of triples.  We show in this paper that the RDFS rules can be applied to independent partitions of data to produce the RDFS closure for all of the data, as long as each partition has the ontologies.  In parallel computing terms, the RDFS closure can be computed in an embarrassingly parallel fashion.  “Embarrassingly parallel” is a technical term from parallel computing describing a computation that can be divided into completely independent parts.  Such computations are considered ideal for parallelism because there is no need for communication between processes and hence there is essentially no overhead for parallelization.  Peter Patel-Schneider had some good questions and comments after the presentation.  I have made my responses publicly available in a brief note.

Gregory Todd Williams and I published a paper at SSWS 2009 entitled Scalable RDF query processing on clusters and supercomputers.  This paper shows how parallel hash joins can be used on high-performance clusters to efficiently query large RDF datasets.  It seemed to get a lot of attention at the SSWS workshop as well as stir up a little bit of controversy.  The interesting thing about our approach is that no global indexes are created.  Each process in the cluster gets a portion of the data and indexes it locally, but no global indexes are maintained (e.g., we do not globally dictionary encode RDF terms).  This allows us to load data extremely quickly with some cost to query time.  In many cases, though, the decrease in loading time outweighs the added cost in query time.  (The added cost in query time comes from communicating full string values instead of global IDs during the parallel hash join.)  This allows for exploratory querying and easy handling of dynamically changing data.  Whereas many previous query systems depend heavily on global indexes (for which loading can take on the order of hours or days), we can load large datasets on the order of seconds and minutes.  Therefore, if the data changes, it can just be reloaded instead of updating indexes.

Finally, Greg, Medha Atre, Jim, and I submitted a paper to the Billion Triples Challenge (BTC), which we won!

Greg and Jesse accept the award for 1st place at the 2009 Billion Triples Challenge

We composed together three systems for our submission.  First, we created a simple upper ontology of 31 triples for our domain of interest, linking established concepts of Person to our concept of Person (by subclass), and we did the same for many relevant properties (name, email, etc.) (by subproperty).  Then, we used the aforementioned parallel materialization work to produce inferences on the BTC dataset, inferring triples that use our terms from the upper ontology.  Using the aforementioned work on scalable query, we then extracted only our triples of interest.  This reduced dataset is almost 800K triples, much more manageable than the original 900M triples, and it can now be used by existing tools without much concern of dataset size.  As a finishing touch, we compressed the reduced dataset down into a BitMat RDF data structure, resulting in a final disk space of 8 MB for the triples and 25 MB for the dictionary encoding.  Simple basic graph pattern queries can be executed against the BitMat.  The entire process took roughly 22 minutes.  See more about the submission at our BTC website which contains the datasets and some statistics about the datasets.

Jesse presents at the 2009 Billion Triples Challenge

That being said, the future holds much work to be done for scalability in the semantic web domain.

At present, I have been looking at formalizing a more general notion of “abox partitioning” for the purpose of classifying rules that fit such a paradigm, and then explore its application to OWL2RL.  Some parts of OWL2RL—like symmetric properties and inverse properties—clearly fit in the inferencing scheme from the parallel materialization paper.  However, many of the much desired features—like inverse functional properties and owl:sameAs—do not.  For such rules, parallel hash joins may be needed, or perhaps a more clever partitioning scheme.

We could also improve loading time of these systems (and perhaps communication time during parallel hash joins) by using an RDF syntax that is less verbose than N-Triples, but not as complex as Turtle.  (Remember, we are concerned about parallel I/O.)  To that end, we are exploring defining a subset of Turtle that would be helpful for I/O purposes without trading off the inherent simplicity of N-Triples (one triple per line).

We would also like to start employing more memory-efficient RDF storage data structures (like BitMat or Parliament) directly in our systems.  This is particularly important for the Blue Gene/L architecture which has at most 1 GB of memory per node.

And speaking of the Blue Gene/L, I have been doing all my work at RPI’s fabulous Computational Center for Nanotechnology Innovations (CCNI).  The CCNI is really a great computation facility having parallel file systems, high performance clusters, large SMP machines, and—of course—a Blue Gene/L.  Such a resource is a great enabler for our research.

Jesse Weaver
Ph.D. Student, Patroon Fellow
Tetherless World Constellation
Rensselaer Polytechnic Institute
http://www.cs.rpi.edu/~weavej3/

VN:F [1.2.0_562]
Rating: 0.0/10 (0 votes cast)
Author: Jesse Weaver Categories: tetherless world Tags:

Parliament, storage density, and napkin math

November 8th, 2009

What follows is some napkin math on the storage requirements of the Parliament triplestore (as described in their SSWS paper) and its implications for our clustered RDF query engine (any errors are presumably with my understanding of the described system).

The triple store has three data structures: a resource dictionary, statement list, and resource table. The resource dictionary is a mapping from node values (literals/IRIs/blank node identifiers) to an integer identifier and can be implemented with a search tree or hash (Parliament uses a BerkeleyDB B-tree). The statement list and resource table are the two interesting structures.

The statement list is an array of fixed width records containing an ID for the subject, predicate, and object in the triple, and corresponding offsets to the next triple record with the same subject, predicate, and object, respectively. Making some assumptions about the storage sizes involved (related to my specific needs), a statement list record’s fields are:

field type
subject 4 byte Offset into Resource Table
predicate 4 byte Offset into Resource Table
object 4 byte Offset into Resource Table
next subject 4 byte Offset into Statement List
next predicate 4 byte Offset into Statement List
next object 4 byte Offset into Statement List
Statement List Record

The resource table is an array of fixed width records for each resource containing three offsets into the statement list for the first statement with the specific resource in subject, predicate, and object position, respectively, three counts representing the length of these three lists, and the resoure value. The storage requirements for a resource table record’s fields are:

Resource Table Record
field type
first subject 4 byte Offset into Statement List
first predicate 4 byte Offset into Statement List
first object 4 byte Offset into Statement List
subject count 4 byte Integer
predicate count 4 byte Integer
object count 4 byte Integer
value 8 byte pointer to n byte resource value

Puting aside the storage cost for the resource dictionary (which would obviously depend on the specific structure used, whether tree or hash), the storage requirements for the statement list and resource table are then:

(24 bytes * |triples|) + ((32+n) bytes * |nodes|)

where n is the average resource value size.

To get some real world numbers, I looked at a subset of the Billion Triples Challenge dataset (BTC chunks 000-009). This subset has 100M triples, ~34M nodes, and occupies 3.5GB with nodes serialized using N-Triples syntax (I’ll assume here that the N-Triples serialization gives a rough approximation of the node’s representation in memory). This averages out to ~341 unique nodes per thousand triples and ~109 bytes per node.

Using these numbers, the statement list and resource table for 100M triples would take (24*100M) + ((32+109)*34144466) bytes = 7214369706 bytes = 6.7GB. The upper limit on a single triplestore with these structures (using 32-bit offsets) is 4B statements, with a storage estimate of 288GB.

I’d like to try implementing this and using it on our clustered RDF query engine (part of the system that won the Billion Triples Challenge last month). It seems like a very natural fit since we’d like high storage density but haven’t spent any effort on that front yet (and our current code is awful in this respect) and our parallel query answering code doesn’t need the sorting that a traditional tree-based index would provide. Some more (very rough) napkin math suggests that we’d be able to get over 100 billion triples loaded onto the Blue Gene/L. It’ll be interesting to see if the numbers actually pan out, and if the query algorithm can handle anything close to that much data.

Gregory Todd Williams

VN:F [1.2.0_562]
Rating: 0.0/10 (0 votes cast)
Author: greg Categories: Semantic Web, tetherless world Tags:

OWL 2 Reference Card released

October 18th, 2009

We’re pleased to announce the OWL 2 Reference Card [1]. The Card is meant to be a “cheat sheet” of OWL 2 features printable on a single piece of paper (on both sides). It is based on the OWL 2 Quick Reference Guide [1], which is now a Proposed Recommendation [2] in the OWL 2 Web Ontology Language document set.

Background: OWL 2 [4] is an extension to OWL 1 with a few new functionalities. Some of the new features are syntactic sugar (e.g., disjoint union of classes) while others offer new expressivity, including:

* keys;
* property chains;
* richer datatypes, data ranges;
* qualified cardinality restrictions;
* asymmetric, reflexive, and disjoint properties; and
* enhanced annotation capabilities

Comments and suggestions to the Card are welcome (please send to public-owl-comments@w3.org)

[1] http://www.w3.org/2007/OWL/refcard

[2] http://www.w3.org/2007/OWL/wiki/Quick_Reference_Guide

[3] http://www.w3.org/TR/2009/PR-owl2-quick-reference-20090922/

[4] http://www.w3.org/TR/owl2-overview/

Jie Bao

VN:F [1.2.0_562]
Rating: 10.0/10 (1 vote cast)
Author: Jie Bao Categories: tetherless world Tags:

Graduation day - in haiku

May 18th, 2009

For those interested, I kept a twitter log of my daughter’s graduation — but did it in haiku to make it more interesting — slightly edited and cleaned up - here’s what I’ve got.

Graduation in Haiku

Parents wait and wait
Students anxious in the hall
Commencement is nigh

Pomp and Circumstance
Way too many caps and gowns
precess slowly by

Chairman of trustees
welcomes all to the event
when will he be done?

President makes jokes
then he gets more serious
“we need your money!”

Now they proceed to
honorary doctorates
lots of famous folks

(someone sends a text
it’s a message from my kid:
“Get me out of here!”)

juris causus doc
no one has heard of this guy
must be a donor

Commencement address
starts with funny anecdote
so what else is new?

Speaker’s really good!
Words of hope and cheer - too bad:
grads are all asleep

Cannot see my kid
the talks all go on too long
will it ever end?

A problem with this:
seeing your kid get her degree
means watching the rest

The Prez says the words
the crowd stands up, cheers real loud
kids now have B.A.s

Alma mater sung
speeches made, degrees conferred
graduation’s done

VN:F [1.2.0_562]
Rating: 8.0/10 (2 votes cast)
Author: hendler Categories: tetherless world Tags:

A little Semantics Goes a Long way — the background

April 21st, 2009

For whatever reason, I’ve recently been asked several time about my quote “A little semantics goes a long way” - I sent a long email answer to one of these, and then realized I could make it into a web page to point people at should this come up again — so for those of you looking for a little light background reading, and interested, enjoy.  http://www.cs.rpi.edu/~hendler/LittleSemanticsWeb.html

VN:F [1.2.0_562]
Rating: 8.5/10 (2 votes cast)