Scaling Up at the Tetherless World Constellation in 2009
Since this is my first post to the Tetherless World blog, perhaps a brief introduction is in order. I’m Jesse Weaver, one of Jim Hendler‘s Ph.D. students in the Tetherless World Constellation (TWC) at Rensselaer Polytechnic Institute (RPI). My general research interest is in high-performance computing for the semantic web. Specifically, I have been looking at employing parallelism on cluster architectures for rule-based reasoning and RDF query. Since joining TWC in Fall 2008, I have been working with colleagues toward this end, and it is that work that I would like to share in this blog post.
Jim and I recently published a paper at ISWC 2009 entitled Parallel Materialization of the Finite RDFS Closure for Hundreds of Millions of Triples. Since the time that paper was accepted (and as presented at ISWC), we have actually scaled to billions of triples. We show in this paper that the RDFS rules can be applied to independent partitions of data to produce the RDFS closure for all of the data, as long as each partition has the ontologies. In parallel computing terms, the RDFS closure can be computed in an embarrassingly parallel fashion. “Embarrassingly parallel” is a technical term from parallel computing describing a computation that can be divided into completely independent parts. Such computations are considered ideal for parallelism because there is no need for communication between processes and hence there is essentially no overhead for parallelization. Peter Patel-Schneider had some good questions and comments after the presentation. I have made my responses publicly available in a brief note.
Gregory Todd Williams and I published a paper at SSWS 2009 entitled Scalable RDF query processing on clusters and supercomputers. This paper shows how parallel hash joins can be used on high-performance clusters to efficiently query large RDF datasets. It seemed to get a lot of attention at the SSWS workshop as well as stir up a little bit of controversy. The interesting thing about our approach is that no global indexes are created. Each process in the cluster gets a portion of the data and indexes it locally, but no global indexes are maintained (e.g., we do not globally dictionary encode RDF terms). This allows us to load data extremely quickly with some cost to query time. In many cases, though, the decrease in loading time outweighs the added cost in query time. (The added cost in query time comes from communicating full string values instead of global IDs during the parallel hash join.) This allows for exploratory querying and easy handling of dynamically changing data. Whereas many previous query systems depend heavily on global indexes (for which loading can take on the order of hours or days), we can load large datasets on the order of seconds and minutes. Therefore, if the data changes, it can just be reloaded instead of updating indexes.
We composed together three systems for our submission. First, we created a simple upper ontology of 31 triples for our domain of interest, linking established concepts of Person to our concept of Person (by subclass), and we did the same for many relevant properties (name, email, etc.) (by subproperty). Then, we used the aforementioned parallel materialization work to produce inferences on the BTC dataset, inferring triples that use our terms from the upper ontology. Using the aforementioned work on scalable query, we then extracted only our triples of interest. This reduced dataset is almost 800K triples, much more manageable than the original 900M triples, and it can now be used by existing tools without much concern of dataset size. As a finishing touch, we compressed the reduced dataset down into a BitMat RDF data structure, resulting in a final disk space of 8 MB for the triples and 25 MB for the dictionary encoding. Simple basic graph pattern queries can be executed against the BitMat. The entire process took roughly 22 minutes. See more about the submission at our BTC website which contains the datasets and some statistics about the datasets.
That being said, the future holds much work to be done for scalability in the semantic web domain.
At present, I have been looking at formalizing a more general notion of “abox partitioning” for the purpose of classifying rules that fit such a paradigm, and then explore its application to OWL2RL. Some parts of OWL2RL—like symmetric properties and inverse properties—clearly fit in the inferencing scheme from the parallel materialization paper. However, many of the much desired features—like inverse functional properties and owl:sameAs—do not. For such rules, parallel hash joins may be needed, or perhaps a more clever partitioning scheme.
We could also improve loading time of these systems (and perhaps communication time during parallel hash joins) by using an RDF syntax that is less verbose than N-Triples, but not as complex as Turtle. (Remember, we are concerned about parallel I/O.) To that end, we are exploring defining a subset of Turtle that would be helpful for I/O purposes without trading off the inherent simplicity of N-Triples (one triple per line).
We would also like to start employing more memory-efficient RDF storage data structures (like BitMat or Parliament) directly in our systems. This is particularly important for the Blue Gene/L architecture which has at most 1 GB of memory per node.
And speaking of the Blue Gene/L, I have been doing all my work at RPI’s fabulous Computational Center for Nanotechnology Innovations (CCNI). The CCNI is really a great computation facility having parallel file systems, high performance clusters, large SMP machines, and—of course—a Blue Gene/L. Such a resource is a great enabler for our research.
Ph.D. Student, Patroon Fellow
Tetherless World Constellation
Rensselaer Polytechnic Institute