Scaling Up at the Tetherless World Constellation in 2009

November 20th, 2009

Since this is my first post to the Tetherless World blog, perhaps a brief introduction is in order.  I’m Jesse Weaver, one of Jim Hendler’s Ph.D. students in the Tetherless World Constellation (TWC) at Rensselaer Polytechnic Institute (RPI).  My general research interest is in high-performance computing for the semantic web.  Specifically, I have been looking at employing parallelism on cluster architectures for rule-based reasoning and RDF query.  Since joining TWC in Fall 2008, I have been working with colleagues toward this end, and it is that work that I would like to share in this blog post.

Jim and I recently published a paper at ISWC 2009 entitled Parallel Materialization of the Finite RDFS Closure for Hundreds of Millions of Triples.  Since the time that paper was accepted (and as presented at ISWC), we have actually scaled to billions of triples.  We show in this paper that the RDFS rules can be applied to independent partitions of data to produce the RDFS closure for all of the data, as long as each partition has the ontologies.  In parallel computing terms, the RDFS closure can be computed in an embarrassingly parallel fashion.  “Embarrassingly parallel” is a technical term from parallel computing describing a computation that can be divided into completely independent parts.  Such computations are considered ideal for parallelism because there is no need for communication between processes and hence there is essentially no overhead for parallelization.  Peter Patel-Schneider had some good questions and comments after the presentation.  I have made my responses publicly available in a brief note.

Gregory Todd Williams and I published a paper at SSWS 2009 entitled Scalable RDF query processing on clusters and supercomputers.  This paper shows how parallel hash joins can be used on high-performance clusters to efficiently query large RDF datasets.  It seemed to get a lot of attention at the SSWS workshop as well as stir up a little bit of controversy.  The interesting thing about our approach is that no global indexes are created.  Each process in the cluster gets a portion of the data and indexes it locally, but no global indexes are maintained (e.g., we do not globally dictionary encode RDF terms).  This allows us to load data extremely quickly with some cost to query time.  In many cases, though, the decrease in loading time outweighs the added cost in query time.  (The added cost in query time comes from communicating full string values instead of global IDs during the parallel hash join.)  This allows for exploratory querying and easy handling of dynamically changing data.  Whereas many previous query systems depend heavily on global indexes (for which loading can take on the order of hours or days), we can load large datasets on the order of seconds and minutes.  Therefore, if the data changes, it can just be reloaded instead of updating indexes.

Finally, Greg, Medha Atre, Jim, and I submitted a paper to the Billion Triples Challenge (BTC), which we won!

Greg and Jesse accept the award for 1st place at the 2009 Billion Triples Challenge

We composed together three systems for our submission.  First, we created a simple upper ontology of 31 triples for our domain of interest, linking established concepts of Person to our concept of Person (by subclass), and we did the same for many relevant properties (name, email, etc.) (by subproperty).  Then, we used the aforementioned parallel materialization work to produce inferences on the BTC dataset, inferring triples that use our terms from the upper ontology.  Using the aforementioned work on scalable query, we then extracted only our triples of interest.  This reduced dataset is almost 800K triples, much more manageable than the original 900M triples, and it can now be used by existing tools without much concern of dataset size.  As a finishing touch, we compressed the reduced dataset down into a BitMat RDF data structure, resulting in a final disk space of 8 MB for the triples and 25 MB for the dictionary encoding.  Simple basic graph pattern queries can be executed against the BitMat.  The entire process took roughly 22 minutes.  See more about the submission at our BTC website which contains the datasets and some statistics about the datasets.

Jesse presents at the 2009 Billion Triples Challenge

That being said, the future holds much work to be done for scalability in the semantic web domain.

At present, I have been looking at formalizing a more general notion of “abox partitioning” for the purpose of classifying rules that fit such a paradigm, and then explore its application to OWL2RL.  Some parts of OWL2RL—like symmetric properties and inverse properties—clearly fit in the inferencing scheme from the parallel materialization paper.  However, many of the much desired features—like inverse functional properties and owl:sameAs—do not.  For such rules, parallel hash joins may be needed, or perhaps a more clever partitioning scheme.

We could also improve loading time of these systems (and perhaps communication time during parallel hash joins) by using an RDF syntax that is less verbose than N-Triples, but not as complex as Turtle.  (Remember, we are concerned about parallel I/O.)  To that end, we are exploring defining a subset of Turtle that would be helpful for I/O purposes without trading off the inherent simplicity of N-Triples (one triple per line).

We would also like to start employing more memory-efficient RDF storage data structures (like BitMat or Parliament) directly in our systems.  This is particularly important for the Blue Gene/L architecture which has at most 1 GB of memory per node.

And speaking of the Blue Gene/L, I have been doing all my work at RPI’s fabulous Computational Center for Nanotechnology Innovations (CCNI).  The CCNI is really a great computation facility having parallel file systems, high performance clusters, large SMP machines, and—of course—a Blue Gene/L.  Such a resource is a great enabler for our research.

Jesse Weaver
Ph.D. Student, Patroon Fellow
Tetherless World Constellation
Rensselaer Polytechnic Institute
http://www.cs.rpi.edu/~weavej3/

VN:F [1.2.0_562]
Rating: 0.0/10 (0 votes cast)
Author: Jesse Weaver Categories: tetherless world Tags:

Parliament, storage density, and napkin math

November 8th, 2009

What follows is some napkin math on the storage requirements of the Parliament triplestore (as described in their SSWS paper) and its implications for our clustered RDF query engine (any errors are presumably with my understanding of the described system).

The triple store has three data structures: a resource dictionary, statement list, and resource table. The resource dictionary is a mapping from node values (literals/IRIs/blank node identifiers) to an integer identifier and can be implemented with a search tree or hash (Parliament uses a BerkeleyDB B-tree). The statement list and resource table are the two interesting structures.

The statement list is an array of fixed width records containing an ID for the subject, predicate, and object in the triple, and corresponding offsets to the next triple record with the same subject, predicate, and object, respectively. Making some assumptions about the storage sizes involved (related to my specific needs), a statement list record’s fields are:

field type
subject 4 byte Offset into Resource Table
predicate 4 byte Offset into Resource Table
object 4 byte Offset into Resource Table
next subject 4 byte Offset into Statement List
next predicate 4 byte Offset into Statement List
next object 4 byte Offset into Statement List
Statement List Record

The resource table is an array of fixed width records for each resource containing three offsets into the statement list for the first statement with the specific resource in subject, predicate, and object position, respectively, three counts representing the length of these three lists, and the resoure value. The storage requirements for a resource table record’s fields are:

Resource Table Record
field type
first subject 4 byte Offset into Statement List
first predicate 4 byte Offset into Statement List
first object 4 byte Offset into Statement List
subject count 4 byte Integer
predicate count 4 byte Integer
object count 4 byte Integer
value 8 byte pointer to n byte resource value

Puting aside the storage cost for the resource dictionary (which would obviously depend on the specific structure used, whether tree or hash), the storage requirements for the statement list and resource table are then:

(24 bytes * |triples|) + ((32+n) bytes * |nodes|)

where n is the average resource value size.

To get some real world numbers, I looked at a subset of the Billion Triples Challenge dataset (BTC chunks 000-009). This subset has 100M triples, ~34M nodes, and occupies 3.5GB with nodes serialized using N-Triples syntax (I’ll assume here that the N-Triples serialization gives a rough approximation of the node’s representation in memory). This averages out to ~341 unique nodes per thousand triples and ~109 bytes per node.

Using these numbers, the statement list and resource table for 100M triples would take (24*100M) + ((32+109)*34144466) bytes = 7214369706 bytes = 6.7GB. The upper limit on a single triplestore with these structures (using 32-bit offsets) is 4B statements, with a storage estimate of 288GB.

I’d like to try implementing this and using it on our clustered RDF query engine (part of the system that won the Billion Triples Challenge last month). It seems like a very natural fit since we’d like high storage density but haven’t spent any effort on that front yet (and our current code is awful in this respect) and our parallel query answering code doesn’t need the sorting that a traditional tree-based index would provide. Some more (very rough) napkin math suggests that we’d be able to get over 100 billion triples loaded onto the Blue Gene/L. It’ll be interesting to see if the numbers actually pan out, and if the query algorithm can handle anything close to that much data.

Gregory Todd Williams

VN:F [1.2.0_562]
Rating: 0.0/10 (0 votes cast)
Author: greg Categories: Semantic Web, tetherless world Tags:

Web Accessibility in an Educational Context

October 25th, 2009

I’m currently at the annual meeting for the Human Factors and Ergonomics Society and a few hours ago I was attending a set of presentations by the Internet TG on Input and Output on the Web. One of the talks that caught my attention was by Wayne Shebilske of Wright State University on the use of screen readers to help impaired students complete tasks within educational programs such as WebCT, student status management software, and the like. There seem to be many problems with such tools as Dr. Shebilske pointed out based on his study. I think this could be a prime candidate for SW technologies to step in to improve the end user experience. I foresee using such tools along with technologies like RDFa to give the user a better representation of the content being displayed and improving the overall quality of life for these individuals.

Evan

VN:F [1.2.0_562]
Rating: 0.0/10 (0 votes cast)
Author: Evan Categories: Semantic Web Tags: ,

Probing the SPARQL endpoint of data.gov.uk

October 23rd, 2009

We just ran across the preview SPARQL endpoint for UK’s Data.gov (powered by Talis) following Harry Metcalfe’s blog . In order to understand what data is hosted by the triple store, we use a series of SPARQL queries to probe the content in data.gov. We leverage a web service http://data-gov.tw.rpi.edu/ws/sparqlproxy.php to convert SPARQL/XMl result into HTML and JSON.

First, let’s do some warm up exercises

Q: show me some triples!

SPARQL:

SELECT ?s ?p ?o WHERE {?s ?p ?o} LIMIT 5

Result:

<http://www.london-gazette.co.uk/id/issues/58316/notices/240663> <http://www.gazettes-online.co.uk/ontology#hasPublicationDate> “2007-05-02″^^http://www.w3.org/2001/XMLSchema#date .

<http://www.london-gazette.co.uk/id/issues/58316/notices/240663> <http://xmlns.com/foaf/0.1/page> <http://www.london-gazette.co.uk/issues/58316/pages/6359> .

<http://www.london-gazette.co.uk/id/issues/58316/notices/240663> <http://purl.org/dc/terms/modified> “2007-05-02″^^http://www.w3.org/2001/XMLSchema#date .

A: ok, it has some gazette dataset about lond (see http://www.london-gazette.co.uk/), and it uses FOAF and DC vocabulary.

Q: show me some classes and their instances?

SPARQL:

SELECT DISTINCT ?c WHERE { [] a ?c. } LIMIT 5

Result:

<http://www.gazettes-online.co.uk/ontology/transport#RoadTrafficActsNotice>

<http://www.gazettes-online.co.uk/ontology#Notice>

<http://xmlns.com/foaf/0.1/Document>

<http://www.gazettes-online.co.uk/ontology#Issue>

<http://www.gazettes-online.co.uk/ontology#Edition>

A: same observation as above.

Q: Does it host any named graphs?

SPARQL:

SELECT ?g WHERE {GRAPH ?g { ?s ?p ?o } } LIMIT 10

Result: “0″ ^^<http://www.w3.org/2001/XMLSchema#integer>

A: no named graph found, and there is only one big default graph

Now let’s run several expensive aggregation queries (note aggregation queries are not part of the current SPARQL specification)

Q: How many triples?

SPARQL:

SELECT count(*) WHERE {?s ?p ?o}

Result: “5529380″^^<http://www.w3.org/2001/XMLSchema#integer>

A: alright! Aggregation query is support, and there are 5 million triples. Note “count” is a non-standard aggregation function, it may be support differently be different SPARQL endpoints.

Q: How many graphs (and the number of triples in each graph)?

SPARQL:

SELECT ?g count(*) WHERE {GRAPH ?g { ?s ?p ?o } }

Result: “0″ ^^<http://www.w3.org/2001/XMLSchema#integer>

A: no named graph.

Q: How many populated classes?

SPARQL:

SELECT count(distinct ?c) WHERE {[] a ?c}

Result: “99″^^<http://www.w3.org/2001/XMLSchema#integer>

A: there are 99 different classes having direct instances in this triple store.

Q: How many populated properties?

SPARQL:

SELECT count(distinct ?p) WHERE {[] ?p ?o}

Result: “86″ ^^<http://www.w3.org/2001/XMLSchema#integer>

A: There are 86 unique properties being used as predicate in this dataset. Each property has used by 64K (5529380/86) triples in average. There must be some very popular properties, and we will do that survey later.

Q: How many typed individuals?

SPARQL:

SELECT count(distinct ?s) WHERE {?s a ?c}

Result: 504 Gateway Time-out

A: Opps, really expensive. Let’s try something else

Q: How many defined classes?

SPARQL:

SELECT count(distinct ?s) WHERE {{?s a <http://www.w3.org/2002/07/owl#Class> } UNION {?s a <http://www.w3.org/2000/01/rdf-schema#Class>}}

Result: 0

A: no class defined, and the triple store is full of individuals

Q: How many individuals (again)?

SPARQL:

SELECT count(*) WHERE {[] a ?c}

Result: 995694

A: There are nearly 1 millions of typed individuals, so we can easily see every invidual has 5 (5M/1M) triples in average.

Now, let’s do some knowledge discovery

Q: Show me the 3 most/least used classes in this dataset?

SPARQL:

1) select ?c ( count(?s) AS ?count ) where {?s a ?c} group by ?c order by desc(?count) limit 3

2) select ?c ( count(?s) AS ?count ) where {?s a ?c} group by ?c order by ?count limit 3

Result:

1) most used classes

<http://www.gazettes-online.co.uk/ontology#Notice> “156452″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.w3.org/2006/vcard/ns#Address> “106934″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology/person#Person> “87798″ ^^<http://www.w3.org/2001/XMLSchema#integer>

2) least used classes

<http://www.gazettes-online.co.uk/ontology/transport#CycleTracksNotice> “1″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology/transport#PortsNotice> “1″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology/corp-insolvency#AdministrationOrder> “2″ ^^<http://www.w3.org/2001/XMLSchema#integer>

A: Again, the SPAQL query is “safe” because Talis support these SPARQL extensions (c.f. http://n2.talis.com/wiki/SPARQL_Extensions). The class that has the most number of instances in this dataset is http://www.gazettes-online.co.uk/ontology#Notice (which has 156,452 instances.)

Q: what about property usage (top 5)?

SPARQL:

1) select ?p ( count(?s) AS ?count ) where {?s ?p ?o} group by ?p order by DESC(?count) limit 5

2) select ?p ( count(?s) AS ?count ) where {?s ?p ?o} group by ?p order by ?count limit 5

Result:

1) 5 most used properties:

<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> “995694″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://purl.org/dc/dcam/memberOf> “312476″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.w3.org/1999/02/22-rdf-syntax-ns#value> “310170″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology#hasPublicationDate> “181940″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology#hasNoticeCode> “181335″ ^^<http://www.w3.org/2001/XMLSchema#integer>

2) 5 least used properties:

<http://www.gazettes-online.co.uk/ontology/corp-insolvency#dateAdministrationOrderMade> “1″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology#hasAuthorisingOrganisation> “2″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology/personal-legal#isForNextOfKinOf> “2″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology/corp-insolvency#orderAdministrator> “3″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology#authorisingOrganisation> “49″ ^^<http://www.w3.org/2001/XMLSchema#integer>

A: Some properties are really used (e.g., http://www.gazettes-online.co.uk/ontology/corp-insolvency#dateAdministrationOrderMade has been used only once) while some are heavily used (like rdf:type, which is the most frequently used predicate).

Conclusion

We need to stop probing now. A number of complex queries ended up with a timeout error because (i) “LIMIT” only control the final results, so that we cannot just get statistical results on 1000 triples; (ii) “GROUP BY” may produce too many intermediate results, (iii) the statistics queries does not leverage the index structure of triple store, or index structure are not designed for handling such queries, and (iv) many other issues.

Updates

presented by Li Ding and Zhengning Shangguan

VN:F [1.2.0_562]
Rating: 10.0/10 (2 votes cast)
Author: li Categories: Semantic Web, linked data Tags:

OWL 2 Reference Card released

October 18th, 2009

We’re pleased to announce the OWL 2 Reference Card [1]. The Card is meant to be a “cheat sheet” of OWL 2 features printable on a single piece of paper (on both sides). It is based on the OWL 2 Quick Reference Guide [1], which is now a Proposed Recommendation [2] in the OWL 2 Web Ontology Language document set.

Background: OWL 2 [4] is an extension to OWL 1 with a few new functionalities. Some of the new features are syntactic sugar (e.g., disjoint union of classes) while others offer new expressivity, including:

* keys;
* property chains;
* richer datatypes, data ranges;
* qualified cardinality restrictions;
* asymmetric, reflexive, and disjoint properties; and
* enhanced annotation capabilities

Comments and suggestions to the Card are welcome (please send to public-owl-comments@w3.org)

[1] http://www.w3.org/2007/OWL/refcard

[2] http://www.w3.org/2007/OWL/wiki/Quick_Reference_Guide

[3] http://www.w3.org/TR/2009/PR-owl2-quick-reference-20090922/

[4] http://www.w3.org/TR/owl2-overview/

Jie Bao

VN:F [1.2.0_562]
Rating: 10.0/10 (1 vote cast)
Author: Jie Bao Categories: tetherless world Tags: