Archive

Archive for November, 2009

The RPI Data-gov Wiki: Current and Future Status

November 24th, 2009

This blog post is being written in response to some questions being asked of late about our work on turning the datasets from http://data.gov  (and also some other govt datasets) into linkeddata formats (RDF) and making this available to the community.  In essence, the criticisms have been that although we’ve made a good start, there’s still a lot more to do.  We couldn’t agree more.

Our http://data-gov.tw.rpi.edu  Wiki has been made available to help the public build, access and reuse linked government data. In order to get the data out quickly, we took some simple steps to start, showing how powerful it could be just to republish the data in RDF.  However, we are also working now to bring in more linking between the data, more datasets from other govt data sites, and more semantics in relating the datatsets.

The first step we did was to bring out a raw RDF version of data.gov datasets and built some quick demos to show how easy it was (see http://data-gov.tw.rpi.edu/wiki/Demos). The benefit of this was that we could easily dump from other formats into RDF, merge the data in simple ways, and then query the data using SPARQL and put the results right into visualizations using “off the shelf” Web tools such as Google Visualization and MIT’s Exhibit. In this step, we follow the “minimalism” principle – minimize human/development efforts and keep the process simple and replicable. Thus we did not try to do a lot of analysis of the data, didn’t add triples for things such as provenance, and didn’t link the datasets directly. Rather, the linking in our early demos came from obvious linking such as same locations or temporal-spatial overlaps.

The second step, which is ongoing right now, is to improve data quality by cleaning and enriching sour emantics. We are improving our human (and machine) versions of the  data.gov catalog (http://data-gov.tw.rpi.edu/wiki/Data.gov_Catalog), which is important for non-government people to use our data. For example, we now:

1. It aggregates metadata from data.gov (http://www.data.gov/details/92) and metadata about our curated RDF version. The aggregated metadata is published in both human and machine readable (RDF) forms.

2. Every dataset has a dereferenceable URI for itself, and links to the raw data, to linked RDF datasets in chunks that are small enough for linked data browsers such as tabulator, and the converted complete RDF data documents.

3. We use the definitions from data.gov (their dataset92 metadata dictionary as it were) for the metadata of each file, but we also add some DC, FOAF  and a couple of small home brews (like number of triples) in an ontology called DGTWC.

4. We now are also linking to more online “enhanced” datasets that  include (we’ve only done it for a few so far) normalized triples extracted from the raw data and links from entities (such as locations, organizations, persons) in government dataset to the linked data cloud (DBPedia and Geonames so far, much more coming soon).  We are also exploring the use of VoID for adding more data descriptions (and IRIs) to the datasets.

5. We are also working on linking the datasets by common properties — this is harder than most people think because you cannot just assume the same name means the same relations – can have different ranges, values or even semantics (and we have found examples of all of the above) – so soon you’ll find for each property there is something like this
geo:lat a rdf:property.
191:latitude rdfs:subPropertyOf geo:lat .
398:latitude rdfs:subPropertyOf geo:lat .

and we have a Semantic Wiki page for each property, so you can find all the subproperty relations and, eventually, a place where people can add information about what some of the more obscure properties mean, or where semantic relations such as “owl:sameAs” can be introduced when these subproperties are known to be the same.

So to summarize, our first step, and we continue to do it, is to transform data in ways that other people can start to add value.  Our second goal, which we’re now working on, is enhanced metadata and adding more semantics, including what is needed for more linking.

We’re also, in our research role, working on next generation technologies for really doing more exciting things (think “read write web of data”) but we’re trying to keep that separate from the work at http://data-gov.tw.rpi.edu, which is aimed at helping to show that Semantic Web technologies are really the only game in town for the sort of large-scale, open, distributed data use that is needed for linked data to really take off.

And if you feel there is stuff missing, let us know (via contact us)- or even better, all our stuff is open (see http://code.google.com/p/data-gov-wiki/), free and easily used – all we ask is that you do great stuff and help make the Web of data grow.

Jim Hendler, Li Ding, and the RPI data-gov research team.

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)

Scaling Up at the Tetherless World Constellation in 2009

November 20th, 2009

Since this is my first post to the Tetherless World blog, perhaps a brief introduction is in order.  I’m Jesse Weaver, one of Jim Hendler‘s Ph.D. students in the Tetherless World Constellation (TWC) at Rensselaer Polytechnic Institute (RPI).  My general research interest is in high-performance computing for the semantic web.  Specifically, I have been looking at employing parallelism on cluster architectures for rule-based reasoning and RDF query.  Since joining TWC in Fall 2008, I have been working with colleagues toward this end, and it is that work that I would like to share in this blog post.

Jim and I recently published a paper at ISWC 2009 entitled Parallel Materialization of the Finite RDFS Closure for Hundreds of Millions of Triples.  Since the time that paper was accepted (and as presented at ISWC), we have actually scaled to billions of triples.  We show in this paper that the RDFS rules can be applied to independent partitions of data to produce the RDFS closure for all of the data, as long as each partition has the ontologies.  In parallel computing terms, the RDFS closure can be computed in an embarrassingly parallel fashion.  “Embarrassingly parallel” is a technical term from parallel computing describing a computation that can be divided into completely independent parts.  Such computations are considered ideal for parallelism because there is no need for communication between processes and hence there is essentially no overhead for parallelization.  Peter Patel-Schneider had some good questions and comments after the presentation.  I have made my responses publicly available in a brief note.

Gregory Todd Williams and I published a paper at SSWS 2009 entitled Scalable RDF query processing on clusters and supercomputers.  This paper shows how parallel hash joins can be used on high-performance clusters to efficiently query large RDF datasets.  It seemed to get a lot of attention at the SSWS workshop as well as stir up a little bit of controversy.  The interesting thing about our approach is that no global indexes are created.  Each process in the cluster gets a portion of the data and indexes it locally, but no global indexes are maintained (e.g., we do not globally dictionary encode RDF terms).  This allows us to load data extremely quickly with some cost to query time.  In many cases, though, the decrease in loading time outweighs the added cost in query time.  (The added cost in query time comes from communicating full string values instead of global IDs during the parallel hash join.)  This allows for exploratory querying and easy handling of dynamically changing data.  Whereas many previous query systems depend heavily on global indexes (for which loading can take on the order of hours or days), we can load large datasets on the order of seconds and minutes.  Therefore, if the data changes, it can just be reloaded instead of updating indexes.

Finally, Greg, Medha Atre, Jim, and I submitted a paper to the Billion Triples Challenge (BTC), which we won!

Greg and Jesse accept the award for 1st place at the 2009 Billion Triples Challenge

We composed together three systems for our submission.  First, we created a simple upper ontology of 31 triples for our domain of interest, linking established concepts of Person to our concept of Person (by subclass), and we did the same for many relevant properties (name, email, etc.) (by subproperty).  Then, we used the aforementioned parallel materialization work to produce inferences on the BTC dataset, inferring triples that use our terms from the upper ontology.  Using the aforementioned work on scalable query, we then extracted only our triples of interest.  This reduced dataset is almost 800K triples, much more manageable than the original 900M triples, and it can now be used by existing tools without much concern of dataset size.  As a finishing touch, we compressed the reduced dataset down into a BitMat RDF data structure, resulting in a final disk space of 8 MB for the triples and 25 MB for the dictionary encoding.  Simple basic graph pattern queries can be executed against the BitMat.  The entire process took roughly 22 minutes.  See more about the submission at our BTC website which contains the datasets and some statistics about the datasets.

Jesse presents at the 2009 Billion Triples Challenge

That being said, the future holds much work to be done for scalability in the semantic web domain.

At present, I have been looking at formalizing a more general notion of “abox partitioning” for the purpose of classifying rules that fit such a paradigm, and then explore its application to OWL2RL.  Some parts of OWL2RL—like symmetric properties and inverse properties—clearly fit in the inferencing scheme from the parallel materialization paper.  However, many of the much desired features—like inverse functional properties and owl:sameAs—do not.  For such rules, parallel hash joins may be needed, or perhaps a more clever partitioning scheme.

We could also improve loading time of these systems (and perhaps communication time during parallel hash joins) by using an RDF syntax that is less verbose than N-Triples, but not as complex as Turtle.  (Remember, we are concerned about parallel I/O.)  To that end, we are exploring defining a subset of Turtle that would be helpful for I/O purposes without trading off the inherent simplicity of N-Triples (one triple per line).

We would also like to start employing more memory-efficient RDF storage data structures (like BitMat or Parliament) directly in our systems.  This is particularly important for the Blue Gene/L architecture which has at most 1 GB of memory per node.

And speaking of the Blue Gene/L, I have been doing all my work at RPI’s fabulous Computational Center for Nanotechnology Innovations (CCNI).  The CCNI is really a great computation facility having parallel file systems, high performance clusters, large SMP machines, and—of course—a Blue Gene/L.  Such a resource is a great enabler for our research.

Jesse Weaver
Ph.D. Student, Patroon Fellow
Tetherless World Constellation
Rensselaer Polytechnic Institute
http://www.cs.rpi.edu/~weavej3/

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: tetherless world Tags:

Parliament, storage density, and napkin math

November 8th, 2009

What follows is some napkin math on the storage requirements of the Parliament triplestore (as described in their SSWS paper) and its implications for our clustered RDF query engine (any errors are presumably with my understanding of the described system).

The triple store has three data structures: a resource dictionary, statement list, and resource table. The resource dictionary is a mapping from node values (literals/IRIs/blank node identifiers) to an integer identifier and can be implemented with a search tree or hash (Parliament uses a BerkeleyDB B-tree). The statement list and resource table are the two interesting structures.

The statement list is an array of fixed width records containing an ID for the subject, predicate, and object in the triple, and corresponding offsets to the next triple record with the same subject, predicate, and object, respectively. Making some assumptions about the storage sizes involved (related to my specific needs), a statement list record’s fields are:

field type
subject 4 byte Offset into Resource Table
predicate 4 byte Offset into Resource Table
object 4 byte Offset into Resource Table
next subject 4 byte Offset into Statement List
next predicate 4 byte Offset into Statement List
next object 4 byte Offset into Statement List
Statement List Record

The resource table is an array of fixed width records for each resource containing three offsets into the statement list for the first statement with the specific resource in subject, predicate, and object position, respectively, three counts representing the length of these three lists, and the resoure value. The storage requirements for a resource table record’s fields are:

Resource Table Record
field type
first subject 4 byte Offset into Statement List
first predicate 4 byte Offset into Statement List
first object 4 byte Offset into Statement List
subject count 4 byte Integer
predicate count 4 byte Integer
object count 4 byte Integer
value 8 byte pointer to n byte resource value

Puting aside the storage cost for the resource dictionary (which would obviously depend on the specific structure used, whether tree or hash), the storage requirements for the statement list and resource table are then:

(24 bytes * |triples|) + ((32+n) bytes * |nodes|)

where n is the average resource value size.

To get some real world numbers, I looked at a subset of the Billion Triples Challenge dataset (BTC chunks 000-009). This subset has 100M triples, ~34M nodes, and occupies 3.5GB with nodes serialized using N-Triples syntax (I’ll assume here that the N-Triples serialization gives a rough approximation of the node’s representation in memory). This averages out to ~341 unique nodes per thousand triples and ~109 bytes per node.

Using these numbers, the statement list and resource table for 100M triples would take (24*100M) + ((32+109)*34144466) bytes = 7214369706 bytes = 6.7GB. The upper limit on a single triplestore with these structures (using 32-bit offsets) is 4B statements, with a storage estimate of 288GB.

I’d like to try implementing this and using it on our clustered RDF query engine (part of the system that won the Billion Triples Challenge last month). It seems like a very natural fit since we’d like high storage density but haven’t spent any effort on that front yet (and our current code is awful in this respect) and our parallel query answering code doesn’t need the sorting that a traditional tree-based index would provide. Some more (very rough) napkin math suggests that we’d be able to get over 100 billion triples loaded onto the Blue Gene/L. It’ll be interesting to see if the numbers actually pan out, and if the query algorithm can handle anything close to that much data.

Gregory Todd Williams

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: Semantic Web, tetherless world Tags: