Author Archive

Parliament, storage density, and napkin math

November 8th, 2009

What follows is some napkin math on the storage requirements of the Parliament triplestore (as described in their SSWS paper) and its implications for our clustered RDF query engine (any errors are presumably with my understanding of the described system).

The triple store has three data structures: a resource dictionary, statement list, and resource table. The resource dictionary is a mapping from node values (literals/IRIs/blank node identifiers) to an integer identifier and can be implemented with a search tree or hash (Parliament uses a BerkeleyDB B-tree). The statement list and resource table are the two interesting structures.

The statement list is an array of fixed width records containing an ID for the subject, predicate, and object in the triple, and corresponding offsets to the next triple record with the same subject, predicate, and object, respectively. Making some assumptions about the storage sizes involved (related to my specific needs), a statement list record’s fields are:

field type
subject 4 byte Offset into Resource Table
predicate 4 byte Offset into Resource Table
object 4 byte Offset into Resource Table
next subject 4 byte Offset into Statement List
next predicate 4 byte Offset into Statement List
next object 4 byte Offset into Statement List
Statement List Record

The resource table is an array of fixed width records for each resource containing three offsets into the statement list for the first statement with the specific resource in subject, predicate, and object position, respectively, three counts representing the length of these three lists, and the resoure value. The storage requirements for a resource table record’s fields are:

Resource Table Record
field type
first subject 4 byte Offset into Statement List
first predicate 4 byte Offset into Statement List
first object 4 byte Offset into Statement List
subject count 4 byte Integer
predicate count 4 byte Integer
object count 4 byte Integer
value 8 byte pointer to n byte resource value

Puting aside the storage cost for the resource dictionary (which would obviously depend on the specific structure used, whether tree or hash), the storage requirements for the statement list and resource table are then:

(24 bytes * |triples|) + ((32+n) bytes * |nodes|)

where n is the average resource value size.

To get some real world numbers, I looked at a subset of the Billion Triples Challenge dataset (BTC chunks 000-009). This subset has 100M triples, ~34M nodes, and occupies 3.5GB with nodes serialized using N-Triples syntax (I’ll assume here that the N-Triples serialization gives a rough approximation of the node’s representation in memory). This averages out to ~341 unique nodes per thousand triples and ~109 bytes per node.

Using these numbers, the statement list and resource table for 100M triples would take (24*100M) + ((32+109)*34144466) bytes = 7214369706 bytes = 6.7GB. The upper limit on a single triplestore with these structures (using 32-bit offsets) is 4B statements, with a storage estimate of 288GB.

I’d like to try implementing this and using it on our clustered RDF query engine (part of the system that won the Billion Triples Challenge last month). It seems like a very natural fit since we’d like high storage density but haven’t spent any effort on that front yet (and our current code is awful in this respect) and our parallel query answering code doesn’t need the sorting that a traditional tree-based index would provide. Some more (very rough) napkin math suggests that we’d be able to get over 100 billion triples loaded onto the Blue Gene/L. It’ll be interesting to see if the numbers actually pan out, and if the query algorithm can handle anything close to that much data.

Gregory Todd Williams

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: Semantic Web, tetherless world Tags:

Foaf Groups and Bloom Filters

January 29th, 2008

Bloom filters seem to be popping up in the foaf world quite a bit recently. Henry Story has been posting lots of good thoughts on this (see “My Bloomin’ Friends” from this past August as well as more recent postings on foaf-dev). Henry used an example Bloom Filter vocabulary in his discussion that looked something like this:

  a :Bloom ;
  :hashNum 4;
  :length 1000 ;
  :base64String "..."

I asked him if he had developed an actual vocabulary for this, or if he knew of any similar work, but didn’t find any published work.

I’d like to flesh this idea out a bit and get a usable vocabulary out of it. I think what Henry came up with looks good, but think it probably needs another predicate to identify the algorithm used to generate the filter. This would allow the vocabulary to be used across languages and environments which might rely on different hashing algorithms and allow systems to determine if they can interact through an arbitrary filter.

Here’s what a simple filter might look like as emitted by the LOAF tools:

  a :Bloom ;
  :algorithm <> ;
  :hashNum 4;
  :length 1000 ;
  :base64String "..."

The idea here is that the value of the :algorithm predicate would indicate what algorithm was used to compute the filter (the value stashed in :base64String). Each :algorithm might imply its own set of parameters or allow for additional algorithm-specific predicates; based on the LOAF example, such additional information would include the underlying hashing algorithm for each hash (such as “sha1;salt=a”, “sha1;salt=b”, …). Of course, the :algorithm value should really be defined by the person who’s actually defining the algorithm (so my defining one for the LOAF algorithm in their namespace is bad form, but hopefully the idea is clear).

Bringing this back to Henry’s example, a FOAF file might then do something like this (modulo the appropriate OpenID login stuff):

<public> rdfs:seeAlso <protected> .
<protected> :readableBy [
  a foaf:Group ;
  bloom:filter [
    a bloom:Bloom ;
    bloom:algorithm <> ;
    bloom:hashNum 3;
    bloom:length 400 ;
    bloom:base64String "AIhDCURG/nWpcj1Q6OlJ4xV3Xsg9k4IiBgO2zye+PYLhRzwdok9BcTQdKGDCZDpoHgiN"
] .

Once they login, this would allow anyone in the (hidden) group to retrieve more data from <protected> than was available at <public>. I’m not sure how it would play with TAG guidelines, but this could also be used to provide varying levels of data at <public> depending on whether the client has logged in and what credentials they might have.

Of course, I have my own reasons for playing around with Bloom filters and RDF having to do with SPARQL, but I’ll save that for another post.

VN:F [1.9.22_1171]
Rating: 5.0/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: Semantic Web Tags: