Parliament, storage density, and napkin math
What follows is some napkin math on the storage requirements of the Parliament triplestore (as described in their SSWS paper) and its implications for our clustered RDF query engine (any errors are presumably with my understanding of the described system).
The triple store has three data structures: a resource dictionary, statement list, and resource table. The resource dictionary is a mapping from node values (literals/IRIs/blank node identifiers) to an integer identifier and can be implemented with a search tree or hash (Parliament uses a BerkeleyDB B-tree). The statement list and resource table are the two interesting structures.
The statement list is an array of fixed width records containing an ID for the subject, predicate, and object in the triple, and corresponding offsets to the next triple record with the same subject, predicate, and object, respectively. Making some assumptions about the storage sizes involved (related to my specific needs), a statement list record’s fields are:
|subject||4 byte Offset into Resource Table|
|predicate||4 byte Offset into Resource Table|
|object||4 byte Offset into Resource Table|
|next subject||4 byte Offset into Statement List|
|next predicate||4 byte Offset into Statement List|
|next object||4 byte Offset into Statement List|
The resource table is an array of fixed width records for each resource containing three offsets into the statement list for the first statement with the specific resource in subject, predicate, and object position, respectively, three counts representing the length of these three lists, and the resoure value. The storage requirements for a resource table record’s fields are:
|first subject||4 byte Offset into Statement List|
|first predicate||4 byte Offset into Statement List|
|first object||4 byte Offset into Statement List|
|subject count||4 byte Integer|
|predicate count||4 byte Integer|
|object count||4 byte Integer|
|value||8 byte pointer to n byte resource value|
Puting aside the storage cost for the resource dictionary (which would obviously depend on the specific structure used, whether tree or hash), the storage requirements for the statement list and resource table are then:
(24 bytes * |triples|) + ((32+n) bytes * |nodes|)
where n is the average resource value size.
To get some real world numbers, I looked at a subset of the Billion Triples Challenge dataset (BTC chunks 000-009). This subset has 100M triples, ~34M nodes, and occupies 3.5GB with nodes serialized using N-Triples syntax (I’ll assume here that the N-Triples serialization gives a rough approximation of the node’s representation in memory). This averages out to ~341 unique nodes per thousand triples and ~109 bytes per node.
Using these numbers, the statement list and resource table for 100M triples would take (24*100M) + ((32+109)*34144466) bytes = 7214369706 bytes = 6.7GB. The upper limit on a single triplestore with these structures (using 32-bit offsets) is 4B statements, with a storage estimate of 288GB.
I’d like to try implementing this and using it on our clustered RDF query engine (part of the system that won the Billion Triples Challenge last month). It seems like a very natural fit since we’d like high storage density but haven’t spent any effort on that front yet (and our current code is awful in this respect) and our parallel query answering code doesn’t need the sorting that a traditional tree-based index would provide. Some more (very rough) napkin math suggests that we’d be able to get over 100 billion triples loaded onto the Blue Gene/L. It’ll be interesting to see if the numbers actually pan out, and if the query algorithm can handle anything close to that much data.
Gregory Todd Williams