Home > Semantic Web, tetherless world > Parliament, storage density, and napkin math

Parliament, storage density, and napkin math

November 8th, 2009

What follows is some napkin math on the storage requirements of the Parliament triplestore (as described in their SSWS paper) and its implications for our clustered RDF query engine (any errors are presumably with my understanding of the described system).

The triple store has three data structures: a resource dictionary, statement list, and resource table. The resource dictionary is a mapping from node values (literals/IRIs/blank node identifiers) to an integer identifier and can be implemented with a search tree or hash (Parliament uses a BerkeleyDB B-tree). The statement list and resource table are the two interesting structures.

The statement list is an array of fixed width records containing an ID for the subject, predicate, and object in the triple, and corresponding offsets to the next triple record with the same subject, predicate, and object, respectively. Making some assumptions about the storage sizes involved (related to my specific needs), a statement list record’s fields are:

field type
subject 4 byte Offset into Resource Table
predicate 4 byte Offset into Resource Table
object 4 byte Offset into Resource Table
next subject 4 byte Offset into Statement List
next predicate 4 byte Offset into Statement List
next object 4 byte Offset into Statement List
Statement List Record

The resource table is an array of fixed width records for each resource containing three offsets into the statement list for the first statement with the specific resource in subject, predicate, and object position, respectively, three counts representing the length of these three lists, and the resoure value. The storage requirements for a resource table record’s fields are:

Resource Table Record
field type
first subject 4 byte Offset into Statement List
first predicate 4 byte Offset into Statement List
first object 4 byte Offset into Statement List
subject count 4 byte Integer
predicate count 4 byte Integer
object count 4 byte Integer
value 8 byte pointer to n byte resource value

Puting aside the storage cost for the resource dictionary (which would obviously depend on the specific structure used, whether tree or hash), the storage requirements for the statement list and resource table are then:

(24 bytes * |triples|) + ((32+n) bytes * |nodes|)

where n is the average resource value size.

To get some real world numbers, I looked at a subset of the Billion Triples Challenge dataset (BTC chunks 000-009). This subset has 100M triples, ~34M nodes, and occupies 3.5GB with nodes serialized using N-Triples syntax (I’ll assume here that the N-Triples serialization gives a rough approximation of the node’s representation in memory). This averages out to ~341 unique nodes per thousand triples and ~109 bytes per node.

Using these numbers, the statement list and resource table for 100M triples would take (24*100M) + ((32+109)*34144466) bytes = 7214369706 bytes = 6.7GB. The upper limit on a single triplestore with these structures (using 32-bit offsets) is 4B statements, with a storage estimate of 288GB.

I’d like to try implementing this and using it on our clustered RDF query engine (part of the system that won the Billion Triples Challenge last month). It seems like a very natural fit since we’d like high storage density but haven’t spent any effort on that front yet (and our current code is awful in this respect) and our parallel query answering code doesn’t need the sorting that a traditional tree-based index would provide. Some more (very rough) napkin math suggests that we’d be able to get over 100 billion triples loaded onto the Blue Gene/L. It’ll be interesting to see if the numbers actually pan out, and if the query algorithm can handle anything close to that much data.

Gregory Todd Williams

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: Semantic Web, tetherless world Tags:
  1. November 30th, 2009 at 07:14 | #1

    is the parliament storage somewhat related to linked lists

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
  2. December 3rd, 2009 at 20:51 | #2

    Hi Greg, the proposed tables are an interesting representation. I wonder what kinds of SPARQL queries this representation would support well, and what queries it could make slow – any thoughts?

    In the post, you choose to ignore the size of the resource dictionary, but I’d say that you can only ignore a small part of it, which would be the overhead of the hashing or the tree; the main size would likely be the size of the ID (4B) plus the size of the string plus maybe some kind of data type indicator. Can you say how many distinct resources there were in the 100M triples, and maybe the average length of the string representation?

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
  3. December 3rd, 2009 at 20:59 | #3

    Ah, it’s kinda in there – I got confused by the switch from “resources” to “nodes” and by your mention of n3, because I was expecting some words about the average size of the value of the node/resource; and a removal of duplicate uses of the same value in the N3.

    Maybe you could consider calling your “resource dictionary” a “node dictionary”, if you use “node value” as the main part of the dictionary?

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
  4. December 5th, 2009 at 10:15 | #4

    Hi, somehow my second comment doesn’t seem to have made it – I noticed after I submitted the one above that there is indeed an estimation of the size of the resource dictionary, as “the node’s representation in memory” [sic]. Perhaps you should call the dictionary a “node dictionary” if you use the word “node” so much more than the word “resource” in the prose? Anyway, since n-triples is so redundant, do you have an estimate on an average “node value” size (the length of the required string representation)?

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
  1. No trackbacks yet.