Archive

Archive for the ‘Semantic Web’ Category

Parliament, storage density, and napkin math

November 8th, 2009

What follows is some napkin math on the storage requirements of the Parliament triplestore (as described in their SSWS paper) and its implications for our clustered RDF query engine (any errors are presumably with my understanding of the described system).

The triple store has three data structures: a resource dictionary, statement list, and resource table. The resource dictionary is a mapping from node values (literals/IRIs/blank node identifiers) to an integer identifier and can be implemented with a search tree or hash (Parliament uses a BerkeleyDB B-tree). The statement list and resource table are the two interesting structures.

The statement list is an array of fixed width records containing an ID for the subject, predicate, and object in the triple, and corresponding offsets to the next triple record with the same subject, predicate, and object, respectively. Making some assumptions about the storage sizes involved (related to my specific needs), a statement list record’s fields are:

field type
subject 4 byte Offset into Resource Table
predicate 4 byte Offset into Resource Table
object 4 byte Offset into Resource Table
next subject 4 byte Offset into Statement List
next predicate 4 byte Offset into Statement List
next object 4 byte Offset into Statement List
Statement List Record

The resource table is an array of fixed width records for each resource containing three offsets into the statement list for the first statement with the specific resource in subject, predicate, and object position, respectively, three counts representing the length of these three lists, and the resoure value. The storage requirements for a resource table record’s fields are:

Resource Table Record
field type
first subject 4 byte Offset into Statement List
first predicate 4 byte Offset into Statement List
first object 4 byte Offset into Statement List
subject count 4 byte Integer
predicate count 4 byte Integer
object count 4 byte Integer
value 8 byte pointer to n byte resource value

Puting aside the storage cost for the resource dictionary (which would obviously depend on the specific structure used, whether tree or hash), the storage requirements for the statement list and resource table are then:

(24 bytes * |triples|) + ((32+n) bytes * |nodes|)

where n is the average resource value size.

To get some real world numbers, I looked at a subset of the Billion Triples Challenge dataset (BTC chunks 000-009). This subset has 100M triples, ~34M nodes, and occupies 3.5GB with nodes serialized using N-Triples syntax (I’ll assume here that the N-Triples serialization gives a rough approximation of the node’s representation in memory). This averages out to ~341 unique nodes per thousand triples and ~109 bytes per node.

Using these numbers, the statement list and resource table for 100M triples would take (24*100M) + ((32+109)*34144466) bytes = 7214369706 bytes = 6.7GB. The upper limit on a single triplestore with these structures (using 32-bit offsets) is 4B statements, with a storage estimate of 288GB.

I’d like to try implementing this and using it on our clustered RDF query engine (part of the system that won the Billion Triples Challenge last month). It seems like a very natural fit since we’d like high storage density but haven’t spent any effort on that front yet (and our current code is awful in this respect) and our parallel query answering code doesn’t need the sorting that a traditional tree-based index would provide. Some more (very rough) napkin math suggests that we’d be able to get over 100 billion triples loaded onto the Blue Gene/L. It’ll be interesting to see if the numbers actually pan out, and if the query algorithm can handle anything close to that much data.

Gregory Todd Williams

VN:F [1.2.0_562]
Rating: 0.0/10 (0 votes cast)
Author: greg Categories: Semantic Web, tetherless world Tags:

Web Accessibility in an Educational Context

October 25th, 2009

I’m currently at the annual meeting for the Human Factors and Ergonomics Society and a few hours ago I was attending a set of presentations by the Internet TG on Input and Output on the Web. One of the talks that caught my attention was by Wayne Shebilske of Wright State University on the use of screen readers to help impaired students complete tasks within educational programs such as WebCT, student status management software, and the like. There seem to be many problems with such tools as Dr. Shebilske pointed out based on his study. I think this could be a prime candidate for SW technologies to step in to improve the end user experience. I foresee using such tools along with technologies like RDFa to give the user a better representation of the content being displayed and improving the overall quality of life for these individuals.

Evan

VN:F [1.2.0_562]
Rating: 0.0/10 (0 votes cast)
Author: Evan Categories: Semantic Web Tags: ,

Probing the SPARQL endpoint of data.gov.uk

October 23rd, 2009

We just ran across the preview SPARQL endpoint for UK’s Data.gov (powered by Talis) following Harry Metcalfe’s blog . In order to understand what data is hosted by the triple store, we use a series of SPARQL queries to probe the content in data.gov. We leverage a web service http://data-gov.tw.rpi.edu/ws/sparqlproxy.php to convert SPARQL/XMl result into HTML and JSON.

First, let’s do some warm up exercises

Q: show me some triples!

SPARQL:

SELECT ?s ?p ?o WHERE {?s ?p ?o} LIMIT 5

Result:

<http://www.london-gazette.co.uk/id/issues/58316/notices/240663> <http://www.gazettes-online.co.uk/ontology#hasPublicationDate> “2007-05-02″^^http://www.w3.org/2001/XMLSchema#date .

<http://www.london-gazette.co.uk/id/issues/58316/notices/240663> <http://xmlns.com/foaf/0.1/page> <http://www.london-gazette.co.uk/issues/58316/pages/6359> .

<http://www.london-gazette.co.uk/id/issues/58316/notices/240663> <http://purl.org/dc/terms/modified> “2007-05-02″^^http://www.w3.org/2001/XMLSchema#date .

A: ok, it has some gazette dataset about lond (see http://www.london-gazette.co.uk/), and it uses FOAF and DC vocabulary.

Q: show me some classes and their instances?

SPARQL:

SELECT DISTINCT ?c WHERE { [] a ?c. } LIMIT 5

Result:

<http://www.gazettes-online.co.uk/ontology/transport#RoadTrafficActsNotice>

<http://www.gazettes-online.co.uk/ontology#Notice>

<http://xmlns.com/foaf/0.1/Document>

<http://www.gazettes-online.co.uk/ontology#Issue>

<http://www.gazettes-online.co.uk/ontology#Edition>

A: same observation as above.

Q: Does it host any named graphs?

SPARQL:

SELECT ?g WHERE {GRAPH ?g { ?s ?p ?o } } LIMIT 10

Result: “0″ ^^<http://www.w3.org/2001/XMLSchema#integer>

A: no named graph found, and there is only one big default graph

Now let’s run several expensive aggregation queries (note aggregation queries are not part of the current SPARQL specification)

Q: How many triples?

SPARQL:

SELECT count(*) WHERE {?s ?p ?o}

Result: “5529380″^^<http://www.w3.org/2001/XMLSchema#integer>

A: alright! Aggregation query is support, and there are 5 million triples. Note “count” is a non-standard aggregation function, it may be support differently be different SPARQL endpoints.

Q: How many graphs (and the number of triples in each graph)?

SPARQL:

SELECT ?g count(*) WHERE {GRAPH ?g { ?s ?p ?o } }

Result: “0″ ^^<http://www.w3.org/2001/XMLSchema#integer>

A: no named graph.

Q: How many populated classes?

SPARQL:

SELECT count(distinct ?c) WHERE {[] a ?c}

Result: “99″^^<http://www.w3.org/2001/XMLSchema#integer>

A: there are 99 different classes having direct instances in this triple store.

Q: How many populated properties?

SPARQL:

SELECT count(distinct ?p) WHERE {[] ?p ?o}

Result: “86″ ^^<http://www.w3.org/2001/XMLSchema#integer>

A: There are 86 unique properties being used as predicate in this dataset. Each property has used by 64K (5529380/86) triples in average. There must be some very popular properties, and we will do that survey later.

Q: How many typed individuals?

SPARQL:

SELECT count(distinct ?s) WHERE {?s a ?c}

Result: 504 Gateway Time-out

A: Opps, really expensive. Let’s try something else

Q: How many defined classes?

SPARQL:

SELECT count(distinct ?s) WHERE {{?s a <http://www.w3.org/2002/07/owl#Class> } UNION {?s a <http://www.w3.org/2000/01/rdf-schema#Class>}}

Result: 0

A: no class defined, and the triple store is full of individuals

Q: How many individuals (again)?

SPARQL:

SELECT count(*) WHERE {[] a ?c}

Result: 995694

A: There are nearly 1 millions of typed individuals, so we can easily see every invidual has 5 (5M/1M) triples in average.

Now, let’s do some knowledge discovery

Q: Show me the 3 most/least used classes in this dataset?

SPARQL:

1) select ?c ( count(?s) AS ?count ) where {?s a ?c} group by ?c order by desc(?count) limit 3

2) select ?c ( count(?s) AS ?count ) where {?s a ?c} group by ?c order by ?count limit 3

Result:

1) most used classes

<http://www.gazettes-online.co.uk/ontology#Notice> “156452″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.w3.org/2006/vcard/ns#Address> “106934″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology/person#Person> “87798″ ^^<http://www.w3.org/2001/XMLSchema#integer>

2) least used classes

<http://www.gazettes-online.co.uk/ontology/transport#CycleTracksNotice> “1″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology/transport#PortsNotice> “1″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology/corp-insolvency#AdministrationOrder> “2″ ^^<http://www.w3.org/2001/XMLSchema#integer>

A: Again, the SPAQL query is “safe” because Talis support these SPARQL extensions (c.f. http://n2.talis.com/wiki/SPARQL_Extensions). The class that has the most number of instances in this dataset is http://www.gazettes-online.co.uk/ontology#Notice (which has 156,452 instances.)

Q: what about property usage (top 5)?

SPARQL:

1) select ?p ( count(?s) AS ?count ) where {?s ?p ?o} group by ?p order by DESC(?count) limit 5

2) select ?p ( count(?s) AS ?count ) where {?s ?p ?o} group by ?p order by ?count limit 5

Result:

1) 5 most used properties:

<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> “995694″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://purl.org/dc/dcam/memberOf> “312476″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.w3.org/1999/02/22-rdf-syntax-ns#value> “310170″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology#hasPublicationDate> “181940″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology#hasNoticeCode> “181335″ ^^<http://www.w3.org/2001/XMLSchema#integer>

2) 5 least used properties:

<http://www.gazettes-online.co.uk/ontology/corp-insolvency#dateAdministrationOrderMade> “1″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology#hasAuthorisingOrganisation> “2″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology/personal-legal#isForNextOfKinOf> “2″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology/corp-insolvency#orderAdministrator> “3″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology#authorisingOrganisation> “49″ ^^<http://www.w3.org/2001/XMLSchema#integer>

A: Some properties are really used (e.g., http://www.gazettes-online.co.uk/ontology/corp-insolvency#dateAdministrationOrderMade has been used only once) while some are heavily used (like rdf:type, which is the most frequently used predicate).

Conclusion

We need to stop probing now. A number of complex queries ended up with a timeout error because (i) “LIMIT” only control the final results, so that we cannot just get statistical results on 1000 triples; (ii) “GROUP BY” may produce too many intermediate results, (iii) the statistics queries does not leverage the index structure of triple store, or index structure are not designed for handling such queries, and (iv) many other issues.

Updates

presented by Li Ding and Zhengning Shangguan

VN:F [1.2.0_562]
Rating: 10.0/10 (2 votes cast)
Author: li Categories: Semantic Web, linked data Tags:

My Personal (unofficial) Semantic Web FAQ — a pointer

September 1st, 2009

The joy of multiple blog sites is having to post pointers to one blog entry from another.

My blog at nature.com now has an entry entitled “The Semantic Web: My personal (unofficial) FAQ” which lives at http://network.nature.com/people/jhendler/blog/2009/08/03/the-semantic-web-my-personal-unofficial-faq. Comments, and especially your suggestions for Qs and As are more than welcome there or here (or anywhere else for that matter)

Cheers,

Jim H.

VN:F [1.2.0_562]
Rating: 0.0/10 (0 votes cast)

Current Issues in data.gov

July 31st, 2009

While translating data.gov data into RDF, we have discovered some issues with the published datasets. These issues can be roughly categorized as follows:

  • Duplicated Datasets- Some datasets are part of another dataset, e.g. Dataset 140 (2005 Toxics Release Inventory data for the state of California (Environmental Protection Agency)) is a subset of Dataset 191 (2005 Toxics Release Inventory National data file of all US States and Territories (Environmental Protection Agency)).
  • Formatting Issues - The format of some datasets is not friendly to machine processing. Not all datasets offer CSV format data, and parsing table data from them requires non-trivial efforts. Example: Dataset 37 (Lower Colorado River Daily Average Water Elevations and Releases (US Bureau of Reclamation)). Some websites, meanwhile, have no data at all: Dataset 335 (National Longitudinal Surveys (US Bureau of Labor Statistics)), for example, tells you how to order data from the government.
  • screen shot of the text file from dataset 37 (Lower Colorado River Daily Average Water Elevations and Releases) by US Bureau of Reclamation

  • Access Point Issues - The access points for some datasets do not point to pages friendly to machine access. Instead of pointing to a downloadable file covering the entire dataset, some lead to an interactive website where only partial data can be returned by a web-based query. Example: Dataset 330 (Local Area Unemployment Statistics (US Bureau of Labor Statistics)) and Dataset 96 (National Water Information System (NWIS) (US Geological Survey)).

    screen shot of the query interface for accessing dataset 330 (Local Area Unemployment Statistics) by US Bureau of Labor Statistics

For more details, please visit http://data-gov.tw.rpi.edu/wiki/Current_Issues_in_data.gov .

Sarah Magidson, Li Ding, Dominic DiFranzo, and Jim Hendler

VN:F [1.2.0_562]
Rating: 0.0/10 (0 votes cast)
Author: li Categories: linked data Tags: