Archive

Author Archive

Two Misconceptions about the Semantic Web

November 18th, 2011

I recently presented at the Semantic Graph Database Processing BOF at SC2011, and I had the opportunity to discuss with others the needs for high-performance computing in web-scale computation and the benefits of Linked Data and ontologies on the World Wide Web. There was one participant there who was adamantly opposed to the semantic web.  (I think his exact quotes outside of the presentation were something like “I do not believe in the semantic web” and “only the semantic web cares about the semantic web”).  As I tried to make my case with him, it became increasingly clear to me that this person had a few misconceptions about the semantic web. I want to address those misconceptions here.

Before I continue, though, allow me to disclaim a bit. I am not a representative of the entire semantic web community, although I do consider myself a member of it. Additionally, I am not officially associated with the W3C. I write this blog entry simply in the capacity of a semantic web enthusiast (henceforth, semwebber), and not even as a member of the Tetherless World Constellation. I invite, nay, urge other semwebbers to contribute comments to this blog post in any capacity (agree, disagree, amend, etc.).

1. “One ontology to rule them all”

To my knowledge, nobody has ever claimed that there should be “one ontology to rule them all.” Instead, what is regularly promoted is ontology reuse and/or integration. For example, the FOAF ontology is widely used in the semantic web to describe persons; why create your own ontology when you can reuse a well-established one? Integration of ontologies allows for conciliation of perspectives, causing data that use these ontologies to become meaningfully related. Admittedly, there are some rather large, comprehensive ontologies out there, and there are some very popular and pervasive ones, too. However, there is no standard or recommendation that requires publishers of RDF data to comply with any particular ontology. You could even ignore the RDF vocabulary if you so please (yes, even rdf:type).

The primary purpose of an ontology (in my view) is to attach explicit semantics to your data. Just as the participant had stated (although he meant it in contrast to the semantic web), there are many ontologies. They compete in the ecosystem of the World Wide Web and evolve accordingly (or become extinct).

2. “Triples all the way down”

(First, let me say, this is not an affront to Planet RDF.)

This is a bit of a pet peeve of mine, and perhaps what I say here will offend some semwebbers (I hope not). The semantic web (in my view) is not about “triples all the way down.” What do I mean by that? Let me explain.

RDF brings primarily two things to the table when it comes to publishing and integrating data on the web: names in the form of URIs, and a simple data model that is flexible enough for (arguably) nearly any kind of data. (I would like to add a third, meaningful links, but I will avoid that for now.) So when data is published to the web, publishing it as RDF allows you: (1) to identify the things in your data across the World Wide Web, and (2) to structurally (and possibly semantically) integrate your data with other data on the World Wide Web. (I emphasize “World Wide” here to bring to attention the vast scope of publication, identification, and integration that is being achieved.) Fantastic.

Does this mean that everything can be efficiently (or rather, ideally) represented in RDF? No. Then why would you ever want to handle triples? You probably don’t. Let me explain.

RDF is meant to solve the problem of meaningfully publishing data (not just documents) on the World Wide Web. Beyond that, do what you want. More specifically, when you crawl and/or aggregate data from the World Wide Web, you don’t have to keep the RDF data as triples in your system. It is no longer on the global stage of the World Wide Web; rather, it is now in your system where you are king. So optimize away! Store it or process it however you like! Relational databases? Sure! Rewrite URIs as shorter terms? Whatever floats your boat! Ignore the explicit semantics and treat it like an unlabeled graph? I wouldn’t recommend it, but you’re the king! Do whatever it takes to meet your use case, and if your use case has something to do with RDF data, then fine, leave it as triples if you want. My point is, it’s not necessarily “RDF all the way down,” but it is “RDF at the top” where “top” is the place of publication, the World Wide Web. The universal naming mechanism of URIs and the generic data model enables data publishers to get data out there in a way that can be explicitly understood by machines (for example, when I say “Beast is furry,” am I talking about Mark Zuckerberg’s dog or the fictional X-Man Dr. Henry Philip “Hank” McCoy?), but as the creator of that machine, it’s up to you how to utilize those explicit semantics.

Beast, Mark Zuckerberg's DogBeast, the fictional X-Man (They both look furry to me.)

To be clear, though, I am promoting RDF as a way to publish structured, semantic data as opposed to not publishing structured, semantic data.  In the future, it is conceivable that there may exist other good ways to publish structured, semantic data, but RDF exists today and is widely used.

So I will leave it at that. Again, I invite comments, rebuttals, accolades, disparagements, etc.

Jesse Weaver

VN:F [1.9.13_1145]
Rating: 9.4/10 (5 votes cast)
VN:F [1.9.13_1145]
Rating: +2 (from 2 votes)
Author: Categories: tetherless world Tags:

Introducing Sterno, another RDF syntax (Really?)

July 3rd, 2011

In this blog post, I want to introduce a RDF syntax called Sterno. (Oh no… not another one… right? Please, read on.) Sterno is an extension of the N-triples syntax and a subset of the Turtle syntax aimed at improving compression over N-triples while also preserving the simplicity of N-triples. But what could possibly warrant defining yet another RDF syntax?

After winning the 2009 Billion Triple Challenge, Greg and I realized that a fair amount of time in our system was spent transferring data from disk. At that time, our system read N-triples documents because their simple syntax was amenable to parallel I/O, but N-triples documents are often very verbose. Turtle, however, introduces many features which improve compression, and N-triples is a syntactic subset of Turtle. So the idea arose, how much of Turtle (i.e., which features of Turtle) should we use to extend the N-triples syntax in order to improve parallel I/O? The details of our investigation into the matter can be found in our paper entitled “Reducing I/O Load in Parallel RDF Systems via Data Compression,” published at the 1st Workshop on High-Performance Computing for the Semantic Web (HPCSW2011). (We also compare the use of Sterno “compression” of RDF data with LZO compression for parallel I/O. The HPCSW proceedings can be found here for those who are interested.)

Admittedly, a RDF syntax designed for parallel I/O would seem to have a limited audience, but it turns out that Sterno may be of more general use. Sterno’s simplicity may be desirable for a multitude of purposes simply because it is easier to support than Turtle (that is, easier to produce and parse), particularly for use on the command-line. Note that Sterno is not meant to replace or compete with any other RDF syntax; instead, it simply gives a name and definition to a useful middle ground between N-triples and Turtle.

Sterno is normatively described as an extension of the N-triples syntax. In other words, the Sterno syntax subsumes the N-triples syntax, and the Sterno syntax is defined as the N-triples syntax with the addition of the following Turtle features:

  • UTF-8 Encoding: A Sterno document is a Unicode character string encoded in UTF-8.
  • Prefix declarations and QNames: A Sterno document allows for prefix declarations and QNames, but all prefix declarations must occur at the beginning of the document before any actual triples.
  • Implicit datatypes for xsd:integer, xsd:double, xsd:decimal, and xsd:boolean. For example, "1"^^xsd:integer may simply appear in the document as 1.
  • The a keyword may be used to replace rdf:type whenever it occurs in the predicate position of a triple.
  • The empty collection () may be used to replace rdf:nil whenever it occurs in the subject or object position of a triple.
  • An anonymous blank node [] may be used, although its usefulness is severely limited in Sterno.
  • Blank node labels may be as complex as in Turtle. That is, we do not maintain the restriction in N-triples that blank node labels be only word characters. (E.g., _:blank-node is valid in Turtle and Sterno, but not in N-triples.)

An actual grammar for the Sterno syntax can be found in the extended version of our HPCSW paper. All this may be a bit too much to think about in one's head, so following is a contrived example in N-triples, Sterno, and Turtle. (For a more realistic example, see my FOAF profile in N-triples, Sterno, and Turtle.)

N-triples:

<file:///foaf.rdf#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
<file:///foaf.rdf#me> <http://xmlns.com/foaf/0.1/nick> "Andr\u00E9" .
<file:///foaf.rdf#me> <http://xmlns.com/foaf/0.1/age> "40"^^<http://www.w3.org/2001/XMLSchema#integer> .
_:list <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/1999/02/22-rdf-syntax-ns#List> .
_:list <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> "line1\n\tline2 \"quoted string\" " .
_:list <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> <http://www.w3.org/1999/02/22-rdf-syntax-ns#nil> .
_:contrived <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> .
# What a contrived triple.

Sterno:

@prefix mine: <file:///foaf.rdf#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
mine:me a foaf:Person .
mine:me foaf:nick "André" .
mine:me foaf:age 40 .
_:list a rdf:List .
_:list rdf:first "line1\n\tline2 \"quoted string\" " .
_:list rdf:rest () .
[] a <http://www.w3.org/2002/07/owl#Thing> .
# What a contrived triple.

Turtle (with base URI <file:///foaf.rdf>):

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
<#me> a foaf:Person ; foaf:nick "André" ; foaf:age 40 .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
( """line1
line2 "quoted string" """ ) a rdf:List .
[]a<http://www.w3.org/2002/07/owl#Thing> . # What a contrived triple.

Put roughly, the Sterno syntax maintains the simplicity of N-triples that each line contain at most one triple, and there must be whitespace between the RDF terms of a triple. Therefore, although it is not as concise as Turtle (e.g., property lists and object lists are not adopted), it is easier to parse and generate.

Feedback welcome, even encouraged.

(Why the name “Sterno”? The name “Sterno” originated as an abbreviation for sternotherus, a genus of aquatic turtle, the most common species of which typically grows to only 7.5-14 centimeters. The name is chosen to reflect that the Sterno syntax is a small, syntactic subset of the Turtle syntax. Additionally, it is an acronym meaning “Simple, TErse Rdf… NOthing else.”)

VN:F [1.9.13_1145]
Rating: 10.0/10 (2 votes cast)
VN:F [1.9.13_1145]
Rating: +1 (from 1 vote)
Author: Categories: tetherless world Tags:

HPCSW Workshop Website

February 27th, 2011

As one of the organizers for the 1st Workshop on High-Performance Computing for the Semantic Web (HPCSW), I had been hosting the HPCSW website from my home directory at RPI. However, the servers on which the home directory was hosted have been suffering from instability in the recent month. Even now, the server has gone down, with the workshop paper submission deadline less than a week away. To solve the problem, I have found a more stable place in our lab to host the website, and the purpose of this brief blog post is to try and improve awareness of this change. I have requested that the ESWC workshop chairs change the link on the ESWC website, and I hope that will happen soon. I apologize for any inconvenience this may have caused and hope that it will not diminish interest in the workshop. (In case of any future problems, I have also created a sort of “backup” version of the website at Google Sites. This version is just to make sure pertinent information about the workshop is always available… or at least as available as Google sites.)

Please feel free to email me if you have any questions. For your convenience, I have included basic information about the workshop below.

Call for Papers

Over the last several years, there has been an increase of research in parallel semantic web data processing (SWDP) in the semantic web community as well as burgeoning interest in the high-performance computing (HPC) community. As specific examples, use of high-performance computing won the 2009 Billion Triple Challenge, and a parallel inference engine won the 2010 IEEE SCALE Challenge. The goal of the High-Performance Computing for the Semantic Web (HPCSW) workshop is to facilitate synergy between the HPC and semantic web communities as well as between academia and industry to further scalability of SWDP.

HPCSW is a full-day workshop that will begin with presentations of technical papers submitted for publication and will end with a discussion among (but not limited to) invited participants. Proceedings will be published online, and a selection of revised papers will be published in a joint Lecture Notes in Computer Science post-proceedings volume. Papers should present work in which HPC in some form (e.g., parallelism, supercomputers, FPGAs, etc.) is employed to improve SWDP (any kind of processing of semantic web data). Broad topics for papers include (but are not limited to):

* Parallelizing SWDP.
* Exploiting HPC architectures for SWDP.
* Employing parallel graph algorithms for SWDP.
* Benchmarks for SWDP from a HPC perspective.

The following questions will guide (but not limit) the discussion:

* How important is it to have high-performant applications for the semantic web?
* What are the boundaries of HPC for SWDP?
* What SWDP lends itself to HPC and/or parallel processing?
* What pre-existing work in HPC can be leveraged for SWDP?
* What are the tradeoffs between commodity HPC and specialized HPC for SWDP? (e.g., MapReduce vs. “hand-coded”, commodity clusters vs. supercomputers)

Important Dates

* Paper Submission Deadline: 4 March 2011 (23:59 Hawaii Time)
* Acceptance Notification: 1 April 2011
* Camera Ready Papers Due: 15 April 2011
* Workshop Day (full day): 29 May 2011

Submission

Papers must be submitted by 23:59 Hawaii Time on Friday, March 4, 2011 at the following address: http://www.easychair.org/conferences/?conf=hpcsw2011.

Position/short papers should not exceed five (5) pages in length, and full papers should not exceed twelve (12) pages in length. All papers must be formatted according to the information for LNCS authors. Papers must be submitted in PDF (Adobe’s Portable Document Format) format. More information about the Springer’s Lecture Notes in Computer Science (LNCS) is available on the Springer LNCS Web site.

VN:F [1.9.13_1145]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.13_1145]
Rating: +2 (from 2 votes)
Author: Categories: tetherless world Tags:

Suggestions to the Supercomputing Community

December 4th, 2010

As mentioned in my last blog post, I recently participated in a birds-of-a-feather (BOF) on semantic graph/database processing at Supercomputing 2010 (SC10).  My general research interest is in high-performance computing (HPC) for the semantic web, so this BOF was a great fit.  At the BOF, I very briefly made three suggestions to HPC researchers; in this blog post, I expand on and explain these suggestions.  I welcome feedback, particularly from those in the semantic web community who have something to share with the supercomputing community.

1. There is a need for good benchmarks from a HPC perspective.

By “good,” I primarily mean that the datasets and queries need to be realistic.  In other words, the data should reflect data that occurs in the real world, and queries should reflect queries that would be posed by actual users or systems.  By “HPC perspective,” I mean that it needs to test strong scaling (change in time for fixed total dataset size and varying number of processors) and weak scaling (change in time for fixed dataset size per processor and varying number of processors).

The Lehigh University Benchmark (LUBM) [1] has arguably been the most widely used benchmark likely because it is one of the earliest benchmarks that provide a data generator and a standard set of queries.  It is targeted towards inferencing. However, LUBM datasets are not only synthetic, but they are quite unrealistic.  In addition to uniform distribution of data, it suffers from other inadequacies like few links between universities and the use of a single, nonsensical phone number for every person (“xxx-xxx-xxxx”).  Therefore, LUBM datasets do not provide a realistic data distribution and thus cannot test the ability of systems to handle realistic selectivity and skew.

There is also the Berlin SPARQL Benchmark (BSBM) [2], but it is “built around an e-commerce use case” and “illustrates the search and navigation pattern of a consumer looking for a product” [3].  From a HPC perspective, we will likely be more concerned with overall run-time of queries or reasoning processes (or whatever other interesting processes) rather than handling interaction with users.

Finally, there is SP2Bench [4].  This is perhaps the most useful benchmark for SPARQL benchmarking.  It provides a data generator that mimics statistical properties of DBLP data, and it provides a set of sensible queries.  Therefore, the dataset is more realistic than LUBM, and it is focused on SPARQL query (whereas LUBM focuses on reasoning).

However, there is still a need for a good reasoning benchmark from a HPC perspective.  It’s difficult to be more specific than that because providing such a benchmark is still very much an open research topic.  Clearly there needs to be an ontology that uses features from various reasoning standards (e.g., RDFS, OWL) and a corresponding data generator.  There should also be some way to verify validity of inferences based on certain entailments.  Again, this is very much an open research topic which is why I made the suggestion but have few answers myself.

2. Consider existing reasoning standards as starting points.

This may be the more controversial of my suggestions, but there is good reason for it.  Recent history indicates that the reasoning standards continue to iteratively evolve based on the needs of the community.

Consider RDFS (by which I mean RDFS entailment as defined in RDF Semantics).  First of all, it is technically undecidable [5], but in a way that is trivial and easily overcome.  Secondly, few systems (in my experience) completely support inferences based on literal generalization, XML literals, and container-membership properties.  Other rules, like “everything is a resource,” are generally trivial and uninteresting.  More commonly, implementations align with a fragment of RDFS that I call RDFS Muñoz [6] (originally termed the ρdf fragment), which essentially boils down to domains, ranges, subclasses, and subproperties. Perhaps Muñoz said it best:

“Efficient processing of any kind of data relies on a compromise between the size of the data and the expressiveness of the language describing it. As we already pointed out, in the RDF case the size of the data to be processed will be enormous, as current developments show …. Hence, a program to make RDF processing scalable has to consider necessarily the compromise between complexity and expressiveness. Such a program amounts essentially to look for fragments of RDF with good behavior with respect to complexity of processing.” [6]

Consider also OWL 1.  How many scalable systems completely support one of the OWL 1 fragments (Lite, DL, Full)?  I cannot say for sure, but my impression from experience and feedback from others is that the cost for higher expressivity can often be too expensive in terms of performance, especially as you scale dataset size.  Perhaps it is for this reason that OWL Horst [7] (originally termed the pD* fragment) has gained popularity as (arguably) the most widely supported OWL fragment.

Now there is OWL 2OWL 2 RL (a fragment of OWL 2) is “inspired by description logic programs and pD* [OWL Horst]” [8]. The SAOR paper from ISWC 2010 [9] has already shown a subset of OWL 2 RL rules for which closure can be efficiently produced in parallel.

So my point is this. Reasoning standards capture well-defined and understood fragments, but research and practice continue to explore subfragments that are suitable for certain problems, and as the subfragments become stable and gain popularity, they inspire future standards. It is an iterative process, so it is not necessary to become obsessed with fully complying with existing standards (unless that is actually necessary to meet your use case). It is probably more interesting to search for fragments of the standards that fit certain HPC paradigms.

3. Review the literature to reconsider approaches that were once considered less viable.

This suggestion seems obvious.  As an example, I recently did a literature review of parallel join processing, and one thing I noticed is that a majority of the literature is focused on shared-nothing architectures.  In 1992, DeWitt and Gray stated:

“A consensus on parallel and distributed database system architecture has emerged.  This architecture is based on a shared-nothing hardware design ….” [10]

However, in 1996, Norman, Zurek, and Thanisch directly opposed (or reversed) the claim of DeWitt and Gray saying:

“We argue that shared-nothingness is no longer the consensus hardware architecture and that hardware resource sharing is a poor basis for categorising parallel DBMS software architectures if one wishes to compare the performance characteristics of parallel DBMS products.” [11]

The popularity of the shared-nothing paradigm was probably further fueled by the advent of inexpensive supercomputing by way of Beowulf clusters and Networks of Workstations (around the mid 90′s).  However, many modern supercomputers provide shared-disk and shared-memory paradigms.  The Blue Gene/L in our Computational Center for Nanotechnology Innovations (CCNI) is networked with a General Parallel File System (GPFS).  Making use of GPFS, the Blue Gene/L could be considered shared-disk in a programmatic sense.  The Cray XMT uses large shared-memory. Rahm points out that a major advantage of shared-disk is its potential for truly dynamic load-balancing [11], so lets look back at some of the shared-disk and shared-memory research that has been done [12-15].

All of that just to say, a review of literature is in order. Potential sources of inspiration include parallel databases, parallel graph algorithms, deductive databases, and graph databases.

Jesse Weaver
Ph.D. Student, Patroon Fellow
Tetherless World Constellation
Rensselaer Polytechnic Institute

[1] Guo, Pan, Heflin.  LUBM: A benchmark for OWL knowledge base systems.  JWS 2005.
[2] Bizer, Schultz.  The Berlin SPARQL Benchmark.  IJSWIS 2009.
[3] http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/
[4] Schmidt, Hornung, Lausen, Pinkel.  SP2Bench: A SPARQL Performance Benchmark.  ICDE 2009.
[5] Weaver.  Redefining the RDFS Closure to be Decidable.  RDF Next Steps 2010.
[6] Muñoz, Pérez, Gutierrez.  Simple and Efficient Minimal RDFS.  JWS 2009.
[7] ter Horst.  Completeness, decidability and complexity of entailment for RDF Schema and a semantic extensions involving the OWL vocabulary.  JWS 2005.
[8] http://www.w3.org/TR/owl2-profiles/#OWL_2_RL
[9] Hogan, Pan, Polleres, Decker. SAOR: Template Rule Optimisations for Distributed Reasoning over 1 Billion Linked Data Triples. ISWC 2010.
[10] DeWitt, Gray.  Parallel Database Systems: The Future of High Performance Database Systems.  Communications of the ACM 1992.
[11] Norman, Zurek, Thanisch.  Much Ado About Shared-Nothing.  SIGMOD Record 1996.
[12] Rahm.  Parallel Query Processing in Shared Disk Database Systems.  SIGMOD Record 1993.
[13] Lu, Tan.  Dynamic and Load-balanced Task-Oriented Database Query Processing in Parallel Systems.  EDBT 1992.
[14] Märtens.  Skew-Insensitive Join Processing in Shared-Disk Database Systems.  IADT 1998.
[15] Moon, On, Cho.  Performance of Dynamic Load Balanced Join Algorithms in Shared Disk Parallel Database Systems.  Workshop on Future Trends of Distributed Computing Systems 1999.

VN:F [1.9.13_1145]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.13_1145]
Rating: 0 (from 0 votes)
Author: Categories: Semantic Web, tetherless world Tags:

Supercomputing 2010

November 30th, 2010

In this blog post, I wish to briefly summarize the aspects of the Supercomputing 2010 conference that may be of use to the semantic web community.

1. Technical Papers on Graph Algorithms

The first paper was Fast PGAS Implementation of Distributed Graph Algorithms by Cong, Almasi, and Saraswat. PGAS stands for Partitioned Global Address Space. They claim to “present the first fast PGAS implementation of graph algorithms for the connected components and minimum spanning tree problems” [1]. They performed an evaluation on two random graphs each with 100 million vertices, one with 400 million edges and the other with one billion edges.

The second paper was Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory by Pearce, Gokhale, and Amato. This was the only paper of the three that included as part of its motivation something about the web. They mention “WWW graphs” as an example of real-world, large graphs, and by “WWW graph” they mean webpages as nodes and links between webpages as edges. Also, social networks were mentioned as another example of real-world, large graphs. Unlike the first paper, this paper focuses on a shared-memory paradigm. They “present a novel asynchronous approach to compute Breadth-First-Search (BFS), Single-Source-Shortest-Paths, and Connected Components for large graphs in shared memory” [2]. The “semi-external memory” refers to the use of solid-state memory devices. They evaluated the approach on synthetic, “scale-free graphs” generated using the RMAT [3] graph generator. The graphs had roughly between 33 million and one billion unique edges. They also evaluated Connected Components on WWW graphs ranging from roughly one billion to eight billion edges.

The third paper was Scalable Graph Exploration on Multicore Processors by Agarwal et al. They “investigate the challenges involved in exploring very large graph by designing a breadth-first search (BFS) algorithm for advanced multi-core processors …” [4]. They evaluate on RMAT [3] graphs of up to a billion edges. They make some very interesting claims that in some cases, their BFS algorithm running on a 4-socket Nehalem EX performs on par or even exceeds previously reported performance on the Cray XMT, Cray MTA-2, and Blue Gene/L.

The problem is clear: graph algorithms require random data access (in the sense of spacial locality of the bytes) which inhibits performance. The common solution is to have some abstraction of shared memory (whether it is actually shared memory or not). The focus is on conventional graph algorithms like BFS and connected components, and the state-of-the-art seems to handle on the order of billions of edges.

2. The Graph 500 BOF

The Graph 500 BOF was also of interest, although it was really the Graph 9. The effort was motivated by dissatisfaction that the Top 500 list uses LINPACK as its benchmark, and FLOPS are not the best indication of the usefulness of a system for attacking graph-related problems. For this first version of Graph 500, BFS was the benchmark, and systems were ranked first by size of the input graph and then by edge traversals per second. The list was dominated by Cray machines with a Cray XT4 coming in second place with 232 vertices and 5.22 billion edge traversals per second. However, the one BlueGene machine, a BlueGene/P, took first place achieving both the largest graph of 236 vertices and the highest edge traversal rate of 6.6 billion edge traversals per second.

3. The Semantic Graph/Database Processing BOF

Finally, there was the semantic graph/database processing BOF hosted by David Haglin of PNNL. Cliff Joslyn of PNNL presented their submission to the 2010 Billion Triple Challenge, and David Mizell of Cray, Inc. presented his work in SPARQL querying on the Cray XMT. There was also a presentation by Franz Inc. about AllegroGraph. I presented my relevant existing research as well as gave mention of related works. I also gave a few suggestions to interested HPC researchers, of which I will discuss more in another blog post.

One thing that was clear is that there was diverse interest in semantic web at the BOF. Some were interested in merely utilizing RDF for representing graphs while others were interesting in supporting analysis and/or query of such RDF graphs. I ran into only one person who seemed interested in reasoning. Another comment was about how SPARQL does not directly support graph-like operations or algorithms but serves primarily as a data access language. As an example, how would one find clustered components using SPARQL? It could be done, but perhaps would be cumbersome. The supercomputing community seems to handle RDF data primarily as a graph data structure rather than as sets of triples.

Jesse Weaver
Ph.D. Student, Patroon Fellow
Tetherless World Constellation
Rensselaer Polytechnic Institute

[1] Cong, Almasi, Saraswat. Fast PGAS Implementations of Distribute Graph Algorithms. SC 2010.
[2] Pearce, Gokhale, Amato. Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory. SC 2010.
[3] Chakrabarti, Zhan, Faloutsos. R-MAT: A recursive model for graph mining. ICDM 2004.
[4] Agarwal, Petrini, Pasetto, Bader. Scalable Graph Exploration on Multicore Processors. SC 2010.

VN:F [1.9.13_1145]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.13_1145]
Rating: 0 (from 0 votes)
Author: Categories: tetherless world Tags: