Jesse Weaver RDF Management Approaches
From Semantic Portal Wiki
Presentation given at CSCI 6966 Advanced Semantic Web (Fall 2008) - Lesson 6
Presentation Slides: File:ASW-SP2BenchExp-JesseWeaver-20081002.ppt
- Speaker: Jesse Weaver
- Title: An Experimental Comparison of RDF Data Management Approaches in a SPARQL Benchmark Scenario
- Authors: Michael Schmidt, Thomas Hornung, Norbert Küchlin, Georg Lausen, Christoph Pinkel
- Conference: ISWC 2008
- URL: http://www.informatik.uni-freiburg.de/~mschmidt/docs/sp2b_exp.pdf
- Date of Presentation: 2008/10/02
Questions
| ID | Question | Name | Answer |
|---|---|---|---|
| Jesse Weaver RDF Management Approaches Gregory Todd Williams 1 | The paper asserts in section 3 that "Sesame constitutes a query engine that, like the other three scenarios, relies on a physical DB backend." However, the next paragraph says that MonetDB, the column store that Sesame is compared against "is memory-based". Have I understood this correctly? Wouldn't such a setup immediately (and massively) disadvantage Sesame compared to the three relational implementations (TR, VP, and RS)? | Gregory Todd Williams | |
| Jesse Weaver RDF Management Approaches Gregory Todd Williams 2 | In discussing query Q2 (and again for subsequent queries), it is noted that MonetDB chooses inefficient query execution plans "that mostly use fetch joins, involving merge joins only in a few cases." Based on this, the conclusion goes on to say "relational optimizers may have problems to cope with the specific challenges that arise in the context of RDF." Are we to assume that MonetDB's optimizer performs equally well (or better than) other relational optimizers (including those from industrial implementations such as Oracle and Sybase)? Since the paper purportedly attempts to evaluate storage schemes for RDF, wouldn't a poor optimizer be an orthogonal issue, and a proper evaluation of the storage schemes be based on the best QEP available for each query? | Gregory Todd Williams | |
| Jesse Weaver RDF Management Approaches Joshua Shinavier 1 | Query 9 of SP2Bench is said to be impossible to evaluate in the purely relational case, on account of an unbound predicate. This being the case, how was query 10 evaluated, which also contains an unbound predicate that (unlike in query 3) cannot be resolved by first evaluating a FILTER expression? I must admit that I don't see how the member2, member3, and member4 variables in query 7 are resolved, either. | Joshua Shinavier | |
| Jesse Weaver RDF Management Approaches Joshua Taylor 1 | In the penultimate paragraph of 2 The SP2Bench Scenario we read, "The table also lists the number #prop. of distinct properties. This value x+y splits into x "standard" attribute properties and y bag membership properties rdf:_1, …, rdf:_y, where y depends on the maximum-sized reference list in the data." The authors also make the point that difficulties are introduced as this number increases. In 3.3 The Purely Relational Scheme it becomes clear that they are capable of representing the references without using containers. It seems that using a container for a reference list rather than relating the paper to the referenced work with dcterms:references (as is done for authors with dc:creator) introduces unnecessary difficulties. Have you any thoughts about why they did this, and whether it alters performance and evaluation? | Joshua A. Taylor | |
| Jesse Weaver RDF Management Approaches Joshua Taylor 2 | In 4 Experimental Results the authors write, "As our primary interest is the basic performance of the approaches (rather than caching or learning strategies), we performed cold runs, i.e. destroyed the database in-beween each two consecutive runs, and always restarted it before evaluating a query." Is this realistic? Clearly it means that caching and learning strategies will not influence the result, but if those are typical of database systems, would it not make for a better evaluation if they had cold runs in addition to "warm" runs where caching and learning strategies could affect performance? If the differences were not significant, then the authors would have shown that, and if the differences were, and those uses are more typical, then the comparison would be more useful. | Joshua A. Taylor | |
| Jesse Weaver RDF Management Approaches Joshua Taylor 3 | In the Conclusion, the authors point out that storing the data in a relational database whose schema is based on the ontology at hand performed better than any of the other RDF stores. This is not particularly surprising, as it the most specialized, but least flexible, representation. I wonder if a hybrid approach in which database tables are constructed based on formal ontology descriptions (e.g., in RDFS or OWL), and triples using this vocabulary are stored in said specialized table, but other triples are stored using a more general approach (but within the same database) would be practical/useful, and how such a system would fare in this evaluation. What do you think? | Joshua A. Taylor | |
| Jesse Weaver RDF Management Approaches Shangguan | Actuall, the same question as the 2nd one proposed by Greg came up to me when I was reading the paper. Seems like some of the results are somewhat biased in some comparison cases. E.g., in some test cases, such as Q2, it would be more convincing if the author can possibly present the results when using better QEPs in MonetDB. It would be even more convincing, as Greg said, to carry out a comparison based different QEPs to see in which circumstances VP & SP can possibly do a better job, and in which cases they cannot. | Zhenning Shangguan | |
| RDF Management Approaches Ankesh | This question digresses from main attention of this paper. Its more from database point of view. Could we map rdf:bag to a collection in purely relational scheme (eg. varray or nested-table in Oracle. I am not aware how this is done in column stores)? For eg. in the reference table, there can be a row for each paper that is related to a collection of papers. Can this help improve relational scheme, in terms of efficiency of query answer? Personal thoughts: This would make joins difficult from the collection. But it would reduce number of distinct rows. For eg. separately we can keep publication_author(publication, list of authors). Authors mention that slowly the number of authors contributing to a paper are increasing. | Ankesh Khandelwal | |
| Schmidt2008experimental question 1 by lebo | The only place the ratios of usr, sys, and total response times are mentioned is in the discussion for Q1 ("Return the year of publication of 'Journal 1 (1940)'", where the authors state, "The gap between total and usr+sys for 25M indicates that much time is spent in waiting for data being read from or written to disk". The choice to use different vertical scales in Figures 1-3 leads to an investigation of these ratios while obscuring a natural consideration of the more important issue: the relative response times between triple store approaches. Regardless, the ratio usr/sys falls within one of three categories: minority/majority, all/none, and none/none -- and the ratio's category transitions from none/none, to all/none, to minority/majority as the data size increases within a condition.
|
Tim Lebo |
Attendees
Facts about Jesse Weaver RDF Management ApproachesRDF feed
| A | Presentation +, and Presentation attended by Tim Lebo + |
| Conference | ISWC 2008 + |
| Date | 2 October 2008 + |
| Given at | CSCI 6966 Advanced Semantic Web (Fall 2008) - Lesson 6 + |
| Paper has author | Michael Schmidt +, Thomas Hornung +, Norbert Küchlin +, Georg Lausen +, and Christoph Pinkel + |
| Speaker | Jesse Weaver + |
| Title of paper | An Experimental Comparison of RDF Data Management Approaches in a SPARQL Benchmark Scenario + |
| Url | http://www.informatik.uni-freiburg.de/~mschmidt/docs/sp2b_exp.pdf + |

