<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>

<channel>
	<title>The Tetherless World Weblog</title>
	<atom:link href="http://tw.rpi.edu/weblog/feed/" rel="self" type="application/rss+xml" />
	<link>http://tw.rpi.edu/weblog</link>
	<description>Everything about Web Science</description>
	<pubDate>Fri, 20 Nov 2009 22:15:50 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Scaling Up at the Tetherless World Constellation in 2009</title>
		<link>http://tw.rpi.edu/weblog/2009/11/20/scaling-up-at-twc-2009/</link>
		<comments>http://tw.rpi.edu/weblog/2009/11/20/scaling-up-at-twc-2009/#comments</comments>
		<pubDate>Fri, 20 Nov 2009 22:15:50 +0000</pubDate>
		<dc:creator>Jesse Weaver</dc:creator>
		
		<category><![CDATA[tetherless world]]></category>

		<guid isPermaLink="false">http://tw.rpi.edu/weblog/?p=219</guid>
		<description><![CDATA[Since this is my first post to the Tetherless World blog, perhaps a brief introduction is in order.  I&#8217;m Jesse Weaver, one of Jim Hendler&#8217;s Ph.D. students in the Tetherless World Constellation (TWC) at Rensselaer Polytechnic Institute (RPI).  My general research interest is in high-performance computing for the semantic web.  Specifically, I have been looking [...]]]></description>
			<content:encoded><![CDATA[<p>Since this is my first post to the Tetherless World blog, perhaps a brief introduction is in order.  I&#8217;m <a href="http://www.cs.rpi.edu/~weavej3/">Jesse Weaver</a>, one of <a href="http://www.cs.rpi.edu/~hendler/">Jim Hendler</a>&#8217;s Ph.D. students in the <a href="http://tw.rpi.edu/">Tetherless World Constellation</a> (TWC) at <a href="http://www.rpi.edu/">Rensselaer Polytechnic Institute</a> (RPI).  My general research interest is in high-performance computing for the semantic web.  Specifically, I have been looking at employing parallelism on cluster architectures for rule-based reasoning and RDF query.  Since joining TWC in Fall 2008, I have been working with colleagues toward this end, and it is that work that I would like to share in this blog post.</p>
<p>Jim and I recently published a paper at ISWC 2009 entitled <em><a href="http://www.cs.rpi.edu/~weavej3/#ParallelRDFS">Parallel Materialization of the Finite RDFS Closure for Hundreds of Millions of Triples</a></em>.  Since the time that paper was accepted (and as presented at ISWC), we have actually scaled to billions of triples.  We show in this paper that the RDFS rules can be applied to independent partitions of data to produce the RDFS closure for all of the data, as long as each partition has the ontologies.  In parallel computing terms, the RDFS closure can be computed in an embarrassingly parallel fashion.  &#8220;Embarrassingly parallel&#8221; is a technical term from parallel computing describing a computation that can be divided into completely independent parts.  Such computations are considered ideal for parallelism because there is no need for communication between processes and hence there is essentially no overhead for parallelization.  <a href="http://ect.bell-labs.com/who/pfps/">Peter Patel-Schneider</a> had some good questions and comments after the presentation.  I have made my responses publicly available in a brief <a href="http://www.cs.rpi.edu/~weavej3/papers/iswc2009-notes.txt">note</a>.</p>
<p><a href="http://www.cs.rpi.edu/~willig4/">Gregory Todd Williams</a> and I published a paper at SSWS 2009 entitled <em><a href="http://www.cs.rpi.edu/~weavej3/#ScalableRDFQuery">Scalable RDF query processing on clusters and supercomputers</a></em>.  This paper shows how parallel hash joins can be used on high-performance clusters to efficiently query large RDF datasets.  It seemed to get a lot of attention at the SSWS workshop as well as stir up a little bit of controversy.  The interesting thing about our approach is that no global indexes are created.  Each process in the cluster gets a portion of the data and indexes it locally, but no global indexes are maintained (e.g., we do not globally dictionary encode RDF terms).  This allows us to load data extremely quickly with some cost to query time.  In many cases, though, the decrease in loading time outweighs the added cost in query time.  (The added cost in query time comes from communicating full string values instead of global IDs during the parallel hash join.)  This allows for exploratory querying and easy handling of dynamically changing data.  Whereas many previous query systems depend heavily on global indexes (for which loading can take on the order of hours or days), we can load large datasets on the order of seconds and minutes.  Therefore, if the data changes, it can just be reloaded instead of updating indexes.</p>
<p>Finally, Greg, <a href="http://www.cs.rpi.edu/~atrem/">Medha Atre</a>, Jim, and I submitted a paper to the <a href="http://www.cs.vu.nl/~pmika/swc/submissions2009.html">Billion Triples Challenge</a> (BTC), which we won!</p>
<div style="text-align: center;"><img src="http://www.cs.rpi.edu/~weavej3/btc2009/Btc-winner-2009.png" alt="Greg and Jesse accept the award for 1st place at the 2009 Billion Triples Challenge" width="300" height="225" /></div>
<p>We composed together three systems for our submission.  First, we created a simple <a href="http://www.cs.rpi.edu/~weavej3/btc2009/#upper">upper ontology</a> of 31 triples for our domain of interest, linking established concepts of Person to our concept of Person (by subclass), and we did the same for many relevant properties (name, email, etc.) (by subproperty).  Then, we used the aforementioned parallel materialization work to produce inferences on the BTC dataset, inferring triples that use our terms from the upper ontology.  Using the aforementioned work on scalable query, we then extracted only our triples of interest.  This <a href="http://www.cs.rpi.edu/~weavej3/btc2009/#reduced">reduced dataset</a> is almost 800K triples, much more manageable than the original 900M triples, and it can now be used by existing tools without much concern of dataset size.  As a finishing touch, we compressed the reduced dataset down into a <a href="http://www.cs.rpi.edu/~atrem/#Research">BitMat</a> RDF data structure, resulting in a final disk space of 8 MB for the triples and 25 MB for the dictionary encoding.  Simple basic graph pattern queries can be executed against the BitMat.  The entire process took roughly 22 minutes.  See more about the submission at <a href="http://www.cs.rpi.edu/~weavej3/btc2009/">our BTC website</a> which contains the datasets and some statistics about the datasets.</p>
<div style="text-align: center;"><a title="Jesse's BTC Presentation by kasei, on Flickr" href="http://www.flickr.com/photos/kasei/4055714142/"><img src="http://farm3.static.flickr.com/2466/4055714142_de5795153f.jpg" alt="Jesse presents at the 2009 Billion Triples Challenge" width="333" height="222" /></a></div>
<p>That being said, the future holds much work to be done for scalability in the semantic web domain.</p>
<p>At present, I have been looking at formalizing a more general notion of &#8220;abox partitioning&#8221; for the purpose of classifying rules that fit such a paradigm, and then explore its application to OWL2RL.  Some parts of OWL2RL&#8212;like symmetric properties and inverse properties&#8212;clearly fit in the inferencing scheme from the parallel materialization paper.  However, many of the much desired features&#8212;like inverse functional properties and owl:sameAs&#8212;do not.  For such rules, parallel hash joins may be needed, or perhaps a more clever partitioning scheme.</p>
<p>We could also improve loading time of these systems (and perhaps communication time during parallel hash joins) by using an RDF syntax that is less verbose than <a href="http://www.w3.org/TR/rdf-testcases/#ntriples">N-Triples</a>, but not as complex as <a href="http://www.w3.org/TeamSubmission/turtle/">Turtle</a>.  (Remember, we are concerned about <strong>parallel</strong> I/O.)  To that end, we are exploring defining a subset of Turtle that would be helpful for I/O purposes without trading off the inherent simplicity of N-Triples (one triple per line).</p>
<p>We would also like to start employing more memory-efficient RDF storage data structures (like BitMat or Parliament) directly in our systems.  This is particularly important for the Blue Gene/L architecture which has at most 1 GB of memory per node.</p>
<p>And speaking of the Blue Gene/L, I have been doing all my work at RPI&#8217;s fabulous <a href="http://www.rpi.edu/research/ccni/">Computational Center for Nanotechnology Innovations</a> (CCNI).  The CCNI is really a great computation facility having parallel file systems, high performance clusters, large SMP machines, and&#8212;of course&#8212;a Blue Gene/L.  Such a resource is a great enabler for our research.</p>
<p>Jesse Weaver<br />
Ph.D. Student, Patroon Fellow<br />
Tetherless World Constellation<br />
Rensselaer Polytechnic Institute<br />
<a href="http://www.cs.rpi.edu/~weavej3/">http://www.cs.rpi.edu/~weavej3/</a></p>
]]></content:encoded>
			<wfw:commentRss>http://tw.rpi.edu/weblog/2009/11/20/scaling-up-at-twc-2009/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Parliament, storage density, and napkin math</title>
		<link>http://tw.rpi.edu/weblog/2009/11/08/parliament-storage-density-and-napkin-math/</link>
		<comments>http://tw.rpi.edu/weblog/2009/11/08/parliament-storage-density-and-napkin-math/#comments</comments>
		<pubDate>Sun, 08 Nov 2009 08:53:41 +0000</pubDate>
		<dc:creator>greg</dc:creator>
		
		<category><![CDATA[Semantic Web]]></category>

		<category><![CDATA[tetherless world]]></category>

		<guid isPermaLink="false">http://tw.rpi.edu/weblog/?p=185</guid>
		<description><![CDATA[What follows is some napkin math on the storage requirements of the <a href="http://parliament.semwebcentral.org/">Parliament</a> triplestore (as described in their <a href="http://www.cse.lehigh.edu/~yug2/ssws09/home.htm">SSWS</a> paper) and its implications for our clustered RDF query engine.]]></description>
			<content:encoded><![CDATA[<p>What follows is some napkin math on the storage requirements of the <a href="http://parliament.semwebcentral.org/">Parliament</a> triplestore (as described in their <a href="http://www.cse.lehigh.edu/~yug2/ssws09/home.htm">SSWS</a> paper) and its implications for our clustered RDF query engine (any errors are presumably with my understanding of the described system).</p>
<p>The triple store has three data structures: a resource dictionary, statement list, and resource table. The resource dictionary is a mapping from node values (literals/IRIs/blank node identifiers) to an integer identifier and can be implemented with a search tree or hash (Parliament uses a BerkeleyDB B-tree). The statement list and resource table are the two interesting structures.</p>
<p>The statement list is an array of fixed width records containing an ID for the subject, predicate, and object in the triple, and corresponding offsets to the next triple record with the same subject, predicate, and object, respectively. Making some assumptions about the storage sizes involved (related to my specific needs), a statement list record&#8217;s fields are:</p>
<table summary="Field types and their storage requirements in the statement list">
<tr>
<th>field</th>
<th>type</th>
</tr>
<tr>
<td>subject</td>
<td>4 byte Offset into Resource Table</td>
</tr>
<tr>
<td>predicate</td>
<td>4 byte Offset into Resource Table</td>
</tr>
<tr>
<td>object</td>
<td>4 byte Offset into Resource Table</td>
</tr>
<tr>
<td>next subject</td>
<td>4 byte Offset into Statement List</td>
</tr>
<tr>
<td>next predicate</td>
<td>4 byte Offset into Statement List</td>
</tr>
<tr>
<td>next object</td>
<td>4 byte Offset into Statement List</td>
</tr>
<caption>Statement List Record</caption>
</table>
<p>The resource table is an array of fixed width records for each resource containing three offsets into the statement list for the first statement with the specific resource in subject, predicate, and object position, respectively, three counts representing the length of these three lists, and the resoure value. The storage requirements for a resource table record&#8217;s fields are:</p>
<table summary="Field types and their storage requirements in the resource table">
<caption>Resource Table Record</th>
</caption>
<tr>
<th>field</th>
<th>type</th>
</tr>
<tr>
<td>first subject</td>
<td>4 byte Offset into Statement List</td>
</tr>
<tr>
<td>first predicate</td>
<td>4 byte Offset into Statement List</td>
</tr>
<tr>
<td>first object</td>
<td>4 byte Offset into Statement List</td>
</tr>
<tr>
<td>subject count</td>
<td>4 byte Integer</td>
</tr>
<tr>
<td>predicate count</td>
<td>4 byte Integer</td>
</tr>
<tr>
<td>object count</td>
<td>4 byte Integer</td>
</tr>
<tr>
<td>value</td>
<td>8 byte pointer to n byte resource value</td>
</tr>
</table>
<p>Puting aside the storage cost for the resource dictionary (which would obviously depend on the specific structure used, whether tree or hash), the storage requirements for the statement list and resource table are then:</p>
<p>(24 bytes * |triples|) + ((32+n) bytes * |nodes|)</p>
<p>where n is the average resource value size.</p>
<p>To get some real world numbers, I looked at a subset of the Billion Triples Challenge dataset (BTC chunks 000-009). This subset has 100M triples, ~34M nodes, and occupies 3.5GB with nodes serialized using N-Triples syntax (I&#8217;ll assume here that the N-Triples serialization gives a rough approximation of the node&#8217;s representation in memory). This averages out to ~341 unique nodes per thousand triples and ~109 bytes per node.</p>
<p>Using these numbers, the statement list and resource table for 100M triples would take (24*100M) + ((32+109)*34144466) bytes = 7214369706 bytes = 6.7GB. The upper limit on a single triplestore with these structures (using 32-bit offsets) is 4B statements, with a storage estimate of 288GB.</p>
<p>I&#8217;d like to try implementing this and using it on our clustered RDF query engine (part of the system that won the Billion Triples Challenge last month). It seems like a very natural fit since we&#8217;d like high storage density but haven&#8217;t spent any effort on that front yet (and our current code is awful in this respect) and our parallel query answering code doesn&#8217;t need the sorting that a traditional tree-based index would provide. Some more (very rough) napkin math suggests that we&#8217;d be able to get over 100 billion triples loaded onto the Blue Gene/L. It&#8217;ll be interesting to see if the numbers actually pan out, and if the query algorithm can handle anything close to that much data.</p>
<p>Gregory Todd Williams</p>
]]></content:encoded>
			<wfw:commentRss>http://tw.rpi.edu/weblog/2009/11/08/parliament-storage-density-and-napkin-math/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Web Accessibility in an Educational Context</title>
		<link>http://tw.rpi.edu/weblog/2009/10/25/web-accessibility-in-an-educational-context/</link>
		<comments>http://tw.rpi.edu/weblog/2009/10/25/web-accessibility-in-an-educational-context/#comments</comments>
		<pubDate>Sun, 25 Oct 2009 15:32:44 +0000</pubDate>
		<dc:creator>Evan</dc:creator>
		
		<category><![CDATA[Semantic Web]]></category>

		<category><![CDATA[RDFa]]></category>

		<guid isPermaLink="false">http://tw.rpi.edu/weblog/?p=165</guid>
		<description><![CDATA[I&#8217;m currently at the annual meeting for the Human Factors and Ergonomics Society and a few hours ago I was attending a set of presentations by the Internet TG on Input and Output on the Web. One of the talks that caught my attention was by Wayne Shebilske of Wright State University on the use [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m currently at the annual meeting for the Human Factors and Ergonomics Society and a few hours ago I was attending a set of presentations by the Internet TG on Input and Output on the Web. One of the talks that caught my attention was by Wayne Shebilske of Wright State University on the use of screen readers to help impaired students complete tasks within educational programs such as WebCT, student status management software, and the like. There seem to be many problems with such tools as Dr. Shebilske pointed out based on his study. I think this could be a prime candidate for SW technologies to step in to improve the end user experience. I foresee using such tools along with technologies like RDFa to give the user a better representation of the content being displayed and improving the overall quality of life for these individuals.</p>
<p>Evan</p>
]]></content:encoded>
			<wfw:commentRss>http://tw.rpi.edu/weblog/2009/10/25/web-accessibility-in-an-educational-context/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Probing the SPARQL endpoint of data.gov.uk</title>
		<link>http://tw.rpi.edu/weblog/2009/10/23/probing-the-sparql-endpoint-of-datagovuk/</link>
		<comments>http://tw.rpi.edu/weblog/2009/10/23/probing-the-sparql-endpoint-of-datagovuk/#comments</comments>
		<pubDate>Fri, 23 Oct 2009 04:13:10 +0000</pubDate>
		<dc:creator>li</dc:creator>
		
		<category><![CDATA[Semantic Web]]></category>

		<category><![CDATA[linked data]]></category>

		<category><![CDATA[data.gov]]></category>

		<guid isPermaLink="false">http://tw.rpi.edu/weblog/?p=167</guid>
		<description><![CDATA[ 
We just ran across the preview SPARQL endpoint for UK&#8217;s Data.gov (powered by Talis) following Harry Metcalfe&#8217;s blog . In order to understand what data is hosted by the triple store, we use a series of SPARQL queries to probe the content in data.gov. We leverage a web service http://data-gov.tw.rpi.edu/ws/sparqlproxy.php to convert SPARQL/XMl result [...]]]></description>
			<content:encoded><![CDATA[<p><!--[if gte mso 9]><xml> <w:WordDocument> <w:View>Normal</w:View> <w:Zoom>0</w:Zoom> <w:TrackMoves /> <w:TrackFormatting /> <w:PunctuationKerning /> <w:ValidateAgainstSchemas /> <w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid> <w:IgnoreMixedContent>false</w:IgnoreMixedContent> <w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText> <w:DoNotPromoteQF /> <w:LidThemeOther>EN-US</w:LidThemeOther> <w:LidThemeAsian>ZH-CN</w:LidThemeAsian> <w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript> <w:Compatibility> <w:BreakWrappedTables /> <w:SnapToGridInCell /> <w:WrapTextWithPunct /> <w:UseAsianBreakRules /> <w:DontGrowAutofit /> <w:SplitPgBreakAndParaMark /> <w:DontVertAlignCellWithSp /> <w:DontBreakConstrainedForcedTables /> <w:DontVertAlignInTxbx /> <w:Word11KerningPairs /> <w:CachedColBalance /> <w:UseFELayout /> </w:Compatibility> <w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel> <m:mathPr> <m:mathFont m:val="Cambria Math" /> <m:brkBin m:val="before" /> <m:brkBinSub m:val="&#45;-" /> <m:smallFrac m:val="off" /> <m:dispDef /> <m:lMargin m:val="0" /> <m:rMargin m:val="0" /> <m:defJc m:val="centerGroup" /> <m:wrapIndent m:val="1440" /> <m:intLim m:val="subSup" /> <m:naryLim m:val="undOvr" /> </m:mathPr></w:WordDocument> </xml><![endif]--><!--[if gte mso 9]><xml> <w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"   DefSemiHidden="true" DefQFormat="false" DefPriority="99"   LatentStyleCount="267"> <w:LsdException Locked="false" Priority="0" SemiHidden="false"    UnhideWhenUsed="false" QFormat="true" Name="Normal" /> <w:LsdException Locked="false" Priority="9" SemiHidden="false"    UnhideWhenUsed="false" QFormat="true" Name="heading 1" /> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 2" /> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 3" /> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 4" /> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 5" /> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 6" /> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 7" /> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 8" /> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 9" /> <w:LsdException Locked="false" Priority="39" Name="toc 1" /> <w:LsdException Locked="false" Priority="39" Name="toc 2" /> <w:LsdException Locked="false" Priority="39" Name="toc 3" /> <w:LsdException Locked="false" Priority="39" Name="toc 4" /> <w:LsdException Locked="false" Priority="39" Name="toc 5" /> <w:LsdException Locked="false" Priority="39" Name="toc 6" /> <w:LsdException Locked="false" Priority="39" Name="toc 7" /> <w:LsdException Locked="false" Priority="39" Name="toc 8" /> <w:LsdException Locked="false" Priority="39" Name="toc 9" /> <w:LsdException Locked="false" Priority="35" QFormat="true" Name="caption" /> <w:LsdException Locked="false" Priority="10" SemiHidden="false"    UnhideWhenUsed="false" QFormat="true" Name="Title" /> <w:LsdException Locked="false" Priority="1" Name="Default Paragraph Font" /> <w:LsdException Locked="false" Priority="11" SemiHidden="false"    UnhideWhenUsed="false" QFormat="true" Name="Subtitle" /> <w:LsdException Locked="false" Priority="22" SemiHidden="false"    UnhideWhenUsed="false" QFormat="true" Name="Strong" /> <w:LsdException Locked="false" Priority="20" SemiHidden="false"    UnhideWhenUsed="false" QFormat="true" Name="Emphasis" /> <w:LsdException Locked="false" Priority="59" SemiHidden="false"    UnhideWhenUsed="false" Name="Table Grid" /> <w:LsdException Locked="false" UnhideWhenUsed="false" Name="Placeholder Text" /> <w:LsdException Locked="false" Priority="1" SemiHidden="false"    UnhideWhenUsed="false" QFormat="true" Name="No Spacing" /> <w:LsdException Locked="false" Priority="60" SemiHidden="false"    UnhideWhenUsed="false" Name="Light Shading" /> <w:LsdException Locked="false" Priority="61" SemiHidden="false"    UnhideWhenUsed="false" Name="Light List" /> <w:LsdException Locked="false" Priority="62" SemiHidden="false"    UnhideWhenUsed="false" Name="Light Grid" /> <w:LsdException Locked="false" Priority="63" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Shading 1" /> <w:LsdException Locked="false" Priority="64" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Shading 2" /> <w:LsdException Locked="false" Priority="65" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium List 1" /> <w:LsdException Locked="false" Priority="66" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium List 2" /> <w:LsdException Locked="false" Priority="67" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Grid 1" /> <w:LsdException Locked="false" Priority="68" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Grid 2" /> <w:LsdException Locked="false" Priority="69" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Grid 3" /> <w:LsdException Locked="false" Priority="70" SemiHidden="false"    UnhideWhenUsed="false" Name="Dark List" /> <w:LsdException Locked="false" Priority="71" SemiHidden="false"    UnhideWhenUsed="false" Name="Colorful Shading" /> <w:LsdException Locked="false" Priority="72" SemiHidden="false"    UnhideWhenUsed="false" Name="Colorful List" /> <w:LsdException Locked="false" Priority="73" SemiHidden="false"    UnhideWhenUsed="false" Name="Colorful Grid" /> <w:LsdException Locked="false" Priority="60" SemiHidden="false"    UnhideWhenUsed="false" Name="Light Shading Accent 1" /> <w:LsdException Locked="false" Priority="61" SemiHidden="false"    UnhideWhenUsed="false" Name="Light List Accent 1" /> <w:LsdException Locked="false" Priority="62" SemiHidden="false"    UnhideWhenUsed="false" Name="Light Grid Accent 1" /> <w:LsdException Locked="false" Priority="63" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Shading 1 Accent 1" /> <w:LsdException Locked="false" Priority="64" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Shading 2 Accent 1" /> <w:LsdException Locked="false" Priority="65" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium List 1 Accent 1" /> <w:LsdException Locked="false" UnhideWhenUsed="false" Name="Revision" /> <w:LsdException Locked="false" Priority="34" SemiHidden="false"    UnhideWhenUsed="false" QFormat="true" Name="List Paragraph" /> <w:LsdException Locked="false" Priority="29" SemiHidden="false"    UnhideWhenUsed="false" QFormat="true" Name="Quote" /> <w:LsdException Locked="false" Priority="30" SemiHidden="false"    UnhideWhenUsed="false" QFormat="true" Name="Intense Quote" /> <w:LsdException Locked="false" Priority="66" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium List 2 Accent 1" /> <w:LsdException Locked="false" Priority="67" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Grid 1 Accent 1" /> <w:LsdException Locked="false" Priority="68" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Grid 2 Accent 1" /> <w:LsdException Locked="false" Priority="69" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Grid 3 Accent 1" /> <w:LsdException Locked="false" Priority="70" SemiHidden="false"    UnhideWhenUsed="false" Name="Dark List Accent 1" /> <w:LsdException Locked="false" Priority="71" SemiHidden="false"    UnhideWhenUsed="false" Name="Colorful Shading Accent 1" /> <w:LsdException Locked="false" Priority="72" SemiHidden="false"    UnhideWhenUsed="false" Name="Colorful List Accent 1" /> <w:LsdException Locked="false" Priority="73" SemiHidden="false"    UnhideWhenUsed="false" Name="Colorful Grid Accent 1" /> <w:LsdException Locked="false" Priority="60" SemiHidden="false"    UnhideWhenUsed="false" Name="Light Shading Accent 2" /> <w:LsdException Locked="false" Priority="61" SemiHidden="false"    UnhideWhenUsed="false" Name="Light List Accent 2" /> <w:LsdException Locked="false" Priority="62" SemiHidden="false"    UnhideWhenUsed="false" Name="Light Grid Accent 2" /> <w:LsdException Locked="false" Priority="63" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Shading 1 Accent 2" /> <w:LsdException Locked="false" Priority="64" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Shading 2 Accent 2" /> <w:LsdException Locked="false" Priority="65" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium List 1 Accent 2" /> <w:LsdException Locked="false" Priority="66" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium List 2 Accent 2" /> <w:LsdException Locked="false" Priority="67" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Grid 1 Accent 2" /> <w:LsdException Locked="false" Priority="68" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Grid 2 Accent 2" /> <w:LsdException Locked="false" Priority="69" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Grid 3 Accent 2" /> <w:LsdException Locked="false" Priority="70" SemiHidden="false"    UnhideWhenUsed="false" Name="Dark List Accent 2" /> <w:LsdException Locked="false" Priority="71" SemiHidden="false"    UnhideWhenUsed="false" Name="Colorful Shading Accent 2" /> <w:LsdException Locked="false" Priority="72" SemiHidden="false"    UnhideWhenUsed="false" Name="Colorful List Accent 2" /> <w:LsdException Locked="false" Priority="73" SemiHidden="false"    UnhideWhenUsed="false" Name="Colorful Grid Accent 2" /> <w:LsdException Locked="false" Priority="60" SemiHidden="false"    UnhideWhenUsed="false" Name="Light Shading Accent 3" /> <w:LsdException Locked="false" Priority="61" SemiHidden="false"    UnhideWhenUsed="false" Name="Light List Accent 3" /> <w:LsdException Locked="false" Priority="62" SemiHidden="false"    UnhideWhenUsed="false" Name="Light Grid Accent 3" /> <w:LsdException Locked="false" Priority="63" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Shading 1 Accent 3" /> <w:LsdException Locked="false" Priority="64" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Shading 2 Accent 3" /> <w:LsdException Locked="false" Priority="65" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium List 1 Accent 3" /> <w:LsdException Locked="false" Priority="66" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium List 2 Accent 3" /> <w:LsdException Locked="false" Priority="67" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Grid 1 Accent 3" /> <w:LsdException Locked="false" Priority="68" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Grid 2 Accent 3" /> <w:LsdException Locked="false" Priority="69" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Grid 3 Accent 3" /> <w:LsdException Locked="false" Priority="70" SemiHidden="false"    UnhideWhenUsed="false" Name="Dark List Accent 3" /> <w:LsdException Locked="false" Priority="71" SemiHidden="false"    UnhideWhenUsed="false" Name="Colorful Shading Accent 3" /> <w:LsdException Locked="false" Priority="72" SemiHidden="false"    UnhideWhenUsed="false" Name="Colorful List Accent 3" /> <w:LsdException Locked="false" Priority="73" SemiHidden="false"    UnhideWhenUsed="false" Name="Colorful Grid Accent 3" /> <w:LsdException Locked="false" Priority="60" SemiHidden="false"    UnhideWhenUsed="false" Name="Light Shading Accent 4" /> <w:LsdException Locked="false" Priority="61" SemiHidden="false"    UnhideWhenUsed="false" Name="Light List Accent 4" /> <w:LsdException Locked="false" Priority="62" SemiHidden="false"    UnhideWhenUsed="false" Name="Light Grid Accent 4" /> <w:LsdException Locked="false" Priority="63" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Shading 1 Accent 4" /> <w:LsdException Locked="false" Priority="64" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Shading 2 Accent 4" /> <w:LsdException Locked="false" Priority="65" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium List 1 Accent 4" /> <w:LsdException Locked="false" Priority="66" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium List 2 Accent 4" /> <w:LsdException Locked="false" Priority="67" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Grid 1 Accent 4" /> <w:LsdException Locked="false" Priority="68" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Grid 2 Accent 4" /> <w:LsdException Locked="false" Priority="69" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Grid 3 Accent 4" /> <w:LsdException Locked="false" Priority="70" SemiHidden="false"    UnhideWhenUsed="false" Name="Dark List Accent 4" /> <w:LsdException Locked="false" Priority="71" SemiHidden="false"    UnhideWhenUsed="false" Name="Colorful Shading Accent 4" /> <w:LsdException Locked="false" Priority="72" SemiHidden="false"    UnhideWhenUsed="false" Name="Colorful List Accent 4" /> <w:LsdException Locked="false" Priority="73" SemiHidden="false"    UnhideWhenUsed="false" Name="Colorful Grid Accent 4" /> <w:LsdException Locked="false" Priority="60" SemiHidden="false"    UnhideWhenUsed="false" Name="Light Shading Accent 5" /> <w:LsdException Locked="false" Priority="61" SemiHidden="false"    UnhideWhenUsed="false" Name="Light List Accent 5" /> <w:LsdException Locked="false" Priority="62" SemiHidden="false"    UnhideWhenUsed="false" Name="Light Grid Accent 5" /> <w:LsdException Locked="false" Priority="63" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Shading 1 Accent 5" /> <w:LsdException Locked="false" Priority="64" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Shading 2 Accent 5" /> <w:LsdException Locked="false" Priority="65" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium List 1 Accent 5" /> <w:LsdException Locked="false" Priority="66" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium List 2 Accent 5" /> <w:LsdException Locked="false" Priority="67" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Grid 1 Accent 5" /> <w:LsdException Locked="false" Priority="68" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Grid 2 Accent 5" /> <w:LsdException Locked="false" Priority="69" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Grid 3 Accent 5" /> <w:LsdException Locked="false" Priority="70" SemiHidden="false"    UnhideWhenUsed="false" Name="Dark List Accent 5" /> <w:LsdException Locked="false" Priority="71" SemiHidden="false"    UnhideWhenUsed="false" Name="Colorful Shading Accent 5" /> <w:LsdException Locked="false" Priority="72" SemiHidden="false"    UnhideWhenUsed="false" Name="Colorful List Accent 5" /> <w:LsdException Locked="false" Priority="73" SemiHidden="false"    UnhideWhenUsed="false" Name="Colorful Grid Accent 5" /> <w:LsdException Locked="false" Priority="60" SemiHidden="false"    UnhideWhenUsed="false" Name="Light Shading Accent 6" /> <w:LsdException Locked="false" Priority="61" SemiHidden="false"    UnhideWhenUsed="false" Name="Light List Accent 6" /> <w:LsdException Locked="false" Priority="62" SemiHidden="false"    UnhideWhenUsed="false" Name="Light Grid Accent 6" /> <w:LsdException Locked="false" Priority="63" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Shading 1 Accent 6" /> <w:LsdException Locked="false" Priority="64" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Shading 2 Accent 6" /> <w:LsdException Locked="false" Priority="65" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium List 1 Accent 6" /> <w:LsdException Locked="false" Priority="66" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium List 2 Accent 6" /> <w:LsdException Locked="false" Priority="67" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Grid 1 Accent 6" /> <w:LsdException Locked="false" Priority="68" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Grid 2 Accent 6" /> <w:LsdException Locked="false" Priority="69" SemiHidden="false"    UnhideWhenUsed="false" Name="Medium Grid 3 Accent 6" /> <w:LsdException Locked="false" Priority="70" SemiHidden="false"    UnhideWhenUsed="false" Name="Dark List Accent 6" /> <w:LsdException Locked="false" Priority="71" SemiHidden="false"    UnhideWhenUsed="false" Name="Colorful Shading Accent 6" /> <w:LsdException Locked="false" Priority="72" SemiHidden="false"    UnhideWhenUsed="false" Name="Colorful List Accent 6" /> <w:LsdException Locked="false" Priority="73" SemiHidden="false"    UnhideWhenUsed="false" Name="Colorful Grid Accent 6" /> <w:LsdException Locked="false" Priority="19" SemiHidden="false"    UnhideWhenUsed="false" QFormat="true" Name="Subtle Emphasis" /> <w:LsdException Locked="false" Priority="21" SemiHidden="false"    UnhideWhenUsed="false" QFormat="true" Name="Intense Emphasis" /> <w:LsdException Locked="false" Priority="31" SemiHidden="false"    UnhideWhenUsed="false" QFormat="true" Name="Subtle Reference" /> <w:LsdException Locked="false" Priority="32" SemiHidden="false"    UnhideWhenUsed="false" QFormat="true" Name="Intense Reference" /> <w:LsdException Locked="false" Priority="33" SemiHidden="false"    UnhideWhenUsed="false" QFormat="true" Name="Book Title" /> <w:LsdException Locked="false" Priority="37" Name="Bibliography" /> <w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading" /> </w:LatentStyles> </xml><![endif]--><!--  /* Font Definitions */  @font-face 	{font-family:SimSun; 	panose-1:2 1 6 0 3 1 1 1 1 1; 	mso-font-alt:宋体; 	mso-font-charset:134; 	mso-generic-font-family:auto; 	mso-font-pitch:variable; 	mso-font-signature:3 680460288 22 0 262145 0;} @font-face 	{font-family:"Cambria Math"; 	panose-1:2 4 5 3 5 4 6 3 2 4; 	mso-font-charset:0; 	mso-generic-font-family:roman; 	mso-font-pitch:variable; 	mso-font-signature:-1610611985 1107304683 0 0 415 0;} @font-face 	{font-family:Calibri; 	panose-1:2 15 5 2 2 2 4 3 2 4; 	mso-font-charset:0; 	mso-generic-font-family:swiss; 	mso-font-pitch:variable; 	mso-font-signature:-520092929 1073786111 9 0 415 0;} @font-face 	{font-family:"\@SimSun"; 	panose-1:2 1 6 0 3 1 1 1 1 1; 	mso-font-charset:134; 	mso-generic-font-family:auto; 	mso-font-pitch:variable; 	mso-font-signature:3 680460288 22 0 262145 0;}  /* Style Definitions */  p.MsoNormal, li.MsoNormal, div.MsoNormal 	{mso-style-unhide:no; 	mso-style-qformat:yes; 	mso-style-parent:""; 	margin-top:0in; 	margin-right:0in; 	margin-bottom:10.0pt; 	margin-left:0in; 	line-height:115%; 	mso-pagination:widow-orphan; 	font-size:11.0pt; 	font-family:"Calibri","sans-serif"; 	mso-ascii-font-family:Calibri; 	mso-ascii-theme-font:minor-latin; 	mso-fareast-font-family:SimSun; 	mso-fareast-theme-font:minor-fareast; 	mso-hansi-font-family:Calibri; 	mso-hansi-theme-font:minor-latin; 	mso-bidi-font-family:"Times New Roman"; 	mso-bidi-theme-font:minor-bidi;} .MsoChpDefault 	{mso-style-type:export-only; 	mso-default-props:yes; 	mso-ascii-font-family:Calibri; 	mso-ascii-theme-font:minor-latin; 	mso-fareast-font-family:SimSun; 	mso-fareast-theme-font:minor-fareast; 	mso-hansi-font-family:Calibri; 	mso-hansi-theme-font:minor-latin; 	mso-bidi-font-family:"Times New Roman"; 	mso-bidi-theme-font:minor-bidi;} .MsoPapDefault 	{mso-style-type:export-only; 	margin-bottom:10.0pt; 	line-height:115%;} @page Section1 	{size:8.5in 11.0in; 	margin:1.0in 1.0in 1.0in 1.0in; 	mso-header-margin:.5in; 	mso-footer-margin:.5in; 	mso-paper-source:0;} div.Section1 	{page:Section1;} --><!--[if gte mso 10]> <mce:style><!   /* Style Definitions */  table.MsoNormalTable 	{mso-style-name:"Table Normal"; 	mso-tstyle-rowband-size:0; 	mso-tstyle-colband-size:0; 	mso-style-noshow:yes; 	mso-style-priority:99; 	mso-style-qformat:yes; 	mso-style-parent:""; 	mso-padding-alt:0in 5.4pt 0in 5.4pt; 	mso-para-margin-top:0in; 	mso-para-margin-right:0in; 	mso-para-margin-bottom:10.0pt; 	mso-para-margin-left:0in; 	line-height:115%; 	mso-pagination:widow-orphan; 	font-size:11.0pt; 	font-family:"Calibri","sans-serif"; 	mso-ascii-font-family:Calibri; 	mso-ascii-theme-font:minor-latin; 	mso-hansi-font-family:Calibri; 	mso-hansi-theme-font:minor-latin; 	mso-bidi-font-family:"Times New Roman"; 	mso-bidi-theme-font:minor-bidi;} --> <!--[endif]--></p>
<p class="MsoNormal">We just ran across the preview <a href="http://services.data.gov.uk/sparql">SPARQL endpoint for UK&#8217;s Data.gov</a> (powered by Talis) following <a href="http://thedextrousweb.com/2009/10/the-wraps-come-off-data-gov-uk/http://thedextrousweb.com/2009/10/the-wraps-come-off-data-gov-uk/">Harry Metcalfe&#8217;s blog</a> . In order to understand what data is hosted by the triple store, we use a series of SPARQL queries to probe the content in data.gov. We leverage a web service http://data-gov.tw.rpi.edu/ws/sparqlproxy.php to convert SPARQL/XMl result into HTML and JSON.</p>
<p class="MsoNormal"><strong>First, let&#8217;s do some warm up exercises</strong></p>
<p class="MsoNormal">Q: show me some triples!</p>
<p class="MsoNormal">SPARQL:</p>
<blockquote>
<p class="MsoNormal">SELECT ?s ?p ?o WHERE {?s ?p ?o} LIMIT 5</p>
</blockquote>
<p class="MsoNormal">Result:</p>
<p class="MsoNormal">&lt;http://www.london-gazette.co.uk/id/issues/58316/notices/240663&gt; &lt;http://www.gazettes-online.co.uk/ontology#hasPublicationDate&gt; &#8220;2007-05-02&#8243;^^http://www.w3.org/2001/XMLSchema#date .</p>
<p class="MsoNormal">&lt;http://www.london-gazette.co.uk/id/issues/58316/notices/240663&gt; &lt;http://xmlns.com/foaf/0.1/page&gt; &lt;http://www.london-gazette.co.uk/issues/58316/pages/6359&gt; .</p>
<p class="MsoNormal">&lt;http://www.london-gazette.co.uk/id/issues/58316/notices/240663&gt; &lt;http://purl.org/dc/terms/modified&gt; &#8220;2007-05-02&#8243;^^http://www.w3.org/2001/XMLSchema#date .</p>
<p class="MsoNormal">A: ok, it has some gazette dataset about lond (see http://www.london-gazette.co.uk/), and it uses FOAF and DC vocabulary.</p>
<p class="MsoNormal">
<p class="MsoNormal">Q: show me some classes and their instances?</p>
<p class="MsoNormal">SPARQL:</p>
<blockquote>
<p class="MsoNormal">SELECT DISTINCT ?c WHERE { [] a ?c. } LIMIT 5</p>
</blockquote>
<p class="MsoNormal">Result:</p>
<p class="MsoNormal">&lt;http://www.gazettes-online.co.uk/ontology/transport#RoadTrafficActsNotice&gt;</p>
<p class="MsoNormal">&lt;http://www.gazettes-online.co.uk/ontology#Notice&gt;</p>
<p class="MsoNormal">&lt;http://xmlns.com/foaf/0.1/Document&gt;</p>
<p class="MsoNormal">&lt;http://www.gazettes-online.co.uk/ontology#Issue&gt;</p>
<p class="MsoNormal">&lt;http://www.gazettes-online.co.uk/ontology#Edition&gt;</p>
<p class="MsoNormal">A: same observation as above.</p>
<p class="MsoNormal">
<p class="MsoNormal">
<p class="MsoNormal">Q: Does it host any named graphs?</p>
<p class="MsoNormal">SPARQL:</p>
<blockquote>
<p class="MsoNormal">SELECT ?g WHERE {GRAPH ?g { ?s ?p ?o } } LIMIT 10</p>
</blockquote>
<p class="MsoNormal">Result: &#8220;0&#8243; ^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt;</p>
<p class="MsoNormal">A: no named graph found, and there is only one big default graph</p>
<p class="MsoNormal">
<p>Now let&#8217;s run several expensive aggregation queries (note aggregation queries are not part of the current SPARQL specification)</p>
<p class="MsoNormal">
<p class="MsoNormal">Q: How many triples?</p>
<p class="MsoNormal">SPARQL:</p>
<blockquote>
<p class="MsoNormal">SELECT count(*) WHERE {?s ?p ?o}</p>
</blockquote>
<p class="MsoNormal">Result: &#8220;5529380&#8243;^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt;</p>
<p class="MsoNormal">A: alright! Aggregation query is support, and there are 5 million triples. Note &#8220;count&#8221; is a non-standard aggregation function, it may be support differently be different SPARQL endpoints.</p>
<p class="MsoNormal">
<p class="MsoNormal">Q: How many graphs (and the number of triples in each graph)?</p>
<p class="MsoNormal">SPARQL:</p>
<blockquote>
<p class="MsoNormal">SELECT ?g count(*) WHERE {GRAPH ?g { ?s ?p ?o } }</p>
</blockquote>
<p class="MsoNormal">Result: &#8220;0&#8243; ^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt;</p>
<p class="MsoNormal">A: no named graph.</p>
<p class="MsoNormal">
<p class="MsoNormal">Q: How many populated classes?</p>
<p class="MsoNormal">SPARQL:</p>
<blockquote>
<p class="MsoNormal">SELECT count(distinct ?c) WHERE {[] a ?c}</p>
</blockquote>
<p class="MsoNormal">Result: &#8220;99&#8243;^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt;</p>
<p class="MsoNormal">A: there are 99 different classes having direct instances in this triple store.</p>
<p class="MsoNormal">
<p class="MsoNormal">Q: How many populated properties?</p>
<p class="MsoNormal">SPARQL:</p>
<blockquote>
<p class="MsoNormal">SELECT count(distinct ?p) WHERE {[] ?p ?o}</p>
</blockquote>
<p class="MsoNormal">Result: &#8220;86&#8243; ^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt;</p>
<p class="MsoNormal">A: There are 86 unique properties being used as predicate in this dataset. Each property has used by 64K (5529380/86) triples in average. There must be some very popular properties, and we will do that survey later.</p>
<p class="MsoNormal">
<p class="MsoNormal">Q: How many typed individuals?</p>
<p class="MsoNormal">SPARQL:</p>
<blockquote>
<p class="MsoNormal">SELECT count(distinct ?s) WHERE {?s a ?c}</p>
</blockquote>
<p class="MsoNormal">Result: 504 Gateway Time-out</p>
<p class="MsoNormal">A: Opps, really expensive. Let&#8217;s try something else</p>
<p class="MsoNormal">
<p class="MsoNormal">Q: How many defined classes?</p>
<p class="MsoNormal">SPARQL:</p>
<blockquote>
<p class="MsoNormal">SELECT count(distinct ?s) WHERE {{?s a &lt;http://www.w3.org/2002/07/owl#Class&gt; } UNION {?s a &lt;http://www.w3.org/2000/01/rdf-schema#Class&gt;}}</p>
</blockquote>
<p class="MsoNormal">Result: 0</p>
<p class="MsoNormal">A: no class defined, and the triple store is full of individuals</p>
<p class="MsoNormal">
<p class="MsoNormal">Q: How many individuals (again)?</p>
<p class="MsoNormal">SPARQL:</p>
<blockquote>
<p class="MsoNormal">SELECT count(*) WHERE {[] a ?c}</p>
</blockquote>
<p class="MsoNormal">Result: 995694</p>
<p class="MsoNormal">A: There are nearly 1 millions of typed individuals, so we can easily see every invidual has 5 (5M/1M) triples in average.</p>
<p class="MsoNormal">
<p class="MsoNormal">
<p class="MsoNormal"><strong>Now, let&#8217;s do some knowledge discovery</strong></p>
<p class="MsoNormal">Q: Show me the 3 most/least used classes in this dataset?</p>
<p class="MsoNormal">SPARQL:</p>
<blockquote>
<p class="MsoNormal">1) select ?c ( count(?s) AS ?count ) where {?s a ?c} group by ?c order by desc(?count) limit 3</p>
<p class="MsoNormal">2) select ?c ( count(?s) AS ?count ) where {?s a ?c} group by ?c order by ?count limit 3</p>
</blockquote>
<p class="MsoNormal">Result:</p>
<p class="MsoNormal">1) most used classes</p>
<p class="MsoNormal">&lt;http://www.gazettes-online.co.uk/ontology#Notice&gt;   &#8220;156452&#8243; ^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt;</p>
<p class="MsoNormal">&lt;http://www.w3.org/2006/vcard/ns#Address&gt;  &#8220;106934&#8243; ^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt;</p>
<p class="MsoNormal">&lt;http://www.gazettes-online.co.uk/ontology/person#Person&gt;  &#8220;87798&#8243; ^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt;</p>
<p class="MsoNormal">2) least used classes</p>
<p class="MsoNormal">&lt;http://www.gazettes-online.co.uk/ontology/transport#CycleTracksNotice&gt;   &#8220;1&#8243; ^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt;</p>
<p class="MsoNormal">&lt;http://www.gazettes-online.co.uk/ontology/transport#PortsNotice&gt;  &#8220;1&#8243; ^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt;</p>
<p class="MsoNormal">&lt;http://www.gazettes-online.co.uk/ontology/corp-insolvency#AdministrationOrder&gt;  &#8220;2&#8243; ^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt;</p>
<p class="MsoNormal">A: Again, the SPAQL query is &#8220;safe&#8221; because Talis support these SPARQL extensions (c.f. http://n2.talis.com/wiki/SPARQL_Extensions). The class that has the most number of instances in this dataset is http://www.gazettes-online.co.uk/ontology#Notice (which has 156,452 instances.)</p>
<p class="MsoNormal">
<p class="MsoNormal">
<p class="MsoNormal">Q: what about property usage (top 5)?</p>
<p class="MsoNormal">SPARQL:</p>
<blockquote>
<p class="MsoNormal">1) select ?p ( count(?s) AS ?count ) where {?s ?p ?o} group by ?p order by DESC(?count) limit 5</p>
<p class="MsoNormal">2) select ?p ( count(?s) AS ?count ) where {?s ?p ?o} group by ?p order by ?count limit 5</p>
</blockquote>
<p class="MsoNormal">Result:</p>
<p class="MsoNormal">1) 5 most used properties:</p>
<p class="MsoNormal">&lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#type&gt;   &#8220;995694&#8243; ^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt;</p>
<p class="MsoNormal">&lt;http://purl.org/dc/dcam/memberOf&gt;  &#8220;312476&#8243; ^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt;</p>
<p class="MsoNormal">&lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#value&gt;  &#8220;310170&#8243; ^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt;</p>
<p class="MsoNormal">&lt;http://www.gazettes-online.co.uk/ontology#hasPublicationDate&gt;  &#8220;181940&#8243; ^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt;</p>
<p class="MsoNormal">&lt;http://www.gazettes-online.co.uk/ontology#hasNoticeCode&gt;  &#8220;181335&#8243; ^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt;</p>
<p class="MsoNormal">2) 5 least used properties:</p>
<p class="MsoNormal">&lt;http://www.gazettes-online.co.uk/ontology/corp-insolvency#dateAdministrationOrderMade&gt;   &#8220;1&#8243; ^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt;</p>
<p class="MsoNormal">&lt;http://www.gazettes-online.co.uk/ontology#hasAuthorisingOrganisation&gt;  &#8220;2&#8243; ^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt;</p>
<p class="MsoNormal">&lt;http://www.gazettes-online.co.uk/ontology/personal-legal#isForNextOfKinOf&gt;  &#8220;2&#8243; ^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt;</p>
<p class="MsoNormal">&lt;http://www.gazettes-online.co.uk/ontology/corp-insolvency#orderAdministrator&gt;  &#8220;3&#8243; ^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt;</p>
<p class="MsoNormal">&lt;http://www.gazettes-online.co.uk/ontology#authorisingOrganisation&gt;  &#8220;49&#8243; ^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt;</p>
<p class="MsoNormal">A: Some properties are really used (e.g., http://www.gazettes-online.co.uk/ontology/corp-insolvency#dateAdministrationOrderMade has been used only once) while some are heavily used (like rdf:type, which is the most frequently used predicate).</p>
<p class="MsoNormal">
<p class="MsoNormal"><strong>Conclusion</strong></p>
<p class="MsoNormal">We need to stop probing now. A number of complex queries ended up with a timeout error because (i) &#8220;LIMIT&#8221; only control the final results, so that we cannot just get statistical results on 1000 triples; (ii) &#8220;GROUP BY&#8221; may produce too many intermediate results, (iii) the statistics queries does not leverage the index structure of triple store, or index structure are not designed for handling such queries, and (iv) many other issues.</p>
<p class="MsoNormal"><span style="color: #ff0000;"><strong>Updates</strong></span></p>
<ul>
<li>We also delivered  a  web service running this probe job, see  <a href="http://data-gov.tw.rpi.edu/ws/sparqlprobe.php">http://data-gov.tw.rpi.edu/ws/sparqlprobe.php</a>. This service is focusing on default graph, ant we welcome input on measuring named graph.</li>
</ul>
<p class="MsoNormal">presented by Li Ding and Zhengning Shangguan</p>
]]></content:encoded>
			<wfw:commentRss>http://tw.rpi.edu/weblog/2009/10/23/probing-the-sparql-endpoint-of-datagovuk/feed/</wfw:commentRss>
		</item>
		<item>
		<title>OWL 2 Reference Card released</title>
		<link>http://tw.rpi.edu/weblog/2009/10/18/owl-2-reference-card-released/</link>
		<comments>http://tw.rpi.edu/weblog/2009/10/18/owl-2-reference-card-released/#comments</comments>
		<pubDate>Sun, 18 Oct 2009 04:39:22 +0000</pubDate>
		<dc:creator>Jie Bao</dc:creator>
		
		<category><![CDATA[tetherless world]]></category>

		<category><![CDATA[owl]]></category>

		<guid isPermaLink="false">http://tw.rpi.edu/weblog/?p=160</guid>
		<description><![CDATA[We&#8217;re pleased to announce the OWL 2 Reference Card [1]. The Card is meant to be a &#8220;cheat sheet&#8221; of OWL 2 features printable on a single piece of paper (on both sides). It is based on the OWL 2 Quick Reference Guide [1], which is now a Proposed Recommendation [2] in the OWL 2 [...]]]></description>
			<content:encoded><![CDATA[<p>We&#8217;re pleased to announce the OWL 2 Reference Card [1]. The Card is meant to be a &#8220;cheat sheet&#8221; of OWL 2 features printable on a single piece of paper (on both sides). It is based on the OWL 2 Quick Reference Guide [1], which is now a Proposed Recommendation [2] in the OWL 2 Web Ontology Language document set.</p>
<p>Background: OWL 2 [4] is an extension to OWL 1 with a few new  functionalities. Some of the new features are syntactic sugar (e.g., disjoint union of classes) while others offer new expressivity, including:</p>
<p>* keys;<br />
* property chains;<br />
* richer datatypes, data ranges;<br />
* qualified cardinality restrictions;<br />
* asymmetric, reflexive, and disjoint properties; and<br />
* enhanced annotation capabilities</p>
<p>Comments and suggestions to the Card are welcome (please send to public-owl-comments@w3.org)</p>
<p>[1] http://www.w3.org/2007/OWL/refcard</p>
<p>[2] http://www.w3.org/2007/OWL/wiki/Quick_Reference_Guide</p>
<p>[3] http://www.w3.org/TR/2009/PR-owl2-quick-reference-20090922/</p>
<p>[4] http://www.w3.org/TR/owl2-overview/</p>
<p>Jie Bao</p>
]]></content:encoded>
			<wfw:commentRss>http://tw.rpi.edu/weblog/2009/10/18/owl-2-reference-card-released/feed/</wfw:commentRss>
		</item>
		<item>
		<title>My Personal (unofficial) Semantic Web FAQ &#8212; a pointer</title>
		<link>http://tw.rpi.edu/weblog/2009/09/01/my-personal-unofficial-semantic-web-faq-a-pointer/</link>
		<comments>http://tw.rpi.edu/weblog/2009/09/01/my-personal-unofficial-semantic-web-faq-a-pointer/#comments</comments>
		<pubDate>Wed, 02 Sep 2009 00:32:59 +0000</pubDate>
		<dc:creator>hendler</dc:creator>
		
		<category><![CDATA[Semantic Web]]></category>

		<category><![CDATA[Web Science]]></category>

		<category><![CDATA[linked data]]></category>

		<category><![CDATA[personal ramblings]]></category>

		<guid isPermaLink="false">http://tw.rpi.edu/weblog/?p=156</guid>
		<description><![CDATA[The joy of multiple blog sites is having to post pointers to one blog entry from another.
My blog at nature.com now has an entry entitled &#8220;The Semantic Web: My personal (unofficial) FAQ&#8221; which lives at http://network.nature.com/people/jhendler/blog/2009/08/03/the-semantic-web-my-personal-unofficial-faq. Comments, and especially your suggestions for Qs and As are more than welcome there or here (or anywhere else [...]]]></description>
			<content:encoded><![CDATA[<p>The joy of multiple blog sites is having to post pointers to one blog entry from another.</p>
<p>My blog at nature.com now has an entry entitled &#8220;The Semantic Web: My personal (unofficial) FAQ&#8221; which lives at <a title="Nature Blog SW FAQ" href="http://network.nature.com/people/jhendler/blog/2009/08/03/the-semantic-web-my-personal-unofficial-faq" target="_blank">http://network.nature.com/people/jhendler/blog/2009/08/03/the-semantic-web-my-personal-unofficial-faq</a>. Comments, and especially your suggestions for Qs and As are more than welcome there or here (or anywhere else for that matter)</p>
<p>Cheers,</p>
<p>Jim H.</p>
]]></content:encoded>
			<wfw:commentRss>http://tw.rpi.edu/weblog/2009/09/01/my-personal-unofficial-semantic-web-faq-a-pointer/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Current Issues in data.gov</title>
		<link>http://tw.rpi.edu/weblog/2009/07/31/current-issues-in-datagov/</link>
		<comments>http://tw.rpi.edu/weblog/2009/07/31/current-issues-in-datagov/#comments</comments>
		<pubDate>Fri, 31 Jul 2009 22:54:34 +0000</pubDate>
		<dc:creator>li</dc:creator>
		
		<category><![CDATA[linked data]]></category>

		<category><![CDATA[data.gov]]></category>

		<guid isPermaLink="false">http://tw.rpi.edu/weblog/?p=137</guid>
		<description><![CDATA[While translating data.gov data into RDF, we have discovered some issues with the published datasets. These issues can be roughly categorized as follows:

  Duplicated Datasets- Some datasets are part of another dataset, e.g. Dataset 140 (2005 Toxics Release Inventory data for the state of California (Environmental Protection Agency)) is a subset of Dataset 191 [...]]]></description>
			<content:encoded><![CDATA[<p>While translating data.gov data into RDF, we have discovered some issues with the published datasets. These issues can be roughly categorized as follows:</p>
<ul>
<li> <a href="http://data-gov.tw.rpi.edu/w/index.php?title=Current_Issues_in_data.gov&amp;action=purge#Duplicated_Datasets"> Duplicated Datasets</a>- Some datasets are part of another dataset, e.g. <a title="Dataset 140" href="http://data-gov.tw.rpi.edu/wiki/Dataset_140">Dataset 140</a> <span>(2005 Toxics Release Inventory data for the state of California (<a title="Environmental Protection Agency" href="http://data-gov.tw.rpi.edu/wiki/Environmental_Protection_Agency">Environmental Protection Agency</a>))</span> is a subset of <a title="Dataset 191" href="http://data-gov.tw.rpi.edu/wiki/Dataset_191">Dataset 191</a> <span> (2005 Toxics Release Inventory National data file of all US States and Territories (<a title="Environmental Protection Agency" href="http://data-gov.tw.rpi.edu/wiki/Environmental_Protection_Agency">Environmental Protection Agency</a>))</span>.</li>
<li> <a href="http://data-gov.tw.rpi.edu/w/index.php?title=Current_Issues_in_data.gov&amp;action=purge#Formatting_Issues"> Formatting Issues</a> - The format of some datasets is not friendly to machine processing. Not all datasets offer CSV format data, and parsing table data from them requires non-trivial efforts. Example: <a title="Dataset 37" href="http://data-gov.tw.rpi.edu/wiki/Dataset_37">Dataset 37</a> <span> (Lower Colorado River Daily Average Water Elevations and Releases (<a title="US Bureau of Reclamation" href="http://data-gov.tw.rpi.edu/wiki/US_Bureau_of_Reclamation">US Bureau of Reclamation</a>))</span>. Some websites, meanwhile, have no data at all: <a title="Dataset 335" href="http://data-gov.tw.rpi.edu/wiki/Dataset_335">Dataset 335</a> <span>(National Longitudinal Surveys (<a title="US Bureau of Labor Statistics" href="http://data-gov.tw.rpi.edu/wiki/US_Bureau_of_Labor_Statistics">US Bureau of Labor Statistics</a>))</span>, for example, tells you how to order data from the government.</li>
<div class="wp-caption aligncenter" style="width: 464px"><a href="http://data-gov.tw.rpi.edu/w/images/e/ee/Dset37.png"><img title="screen shot of the text file from dataset 37 (Lower Colorado River Daily Average Water Elevations and Releases) by US Bureau of Reclamation" src="http://data-gov.tw.rpi.edu/w/images/e/ee/Dset37.png" alt="" width="454" height="191" /></a><p class="wp-caption-text">screen shot of the text file from dataset 37 (Lower Colorado River Daily Average Water Elevations and Releases) by US Bureau of Reclamation</p></div>
<li> <a href="http://data-gov.tw.rpi.edu/wiki/Current_Issues_in_data.gov#Access_Point_Issues"> </a><a href="http://data-gov.tw.rpi.edu/w/index.php?title=Current_Issues_in_data.gov&amp;action=purge#Access_Point_Issues">Access Point Issues</a> - The access points for some datasets do not point to pages friendly to machine access. Instead of pointing to a downloadable file covering the entire dataset, some lead to an interactive website where only partial data can be returned by a web-based query. Example: <a title="Dataset 330" href="http://data-gov.tw.rpi.edu/wiki/Dataset_330">Dataset 330</a> <span>(Local Area Unemployment Statistics (<a title="US Bureau of Labor Statistics" href="http://data-gov.tw.rpi.edu/wiki/US_Bureau_of_Labor_Statistics">US Bureau of Labor Statistics</a>))</span> and <a title="Dataset 96" href="http://data-gov.tw.rpi.edu/wiki/Dataset_96">Dataset 96</a> <span>(National Water Information System (NWIS) (<a title="US Geological Survey" href="http://data-gov.tw.rpi.edu/wiki/US_Geological_Survey">US Geological Survey</a>))</span>.
<p><div class="wp-caption aligncenter" style="width: 464px"><a href="http://data-gov.tw.rpi.edu/w/images/7/74/Bls-access-point.png"><img title="screen shot of the query interface for accessing dataset 330 (Local Area Unemployment Statistics) by 	US Bureau of Labor Statistics" src="http://data-gov.tw.rpi.edu/w/images/7/74/Bls-access-point.png" alt="" width="454" height="317" /></a><p class="wp-caption-text">screen shot of the query interface for accessing dataset 330 (Local Area Unemployment Statistics) by 	US Bureau of Labor Statistics</p></div></li>
</ul>
<p>For more details, please visit <a class="external free" title="http://data-gov.tw.rpi.edu/wiki/Current_Issues_in_data.gov" rel="nofollow" href="http://data-gov.tw.rpi.edu/wiki/Current_Issues_in_data.gov">http://data-gov.tw.rpi.edu/wiki/Current_Issues_in_data.gov</a> .</p>
<p>Sarah Magidson, Li Ding, Dominic DiFranzo, and Jim Hendler</p>
]]></content:encoded>
			<wfw:commentRss>http://tw.rpi.edu/weblog/2009/07/31/current-issues-in-datagov/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Data.gov Datasets Translated in RDF!</title>
		<link>http://tw.rpi.edu/weblog/2009/07/22/datagov-datasets-translated-in-rdf/</link>
		<comments>http://tw.rpi.edu/weblog/2009/07/22/datagov-datasets-translated-in-rdf/#comments</comments>
		<pubDate>Thu, 23 Jul 2009 00:45:21 +0000</pubDate>
		<dc:creator>li</dc:creator>
		
		<category><![CDATA[Semantic Web]]></category>

		<category><![CDATA[linked data]]></category>

		<category><![CDATA[data.gov]]></category>

		<category><![CDATA[RDF]]></category>

		<guid isPermaLink="false">http://tw.rpi.edu/weblog/?p=125</guid>
		<description><![CDATA[We have created 16 RDF datasets covering 187 of the datasets published at data.gov (171 EPA datasets are subsets of three larger EPA datasets). The original datasets were published by EPA, US Census Bureau, USGS and Office of Management and Budget in CSV compatible format, and they contributed 13,532,250 table entries. The translated RDF datasets [...]]]></description>
			<content:encoded><![CDATA[<p>We have created 16 RDF datasets covering 187 of the datasets published at data.gov (171 EPA datasets are subsets of three larger EPA datasets). The original datasets were published by <a class="mw-redirect" title="EPA" href="http://data-gov.tw.rpi.edu/wiki/EPA">EPA</a>, <a title="US Census Bureau" href="http://data-gov.tw.rpi.edu/wiki/US_Census_Bureau">US Census Bureau</a>, <a class="mw-redirect" title="USGS" href="http://data-gov.tw.rpi.edu/wiki/USGS">USGS</a> and <a title="Office of Management and Budget" href="http://data-gov.tw.rpi.edu/wiki/Office_of_Management_and_Budget">Office of Management and Budget</a> in CSV compatible format, and they contributed 13,532,250 table entries. The translated RDF datasets includes a total of 2,927,398,352  triples involving 2,526 properties.</p>
<p>We publish the RDF data in two alternative ways: (i) a collection of linked partition files in RDF/XML for users to browse the dataset and dereference the URIs using semantic web browsers, and (ii) one big N-TRIPLE file (data.nt) concatenating the partition files for machines, especially triple stores, to download and import. The largest dataset is <a title="Dataset 91" href="http://data-gov.tw.rpi.edu/wiki/Dataset_91">Dataset_91</a>, which contributed 2.11 billion triples.</p>
<p>To access the RDF datasets, users may go to <a title="Data.gov Catalog" href="http://data-gov.tw.rpi.edu/wiki/Data.gov_Catalog">Data.gov_Catalog</a> with the following options:</p>
<ul>
<li> follow links in the &#8220;rdf(index file)&#8221; column to access the index file in RDF/XML which contains the property list, statistics, and links of the RDF dataset. e.g. <a class="external free" title="http://data-gov.tw.rpi.edu/raw/401/index.rdf" rel="nofollow" href="http://data-gov.tw.rpi.edu/raw/401/index.rdf">http://data-gov.tw.rpi.edu/raw/401/index.rdf</a></li>
<li> follow links in the &#8220;rdf(partition files)&#8221; column to start an RDF browser (e.g. <a title="http://dig.csail.mit.edu/2005/ajar/release/tabulator/0.8/tab.html" rel="nofollow" href="http://dig.csail.mit.edu/2005/ajar/release/tabulator/0.8/tab.html" target="_blank">tabulator</a>) to surf the RDF/XML partition files. e.g. <a class="external free" title="http://data-gov.tw.rpi.edu/raw/401/link00001.rdf" rel="nofollow" href="http://data-gov.tw.rpi.edu/raw/401/link00001.rdf">http://data-gov.tw.rpi.edu/raw/401/link00001.rdf</a></li>
<li> follow links in &#8220;the rdf(complete file)&#8221; column to download the complete RDF dataset in N-TRIPLE format (gzipped). e.g. <a class="external free" title="http://data-gov.tw.rpi.edu/raw/401/data-401.nt.gz" rel="nofollow" href="http://data-gov.tw.rpi.edu/raw/401/data-401.nt.gz">http://data-gov.tw.rpi.edu/raw/401/data-401.nt.gz</a></li>
<li> follow links in the &#8220;url(data.gov)&#8221; column to see the original metadata at data.gov</li>
<li> follow links in the &#8220;wiki page&#8221; column to see enhanced metadata about data.gov datasets</li>
</ul>
<p>More datasets are coming, so please stay tuned and come back to <a class="external free" title="http://data-gov.tw.rpi.edu/" rel="nofollow" href="http://data-gov.tw.rpi.edu/">http://data-gov.tw.rpi.edu/</a>.</p>
<p>Further reading:</p>
<ul>
<li> To learn how we managed the translation, please go to <a title="Generating RDF from data.gov" href="http://data-gov.tw.rpi.edu/wiki/Generating_RDF_from_data.gov">Generating RDF from data.gov</a></li>
<li> To learn more translation statistics, please to go <a title="What's in data.gov" href="http://data-gov.tw.rpi.edu/wiki/What%27s_in_data.gov">What&#8217;s in data.gov</a></li>
</ul>
<p>Li Ding, Dominic DiFranzo, Sarah Magidson, and Jim Hendler</p>
]]></content:encoded>
			<wfw:commentRss>http://tw.rpi.edu/weblog/2009/07/22/datagov-datasets-translated-in-rdf/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Tilting at the NSF windmill</title>
		<link>http://tw.rpi.edu/weblog/2009/07/13/121/</link>
		<comments>http://tw.rpi.edu/weblog/2009/07/13/121/#comments</comments>
		<pubDate>Mon, 13 Jul 2009 18:05:49 +0000</pubDate>
		<dc:creator>hendler</dc:creator>
		
		<category><![CDATA[Semantic Web]]></category>

		<category><![CDATA[Web Science]]></category>

		<category><![CDATA[personal ramblings]]></category>

		<guid isPermaLink="false">http://tw.rpi.edu/weblog/?p=121</guid>
		<description><![CDATA[Colleagues - one of my blog entries at Nature seems to have hit a nerve - been zinging around the &#8220;twittersphere&#8221; and I&#8217;ve received a number of responses in private not just commiserating, but agreeing with the major points.  I want to make it clear that this is solely my own opinion, and it has [...]]]></description>
			<content:encoded><![CDATA[<p>Colleagues - one of my blog entries at Nature seems to have hit a nerve - been zinging around the &#8220;twittersphere&#8221; and I&#8217;ve received a number of responses in private not just commiserating, but agreeing with the major points.  I want to make it clear that this is solely my own opinion, and it has not been carefully researched, but given that so many US Semantic Web researchers have shared the frustration that I express here, I thought I&#8217;d share it on planetRDF as well  (Europeans, believe it or not, on this side of the ocean it is hard to get funding for Semantic Web research - you have no idea how lucky you are!)</p>
<p>-Jim H</p>
<p>from blog entry: &#8220;Why NSF cannot fund high-risk, high-reward research&#8221;</p>
<div class="post entry-content">
<p>I just got turned down for a grant. That’s nothing new, you win some and you lose some, and every senior professor has gotten used to that over time. This time, however, I cannot find it in myself to just say “oh well” and let it go at that. This time, I think I need to go public, because I think what happened shows an endemic problem with the US National Science Foundation and, I hope, points out some things they could do to fix it.</p>
<p><a title="Why NSF cannot fund high-risk, high reward research" href="http://network.nature.com/people/jhendler/blog/2009/07/12/why-nsf-cannot-fund-high-risk-high-reward-research" target="_blank"> Click here for the blog entry at Nature.com</a></div>
]]></content:encoded>
			<wfw:commentRss>http://tw.rpi.edu/weblog/2009/07/13/121/feed/</wfw:commentRss>
		</item>
		<item>
		<title>What&#8217;s in data.gov</title>
		<link>http://tw.rpi.edu/weblog/2009/06/25/whats-in-datagov/</link>
		<comments>http://tw.rpi.edu/weblog/2009/06/25/whats-in-datagov/#comments</comments>
		<pubDate>Thu, 25 Jun 2009 14:05:24 +0000</pubDate>
		<dc:creator>li</dc:creator>
		
		<category><![CDATA[linked data]]></category>

		<category><![CDATA[data.gov]]></category>

		<guid isPermaLink="false">http://tw.rpi.edu/weblog/?p=93</guid>
		<description><![CDATA[A recent article by Tim Berners-Lee, &#8220;Putting Government Data online&#8220;, has  attracted significant interest to the  datasets published at the US data.gov website.  As Berners-Lee discusses the Semantic Web techniques that can be used to get those data into RDF space (something we are now working on), we would like to share our initial investigation [...]]]></description>
			<content:encoded><![CDATA[<p>A recent article by Tim Berners-Lee, &#8220;<a href="http://www.w3.org/DesignIssues/GovData.html">Putting Government Data online</a>&#8220;, has  attracted significant interest to the  datasets published at the <a href="http://data.gov/">US data.gov website</a>.  As Berners-Lee discusses the Semantic Web techniques that can be used to get those data into RDF space (something we are now working on), we would like to share our initial investigation of the contents of these government datasets.</p>
<p><span style="color: #ff0000;">updates:</span></p>
<p><span style="color: #ff0000;">* we have not published 5 billions triples from hundreds of datasets at http://data.gov. see http://data-gov.tw.rpi.edu/wiki/Data.gov_Catalog</span></p>
<p><strong>I. Translate dataset into RDF<br />
</strong></p>
<p>The catalog of the datasets in data.gov,<a class="external free" title="http://www.data.gov/details/92" rel="nofollow" href="http://www.data.gov/details/92">http://www.data.gov/details/92</a>,  is published in CSV format as part of data.gov. We  converted it into RDF using simple CSV parsing. We kept the translation minimal: (i) the properties are directly created from thecolumn names; (ii) each table row is mapped to an instance of <a href="http://inference-web.org/2.0/pml-provenance.owl#Dataset">pmlp:Dataset</a>; (iii) all non-header cells are mapped to a literal - we don&#8217;t create new URIs at this point. The output of our work is published on tw website at:</p>
<p><a href=" http://data-gov.tw.rpi.edu/raw/92/data-92.rdf"> http://data-gov.tw.rpi.edu/raw/92/data-92.rdf</a></p>
<p>(We are now starting to do more  integration work, extracting multiple objects from single tables, linking into the linked open data  cloud, etc.  and will publish new version when that is done - the purpose of this first work was simply to make the catalog more available to the RDF community)</p>
<p><strong>II. Browse and query the RDF graph<br />
</strong></p>
<p>As an example, we can browse the dataset in <a href="http://dig.csail.mit.edu/2005/ajar/ajaw/tab?uri=http://data-gov.tw.rpi.edu/raw/92/data-92.rdf">tabulator</a>, and then use a SPARQL web service to query the dataset. For example, we use <a href="http://data-gov.tw.rpi.edu/sparql/select-csv-dataset.sparql">a sparql query </a>to list datasets published in CSV format:</p>
<p><a href="http://onto.rpi.edu/sw4j/sparql?queryURL=http://data-gov.tw.rpi.edu/sparql/select-csv-dataset.sparql ">http://onto.rpi.edu/sw4j/sparql?queryURL=http://data-gov.tw.rpi.edu/sparql/select-csv-dataset.sparql</a></p>
<p><strong>III. Observations on the RDF graph<br />
</strong></p>
<p>Using this service we can answer some basic questions about the data.gov datatsets:</p>
<p>1. How many datasets are published, and how many among them can be easily converted into RDF?</p>
<p>There are 332 datasets which can be partitioned by  type:  raw data catalog(301);  tool catalog (31).</p>
<p>Not all of the datasets have a link to downloadable data because some offer only browseable data via their own websites,  Others  publish datasets in multiple formats. As of today, the online static files associated with the datasets are distributed as  follows:  204 datasets offer a CSV format dump, 10 datasets offer an XML format dump, and 21 datasets offer an XLS format dump.</p>
<p>2. How are the datasets categorized?</p>
<table style="border-collapse: collapse; width: 295pt;" border="1" cellspacing="0" cellpadding="0" width="393"><col style="width: 247pt;" width="329"></col> <col style="width: 48pt;" width="64"></col></p>
<tbody>
<tr style="height: 12.75pt;" height="17">
<td class="xl24" style="height: 12.75pt; width: 247pt;" width="329" height="17"><strong>Category</strong></td>
<td class="xl24" style="border-left: medium none; width: 48pt;" width="64" align="right"><strong>number of datasets</strong></td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td class="xl24" style="height: 12.75pt; width: 247pt;" width="329" height="17">Geography   and Environment</td>
<td class="xl24" style="border-left: medium none; width: 48pt;" width="64" align="right">227</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td class="xl24" style="border-top: medium none; height: 12.75pt;" height="17">Labor Force,   Employment, and Earnings</td>
<td class="xl24" style="border-top: medium none; border-left: medium none;" align="right">30</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td class="xl24" style="border-top: medium none; height: 12.75pt;" height="17">Social   Insurance and Human Services</td>
<td class="xl24" style="border-top: medium none; border-left: medium none;" align="right">30</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td class="xl24" style="border-top: medium none; height: 12.75pt;" height="17">Health and   Nutrition</td>
<td class="xl24" style="border-top: medium none; border-left: medium none;" align="right">11</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td class="xl24" style="border-top: medium none; height: 12.75pt;" height="17">Law   Enforcement, Courts, and Prisons</td>
<td class="xl24" style="border-top: medium none; border-left: medium none;" align="right">7</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td class="xl24" style="border-top: medium none; height: 12.75pt;" height="17">Population</td>
<td class="xl24" style="border-top: medium none; border-left: medium none;" align="right">4</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td class="xl24" style="border-top: medium none; height: 12.75pt;" height="17">Other</td>
<td class="xl24" style="border-top: medium none; border-left: medium none;" align="right">3</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td class="xl24" style="border-top: medium none; height: 12.75pt;" height="17">Prices</td>
<td class="xl24" style="border-top: medium none; border-left: medium none;" align="right">3</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td class="xl24" style="border-top: medium none; height: 12.75pt;" height="17">Business   Enterprise</td>
<td class="xl24" style="border-top: medium none; border-left: medium none;" align="right">2</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td class="xl24" style="border-top: medium none; height: 12.75pt;" height="17">Education</td>
<td class="xl24" style="border-top: medium none; border-left: medium none;" align="right">2</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td class="xl24" style="border-top: medium none; height: 12.75pt;" height="17">Energy and   Utilities</td>
<td class="xl24" style="border-top: medium none; border-left: medium none;" align="right">2</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td class="xl24" style="border-top: medium none; height: 12.75pt;" height="17">Federal   Government Finances and Employment</td>
<td class="xl24" style="border-top: medium none; border-left: medium none;" align="right">2</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td class="xl24" style="border-top: medium none; height: 12.75pt;" height="17">Income,   Expenditures, Poverty, and Wealth</td>
<td class="xl24" style="border-top: medium none; border-left: medium none;" align="right">2</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td class="xl24" style="border-top: medium none; height: 12.75pt;" height="17">Science and   Technology</td>
<td class="xl24" style="border-top: medium none; border-left: medium none;" align="right">2</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td class="xl24" style="border-top: medium none; height: 12.75pt;" height="17">Transportation</td>
<td class="xl24" style="border-top: medium none; border-left: medium none;" align="right">2</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td class="xl24" style="border-top: medium none; height: 12.75pt;" height="17">Construction   and Housing</td>
<td class="xl24" style="border-top: medium none; border-left: medium none;" align="right">1</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td class="xl24" style="border-top: medium none; height: 12.75pt;" height="17">International   Statistics</td>
<td class="xl24" style="border-top: medium none; border-left: medium none;" align="right">1</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td class="xl24" style="border-top: medium none; height: 12.75pt;" height="17">National   Security and Veterans Affairs</td>
<td class="xl24" style="border-top: medium none; border-left: medium none;" align="right">1</td>
</tr>
</tbody>
</table>
<p>3. What are some of the key items in the dataset?</p>
<p><a href="http://data-gov.tw.rpi.edu/wiki/File:Data-gov-tagcloud.png"><img class="alignnone" title="TagCloud of the Titles of data.gov Datasets (June 2009)" src="http://data-gov.tw.rpi.edu/w/images/8/81/Data-gov-tagcloud.png" alt="" width="530" height="285" /></a></p>
<p>4. What are the  sources of the datasets?</p>
<p>The majority of the datasets are published by the EPA, and they contain environmental data partitioned by the states of the US in three individual years.  Others come from other govt agencies - the distribution is as follows:</p>
<p><img class="alignnone" title="sources for datasets at data.gov" src="http://data-gov.tw.rpi.edu/w/images/2/20/Data-gov-sources.png" alt="" width="529" height="325" /></p>
<p><strong>IV. Getting Datasets linked</strong></p>
<p>Although the datasets are not explicily linked, we see a number of opportunities for connecting these datasets to others (and into the Linked Open Data datasets):</p>
<ul>
<li>A large percentage of files have some sort of geo-tagging, thus they can be linked to <a href="http://dbpedia.org/">DBpedia</a> or <a href="http://www.geonames.org/">Geo-names</a> (and then presented via Map services).</li>
<li>Some datasets are subsets of other datasets, e.g. EPA data &#8220;2005 Toxics Release Inventory data for the state of Georgia&#8221; is a subset of  &#8220;2005 Toxics Release Inventory National data file of all US States and Territories&#8221; making for easier &#8220;internal&#8221; linking of the datasets.</li>
<li>A number of the datasets contain temporal information, e.g. IRS&#8217;s &#8220;Tax Year 1992 Private Foundations Study&#8221;,&#8230;&#8221;Tax Year 2005 Private Foundations Study&#8221; which provides an opportunity for mashups using timelines and such.</li>
</ul>
<p><strong>V. Conclusions</strong></p>
<p>We are committed to getting more of the data.gov data online soon (in RDF), and then investigating data integration and knowledge discovery. In order to get our datasets linked to the linked data cloud, we will use SPARQL for extracting entities and our Semantic Mediawiki as a platform to capture the owl:sameAs mappings.  Scalable dataset publishing is also challenging as some of these are very large datasets, e.g. &#8220;2005-2007 American Community Survey Three-Year PUMS Population File&#8221; has a 1.1 g zipped csv file.  Moreover, some datasets are not directly available in one file but via a web service.  Our current plan is to produce RDF documents available for download soon, and to work on bringing more of these datasets into live, SPARQLable forms as we can.</p>
<p>Li Ding, Dominic DiFranzo and Jim Hendler</p>
]]></content:encoded>
			<wfw:commentRss>http://tw.rpi.edu/weblog/2009/06/25/whats-in-datagov/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
