Archive

Author Archive

Probing the SPARQL endpoint of data.gov.uk

October 23rd, 2009

We just ran across the preview SPARQL endpoint for UK’s Data.gov (powered by Talis) following Harry Metcalfe’s blog . In order to understand what data is hosted by the triple store, we use a series of SPARQL queries to probe the content in data.gov. We leverage a web service http://data-gov.tw.rpi.edu/ws/sparqlproxy.php to convert SPARQL/XMl result into HTML and JSON.

First, let’s do some warm up exercises

Q: show me some triples!

SPARQL:

SELECT ?s ?p ?o WHERE {?s ?p ?o} LIMIT 5

Result:

<http://www.london-gazette.co.uk/id/issues/58316/notices/240663> <http://www.gazettes-online.co.uk/ontology#hasPublicationDate> “2007-05-02″^^http://www.w3.org/2001/XMLSchema#date .

<http://www.london-gazette.co.uk/id/issues/58316/notices/240663> <http://xmlns.com/foaf/0.1/page> <http://www.london-gazette.co.uk/issues/58316/pages/6359> .

<http://www.london-gazette.co.uk/id/issues/58316/notices/240663> <http://purl.org/dc/terms/modified> “2007-05-02″^^http://www.w3.org/2001/XMLSchema#date .

A: ok, it has some gazette dataset about lond (see http://www.london-gazette.co.uk/), and it uses FOAF and DC vocabulary.

Q: show me some classes and their instances?

SPARQL:

SELECT DISTINCT ?c WHERE { [] a ?c. } LIMIT 5

Result:

<http://www.gazettes-online.co.uk/ontology/transport#RoadTrafficActsNotice>

<http://www.gazettes-online.co.uk/ontology#Notice>

<http://xmlns.com/foaf/0.1/Document>

<http://www.gazettes-online.co.uk/ontology#Issue>

<http://www.gazettes-online.co.uk/ontology#Edition>

A: same observation as above.

Q: Does it host any named graphs?

SPARQL:

SELECT ?g WHERE {GRAPH ?g { ?s ?p ?o } } LIMIT 10

Result: “0″ ^^<http://www.w3.org/2001/XMLSchema#integer>

A: no named graph found, and there is only one big default graph

Now let’s run several expensive aggregation queries (note aggregation queries are not part of the current SPARQL specification)

Q: How many triples?

SPARQL:

SELECT count(*) WHERE {?s ?p ?o}

Result: “5529380″^^<http://www.w3.org/2001/XMLSchema#integer>

A: alright! Aggregation query is support, and there are 5 million triples. Note “count” is a non-standard aggregation function, it may be support differently be different SPARQL endpoints.

Q: How many graphs (and the number of triples in each graph)?

SPARQL:

SELECT ?g count(*) WHERE {GRAPH ?g { ?s ?p ?o } }

Result: “0″ ^^<http://www.w3.org/2001/XMLSchema#integer>

A: no named graph.

Q: How many populated classes?

SPARQL:

SELECT count(distinct ?c) WHERE {[] a ?c}

Result: “99″^^<http://www.w3.org/2001/XMLSchema#integer>

A: there are 99 different classes having direct instances in this triple store.

Q: How many populated properties?

SPARQL:

SELECT count(distinct ?p) WHERE {[] ?p ?o}

Result: “86″ ^^<http://www.w3.org/2001/XMLSchema#integer>

A: There are 86 unique properties being used as predicate in this dataset. Each property has used by 64K (5529380/86) triples in average. There must be some very popular properties, and we will do that survey later.

Q: How many typed individuals?

SPARQL:

SELECT count(distinct ?s) WHERE {?s a ?c}

Result: 504 Gateway Time-out

A: Opps, really expensive. Let’s try something else

Q: How many defined classes?

SPARQL:

SELECT count(distinct ?s) WHERE {{?s a <http://www.w3.org/2002/07/owl#Class> } UNION {?s a <http://www.w3.org/2000/01/rdf-schema#Class>}}

Result: 0

A: no class defined, and the triple store is full of individuals

Q: How many individuals (again)?

SPARQL:

SELECT count(*) WHERE {[] a ?c}

Result: 995694

A: There are nearly 1 millions of typed individuals, so we can easily see every invidual has 5 (5M/1M) triples in average.

Now, let’s do some knowledge discovery

Q: Show me the 3 most/least used classes in this dataset?

SPARQL:

1) select ?c ( count(?s) AS ?count ) where {?s a ?c} group by ?c order by desc(?count) limit 3

2) select ?c ( count(?s) AS ?count ) where {?s a ?c} group by ?c order by ?count limit 3

Result:

1) most used classes

<http://www.gazettes-online.co.uk/ontology#Notice> “156452″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.w3.org/2006/vcard/ns#Address> “106934″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology/person#Person> “87798″ ^^<http://www.w3.org/2001/XMLSchema#integer>

2) least used classes

<http://www.gazettes-online.co.uk/ontology/transport#CycleTracksNotice> “1″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology/transport#PortsNotice> “1″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology/corp-insolvency#AdministrationOrder> “2″ ^^<http://www.w3.org/2001/XMLSchema#integer>

A: Again, the SPAQL query is “safe” because Talis support these SPARQL extensions (c.f. http://n2.talis.com/wiki/SPARQL_Extensions). The class that has the most number of instances in this dataset is http://www.gazettes-online.co.uk/ontology#Notice (which has 156,452 instances.)

Q: what about property usage (top 5)?

SPARQL:

1) select ?p ( count(?s) AS ?count ) where {?s ?p ?o} group by ?p order by DESC(?count) limit 5

2) select ?p ( count(?s) AS ?count ) where {?s ?p ?o} group by ?p order by ?count limit 5

Result:

1) 5 most used properties:

<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> “995694″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://purl.org/dc/dcam/memberOf> “312476″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.w3.org/1999/02/22-rdf-syntax-ns#value> “310170″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology#hasPublicationDate> “181940″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology#hasNoticeCode> “181335″ ^^<http://www.w3.org/2001/XMLSchema#integer>

2) 5 least used properties:

<http://www.gazettes-online.co.uk/ontology/corp-insolvency#dateAdministrationOrderMade> “1″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology#hasAuthorisingOrganisation> “2″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology/personal-legal#isForNextOfKinOf> “2″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology/corp-insolvency#orderAdministrator> “3″ ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology#authorisingOrganisation> “49″ ^^<http://www.w3.org/2001/XMLSchema#integer>

A: Some properties are really used (e.g., http://www.gazettes-online.co.uk/ontology/corp-insolvency#dateAdministrationOrderMade has been used only once) while some are heavily used (like rdf:type, which is the most frequently used predicate).

Conclusion

We need to stop probing now. A number of complex queries ended up with a timeout error because (i) “LIMIT” only control the final results, so that we cannot just get statistical results on 1000 triples; (ii) “GROUP BY” may produce too many intermediate results, (iii) the statistics queries does not leverage the index structure of triple store, or index structure are not designed for handling such queries, and (iv) many other issues.

Updates

presented by Li Ding and Zhengning Shangguan

VN:F [1.2.0_562]
Rating: 10.0/10 (2 votes cast)
Author: li Categories: Semantic Web, linked data Tags:

Current Issues in data.gov

July 31st, 2009

While translating data.gov data into RDF, we have discovered some issues with the published datasets. These issues can be roughly categorized as follows:

  • Duplicated Datasets- Some datasets are part of another dataset, e.g. Dataset 140 (2005 Toxics Release Inventory data for the state of California (Environmental Protection Agency)) is a subset of Dataset 191 (2005 Toxics Release Inventory National data file of all US States and Territories (Environmental Protection Agency)).
  • Formatting Issues - The format of some datasets is not friendly to machine processing. Not all datasets offer CSV format data, and parsing table data from them requires non-trivial efforts. Example: Dataset 37 (Lower Colorado River Daily Average Water Elevations and Releases (US Bureau of Reclamation)). Some websites, meanwhile, have no data at all: Dataset 335 (National Longitudinal Surveys (US Bureau of Labor Statistics)), for example, tells you how to order data from the government.
  • screen shot of the text file from dataset 37 (Lower Colorado River Daily Average Water Elevations and Releases) by US Bureau of Reclamation

  • Access Point Issues - The access points for some datasets do not point to pages friendly to machine access. Instead of pointing to a downloadable file covering the entire dataset, some lead to an interactive website where only partial data can be returned by a web-based query. Example: Dataset 330 (Local Area Unemployment Statistics (US Bureau of Labor Statistics)) and Dataset 96 (National Water Information System (NWIS) (US Geological Survey)).

    screen shot of the query interface for accessing dataset 330 (Local Area Unemployment Statistics) by US Bureau of Labor Statistics

For more details, please visit http://data-gov.tw.rpi.edu/wiki/Current_Issues_in_data.gov .

Sarah Magidson, Li Ding, Dominic DiFranzo, and Jim Hendler

VN:F [1.2.0_562]
Rating: 0.0/10 (0 votes cast)
Author: li Categories: linked data Tags:

Data.gov Datasets Translated in RDF!

July 22nd, 2009

We have created 16 RDF datasets covering 187 of the datasets published at data.gov (171 EPA datasets are subsets of three larger EPA datasets). The original datasets were published by EPA, US Census Bureau, USGS and Office of Management and Budget in CSV compatible format, and they contributed 13,532,250 table entries. The translated RDF datasets includes a total of 2,927,398,352 triples involving 2,526 properties.

We publish the RDF data in two alternative ways: (i) a collection of linked partition files in RDF/XML for users to browse the dataset and dereference the URIs using semantic web browsers, and (ii) one big N-TRIPLE file (data.nt) concatenating the partition files for machines, especially triple stores, to download and import. The largest dataset is Dataset_91, which contributed 2.11 billion triples.

To access the RDF datasets, users may go to Data.gov_Catalog with the following options:

  • follow links in the “rdf(index file)” column to access the index file in RDF/XML which contains the property list, statistics, and links of the RDF dataset. e.g. http://data-gov.tw.rpi.edu/raw/401/index.rdf
  • follow links in the “rdf(partition files)” column to start an RDF browser (e.g. tabulator) to surf the RDF/XML partition files. e.g. http://data-gov.tw.rpi.edu/raw/401/link00001.rdf
  • follow links in “the rdf(complete file)” column to download the complete RDF dataset in N-TRIPLE format (gzipped). e.g. http://data-gov.tw.rpi.edu/raw/401/data-401.nt.gz
  • follow links in the “url(data.gov)” column to see the original metadata at data.gov
  • follow links in the “wiki page” column to see enhanced metadata about data.gov datasets

More datasets are coming, so please stay tuned and come back to http://data-gov.tw.rpi.edu/.

Further reading:

Li Ding, Dominic DiFranzo, Sarah Magidson, and Jim Hendler

VN:F [1.2.0_562]
Rating: 9.8/10 (4 votes cast)
Author: li Categories: Semantic Web, linked data Tags: , ,

What’s in data.gov

June 25th, 2009

A recent article by Tim Berners-Lee, “Putting Government Data online“, has  attracted significant interest to the  datasets published at the US data.gov website.  As Berners-Lee discusses the Semantic Web techniques that can be used to get those data into RDF space (something we are now working on), we would like to share our initial investigation of the contents of these government datasets.

updates:

* we have not published 5 billions triples from hundreds of datasets at http://data.gov. see http://data-gov.tw.rpi.edu/wiki/Data.gov_Catalog

I. Translate dataset into RDF

The catalog of the datasets in data.gov,http://www.data.gov/details/92,  is published in CSV format as part of data.gov. We  converted it into RDF using simple CSV parsing. We kept the translation minimal: (i) the properties are directly created from thecolumn names; (ii) each table row is mapped to an instance of pmlp:Dataset; (iii) all non-header cells are mapped to a literal - we don’t create new URIs at this point. The output of our work is published on tw website at:

http://data-gov.tw.rpi.edu/raw/92/data-92.rdf

(We are now starting to do more  integration work, extracting multiple objects from single tables, linking into the linked open data  cloud, etc.  and will publish new version when that is done - the purpose of this first work was simply to make the catalog more available to the RDF community)

II. Browse and query the RDF graph

As an example, we can browse the dataset in tabulator, and then use a SPARQL web service to query the dataset. For example, we use a sparql query to list datasets published in CSV format:

http://onto.rpi.edu/sw4j/sparql?queryURL=http://data-gov.tw.rpi.edu/sparql/select-csv-dataset.sparql

III. Observations on the RDF graph

Using this service we can answer some basic questions about the data.gov datatsets:

1. How many datasets are published, and how many among them can be easily converted into RDF?

There are 332 datasets which can be partitioned by  type:  raw data catalog(301);  tool catalog (31).

Not all of the datasets have a link to downloadable data because some offer only browseable data via their own websites,  Others  publish datasets in multiple formats. As of today, the online static files associated with the datasets are distributed as  follows:  204 datasets offer a CSV format dump, 10 datasets offer an XML format dump, and 21 datasets offer an XLS format dump.

2. How are the datasets categorized?

Category number of datasets
Geography and Environment 227
Labor Force, Employment, and Earnings 30
Social Insurance and Human Services 30
Health and Nutrition 11
Law Enforcement, Courts, and Prisons 7
Population 4
Other 3
Prices 3
Business Enterprise 2
Education 2
Energy and Utilities 2
Federal Government Finances and Employment 2
Income, Expenditures, Poverty, and Wealth 2
Science and Technology 2
Transportation 2
Construction and Housing 1
International Statistics 1
National Security and Veterans Affairs 1

3. What are some of the key items in the dataset?

4. What are the  sources of the datasets?

The majority of the datasets are published by the EPA, and they contain environmental data partitioned by the states of the US in three individual years.  Others come from other govt agencies - the distribution is as follows:

IV. Getting Datasets linked

Although the datasets are not explicily linked, we see a number of opportunities for connecting these datasets to others (and into the Linked Open Data datasets):

  • A large percentage of files have some sort of geo-tagging, thus they can be linked to DBpedia or Geo-names (and then presented via Map services).
  • Some datasets are subsets of other datasets, e.g. EPA data “2005 Toxics Release Inventory data for the state of Georgia” is a subset of  “2005 Toxics Release Inventory National data file of all US States and Territories” making for easier “internal” linking of the datasets.
  • A number of the datasets contain temporal information, e.g. IRS’s “Tax Year 1992 Private Foundations Study”,…”Tax Year 2005 Private Foundations Study” which provides an opportunity for mashups using timelines and such.

V. Conclusions

We are committed to getting more of the data.gov data online soon (in RDF), and then investigating data integration and knowledge discovery. In order to get our datasets linked to the linked data cloud, we will use SPARQL for extracting entities and our Semantic Mediawiki as a platform to capture the owl:sameAs mappings.  Scalable dataset publishing is also challenging as some of these are very large datasets, e.g. “2005-2007 American Community Survey Three-Year PUMS Population File” has a 1.1 g zipped csv file.  Moreover, some datasets are not directly available in one file but via a web service.  Our current plan is to produce RDF documents available for download soon, and to work on bringing more of these datasets into live, SPARQLable forms as we can.

Li Ding, Dominic DiFranzo and Jim Hendler

VN:F [1.2.0_562]
Rating: 10.0/10 (1 vote cast)
Author: li Categories: linked data Tags:

Musing the Future of Semantic Wikis

March 5th, 2009

Just finished the last Ontolog mini-series on Semantic Wiki, and I would like to contribute my two cents:

1. We should differentiate work on semantic wiki

  • Semantic wiki - we download it, and use it. It is a wiki with some semantic capabilities.
  • Semantic wiki engineering - we develop conventional web applications (e.g. CMS, portal) using semantic wiki as the underline platform. But keep in mind that most efforts fall in application design and development, not development of wiki itself.

2. We should worry more about the reality and the end users

  • Similar to the case “we got trained by Google search”, it takes longer time to get users spend more time on adding semantic annotation then editing text
  • It seems we are still organizing content using web pages, but is there any other intuitive way? In particular, how can we enable “one edit on data can affect all pages importing the data”
  • Users are impatient so they prefer speed, easy, minimal is always critical. BTW, “easy” is hard to define, and the end users want “easy enough” not “easier”.

Now I can list the highlights of that great event

Mark:

  • UI is the key problem, and we are expecting “zero-training”
  • knowledge engineers still cannot be replaced by social systems, and how do we achieve network effects of semantics

Rudi:

  • “keep it simple” is more important than “more power” and
  • sharing data across wiki boundary
  • semantic wiki can progress from CMS to Knowledge management system

schaffer

  • survey of semantic wiki systems and application areas
  • semantic wiki can be used as a Web engineering platform, a testbed of Semantic Web

Solbrig

  • the identity features of semantic wiki
  • summaries of several semantic wiki instances
  • some features are removed from some semantic wiki: page, text (but is that a good way to go?)

Voelkel/Kroetsch

  • semantic MediaWiki,
  • key technologies include extensions, UI design, rule support, best practices/design patterns, and etc.

Dean/Yim

  • domain applications of semantic wiki: biomedical vocabulary, campus information, math

Ding/Bao

  • there are several issues that affect all semantic wiki based applications:
    • interoperability - wiki should be an island of information
    • collaboration - does semantic wiki provide enough collaboration support, e.g. privacy protection?
    • usability - it is the key for any web application to survive and deployed
    • methodology - we need best practices

Discussions on the future of semantic wiki

  • how to compete/collaborate with online documents e.g. googledoc
  • wiki will vanish because it is general purposed
  • wiki should be easy, fast, minimal

To find more details and download presentation slides, please go to http://ontolog.cim3.net/cgi-bin/wiki.pl?ConferenceCall_2009_03_05

Best,
Li

VN:F [1.2.0_562]
Rating: 0.0/10 (0 votes cast)
Author: li Categories: Semantic Web Tags: