Archive

Posts Tagged ‘data.gov’

Three principles for building government dataset catalog vocabulary

April 23rd, 2010

There are some ongoing interests in vocabulary for government dataset publishing. There are a  number of proposals such as DERI dcat, Sunlight Lab’s guidelines and RPI’s proposal on Data-gov Vocabulary. Based on our experiences on data.gov catalog data, we found the following principles are useful for consolidate the vocabulary building process and potentially bring consensus:

1. modular vocabulary with minimal core
  • keep the core vocabulary small and stable, only include a small set of frequently used (or required) terms
  • allow extensions contributed by anyone. Extensions should be connected to the core ontology and be possible to be promoted to core status later.
2. choice of term
  • make it easy for curator to produce metadata using the term, e.g. do they need to specify data quality ?
  • make it clear on the expected range of term , e.g. should they use “New York” or “dbpedia:New_York” for spatial coverage? does it require a controlled vocabulary? A validator would be very helpful
  • make it clear on the expected use of term, e.g. can it be displayed in rich snippet? can it be used in SPARQL query, search or facet browsing?
  • try to reuse a term from existing popular vocabulary
  • identify the required, recommended, and optional terms
3. best practices for actual usage
  • we certainly want the metadata to be part of linked data, but that is not the end. We would like to see the linked data actually being used by users who don’t know much about the semantic web.
  • we should consider make vocabulary available in different formats for a wider range of users , e.g. RDFa, Microformat, ATOM, JSON, XML Schema, OData
  • we should build use cases, tools and demos to exhibit the use of vocabulary to promote adoption

comments are welcome.

Li Ding @ RPI

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: tetherless world Tags: ,

A Guided Tour into the Data-gov Wiki

December 9th, 2009

We recently revised the data-gov wiki demos and published a guided tour for web users to better understand and use the projects published at the Data-gov Wiki.We also expect the article to be  a tutorial that meet the increasing requests from web developers who want to integrate semantic technology with existing web technology. Here are some highlights:

  • it lists pointers to datasets converted from data.gov and other data sources
  • it lists a couple of simple demos for using critical technologies, such as Google Visualization API, MIT Exhibit, SPARQL (and extended features), SparqlProxy, Triple Store Usage.  All source code and services are included and replicable. You may see the source code at this link.
  • it further lists advanced demos showing how government data are linked by e.g. sharing properties, reusing URI and literal identifiers, and common time and location.

Comments are welcome to be reported at http://code.google.com/p/data-gov-wiki/issues/list.  We are incrementally improving the article, please come back and subscribe our announcement RSS Feeds.

Li Ding and the Data-gov Wiki team

VN:F [1.9.22_1171]
Rating: 6.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: linked data Tags: ,

The RPI Data-gov Wiki: Current and Future Status

November 24th, 2009

This blog post is being written in response to some questions being asked of late about our work on turning the datasets from http://data.gov  (and also some other govt datasets) into linkeddata formats (RDF) and making this available to the community.  In essence, the criticisms have been that although we’ve made a good start, there’s still a lot more to do.  We couldn’t agree more.

Our http://data-gov.tw.rpi.edu  Wiki has been made available to help the public build, access and reuse linked government data. In order to get the data out quickly, we took some simple steps to start, showing how powerful it could be just to republish the data in RDF.  However, we are also working now to bring in more linking between the data, more datasets from other govt data sites, and more semantics in relating the datatsets.

The first step we did was to bring out a raw RDF version of data.gov datasets and built some quick demos to show how easy it was (see http://data-gov.tw.rpi.edu/wiki/Demos). The benefit of this was that we could easily dump from other formats into RDF, merge the data in simple ways, and then query the data using SPARQL and put the results right into visualizations using “off the shelf” Web tools such as Google Visualization and MIT’s Exhibit. In this step, we follow the “minimalism” principle – minimize human/development efforts and keep the process simple and replicable. Thus we did not try to do a lot of analysis of the data, didn’t add triples for things such as provenance, and didn’t link the datasets directly. Rather, the linking in our early demos came from obvious linking such as same locations or temporal-spatial overlaps.

The second step, which is ongoing right now, is to improve data quality by cleaning and enriching sour emantics. We are improving our human (and machine) versions of the  data.gov catalog (http://data-gov.tw.rpi.edu/wiki/Data.gov_Catalog), which is important for non-government people to use our data. For example, we now:

1. It aggregates metadata from data.gov (http://www.data.gov/details/92) and metadata about our curated RDF version. The aggregated metadata is published in both human and machine readable (RDF) forms.

2. Every dataset has a dereferenceable URI for itself, and links to the raw data, to linked RDF datasets in chunks that are small enough for linked data browsers such as tabulator, and the converted complete RDF data documents.

3. We use the definitions from data.gov (their dataset92 metadata dictionary as it were) for the metadata of each file, but we also add some DC, FOAF  and a couple of small home brews (like number of triples) in an ontology called DGTWC.

4. We now are also linking to more online “enhanced” datasets that  include (we’ve only done it for a few so far) normalized triples extracted from the raw data and links from entities (such as locations, organizations, persons) in government dataset to the linked data cloud (DBPedia and Geonames so far, much more coming soon).  We are also exploring the use of VoID for adding more data descriptions (and IRIs) to the datasets.

5. We are also working on linking the datasets by common properties — this is harder than most people think because you cannot just assume the same name means the same relations – can have different ranges, values or even semantics (and we have found examples of all of the above) – so soon you’ll find for each property there is something like this
geo:lat a rdf:property.
191:latitude rdfs:subPropertyOf geo:lat .
398:latitude rdfs:subPropertyOf geo:lat .

and we have a Semantic Wiki page for each property, so you can find all the subproperty relations and, eventually, a place where people can add information about what some of the more obscure properties mean, or where semantic relations such as “owl:sameAs” can be introduced when these subproperties are known to be the same.

So to summarize, our first step, and we continue to do it, is to transform data in ways that other people can start to add value.  Our second goal, which we’re now working on, is enhanced metadata and adding more semantics, including what is needed for more linking.

We’re also, in our research role, working on next generation technologies for really doing more exciting things (think “read write web of data”) but we’re trying to keep that separate from the work at http://data-gov.tw.rpi.edu, which is aimed at helping to show that Semantic Web technologies are really the only game in town for the sort of large-scale, open, distributed data use that is needed for linked data to really take off.

And if you feel there is stuff missing, let us know (via contact us)- or even better, all our stuff is open (see http://code.google.com/p/data-gov-wiki/), free and easily used – all we ask is that you do great stuff and help make the Web of data grow.

Jim Hendler, Li Ding, and the RPI data-gov research team.

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)

Probing the SPARQL endpoint of data.gov.uk

October 23rd, 2009

We just ran across the preview SPARQL endpoint for UK’s Data.gov (powered by Talis) following Harry Metcalfe’s blog . In order to understand what data is hosted by the triple store, we use a series of SPARQL queries to probe the content in data.gov. We leverage a web service http://data-gov.tw.rpi.edu/ws/sparqlproxy.php to convert SPARQL/XMl result into HTML and JSON.

First, let’s do some warm up exercises

Q: show me some triples!

SPARQL:

SELECT ?s ?p ?o WHERE {?s ?p ?o} LIMIT 5

Result:

<http://www.london-gazette.co.uk/id/issues/58316/notices/240663> <http://www.gazettes-online.co.uk/ontology#hasPublicationDate> “2007-05-02″^^http://www.w3.org/2001/XMLSchema#date .

<http://www.london-gazette.co.uk/id/issues/58316/notices/240663> <http://xmlns.com/foaf/0.1/page> <http://www.london-gazette.co.uk/issues/58316/pages/6359> .

<http://www.london-gazette.co.uk/id/issues/58316/notices/240663> <http://purl.org/dc/terms/modified> “2007-05-02″^^http://www.w3.org/2001/XMLSchema#date .

A: ok, it has some gazette dataset about lond (see http://www.london-gazette.co.uk/), and it uses FOAF and DC vocabulary.

Q: show me some classes and their instances?

SPARQL:

SELECT DISTINCT ?c WHERE { [] a ?c. } LIMIT 5

Result:

<http://www.gazettes-online.co.uk/ontology/transport#RoadTrafficActsNotice>

<http://www.gazettes-online.co.uk/ontology#Notice>

<http://xmlns.com/foaf/0.1/Document>

<http://www.gazettes-online.co.uk/ontology#Issue>

<http://www.gazettes-online.co.uk/ontology#Edition>

A: same observation as above.

Q: Does it host any named graphs?

SPARQL:

SELECT ?g WHERE {GRAPH ?g { ?s ?p ?o } } LIMIT 10

Result: “0” ^^<http://www.w3.org/2001/XMLSchema#integer>

A: no named graph found, and there is only one big default graph

Now let’s run several expensive aggregation queries (note aggregation queries are not part of the current SPARQL specification)

Q: How many triples?

SPARQL:

SELECT count(*) WHERE {?s ?p ?o}

Result: “5529380”^^<http://www.w3.org/2001/XMLSchema#integer>

A: alright! Aggregation query is support, and there are 5 million triples. Note “count” is a non-standard aggregation function, it may be support differently be different SPARQL endpoints.

Q: How many graphs (and the number of triples in each graph)?

SPARQL:

SELECT ?g count(*) WHERE {GRAPH ?g { ?s ?p ?o } }

Result: “0” ^^<http://www.w3.org/2001/XMLSchema#integer>

A: no named graph.

Q: How many populated classes?

SPARQL:

SELECT count(distinct ?c) WHERE {[] a ?c}

Result: “99”^^<http://www.w3.org/2001/XMLSchema#integer>

A: there are 99 different classes having direct instances in this triple store.

Q: How many populated properties?

SPARQL:

SELECT count(distinct ?p) WHERE {[] ?p ?o}

Result: “86” ^^<http://www.w3.org/2001/XMLSchema#integer>

A: There are 86 unique properties being used as predicate in this dataset. Each property has used by 64K (5529380/86) triples in average. There must be some very popular properties, and we will do that survey later.

Q: How many typed individuals?

SPARQL:

SELECT count(distinct ?s) WHERE {?s a ?c}

Result: 504 Gateway Time-out

A: Opps, really expensive. Let’s try something else

Q: How many defined classes?

SPARQL:

SELECT count(distinct ?s) WHERE {{?s a <http://www.w3.org/2002/07/owl#Class> } UNION {?s a <http://www.w3.org/2000/01/rdf-schema#Class>}}

Result: 0

A: no class defined, and the triple store is full of individuals

Q: How many individuals (again)?

SPARQL:

SELECT count(*) WHERE {[] a ?c}

Result: 995694

A: There are nearly 1 millions of typed individuals, so we can easily see every invidual has 5 (5M/1M) triples in average.

Now, let’s do some knowledge discovery

Q: Show me the 3 most/least used classes in this dataset?

SPARQL:

1) select ?c ( count(?s) AS ?count ) where {?s a ?c} group by ?c order by desc(?count) limit 3

2) select ?c ( count(?s) AS ?count ) where {?s a ?c} group by ?c order by ?count limit 3

Result:

1) most used classes

<http://www.gazettes-online.co.uk/ontology#Notice> “156452” ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.w3.org/2006/vcard/ns#Address> “106934” ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology/person#Person> “87798” ^^<http://www.w3.org/2001/XMLSchema#integer>

2) least used classes

<http://www.gazettes-online.co.uk/ontology/transport#CycleTracksNotice> “1” ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology/transport#PortsNotice> “1” ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology/corp-insolvency#AdministrationOrder> “2” ^^<http://www.w3.org/2001/XMLSchema#integer>

A: Again, the SPAQL query is “safe” because Talis support these SPARQL extensions (c.f. http://n2.talis.com/wiki/SPARQL_Extensions). The class that has the most number of instances in this dataset is http://www.gazettes-online.co.uk/ontology#Notice (which has 156,452 instances.)

Q: what about property usage (top 5)?

SPARQL:

1) select ?p ( count(?s) AS ?count ) where {?s ?p ?o} group by ?p order by DESC(?count) limit 5

2) select ?p ( count(?s) AS ?count ) where {?s ?p ?o} group by ?p order by ?count limit 5

Result:

1) 5 most used properties:

<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> “995694” ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://purl.org/dc/dcam/memberOf> “312476” ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.w3.org/1999/02/22-rdf-syntax-ns#value> “310170” ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology#hasPublicationDate> “181940” ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology#hasNoticeCode> “181335” ^^<http://www.w3.org/2001/XMLSchema#integer>

2) 5 least used properties:

<http://www.gazettes-online.co.uk/ontology/corp-insolvency#dateAdministrationOrderMade> “1” ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology#hasAuthorisingOrganisation> “2” ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology/personal-legal#isForNextOfKinOf> “2” ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology/corp-insolvency#orderAdministrator> “3” ^^<http://www.w3.org/2001/XMLSchema#integer>

<http://www.gazettes-online.co.uk/ontology#authorisingOrganisation> “49” ^^<http://www.w3.org/2001/XMLSchema#integer>

A: Some properties are really used (e.g., http://www.gazettes-online.co.uk/ontology/corp-insolvency#dateAdministrationOrderMade has been used only once) while some are heavily used (like rdf:type, which is the most frequently used predicate).

Conclusion

We need to stop probing now. A number of complex queries ended up with a timeout error because (i) “LIMIT” only control the final results, so that we cannot just get statistical results on 1000 triples; (ii) “GROUP BY” may produce too many intermediate results, (iii) the statistics queries does not leverage the index structure of triple store, or index structure are not designed for handling such queries, and (iv) many other issues.

Updates

presented by Li Ding and Zhengning Shangguan

VN:F [1.9.22_1171]
Rating: 10.0/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: linked data, Semantic Web Tags:

Current Issues in data.gov

July 31st, 2009

While translating data.gov data into RDF, we have discovered some issues with the published datasets. These issues can be roughly categorized as follows:

  • Duplicated Datasets- Some datasets are part of another dataset, e.g. Dataset 140 (2005 Toxics Release Inventory data for the state of California (Environmental Protection Agency)) is a subset of Dataset 191 (2005 Toxics Release Inventory National data file of all US States and Territories (Environmental Protection Agency)).
  • Formatting Issues – The format of some datasets is not friendly to machine processing. Not all datasets offer CSV format data, and parsing table data from them requires non-trivial efforts. Example: Dataset 37 (Lower Colorado River Daily Average Water Elevations and Releases (US Bureau of Reclamation)). Some websites, meanwhile, have no data at all: Dataset 335 (National Longitudinal Surveys (US Bureau of Labor Statistics)), for example, tells you how to order data from the government.
  • screen shot of the text file from dataset 37 (Lower Colorado River Daily Average Water Elevations and Releases) by US Bureau of Reclamation

  • Access Point Issues – The access points for some datasets do not point to pages friendly to machine access. Instead of pointing to a downloadable file covering the entire dataset, some lead to an interactive website where only partial data can be returned by a web-based query. Example: Dataset 330 (Local Area Unemployment Statistics (US Bureau of Labor Statistics)) and Dataset 96 (National Water Information System (NWIS) (US Geological Survey)).

    screen shot of the query interface for accessing dataset 330 (Local Area Unemployment Statistics) by US Bureau of Labor Statistics

For more details, please visit http://data-gov.tw.rpi.edu/wiki/Current_Issues_in_data.gov .

Sarah Magidson, Li Ding, Dominic DiFranzo, and Jim Hendler

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: linked data Tags: