This blog post is being written in response to some questions being asked of late about our work on turning the datasets from http://data.gov (and also some other govt datasets) into linkeddata formats (RDF) and making this available to the community. In essence, the criticisms have been that although we’ve made a good start, there’s still a lot more to do. We couldn’t agree more.
Our http://data-gov.tw.rpi.edu Wiki has been made available to help the public build, access and reuse linked government data. In order to get the data out quickly, we took some simple steps to start, showing how powerful it could be just to republish the data in RDF. However, we are also working now to bring in more linking between the data, more datasets from other govt data sites, and more semantics in relating the datatsets.
The first step we did was to bring out a raw RDF version of data.gov datasets and built some quick demos to show how easy it was (see http://data-gov.tw.rpi.edu/wiki/Demos). The benefit of this was that we could easily dump from other formats into RDF, merge the data in simple ways, and then query the data using SPARQL and put the results right into visualizations using “off the shelf” Web tools such as Google Visualization and MIT’s Exhibit. In this step, we follow the “minimalism” principle – minimize human/development efforts and keep the process simple and replicable. Thus we did not try to do a lot of analysis of the data, didn’t add triples for things such as provenance, and didn’t link the datasets directly. Rather, the linking in our early demos came from obvious linking such as same locations or temporal-spatial overlaps.
The second step, which is ongoing right now, is to improve data quality by cleaning and enriching sour emantics. We are improving our human (and machine) versions of the data.gov catalog (http://data-gov.tw.rpi.edu/wiki/Data.gov_Catalog), which is important for non-government people to use our data. For example, we now:
1. It aggregates metadata from data.gov (http://www.data.gov/details/92) and metadata about our curated RDF version. The aggregated metadata is published in both human and machine readable (RDF) forms.
2. Every dataset has a dereferenceable URI for itself, and links to the raw data, to linked RDF datasets in chunks that are small enough for linked data browsers such as tabulator, and the converted complete RDF data documents.
3. We use the definitions from data.gov (their dataset92 metadata dictionary as it were) for the metadata of each file, but we also add some DC, FOAF and a couple of small home brews (like number of triples) in an ontology called DGTWC.
4. We now are also linking to more online “enhanced” datasets that include (we’ve only done it for a few so far) normalized triples extracted from the raw data and links from entities (such as locations, organizations, persons) in government dataset to the linked data cloud (DBPedia and Geonames so far, much more coming soon). We are also exploring the use of VoID for adding more data descriptions (and IRIs) to the datasets.
5. We are also working on linking the datasets by common properties — this is harder than most people think because you cannot just assume the same name means the same relations – can have different ranges, values or even semantics (and we have found examples of all of the above) – so soon you’ll find for each property there is something like this
geo:lat a rdf:property.
191:latitude rdfs:subPropertyOf geo:lat .
398:latitude rdfs:subPropertyOf geo:lat .
and we have a Semantic Wiki page for each property, so you can find all the subproperty relations and, eventually, a place where people can add information about what some of the more obscure properties mean, or where semantic relations such as “owl:sameAs” can be introduced when these subproperties are known to be the same.
So to summarize, our first step, and we continue to do it, is to transform data in ways that other people can start to add value. Our second goal, which we’re now working on, is enhanced metadata and adding more semantics, including what is needed for more linking.
We’re also, in our research role, working on next generation technologies for really doing more exciting things (think “read write web of data”) but we’re trying to keep that separate from the work at http://data-gov.tw.rpi.edu, which is aimed at helping to show that Semantic Web technologies are really the only game in town for the sort of large-scale, open, distributed data use that is needed for linked data to really take off.
And if you feel there is stuff missing, let us know (via contact us)- or even better, all our stuff is open (see http://code.google.com/p/data-gov-wiki/), free and easily used – all we ask is that you do great stuff and help make the Web of data grow.
Jim Hendler, Li Ding, and the RPI data-gov research team.