Archive

Archive for July, 2009

Current Issues in data.gov

July 31st, 2009

While translating data.gov data into RDF, we have discovered some issues with the published datasets. These issues can be roughly categorized as follows:

  • Duplicated Datasets- Some datasets are part of another dataset, e.g. Dataset 140 (2005 Toxics Release Inventory data for the state of California (Environmental Protection Agency)) is a subset of Dataset 191 (2005 Toxics Release Inventory National data file of all US States and Territories (Environmental Protection Agency)).
  • Formatting Issues – The format of some datasets is not friendly to machine processing. Not all datasets offer CSV format data, and parsing table data from them requires non-trivial efforts. Example: Dataset 37 (Lower Colorado River Daily Average Water Elevations and Releases (US Bureau of Reclamation)). Some websites, meanwhile, have no data at all: Dataset 335 (National Longitudinal Surveys (US Bureau of Labor Statistics)), for example, tells you how to order data from the government.
  • screen shot of the text file from dataset 37 (Lower Colorado River Daily Average Water Elevations and Releases) by US Bureau of Reclamation

  • Access Point Issues – The access points for some datasets do not point to pages friendly to machine access. Instead of pointing to a downloadable file covering the entire dataset, some lead to an interactive website where only partial data can be returned by a web-based query. Example: Dataset 330 (Local Area Unemployment Statistics (US Bureau of Labor Statistics)) and Dataset 96 (National Water Information System (NWIS) (US Geological Survey)).

    screen shot of the query interface for accessing dataset 330 (Local Area Unemployment Statistics) by US Bureau of Labor Statistics

For more details, please visit http://data-gov.tw.rpi.edu/wiki/Current_Issues_in_data.gov .

Sarah Magidson, Li Ding, Dominic DiFranzo, and Jim Hendler

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: linked data Tags:

Data.gov Datasets Translated in RDF!

July 22nd, 2009

We have created 16 RDF datasets covering 187 of the datasets published at data.gov (171 EPA datasets are subsets of three larger EPA datasets). The original datasets were published by EPA, US Census Bureau, USGS and Office of Management and Budget in CSV compatible format, and they contributed 13,532,250 table entries. The translated RDF datasets includes a total of 2,927,398,352 triples involving 2,526 properties.

We publish the RDF data in two alternative ways: (i) a collection of linked partition files in RDF/XML for users to browse the dataset and dereference the URIs using semantic web browsers, and (ii) one big N-TRIPLE file (data.nt) concatenating the partition files for machines, especially triple stores, to download and import. The largest dataset is Dataset_91, which contributed 2.11 billion triples.

To access the RDF datasets, users may go to Data.gov_Catalog with the following options:

  • follow links in the “rdf(index file)” column to access the index file in RDF/XML which contains the property list, statistics, and links of the RDF dataset. e.g. http://data-gov.tw.rpi.edu/raw/401/index.rdf
  • follow links in the “rdf(partition files)” column to start an RDF browser (e.g. tabulator) to surf the RDF/XML partition files. e.g. http://data-gov.tw.rpi.edu/raw/401/link00001.rdf
  • follow links in “the rdf(complete file)” column to download the complete RDF dataset in N-TRIPLE format (gzipped). e.g. http://data-gov.tw.rpi.edu/raw/401/data-401.nt.gz
  • follow links in the “url(data.gov)” column to see the original metadata at data.gov
  • follow links in the “wiki page” column to see enhanced metadata about data.gov datasets

More datasets are coming, so please stay tuned and come back to http://data-gov.tw.rpi.edu/.

Further reading:

Li Ding, Dominic DiFranzo, Sarah Magidson, and Jim Hendler

VN:F [1.9.22_1171]
Rating: 9.6/10 (7 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: linked data, Semantic Web Tags: , ,

Tilting at the NSF windmill

July 13th, 2009

Colleagues – one of my blog entries at Nature seems to have hit a nerve – been zinging around the “twittersphere” and I’ve received a number of responses in private not just commiserating, but agreeing with the major points.  I want to make it clear that this is solely my own opinion, and it has not been carefully researched, but given that so many US Semantic Web researchers have shared the frustration that I express here, I thought I’d share it on planetRDF as well  (Europeans, believe it or not, on this side of the ocean it is hard to get funding for Semantic Web research – you have no idea how lucky you are!)

-Jim H

from blog entry: “Why NSF cannot fund high-risk, high-reward research”

I just got turned down for a grant. That’s nothing new, you win some and you lose some, and every senior professor has gotten used to that over time. This time, however, I cannot find it in myself to just say “oh well” and let it go at that. This time, I think I need to go public, because I think what happened shows an endemic problem with the US National Science Foundation and, I hope, points out some things they could do to fix it.

Click here for the blog entry at Nature.com

VN:F [1.9.22_1171]
Rating: 9.3/10 (3 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: personal ramblings, Semantic Web, Web Science Tags: