While translating data.gov data into RDF, we have discovered some issues with the published datasets. These issues can be roughly categorized as follows:
- Duplicated Datasets– Some datasets are part of another dataset, e.g. Dataset 140 (2005 Toxics Release Inventory data for the state of California (Environmental Protection Agency)) is a subset of Dataset 191 (2005 Toxics Release Inventory National data file of all US States and Territories (Environmental Protection Agency)).
- Formatting Issues – The format of some datasets is not friendly to machine processing. Not all datasets offer CSV format data, and parsing table data from them requires non-trivial efforts. Example: Dataset 37 (Lower Colorado River Daily Average Water Elevations and Releases (US Bureau of Reclamation)). Some websites, meanwhile, have no data at all: Dataset 335 (National Longitudinal Surveys (US Bureau of Labor Statistics)), for example, tells you how to order data from the government.
- Access Point Issues – The access points for some datasets do not point to pages friendly to machine access. Instead of pointing to a downloadable file covering the entire dataset, some lead to an interactive website where only partial data can be returned by a web-based query. Example: Dataset 330 (Local Area Unemployment Statistics (US Bureau of Labor Statistics)) and Dataset 96 (National Water Information System (NWIS) (US Geological Survey)).
For more details, please visit http://data-gov.tw.rpi.edu/wiki/Current_Issues_in_data.gov .
Sarah Magidson, Li Ding, Dominic DiFranzo, and Jim Hendler