Tw:Hackathon09
From Semantic Portal Wiki
Contents |
What is it
A hackathon consists in a "coding marathon", that is, a full 2-day event where we create applications. This event is sponsored by Sunlight labs Foundation
Where and when
The hackaton will be on the weekend of December 12th to 13th. We will host the event at Winslow building (I assume we can ask permission to allow other people enter during weekends).
Participants
- Alvaro Graves
- Dominic DiFranzo
- Tim Lebo
Project ideas
Linking data-gov data to Linked Data Cloud: There are several datasets (such as geonames and dbpedia) that can be linked from data-gov data. The project is to link the RDF datasets.
Meeting time
Sat
10:00
What does TW data-gov have?
described at http://data-gov.tw.rpi.edu/wiki/Generating_RDF_from_data.gov
- http://data-gov.tw.rpi.edu/wiki/Data.gov_Catalog shows listing
- 1MB per file
- each dataset has a little meta index.rdf describes dataset (num triples, num properies, data.gov url),
- "linked" index file pointing to all fragments.
LOD
- geonames
What things in TW data-gov to link:
- States
RDFized recovery.gov data
new triple store - sam moving away from plato
10:30
google takes address and gives lat/long [1]
Example: Dataset 770 - http://data-gov.tw.rpi.edu/wiki/Dataset_770 shows metadata.
- infobox has info from dataset 92 AND from index.rdf (which is created after conversion)
- http://www.data.gov/details/92 -> http://data-gov.tw.rpi.edu/vocab/Dataset_92
- anything with 92 prepended came from the .csv
- http://www.data.gov/download/92/csv row 752, col C =
- can get RDFFeed
- 770 uses 774 namespace.
- csv of all datasets, what namespaces to use (Li)
TODO: how to know which datasets share same properties? (because they get merged)
TODO: for the page of a property, show all datasets that use it.
TODO: how to query for all datasets that TW hosts?
"link" files are the directory structure pointing to all of the 1 MB fragments for RDF.
TODO: spend some time making tools to help a developer find a "good" property to link to the cloud? Instead of relying on previous experience and or luck to stumble into something that would.
- Li has a tool that, on a property's wiki page, lists datasets that exhibit similarly-named properties (after the xx/ prefix).
- one place that it does it, but not all places: http://data-gov.tw.rpi.edu/wiki/Property:92/data_gov_data_category_type
11:15
coffee!
12:45
Looking at the "other half": DBPedia, Freebase, geonames, US census data recovery.gov has congressional district
host for tw's rdfization http://data-gov.tw.rpi.edu/joseki/sparql/tdb-datagov
zip codes for congressional districts in freebase: http://www.freebase.com/app/queryeditor?q=%5B%7B%20%22type%22%3A%20%22%2Fuser%2Frobert%2Fus_congress%2Fcongressional_district%22%2C%20%22name%22%3A%20null%2C%20%22id%22%3A%20null%2C%20%22district_number%22%3A%20null%20%7D%5D
http://docs.jquery.com/Main_Page has reasonable documentation for jQuery (but not if you're trying use cross-domain ajax)
13:45
Li discussions.
todo: for zip code, show all info about it
primary joins:
- geolocation (lat/long, address)
- government agency
- literals, e.g. "AK"
- geo and timeline
- recovery?
- state has a value over multiple years. (what is trend)
- library data set
knowing whats there.
objective: page to accept dataset URI and provide a list of predicates and a few distinct values. extension would be to query sparql endpoint for all named graphs and provide list for all of them. The idea is to allow a user to find the right predicates to link to LOD.
5:00
unable to get SPARQL query responses using jQuery's get, getJSON, or ajax functions. Functions would work for a canned flickr URL. Perhaps a jsonp callback issue? perhaps a mime issue?
http://data-gov.tw.rpi.edu/joseki/sparql/tdb-datagov was impossibly slow and eventually returned empty page.
will try to ask http://docs.jquery.com/Discussion about issue.
Sun
10:00
Alvaro explained the script/php workaround for the Cross-Host Restriction (XHR). Is this what JSONP does?
[2] recommended http://www.ibm.com/developerworks/library/wa-aj-jsonp1/, which Tim is reading.
10:15
Discussed jQuery.
11:00
Tim finally got JSONP to work. ibm site above cites http://www.geonames.org/postalCodeLookupJSON?postalcode=10504&country=US&callback=? Alvaro suggests "HTTP Fox" -a Firefox extension. Firebug has http://getfirebug.com/wiki/index.php/Command_Line_API.
11:30
Tim cleaning JSONP call in javascript using {} params (for readability) http://twitter.com/datagovwiki
12:30
Tim: are there any good URI prefix maintenance?
Li: Richard Cyn prefix.cc (ISWC)
Li: http://swoogle.umbc.edu/ keeps track of the abbreviations that documents use for namespaces. Example for rdfs [3]
Tim: Can I get a reasonable QName from a URI that I don't recongize?
Li: use swoogle:
term search: uri:"http://xmlns.com/foaf/0.1/Person" in search results: RDF version
1:00
Can't use proxy to get distinct named graphs, must query endpoint directly.
15:15
Tim populating table for named graphs from sam. Event handler added to handle predicates request. Need to use http://docs.jquery.com/Tutorials:How_jQuery_Works#Callback_with_arguments
20:30
assert a range of a property
suggest freebase ID for object
21:00
Allow the declaration of a column in a particular data set as having a range of rdfs:Resource (instead of the presumed rdfs:Literal). When converting csv2rdf, mint a URI for the value by prepending a URI namespace and assert a rdfs:label with the original value. e.g. http://data-gov.tw.rpi.edu/vocab/p/10002/agency_name rdfs:range rdfs:Resource . http://data-gov.tw.rpi.edu/vocab/p/10002/entry44444 dg10002:agency_name "Corps of Engineers" . ==> http://data-gov.tw.rpi.edu/vocab/p/10002/entry44444 dg10002:agency_name dg10002:Corps_of_Engineers . dg10002:Corps_of_Engineers rdfs:label "Corps of Engineers" . dg10002:Corps_of_Engineers rdfs:seeAlso twwiki:Corps_of_Engineers .
Blog entry
Tim
This weekend we participated in the Great American Hackathon. We started Saturday by reviewing the work that Tetherless World (TW) has done over the past few months to convert the raw Data.gov data into the RDF format. The process they used is described at http://data-gov.tw.rpi.edu/wiki/Generating_RDF_from_data.gov and the results are listed at http://data-gov.tw.rpi.edu/wiki/Data.gov_Catalog.
Linked Data
We set out to link this relatively new data to elements already established in the Linked Data cloud. The Linked Data cloud, described at http://linkeddata.org/, is a variety of data sources maintained by independent individuals and organizations. A few examples include publication sites such as ACM, IEEE, PubMed, and CiteSeer. More general data sources include Freebase, Open Calais, and DBPedia (an RDF version of Wikipedia), while for this hackathon we were particularly interested in Geonames, US Census Data, and Gov-track.
The disparate data sets in the Linked Data cloud can be joined together because their maintainers adhere to a few basic principles. First, they use URIs to name their data elements (think, "web address"). Next, they provide web services that returns information about the data element when its URIs are "dereferenced". Dereferencing a URI is much like typing it into a web browser and trying to access a web page using HTTP. Finally, when returning information about a URI, they include pointers to data elements in other Linked Data sites. The more associations that are drawn among data sets, the more structured information you can find out about a topic of interest. Many data sites also provide a query interface so that you can get the subset of information that you want with less one-by-one "crawling".
TW's data.gov RDF
To link elements between TW's data.gov RDF and Linked Data, we needed to find appropriate starting points on both sides. As mentioned earlier, RDF versions of the data.gov data sets are listed at http://data-gov.tw.rpi.edu/wiki/Data.gov_Catalog. Each data set has a wiki page that provides some metadata. For example, http://data-gov.tw.rpi.edu/wiki/Dataset_10 is a wiki page that describes the data derived from http://www.data.gov/details/10. "Dgtwc:uses property" lists properties like "10/acothers" that were created from the columns of the original csv (the "10" prefix is to distinguish the column from any other column in another dataset that may share the same title). http://data-gov.tw.rpi.edu/vocab/Dataset_10 (note the "vocab") is the URI for the dataset, while the URIs of the other data sets follow the same pattern. Any descriptions of http://data-gov.tw.rpi.edu/vocab/Dataset_10 starting with "92" (like "92/category" = Energy and Utilities) come from http://www.data.gov/details/92, while the remaining (like "Dgtwc:number of entries") are created during the RDF conversion process. You can grab the RDF data as one big gzip (http://data-gov.tw.rpi.edu/raw/10/data-10.nt.gz) or as a series of 1MB chunks listed in a "link" file (http://data-gov.tw.rpi.edu/raw/10/link00001.rdf).
Data sets in Linked Data
After having a look at the RDF version of the data.gov sets, we looked at "the other half."
http://www.geonames.org/ontology/ provides a web service that provides "children", "neighbors", and "nearby" features through a web interface (if you know that 3017382 is France). You can also download all of their csv files at [4] (401MB with some duplicates), or an RDF version at [5] (290MB expands to 6.2GB of 13,807,686 lines, one xml document per line). They offer up an ontology for the RDF in two components (note, Full imports Lite). The ontology contains only a few classes, but contains a fair amount of instances. The main owl:Class is geonames:Feature, which are wgs:SpatialThings like France that are described in the database dumps. geonames:Features are categorized using the geonames:featureCode and geonames:featureClass properties, where geonames:Codes are grouped by geonames:Classes. geonames:Features are also described with a geonames:postalCode property, which provides a good place to link with the data.gov RDF. The instances in the ontology are enumerations of potential geonames:Codes (e.g., http://www.geonames.org/ontology#S.BLDG - "a structure built for permanent use, as a house, factory, etc." and geonames:Classes (e.g., http://www.geonames.org/ontology#S - "spot, building, farm, ...").
The Linked Data, RDF version of the US Census Data is described very well at http://www.rdfabout.com/demo/census/.
Gov-track (http://www.govtrack.us/) has an experimental RDF version that you can download or query as a SPARQL endpoint. They describe it at http://www.govtrack.us/developers/rdf.xpd.
Freebase is a relatively large part of the Linked Data cloud and does a good job of reconciling with another large part of the cloud: DBPedia. So, linking to either of these is a good start for getting a "net effect." Freebase can be searched for a wealth of topics (e.g, "connecticut 3rd district"). Pages can be viewed in a "viewer mode" (e.g., http://www.freebase.com/view/en/connecticuts_3rd_congressional_district/) or an "edit mode" (http://www.freebase.com/edit/topic/en/connecticuts_3rd_congressional_district). You can get to the edit mode by clicking the orange "Edit and show details" button at the bottom of the view pages. There is a link for RDF describing the page's topic at the bottom of every view page, while edit mode additionally offers "explore mode" and JSON versions. Freebase accepts MQL queries via HTTP at http://www.freebase.com/api/service/mqlread, which is good when you know the query that you want to submit. Until then, you can use the Query Editor to develop the query that you want. It offers a convenient tab-completion for the properties and entities as you type the query. Freebase Suggest (http://code.google.com/p/freebase-suggest/) can be used as a component in your own web pages to accept a query string from a user and return a Freebase entity. It uses jQuery and is easy to use if you are familiar with Javascript and jQuery.
Survey-NG
An essential aspect to using and reusing Linked Data is being able to navigate it quickly to get a sense of what is available. One way to do this is with documentation, but interfaces to work with the data directly can be much more beneficial. We can see this with the variety of documentation and interfaces for the Linked Data sets that we just described. For the hackathon, we created Survey-NG, a "direct interface" to help developers familiarize with the existing data.gov RDF that Tetherless World provides.
The page starts with listing SPARQL endpoints that can be queried:
http://www.rpi.edu/~lebot/survey-ng/img/survey-ng-1.png
Pressing the "not loaded" cell will query the SPARQL endpoint for some named graphs, each which contains a data set. Here, we are requesting three.
http://www.rpi.edu/~lebot/survey-ng/img/survey-ng-2.png
http://www.rpi.edu/~lebot/survey-ng/img/survey-ng-3.png
Clicking on a blue named graph will list all properties that are used in the data set.
http://www.rpi.edu/~lebot/survey-ng/img/survey-ng-4.png
Clicking on a property will list up to three distinct values. Clicking on the same property will list an additional three distinct values for the property.
http://www.rpi.edu/~lebot/survey-ng/img/survey-ng-5.png
In the case illustrated above, there is only one distinct value. When any sample values are being shown, a Freebase Suggest box appears. This can be used to copy/paste values in to search Freebase for an existing entity. Pressing the "Show more sample values for all predicates." cell will list up to three distinct values for all properties in the data set. Pressing this subsequent times will provide three additional distinct values for the property. After requesting sample values for all properties, we can skim through to find candidates for linking.
http://www.rpi.edu/~lebot/survey-ng/img/survey-ng-6.png
http://data-gov.tw.rpi.edu/vocab/p/10002/agency_name and http://data-gov.tw.rpi.edu/vocab/p/10002/bureau_name look like good candidates, so we can request additional values for just this property by clicking on the property.
http://www.rpi.edu/~lebot/survey-ng/img/survey-ng-7.png
Copying the text "Corps of Engineers" and pasting it into the Freebase Suggest box above, we get a list of suggestions.
http://www.rpi.edu/~lebot/survey-ng/img/survey-ng-8.png
Scrolling to a suggestion will show additional information about it:
http://www.rpi.edu/~lebot/survey-ng/img/survey-ng-9.png
And selecting it will output some RDF associating the Freebase entity with the string label that you entered (Note, this is currently not working, as the Freebase label is appearing and not the value a user pasted in. We still need to handle some issue involving Javascript events/jQuery/Freebase Suggest/Freease API).
http://www.rpi.edu/~lebot/survey-ng/img/survey-ng-10.png
This tool would now allow for a "join" between the data.gov RDF and Freebase, based on the string value in the data.gov RDF, the triples asserted by Survey-NG, and whatever Freebase has stored away.
@prefix dgtwc: <http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf> . # Survey-NG's current RDF output, which incorrectly uses Freebase's label and not the entered label. <http://rdf.freebase.com/ns/en/united_states_army_corps_of_engineers> dgtw:joinLabel "United States Army Corps of Engineers" . # The objective Survey-NG output, based on the current example. This would allow for linking between data.gov RDF and Freebase. <http://rdf.freebase.com/ns/en/united_states_army_corps_of_engineers> dgtw:joinLabel "Corps of Engineers" .
A SPARQL query could then be used to gather Freebase's descriptions of the data.gov entities:
select ?fbProperty ?fbValue
where {
# dg: Data.gov
?dgEntity ?property ?joinLabel .
# fb: Freebase
?fbEntity dgtw:joinLabel ?joinLabel ;
?fbProperty ?fbValue .
}
Cross-Host Restriction (XHR) restriction and JSONP
Tim spent all of Saturday fighting Javascript's Cross-Host Restriction, and eventually got results from TW's SPARQL endpoint an hour into Sunday's session. Hopefully the code snippet below can help others in the future. It uses jQuery to submit the request, which handles inserting the <script href/> element into the DOM. When the response is received, the function listed in the last argument is called (handleResponse_PredicatesInNG) with a single argument containing the responseData.
The trick is tucking +'?callback=?', into the URL. jQuery replaces the second question mark with its own value before sending the HTTP request.
query = 'select distinct ?p ' +
'where { ' +
' graph <'+ng+'> { ' +
' ?s ?p ?o . ' +
' } ' +
'}';
$.getJSON(endPointURL+'?callback=?',
{ 'query' : query,
'output' : 'json'
},
handleResponse_PredicatesInNG}
);
Conclusions
Survey-NG limitations:
- Requires data.gov data sets to be hosted on a sparql endpoint. TW only hosts a few of them. To get around this, a vocabulary could be created to describe the information that Survey-NG needed. The descriptions could then be computed a priori, when the data set is being created.
Survey-NG next steps:
- Better generic URI abbreviation handling for the properties column.
- Showing the full URI is precise but cluttering. Finding a way to preserve the precision while reducing the cluttering would be nice.
- Avoid need to copy/paste
- Allow user to click on value, show freebase-suggest in-line and accept selection.
- Deal with Freebase's lack of SPARQL endpoint.
- The SPARQL query above would not work as is because Freebase only provides an MQL endpoint.
Suggestions for TW's data.gov RDF efforts:
- Make the dataset metadata SPARQL-queryable.
- although one could grab the RDF feed from the wiki, there is a chicken-and-egg problem because one needs to know which datasets exist before one can request its description.
- scraping http://data-gov.tw.rpi.edu/wiki/Data.gov_Catalog is doable but not a good start to one's day.
- Provide an overview of which data.gov RDF datasets reuse predicates (e.g, 770 uses 774 namespace). Since predicate reuse tends to be an exception not a rule, knowing these overlaps would help a developer focus their attention when deciding with which data.gov RDF datasets to work. Predicate reuse is specified in the csv file that is input to the batch RDFization process, so it should already be available in a structured format. If the suggestion above was satisfied, then this overview could be done with a SPARQL query.

