Archive

Posts Tagged ‘linked data’

Fall 2010 TWC Undergraduate Research Summary

December 20th, 2010

The Fall 2010 semester marked the beginning of the Tetherless World Constellation’s undergraduate research program at Rensselaer Polytechnic Institute (RPI). Although TWC has enjoyed significant contributions from RPI undergrads since its inception, this term we stepped up our game by more “formally” incorporating a group of undergrads into TWC’s research programs, established regular meetings for the group, and with input from the students began outfitting their own space in RPI’s Winslow Building.

Patrick West, my fellow TWC undergrad research coordinator and I asked the students to blog about their work throughout the semester; with the end of term, we asked them to post summary descriptions of their work and their thoughts about the fledgling TWC undergrad research program itself. We’ve provided short summaries and links to those blogs below…

  • Cameron Helm began the term coming up to speed on SPARQL and RDF, experimented with several of the public TWC endpoints, and then worked with Phillip on basic visualizations. He then slashed his way through the tutorials on TWC’s LOGD Portal, eventually creating impressive visualizations such as this earthquake map. Cameron is very interested in the subject of data visualization and looks to do more work in this area in the future.
  • After a short TWC learning period, Dan Souza began helping doctoral candidate Evan Patton create an Android version of the Mobile Wine Agent application, with all the amazing visualization and data integration required, including Twitter and Facebook integration. Mid-semester Dan also responded to the call to help with the crash” development of the Android/iPhone TalkTracker app, in time for ISWC 2010 in early November. Dan continues to work with Evan and others for early 2011 releases of Android, iPhone/iPad Touch and iPad versions of the Mobile Wine Agent.
  • David Molik reports that he learned web coding skills, ontology creation, server installation and administration. David contributed to the development and operation of a test site for the new, semantic web savvy website for the Biological and Chemical Oceanography Data Management Office BCO-DMO of the Woods Hole Oceanographic Institute.
  • Jay Chamberlin spent much of his time working on the OPeNDAP Project, an open source server to distribute scientific data that is stored in various formats. His involvement included everything from learning his way around the OPeNAP server, to working with infrastructure such as TWC’s LDAP services, to helping migrate documentation from the previous Wiki to the new Drupal site, to actually implementing required changes to the OPeNDAP code base.
  • Phillip Ng worked on a wide variety of projects this fall, starting with basic visualizations, helping with ISWC applications, and including iPad development for the Mobile Wine Agent. Phillip’s blog is fascinating to read as he works his way through the challenges of creating applications, including his multi-part series on implementing the social media features.
  • Alexei Bulazel began working with Dominic DiFranzo on a health-related mashup using Data.gov datasets and is now working on a research paper with David on “human flesh search engine” techniques, a topic that top thinkers including Tetherless World Senior Constellation Professor Jim Hendler have explored in recent talks. Note: For more background on this phenomena, see e.g. China’s Cyberposse, NY Times (03 Mar 2010)

Many of these students will be continuing on with these or other projects at TWC in 2011; we also expect several new students to be joining the group. The entire team at the Tetherless World Constellation thanks them for their efforts and many important contributions this fall, and looks forward to being amazed by their continued great work in the coming year!

John S. Erickson, Ph.D.

VN:F [1.9.22_1171]
Rating: 9.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Linked Data and the Semantic Web (Nature Blog)

June 3rd, 2010

I have written a blog about the Linked Open Government Project:

(intro)
This entry is a backgrounder, rather than a technical piece – the goal is to introduce some new work that my RPI laboratory has been doing aimed at using Semantic Web technologies to help the US government in their data sharing efforts at the Data.gov site. Since similar work is going on on the British Data.gov.uk website, led by my colleagues Tim Berners-Lee and Nigel Shadbolt, (with some “friendly rivalry” between the two), I thought it might be worth providing some background, and pointers to this work.

(update – DOH!  Here is the link:

http://blogs.nature.com/jhendler/2010/06/01/linked-open-government-data-and-the-semantic-web

)

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: tetherless world Tags: ,

Three principles for building government dataset catalog vocabulary

April 23rd, 2010

There are some ongoing interests in vocabulary for government dataset publishing. There are a  number of proposals such as DERI dcat, Sunlight Lab’s guidelines and RPI’s proposal on Data-gov Vocabulary. Based on our experiences on data.gov catalog data, we found the following principles are useful for consolidate the vocabulary building process and potentially bring consensus:

1. modular vocabulary with minimal core
  • keep the core vocabulary small and stable, only include a small set of frequently used (or required) terms
  • allow extensions contributed by anyone. Extensions should be connected to the core ontology and be possible to be promoted to core status later.
2. choice of term
  • make it easy for curator to produce metadata using the term, e.g. do they need to specify data quality ?
  • make it clear on the expected range of term , e.g. should they use “New York” or “dbpedia:New_York” for spatial coverage? does it require a controlled vocabulary? A validator would be very helpful
  • make it clear on the expected use of term, e.g. can it be displayed in rich snippet? can it be used in SPARQL query, search or facet browsing?
  • try to reuse a term from existing popular vocabulary
  • identify the required, recommended, and optional terms
3. best practices for actual usage
  • we certainly want the metadata to be part of linked data, but that is not the end. We would like to see the linked data actually being used by users who don’t know much about the semantic web.
  • we should consider make vocabulary available in different formats for a wider range of users , e.g. RDFa, Microformat, ATOM, JSON, XML Schema, OData
  • we should build use cases, tools and demos to exhibit the use of vocabulary to promote adoption

comments are welcome.

Li Ding @ RPI

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: tetherless world Tags: ,

RPI Hackathon: Linking government data

December 9th, 2009

This is an invitation to participate in the RPI Hackathon 2009 for linking government data. For more detailed information check our wiki.

Part of the work done here in the Tetherless World Constellation consists in translating the government datasets available from data.gov into RDF. This effort has produced billions of triples from (at the moment of writing this post) more than 130 datasets. This data can used in multiple ways: It can be queried from a SPARQL endpoint, used in visualizations such as maps or it can be combined with other datasets (whether from data.gov or other sources) to find correlations, clustering or other types of analysis.

However, we think that the data is more interesting and useful when is linked: For example, a system can answer a specific query and also suggest other sources of information that may be relevant to the user. Thus we think that while we keep translating datasets, it also would be nice to link these datasets to the Linked Data cloud and, in order to do that, we are asking your help.

During December 12th and 13th we will host a Hackathon (i.e., an event where people gather together to work on a specific computational problem). This event is part of the Great American Hackathon promoted by Sunlight Labs. We will host this event at Winslow Building, RPI, in Troy NY. It will start from 10AM to 5PM , but if you have only a few spare hours, you are also welcome! As I mentioned above, our main goal is to link the available data to the Linked Data cloud, but if you have also other ideas to develop using one or more of the datasets, please join us too! The only requirement is to bring your computer and register by email to gravea3[@]rpi.edu or difrad[@]rpi.edu. Because we know big brains needs energy, food and beverages will be provided. Even if you can’t attend physically you can help us working online.

Everyone is invited to participate. If you have any comments, questions, etc. please don’t hesitate to contact me at gravea3[@]rpi.edu or check the announcement in data-gov.

Alvaro Graves and the Data-gov team.

VN:F [1.9.22_1171]
Rating: 8.3/10 (3 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

The RPI Data-gov Wiki: Current and Future Status

November 24th, 2009

This blog post is being written in response to some questions being asked of late about our work on turning the datasets from http://data.gov  (and also some other govt datasets) into linkeddata formats (RDF) and making this available to the community.  In essence, the criticisms have been that although we’ve made a good start, there’s still a lot more to do.  We couldn’t agree more.

Our http://data-gov.tw.rpi.edu  Wiki has been made available to help the public build, access and reuse linked government data. In order to get the data out quickly, we took some simple steps to start, showing how powerful it could be just to republish the data in RDF.  However, we are also working now to bring in more linking between the data, more datasets from other govt data sites, and more semantics in relating the datatsets.

The first step we did was to bring out a raw RDF version of data.gov datasets and built some quick demos to show how easy it was (see http://data-gov.tw.rpi.edu/wiki/Demos). The benefit of this was that we could easily dump from other formats into RDF, merge the data in simple ways, and then query the data using SPARQL and put the results right into visualizations using “off the shelf” Web tools such as Google Visualization and MIT’s Exhibit. In this step, we follow the “minimalism” principle – minimize human/development efforts and keep the process simple and replicable. Thus we did not try to do a lot of analysis of the data, didn’t add triples for things such as provenance, and didn’t link the datasets directly. Rather, the linking in our early demos came from obvious linking such as same locations or temporal-spatial overlaps.

The second step, which is ongoing right now, is to improve data quality by cleaning and enriching sour emantics. We are improving our human (and machine) versions of the  data.gov catalog (http://data-gov.tw.rpi.edu/wiki/Data.gov_Catalog), which is important for non-government people to use our data. For example, we now:

1. It aggregates metadata from data.gov (http://www.data.gov/details/92) and metadata about our curated RDF version. The aggregated metadata is published in both human and machine readable (RDF) forms.

2. Every dataset has a dereferenceable URI for itself, and links to the raw data, to linked RDF datasets in chunks that are small enough for linked data browsers such as tabulator, and the converted complete RDF data documents.

3. We use the definitions from data.gov (their dataset92 metadata dictionary as it were) for the metadata of each file, but we also add some DC, FOAF  and a couple of small home brews (like number of triples) in an ontology called DGTWC.

4. We now are also linking to more online “enhanced” datasets that  include (we’ve only done it for a few so far) normalized triples extracted from the raw data and links from entities (such as locations, organizations, persons) in government dataset to the linked data cloud (DBPedia and Geonames so far, much more coming soon).  We are also exploring the use of VoID for adding more data descriptions (and IRIs) to the datasets.

5. We are also working on linking the datasets by common properties — this is harder than most people think because you cannot just assume the same name means the same relations – can have different ranges, values or even semantics (and we have found examples of all of the above) – so soon you’ll find for each property there is something like this
geo:lat a rdf:property.
191:latitude rdfs:subPropertyOf geo:lat .
398:latitude rdfs:subPropertyOf geo:lat .

and we have a Semantic Wiki page for each property, so you can find all the subproperty relations and, eventually, a place where people can add information about what some of the more obscure properties mean, or where semantic relations such as “owl:sameAs” can be introduced when these subproperties are known to be the same.

So to summarize, our first step, and we continue to do it, is to transform data in ways that other people can start to add value.  Our second goal, which we’re now working on, is enhanced metadata and adding more semantics, including what is needed for more linking.

We’re also, in our research role, working on next generation technologies for really doing more exciting things (think “read write web of data”) but we’re trying to keep that separate from the work at http://data-gov.tw.rpi.edu, which is aimed at helping to show that Semantic Web technologies are really the only game in town for the sort of large-scale, open, distributed data use that is needed for linked data to really take off.

And if you feel there is stuff missing, let us know (via contact us)- or even better, all our stuff is open (see http://code.google.com/p/data-gov-wiki/), free and easily used – all we ask is that you do great stuff and help make the Web of data grow.

Jim Hendler, Li Ding, and the RPI data-gov research team.

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)