————————————————————————————————
Title: Open Government Knowledge: AI Opportunities and Challenges
When: 4-6 November 2011
Where: Westin Arlington Gateway in Arlington, Virginia, USA
Homepage: http://tw.rpi.edu/ogk2011
Program (PDF): http://tw.rpi.edu/media/latest/ogk2011.pdf
————————————————————————————————
Please join us to meet the thought governmental and business leaders in
US open government data activities, and discuss the challenges. The
symposium features Friday (Nov 4) as governmental day with speakers on
Data.gov, openEi.org, open gov data activities in NIH/NCI, NASA. and
Saturday (Nov 5) as R&D day with speakers from industry such as Google
and Microsoft, as well international researchers.
This symposium will explore how AI technologies such as the Semantic Web,
information extraction, statistical analysis and machine learning, can be used
to make the valuable knowledge embedded in open government data more
explicit, accessible and reusable.
Co-Chairs
* Li Ding, Qualcomm (Previously RPI)
* Tim Finin, UMBC
* Lalana Kagal, MIT
* Deborah McGuinness, RPI
VN:F [1.9.13_1145]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.13_1145]
Check out the tagCloud below, can you see why it is interesting? Please compare the two tag clouds generated from the same text (a text corpus from the title of about 2000 data.gov datasets), and see why they are different.
 Novel Multi-word TagCloud |
 Conventional Single-word TagCloud |
Highlights
- Meaningful Visualization. As you may see from the caption, the first one a “MultiWord TagCloud” while the other is the conventional single-word TagCloud. The former joints individual words into popular multi-word phrases. With the former tag cloud, I can have a better overview on what data was published at data.gov.
- Automated Process. The MultiWord TagCloud was not created by human users, but automatically generated by computer program, powered by Microsoft Web N-gram service. We can generate such tag cloud for all existing text document
- Cloud+Crowd. Broadly, this demo shows the value of the crowd and the cloud I mentioned in my earlier blog, now big data can be tackled by the crowd (text from the entire Web) and the cloud (the high performance computational Web N-gram service).
Behind the Scene
The WWW2010 is really inspiring – making me a productive “engineer” although I came as a researcher. Today I picked up Microsoft Visual Studio and write my first C# program. I was an excellent C++ programmer back to my college time (I wrote ton of code using Visual C++ 4.0 10+ years ago). However, today is not about me being a programmer, but rather announce something that is really cool! I would also like to thank researchers, Evelyne and Paul from Microsoft Research for their great support. My demo on data.gov data is powered by Microsoft Web N-gram Service.
Cheers,
Li Ding @ RPI, April 29, 2010
VN:F [1.9.13_1145]
Rating: 6.5/10 (2 votes cast)
VN:F [1.9.13_1145]
I recently build a proof-of-concept demo on getting Facebook data (public data only) into LOD their recently announced Graph API. The demo is available at http://sam.tw.rpi.edu/ws/face_lod.html.
It is fairly straightforward to convert the JSON object into RDF and make the URI dereferenceable. Now the data are linkable, but not yet linked to other LOD data.
I did see some issues when I was assigning rdf properties. Here is an example JSON from http://graph.facebook.com/cocacola
{
"id": "40796308305",
"name": "Coca-Cola",
"picture": "http://profile.ak.fbcdn.net/object3/1853/100/s40796308305_2334.jpg",
"link": "http://www.facebook.com/coca-cola",
"category": "Consumer_products",
"username": "coca-cola",
"products": "Coca-Cola is the most popular and biggest-selling soft drink in history, as well as the best-known product in the world.\n\nCreated in Atlanta, Georgia, by Dr. John S. Pemberton, Coca-Cola was first offered as a fountain beverage by mixing Coca-Cola syrup with carbonated water. Coca-Cola was introduced in 1886, patented in 1887, registered as a trademark in 1893 and by 1895 it was being sold in every state and territory in the United States. In 1899, The Coca-Cola Company began franchised bottling operations in the United States.\n\nCoca-Cola might owe its origins to the United States, but its popularity has made it truly universal. Today, you can find Coca-Cola in virtually every part of the world.",
"fan_count": 5425800
}
1. The JSON file from Facebook is not using the exact Open Graph Protocol terms -below is the mapping
name => og:title
category => og:type
picture => og:image
link => og:url
2.2 we can reuse FOAF and DCTerms to cover some terms used in Facebook data – below is the mapping
picture => foaf:depiction
name => foaf:name
from => dcterms:source
id => dcterms:identifier
created_time => dcterms:created
updated_time => dcterms:modified
category => dcterms:type
link => foaf:homepage
Li Ding@RPI April 28, 2010
VN:F [1.9.13_1145]
Rating: 7.5/10 (4 votes cast)
VN:F [1.9.13_1145]
Sameas Network is a network of URIs which are inter-connected by owl:sameAs relation. It is such an interesting network as it is not a conventional social network, but rather a socially contributed directed graph DAG connecting “equivalent” identity.
Our recent study [1] crawls sameas network following linked data principles: starting from a given seeding URI, we dereference the URI and recursively fetch URIs linked by owl:sameAs. We used a fairly small seeding set URIs of New York Times URIs (100 people, 100 locations and 100 organizations) and got 300 sameas networks. Please come to WebSci Poster Session today (April 26,2010) to see more discussions.
Results
The average size of sameas network is 22, and one of the largest networks has 58 URIs in network with 1249 sameas arcs. Not all URIs are dereferencable, and the dereferencable ones may be described by 1 to over a thousand triples.
Following are some interesting breaking observations as confirmed in several plotted sample sameas networks (They are breaking because they have not even been printed in our poster yet).
- New York Times(NYT) and DBpedia have different preferences on mutual sameas relation. It is interesting to see that NYT connect its numerical URI to a non-numeric URI in freebase.
- Many DBpedia URIs were connect not within DBpedia, but by freebase. In DBpedia, “dbpprop:redirect” property was used to connect equivalent URIs.
- Wrong links were introduced by freebase, dbpedia:Paul_Allen was linked to dbpedia:Paul_Allen’s_House.

Paul Allen and his House (People NYT)

Arctic (Location NYT)
Discussion
- A lot of URIs does not carry information or just did redirection (see my paper), so it would be useful to reduce skip these URIs to reduce the cost of linked data exploration. we can further reduce the cost of loading same As URI.
- Quality of sameas link causes a big concern, the legitmate use of freebase sameas realtions is debatable.
Comments from Tim Berner-Lee: let’s leverage semantics – we can look into the semantic annotations (e.g. rdf:type) of the URI being described to automatically infer potential bad data integration. Paul Allen’s House will be than knock out with its type being “house”.
[1] Ding, Li and Shinavier, Joshua and Finin, Tim and L. McGuinness, Deborah (2010) An Empirical Study of owl:sameAs Use in Linked Data. In: Proceedings of the WebSci10: Extending the Frontiers of Society On-Line, April 26-27th, 2010, Raleigh, NC: US. http://journal.webscience.org/403/
Li Ding @ RPI April 26, 2010
VN:F [1.9.13_1145]
Rating: 8.0/10 (2 votes cast)
VN:F [1.9.13_1145]
There are some ongoing interests in vocabulary for government dataset publishing. There are a number of proposals such as DERI dcat, Sunlight Lab’s guidelines and RPI’s proposal on Data-gov Vocabulary. Based on our experiences on data.gov catalog data, we found the following principles are useful for consolidate the vocabulary building process and potentially bring consensus:
- 1. modular vocabulary with minimal core
- keep the core vocabulary small and stable, only include a small set of frequently used (or required) terms
- allow extensions contributed by anyone. Extensions should be connected to the core ontology and be possible to be promoted to core status later.
- 2. choice of term
- make it easy for curator to produce metadata using the term, e.g. do they need to specify data quality ?
- make it clear on the expected range of term , e.g. should they use “New York” or “dbpedia:New_York” for spatial coverage? does it require a controlled vocabulary? A validator would be very helpful
- make it clear on the expected use of term, e.g. can it be displayed in rich snippet? can it be used in SPARQL query, search or facet browsing?
- try to reuse a term from existing popular vocabulary
- identify the required, recommended, and optional terms
- 3. best practices for actual usage
- we certainly want the metadata to be part of linked data, but that is not the end. We would like to see the linked data actually being used by users who don’t know much about the semantic web.
- we should consider make vocabulary available in different formats for a wider range of users , e.g. RDFa, Microformat, ATOM, JSON, XML Schema, OData
- we should build use cases, tools and demos to exhibit the use of vocabulary to promote adoption
comments are welcome.
Li Ding @ RPI
VN:F [1.9.13_1145]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.13_1145]