Archive for April, 2010

Multi-Word TagCloud on Web N-gram Now

April 29th, 2010

Check out the tagCloud below, can you see why it is interesting? Please compare the two tag clouds generated from the same text (a text corpus from the title of about 2000 datasets), and see why they are different.

A Multi-word TagCloud produced from 2000 US gov dataset titles

Novel Multi-word TagCloud

Conventional Single-word Tag Cloud

Conventional Single-word TagCloud


  • Meaningful Visualization. As you may see from the caption, the first one a “MultiWord TagCloud” while the other is the conventional single-word  TagCloud. The former joints individual words into popular multi-word phrases. With the former tag cloud, I can have a better overview on what data was published at
  • Automated Process. The MultiWord TagCloud was not created by human users, but automatically generated by computer program, powered by Microsoft Web N-gram service. We can generate such tag cloud for all existing text document
  • Cloud+Crowd. Broadly, this demo shows the value of the crowd and the cloud I mentioned in my earlier blog, now big data can be tackled by the crowd (text from the entire Web) and the cloud (the high performance computational Web N-gram service).

Behind the Scene

The WWW2010 is really inspiring – making me a productive “engineer” although I came as a researcher. Today I picked up Microsoft Visual Studio and write my first C# program. I was an excellent C++ programmer back to my college time (I wrote ton of code using Visual C++ 4.0 10+ years ago). However, today is not about me being a programmer, but rather announce something that is really cool!  I would also like to thank researchers, Evelyne and Paul from Microsoft Research for their great support. My demo on data is powered by Microsoft Web N-gram Service.


Li Ding @ RPI,  April 29, 2010

VN:F [1.9.22_1171]
Rating: 6.5/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: cloud computing, linked data Tags:

Putting open Facebook data into Linked Data Cloud

April 28th, 2010

I recently build a proof-of-concept demo on getting Facebook data (public data only) into LOD   their recently announced Graph API. The demo is available at

It is fairly straightforward to convert the JSON object into RDF and make the URI dereferenceable. Now the data are linkable, but not yet linked to other LOD data.

I did see some issues when I was assigning rdf properties. Here is an example JSON from

   "id": "40796308305",
   "name": "Coca-Cola",
   "picture": "",
   "link": "",
   "category": "Consumer_products",
   "username": "coca-cola",
   "products": "Coca-Cola is the most popular and biggest-selling soft drink in history, as well as the best-known product in the world.\n\nCreated in Atlanta, Georgia, by Dr. John S. Pemberton, Coca-Cola was first offered as a fountain beverage by mixing Coca-Cola syrup with carbonated water. Coca-Cola was introduced in 1886, patented in 1887, registered as a trademark in 1893 and by 1895 it was being sold in every state and territory in the United States. In 1899, The Coca-Cola Company began franchised bottling operations in the United States.\n\nCoca-Cola might owe its origins to the United States, but its popularity has made it truly universal. Today, you can find Coca-Cola in virtually every part of the world.",
   "fan_count": 5425800

1. The JSON file from Facebook is not using the exact Open Graph Protocol terms -below is the mapping

   name  => og:title
   category => og:type
   picture  => og:image
   link  => og:url

2.2 we can reuse FOAF and DCTerms to cover some terms used in Facebook data – below is the mapping

  picture => foaf:depiction
  name => foaf:name
  from => dcterms:source
  id => dcterms:identifier
  created_time => dcterms:created
  updated_time => dcterms:modified
  category => dcterms:type
  link => foaf:homepage

Li Ding@RPI April 28, 2010

VN:F [1.9.22_1171]
Rating: 7.5/10 (4 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: linked data Tags:

Sameas Network

April 26th, 2010

Sameas Network is a network of URIs which are inter-connected by owl:sameAs relation. It is such an interesting network as it is not  a conventional social network, but rather a socially contributed directed graph DAG connecting “equivalent” identity.

Our recent study [1] crawls sameas network following linked data principles: starting from a given seeding URI, we dereference the URI and recursively fetch URIs linked by owl:sameAs. We used a fairly small seeding set URIs of New York Times URIs (100 people, 100 locations and 100 organizations) and got 300 sameas networks.  Please come to WebSci Poster Session today (April 26,2010) to see more discussions.


The average size of sameas network is 22, and one of the largest networks has 58 URIs in network with 1249 sameas arcs. Not all URIs are dereferencable, and the dereferencable ones may be described by 1 to over a thousand triples.

Following are some interesting breaking observations as confirmed in several plotted sample sameas networks (They are breaking because they have not even been printed in our poster yet).

  • New York Times(NYT) and DBpedia have different preferences on mutual sameas relation. It is interesting to see that NYT connect its numerical URI to a non-numeric URI in freebase.
  • Many DBpedia URIs were connect not within DBpedia, but by freebase. In DBpedia, “dbpprop:redirect” property was used to connect equivalent URIs.
  • Wrong links were introduced by freebase, dbpedia:Paul_Allen was linked to dbpedia:Paul_Allen’s_House.

Paul Allen and his House (People NYT)

Paul Allen and his House (People NYT)

Arctic (Location NYT)

Arctic (Location NYT)


  • A lot of URIs does not carry information or just did redirection (see my paper), so it would be useful to reduce skip these URIs to reduce the cost of linked data exploration. we can further reduce the cost of loading same As URI.
  • Quality of sameas link causes a big concern, the legitmate use of freebase sameas realtions is debatable.

Comments from Tim Berner-Lee:  let’s leverage semantics – we can look into the semantic annotations (e.g. rdf:type) of the URI being described to automatically infer potential bad data integration. Paul Allen’s House will be than knock out with its type being “house”.

[1] Ding, Li and Shinavier, Joshua and Finin, Tim and L. McGuinness, Deborah (2010) An Empirical Study of owl:sameAs Use in Linked Data. In: Proceedings of the WebSci10: Extending the Frontiers of Society On-Line, April 26-27th, 2010, Raleigh, NC: US.

Li Ding  @ RPI April 26, 2010

VN:F [1.9.22_1171]
Rating: 8.0/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: linked data Tags:

Three principles for building government dataset catalog vocabulary

April 23rd, 2010

There are some ongoing interests in vocabulary for government dataset publishing. There are a  number of proposals such as DERI dcat, Sunlight Lab’s guidelines and RPI’s proposal on Data-gov Vocabulary. Based on our experiences on catalog data, we found the following principles are useful for consolidate the vocabulary building process and potentially bring consensus:

1. modular vocabulary with minimal core
  • keep the core vocabulary small and stable, only include a small set of frequently used (or required) terms
  • allow extensions contributed by anyone. Extensions should be connected to the core ontology and be possible to be promoted to core status later.
2. choice of term
  • make it easy for curator to produce metadata using the term, e.g. do they need to specify data quality ?
  • make it clear on the expected range of term , e.g. should they use “New York” or “dbpedia:New_York” for spatial coverage? does it require a controlled vocabulary? A validator would be very helpful
  • make it clear on the expected use of term, e.g. can it be displayed in rich snippet? can it be used in SPARQL query, search or facet browsing?
  • try to reuse a term from existing popular vocabulary
  • identify the required, recommended, and optional terms
3. best practices for actual usage
  • we certainly want the metadata to be part of linked data, but that is not the end. We would like to see the linked data actually being used by users who don’t know much about the semantic web.
  • we should consider make vocabulary available in different formats for a wider range of users , e.g. RDFa, Microformat, ATOM, JSON, XML Schema, OData
  • we should build use cases, tools and demos to exhibit the use of vocabulary to promote adoption

comments are welcome.

Li Ding @ RPI

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: tetherless world Tags: ,

The Tetherless World Weblog x Dresshead WMNS Cross Strap Ankle Cute Sandals

April 9th, 2010

Slip into extreme comfort and superior style with these lusciously stylish The Tetherless World Weblog x Dresshead WMNS Cross Strap Ankle Cute Sandals. The entire sole is made of high quality non slip rubber which has cut out patterns of circles and random concentric shapes which provide you excellent coverage and safety. The insole is carefully crafted with a well padded, breathable microfiber fabric that keeps your feet cushioned and comfortable as well as cool. A flexible shape and elastic back strap make these sandals easy to slip on. The wide t-strap is the highlight of these sandals as it is covered with a wide assortment of gemstones and crystals in bright colors. The edging is made of gunmetal gray hexagon shaped crystals and small silver beads are woven between the colorful crystals. With so many colors, these The Tetherless World Weblog x Dresshead WMNS Cross Strap Ankle Cute Sandals will match any outfit.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: tetherless world Tags: