Archive for April, 2010

Multi-Word TagCloud on Web N-gram Now

April 29th, 2010

Check out the tagCloud below, can you see why it is interesting? Please compare the two tag clouds generated from the same text (a text corpus from the title of about 2000 datasets), and see why they are different.

A Multi-word TagCloud produced from 2000 US gov dataset titles

Novel Multi-word TagCloud

Conventional Single-word Tag Cloud

Conventional Single-word TagCloud


  • Meaningful Visualization. As you may see from the caption, the first one a “MultiWord TagCloud” while the other is the conventional single-word  TagCloud. The former joints individual words into popular multi-word phrases. With the former tag cloud, I can have a better overview on what data was published at
  • Automated Process. The MultiWord TagCloud was not created by human users, but automatically generated by computer program, powered by Microsoft Web N-gram service. We can generate such tag cloud for all existing text document
  • Cloud+Crowd. Broadly, this demo shows the value of the crowd and the cloud I mentioned in my earlier blog, now big data can be tackled by the crowd (text from the entire Web) and the cloud (the high performance computational Web N-gram service).

Behind the Scene

The WWW2010 is really inspiring – making me a productive “engineer” although I came as a researcher. Today I picked up Microsoft Visual Studio and write my first C# program. I was an excellent C++ programmer back to my college time (I wrote ton of code using Visual C++ 4.0 10+ years ago). However, today is not about me being a programmer, but rather announce something that is really cool!  I would also like to thank researchers, Evelyne and Paul from Microsoft Research for their great support. My demo on data is powered by Microsoft Web N-gram Service.


Li Ding @ RPI,  April 29, 2010

VN:F [1.9.22_1171]
Rating: 6.5/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: cloud computing, linked data Tags:

Putting open Facebook data into Linked Data Cloud

April 28th, 2010

I recently build a proof-of-concept demo on getting Facebook data (public data only) into LOD   their recently announced Graph API. The demo is available at

It is fairly straightforward to convert the JSON object into RDF and make the URI dereferenceable. Now the data are linkable, but not yet linked to other LOD data.

I did see some issues when I was assigning rdf properties. Here is an example JSON from

   "id": "40796308305",
   "name": "Coca-Cola",
   "picture": "",
   "link": "",
   "category": "Consumer_products",
   "username": "coca-cola",
   "products": "Coca-Cola is the most popular and biggest-selling soft drink in history, as well as the best-known product in the world.\n\nCreated in Atlanta, Georgia, by Dr. John S. Pemberton, Coca-Cola was first offered as a fountain beverage by mixing Coca-Cola syrup with carbonated water. Coca-Cola was introduced in 1886, patented in 1887, registered as a trademark in 1893 and by 1895 it was being sold in every state and territory in the United States. In 1899, The Coca-Cola Company began franchised bottling operations in the United States.\n\nCoca-Cola might owe its origins to the United States, but its popularity has made it truly universal. Today, you can find Coca-Cola in virtually every part of the world.",
   "fan_count": 5425800

1. The JSON file from Facebook is not using the exact Open Graph Protocol terms -below is the mapping

   name  => og:title
   category => og:type
   picture  => og:image
   link  => og:url

2.2 we can reuse FOAF and DCTerms to cover some terms used in Facebook data – below is the mapping

  picture => foaf:depiction
  name => foaf:name
  from => dcterms:source
  id => dcterms:identifier
  created_time => dcterms:created
  updated_time => dcterms:modified
  category => dcterms:type
  link => foaf:homepage

Li Ding@RPI April 28, 2010

VN:F [1.9.22_1171]
Rating: 7.5/10 (4 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: linked data Tags:

Sameas Network

April 26th, 2010

Sameas Network is a network of URIs which are inter-connected by owl:sameAs relation. It is such an interesting network as it is not  a conventional social network, but rather a socially contributed directed graph DAG connecting “equivalent” identity.

Our recent study [1] crawls sameas network following linked data principles: starting from a given seeding URI, we dereference the URI and recursively fetch URIs linked by owl:sameAs. We used a fairly small seeding set URIs of New York Times URIs (100 people, 100 locations and 100 organizations) and got 300 sameas networks.  Please come to WebSci Poster Session today (April 26,2010) to see more discussions.


The average size of sameas network is 22, and one of the largest networks has 58 URIs in network with 1249 sameas arcs. Not all URIs are dereferencable, and the dereferencable ones may be described by 1 to over a thousand triples.

Following are some interesting breaking observations as confirmed in several plotted sample sameas networks (They are breaking because they have not even been printed in our poster yet).

  • New York Times(NYT) and DBpedia have different preferences on mutual sameas relation. It is interesting to see that NYT connect its numerical URI to a non-numeric URI in freebase.
  • Many DBpedia URIs were connect not within DBpedia, but by freebase. In DBpedia, “dbpprop:redirect” property was used to connect equivalent URIs.
  • Wrong links were introduced by freebase, dbpedia:Paul_Allen was linked to dbpedia:Paul_Allen’s_House.

Paul Allen and his House (People NYT)

Paul Allen and his House (People NYT)

Arctic (Location NYT)

Arctic (Location NYT)


  • A lot of URIs does not carry information or just did redirection (see my paper), so it would be useful to reduce skip these URIs to reduce the cost of linked data exploration. we can further reduce the cost of loading same As URI.
  • Quality of sameas link causes a big concern, the legitmate use of freebase sameas realtions is debatable.

Comments from Tim Berner-Lee:  let’s leverage semantics – we can look into the semantic annotations (e.g. rdf:type) of the URI being described to automatically infer potential bad data integration. Paul Allen’s House will be than knock out with its type being “house”.

[1] Ding, Li and Shinavier, Joshua and Finin, Tim and L. McGuinness, Deborah (2010) An Empirical Study of owl:sameAs Use in Linked Data. In: Proceedings of the WebSci10: Extending the Frontiers of Society On-Line, April 26-27th, 2010, Raleigh, NC: US.

Li Ding  @ RPI April 26, 2010

VN:F [1.9.22_1171]
Rating: 8.0/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: linked data Tags:

Three principles for building government dataset catalog vocabulary

April 23rd, 2010

There are some ongoing interests in vocabulary for government dataset publishing. There are a  number of proposals such as DERI dcat, Sunlight Lab’s guidelines and RPI’s proposal on Data-gov Vocabulary. Based on our experiences on catalog data, we found the following principles are useful for consolidate the vocabulary building process and potentially bring consensus:

1. modular vocabulary with minimal core
  • keep the core vocabulary small and stable, only include a small set of frequently used (or required) terms
  • allow extensions contributed by anyone. Extensions should be connected to the core ontology and be possible to be promoted to core status later.
2. choice of term
  • make it easy for curator to produce metadata using the term, e.g. do they need to specify data quality ?
  • make it clear on the expected range of term , e.g. should they use “New York” or “dbpedia:New_York” for spatial coverage? does it require a controlled vocabulary? A validator would be very helpful
  • make it clear on the expected use of term, e.g. can it be displayed in rich snippet? can it be used in SPARQL query, search or facet browsing?
  • try to reuse a term from existing popular vocabulary
  • identify the required, recommended, and optional terms
3. best practices for actual usage
  • we certainly want the metadata to be part of linked data, but that is not the end. We would like to see the linked data actually being used by users who don’t know much about the semantic web.
  • we should consider make vocabulary available in different formats for a wider range of users , e.g. RDFa, Microformat, ATOM, JSON, XML Schema, OData
  • we should build use cases, tools and demos to exhibit the use of vocabulary to promote adoption

comments are welcome.

Li Ding @ RPI

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: tetherless world Tags: ,

Big Data for the Cloud and the Crowd

April 1st, 2010

Researchers have been long starving for big data to improve the excellence of their research. Nowadays big data is no longer a dream but something real on the Web: increasing amount of data is becoming available for public access from research communities, individuals, government agencies and etc. So what does such big data mean to the web users and how can we best use it? Following are some potential benefits from big data.

“Make sense of what have been known”. Scientific research is growing in a progressive manner, and scientific discoveries are founded on the knowledge we known in the past. In order to avoid reinventing the wheel, we should preserve our knowledge on what we have known as part of big data and make them available to ongoing research. Currently, keyword search, such as Google Scholar, has successfully helped researchers to retrieve previous research work. Moreover, well organized knowledge about the past research is wanted to provide users a systematic and accurate way to access past work. With better knowledge on what has been done, user can better identifying promising research directions and approaching new discoveries.

“Support hypothesis generation and testing”. With big data in hand (or public accessible), not only scientists but the general public users can start thinking more on the hypothesis, including theoretical models and pop-science questions. A humble use of big data would be that users use an interactive application to conveniently aggregate distributed big data and then invent or evaluate their hypotheses on big data. On step forward would be the usage of powerful AI technology (especially statistical methods) on big data to help users identify similar/unique data/hypotheses, prioritize potentially interesting candidate hypotheses and even come up with new hypothesis.

“Support persistence and accountability”. If big data are going to be the foundation for massive scientific research and public use, reliable data availability is needed by all applications that depend on the data. Meanwhile, without effective accountability mechanisms over the distributed and shared big data, conclusions derived from the big data may not be trusted.

In order to realize the benefits, the emerging Web Science seems very promising as it is bringing many interesting opportunities to deal with the big data:

“Linked Data” [1]. Big data is not merely a massive collection of information islands bounded by their physical locations, and the value of big data can be greatly increased if there are effectively linked (or networked). Similar to the hyperlinks on the Web, it is very important to turn implicit inter-data connections into declarative ones and get links available as part of big data: a person’s medical records can be linked across different clinics and hospitals, demographic state statistics (e.g. livestock and gross income tax) can be linked across different government agencies [2], and information about a disease can be linked to entries at GenBank.

“Social Machine” [3]. Big data should also interact with human society. Crowd sourcing, such as Wikipedia and Web rating systems, has been seen adding huge value to the knowledge on the Web. However, that is not yet the ultimate vision. We can indeed combine the power of machine and human to build the social machine: cloud computing, such as Google search and Microsoft recently announced Web n-gram service, are offering great computing power for processing massive data, and crowd sourcing, such as Wikipedia, can distribute the cost for solving hard problems to massive human intelligence on the Web and supply high quality results. The social machine also supports interactive problem solving: there is a feedback loop between the cloud and the crowd, and the consumers can feedback comments and enhancements to the publisher.

“Knowledge Provenance”[4,5]. Big data are often integrated when being used. Declarative knowledge provenance (e.g. audit trace) is the foundation of transparency of distributed data processing. Computations on provenance data are the keys to accountability, e.g. a policy framework to assure proper use of digital information and some trust mechanisms to assure credibility of reused data.


[1] Tim Berners-Lee, Linked Data, 2007

[2] Li Ding, Dominic Difranzo, Alvaro Graves, James Michaelis, Xian Li, Deborah L. McGuinness,Jim Hendler, Data-gov Wiki: Towards Linking Government Data, in Proceedings of the AAAI Spring Symposium on Linked Data Meets Artificial Intelligence, 2010,

[3] J. Hendler, T. Berners-Lee, From the semantic web to social machines: A research challenge for AI on the World Wide Web, Artificial Intelligence (2009),

[4] Deborah L. McGuinness and Li Ding and Paulo Pinheiro da silva and Cynthia Chang. PML 2: A Modular Explanation Interlingua. in Proceedings of the AAAI’07 Workshop on Explanation-Aware Computing, 2007,

[5] Li Ding, Provenance and Search Issues in RDF Data Warehouse, in Proceedings of SemGrail Workshop, 2007,

Li Ding,  April 1, 2010

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: linked data, Web Science Tags: