Archive

Archive for the ‘Web Science’ Category

Rankings, Google and Semantic Web

November 12th, 2008

During the last centuries, humankind has experimented an exponential increase in the information available. This is even more perceptible in the Web, which makes information to be reached with just a few clicks….. far beyond what we as humans can process and assimilate.

Thus we need to discriminate among a gargantuan amount of information available to find wha are we looking for (or the closest to it). The traditional Information Retrieval idea for this is based on searching keywords. However, it is difficult to differentiate among several –potentially billions– of pages which has more useful information to what we are looking for. In order to do that, we need to discriminate. The general idea of discriminate is based on the concept of ranking(*): This is an order (whether partial or total) of some entities based on a set of criteria.

This is a good way of handling information because we don’t have the resources (time, memory, etc..) to navigate through all the available data. And that is exactly what Google does: we ask “I need to find some pages that contains keywords X, Y and Z” and Google answers “Look, according to my algorithm and the data I have here is a list order from what I think is the most relevant page for your query”.

The Semantic Web brings similar challenges, but in this case we are not talking about pages and links, but about any entity (people, cars, webpages, ontologies) related by different predicates (people has firends, cars has parts, webpages has authors, ontologies describe other entities and so on). Thus the problem is far more complex, since there is more information available.

Also, there are other questions we can ask: What ontology should I choose for a certain work, given dozens of possible candidates? When using that ontology if I have a SPARQL query that returns 1e6 results, are they all equally interesting to me? If not, which ones to show first?

The idea of opening your data, share it, mash it up, makes it everything more complex: It is not enough to have millions of answers, as a user I want the best suited for me (whatever that means).

Alvaro Graves

(*) Linguistic thought: Is interesting for a spanish native speaker as me that there is not translation for “ranking”: Does it means that the concept didn’t exist in the spanish-speaking world?

VN:F [1.2.0_562]
Rating: 0.0/10 (0 votes cast)
Author: agraves Categories: Semantic Web, Web Science Tags:

Notes for _Freebase: An Open, Writable Database of the World’s Information_ (ISWC 2008 Keynote)

October 29th, 2008

The ISWC 2008 keynote was presented by John Giannandrea (Metaweb Technologies Inc)

Semantic Web is based on a graph database which is not natively supported by relational database or column store. (More accurately, graph database is brought back by semantic web community while it was quite prospective in database community ten years ago.)

Ontology creation is a social process, and both freebase and semantic wiki are tools that enable users to create ontological vocabulary without worrying too much on building a comprehensive ontology. With such open-ended ontology, and effective query language is very important. Interesting enough, the query language of Freebase and Semantic Wiki shares similar flavor - they envision the semantic web as a instance store: where-clause simply describes a filter for instances, select-clause focus on retrieving the properties of the result instances.

Here are some facts about freebase:

* Scale of freebase: 156,000,000 assertions made; 1370 published types; 75 domains. (well, it is easy to see that most published types are well populated)

* view about the Semantic Web

Yes: graph model, identity, web based.

No: no description log; schema not ontology; a writable database!

* Freebase is not formal system cyc, OWL, sumo, true knowledge, and halo; nor google base.

* An industrial view on the relation between audience and complexity (inverse)

Google > Wikipedia > Del.icio.us > NY Times > dbpedia > cyc, OWL2

(Well, industrial people only care and learn what is needed to achieve their goals. They care more on functions, adoption and profits, and they are less picky on soundness and completeness.)

Freebase is dealing with an “identifier” web. While one thing may have quite some name, the names collaboratively contribute the semantics too. (yes, identity is a key problem for web application)

Greetings from ISWC 2008 by Li Ding

VN:F [1.2.0_562]
Rating: 0.0/10 (0 votes cast)
Author: li Categories: Semantic Web, Web Science Tags:

What leads to interoperability? Lessons learned from Dublin Core and DOI

July 15th, 2008

Interoperability is a desired feature when people access Web content, and there is a long way towards this dream. In general, interoperability on the Web can be abstracted as many users communicating with one another to share information. Two extremes are obvious, (i) achieving a language for all at the cost of minimal information can be exchanged, and (ii) achieving a language for each pair so that such pair can maximally exchange information. These two extremes may converge when the users are homogeneous, i.e. from the same community and hosting similar information.While the simplicity and flexibility of Dublin Core (DC) have attracted many followers, they also lead to limited interoperability among DC applications. The comments in [2] made an interesting analogy: “Dublin Core applications are like snowflakes - no two are exactly the same”. For example, dc:date neither restricts the range of the value (that leaves no place of quality validation) nor offers clear enough semantics of that property (it works more like a legal document that needs lawyers’ interpretation). More researchers [1,3] criticized DC that such limited interoperability may restrict automated metadata processing and thus made DC useless.

Digital Object Identifier (DOI), on the other hand, has fast growing instance data space in the publishing industry. Unlike DC, DOI requires more agreements including (i) more mandatory properties, (ii) more restrictions on the value of properties; and (iii) a federated metadata registration mechanism. These features ensure better structured and interoperable DOI instance data.

From the above study, we may raise the following hypotheses:
1. simplicity and flexibility can lower adoption cost, but they should be carefully enforced to avoid damaging interoperability
2. restrictions (e.g. the range of property value) can ensure data quality and thus promote interoperability
3. making more information interoperated among systems is preferred to making all systems interoperating
4. interoperable metadata should support non-trivial automated data integration, such as and reference resolution.

Further readings
[1] Beall, J. (2004), “Dublin Core: an obituary“, Library HiTech News, Vol.21, No. 8, pp 40-1,
[2] Jill Hurst-Wahl (2007), “Dublin Core?”, (the comment is more interesting than the blog) access on July 15, 2008
[3] Allan Cho (2008), “Dublin Core is Dead, Long Live MODS“, access on July 15, 2008

Li Ding

VN:F [1.2.0_562]
Rating: 0.0/10 (0 votes cast)
Author: li Categories: Web Science Tags:

Fellowship of the (Semantic) Web: The Two Towers

May 25th, 2008

By popular request (okay, a couple of people asked for it), I have put my Talk from Semantic Technologies 2008 online - warning, it’s about 22M pdf (lots of gratuitous images to keep things fun)

Enjoy.

Jim H.

VN:F [1.2.0_562]
Rating: 0.0/10 (0 votes cast)
Author: hendler Categories: AI, Semantic Web, Web Science, tetherless world Tags:

Research challenges from TWINE

May 21st, 2008

An interesting interview(source), by John Breslin, revealed some interesting technology features behind Twine: privacy, data integration, and data storage. I got a mixed feeling on that none existing triple/quad stores are used and TWINE had developed its own. How do the current semantic web technologies fit in enterprise-level, small-group-level, and person-level applications, and which triple store solution is ready for supporting such applications? The eight-element tuple is designed for efficiency, but will that be a common model for other social semantic web sites? As for privacy, are there any new benefits or new challenges brought by the semantic web technologies, or we are still using (user, group) access control mechanisms widely used in Web 2.0. Finally, the data integration would be a very interesting challenge: do we have reasonably good automatic entity disambiguation tools; how to use “collective intelligence” to complement the automated tools; and how to present the integration results to end users without causing too much surprise. In general, the deployment of TWINE is promising; and that will produce more interesting and practical challenges to the research community.

Initially Radar had their own triple store, an LGPL one from the CALO project. They found that it didn’t scale towards web-scale applications, and it didn’t have the levels of transaction control you’d need from an enterprise application. They decided to go for a SQL database (PostgreSQL) with WebDAV. However, relational databases weren’t optimised for the “shape” of data that they were putting into it, so it needed to be tweaked. They’ve had no performance issues so far, but they may move to a federated model next year.

….Twine uses an eight-element tuple store (subject-predicate-object, provenance, time stamp, confidence value, and other statistics about the triple or item itself). They can do predicate inferencing across statements, access control, etc. …

… The key “secret sauce” is that everything in Twine is generated from an ontology. The entire site - user interface elements, sidebar, navbar, buttons, etc. - come from an application ontology…

Q: The first one was about privacy. What if you add something and then later you decide that you want to delete it - is it really deleted or does Twine keep it around?

A: Nova answered that currently, it is not really deleted, it goes into a non-visible triple. But they will be doing that (really deleting it) soon.

Q: As one imports information from various places, what exactly is there in Twine that will prevent a person having to merge any duplicate objects?

A: Nova said there is limited duplication detection at the moment, but this will be improved in a few months. Most people submit similar bookmarks and it is reasonably straightforward to identify these, e.g. when the same item is arrived at through different paths on a website and has different URLs.

Q: Why does Twine use tuple storage: why is it not using a quad?

A: Nova said it’s faster in their system, so for performance reasons they decided to avoid reification.

Li

VN:F [1.2.0_562]
Rating: 0.0/10 (0 votes cast)
Author: li Categories: Semantic Web, Web Science Tags: