We’ve come a long way, maybe… (preprint for an editorial for IEEE Intelligent Systems)

The various 50th anniversary events for AI in America that happened a couple of years ago threatened to make me think about that fact that, shall we say, I’m no longer eligible for young researcher awards. Luckily, I kept very busy running events and special issues, so I managed to avoid thinking about it. However, there are some recent events that have caused me to realize that I’ve been doing AI for a reasonably long time, and to reflect on some aspects of the progress that has been made in the more than thirty years since my first publication in the field.

Recent events

The first of these events was a recent panel entitled “Artificial Intelligence Theory and Practice: Hard Challenges and Opportunities Ahead” which was chaired by Eric Horvitz, President of the Association for the Advancement of Artificial Intelligence, at the Microsoft Faculty Summit. There were seven of us on the panel representing a reasonable range of AI fields, and I think it was a pretty interesting discussion (at the time of this writing the video of the panel has not yet been posted to the Web, but it should be by the time you’re reading this, so search for it, I think you’ll enjoy watching it). Scarily, at some point I realized that I was the person on the panel who’d been doing AI the longest (although edging out a couple of others whose hair was as white as mine because I started in my freshman year of college). I also realized that I’ve been working on more-or-less the same problems for my whole career, but strangely at different times I’ve been seen as anything from a mainstream AI researcher to, more recently, doing research outside the mainstream of the field. That’s one theme I’ll return to.

A second theme arises from the use of the term “AI-Complete Problem” on the panel – with several people giving examples of what they thought some were. The odd part is that, that age thing again, I remember when the term was first introduced, and people really meant it to mean a problem on which, to demonstrate significant capabilities, you would need to solve the whole range of problems in AI – ranging from vision and robotics to language and planning. Nowadays it seems to me that the term was being used for “very hard problem,” which is something very different. I’ll return to this as well.

Another event was a meeting of the investigators from one of the large DARPA-funded projects that I’m involved in. For those outside the US, or lucky enough to be funded from other sources, you may not know that in recent years DARPA has pushed for AI researchers to form teams with industrial partners, and try to solve hard “go/no-go” problems. In the case of this particular project, in the second year we had to run our system on a set of problems, run human subjects on the same set of problems, and show our system outperformed the humans. The problems were in a relatively complicated domain, and DARPA set some pretty difficult ground rules – the main one being we couldn’t build in a lot of domain knowledge to solve the problem. Rather, we had use a set of learning technologies (mostly related to explanation-based learning) to accomplish the task. Amazingly, we passed.

Tying it together

So what is it that brings these three themes together? My answer is that all three of them relate to a narrowing of the goals of the AI field over the past decade or so.

To start with, in my career a large part of my work has always been focused on scaling of knowledge-based inferencing in one way or another. The reason is not that I think this is critically important for applied AI, although I do, but rather because one of the things that is clear to me when I interact with my computer is that we each have very different kinds of memory. My computer never really forgets anything I tell it (well, there was that hard disk crash a couple years back, but that isn’t what I mean), while my memory is pretty porous. However, with the exception of some kinds of purely statistical inference, my computer also doesn’t seem able to put things together the way I would. I may forget the details of some particular restaurant I ate in, or what the name of who my middle school math teacher was, but I sure can integrate a lot of other information about restaurants and middle school math in ways that my computer still can’t. This is similar to the theme of my earlier letter “Computers play chess, humans play go” [[insert ref]], but in this case I’m emphasizing that we humans do something really amazing with our memories, that computer models still don’t come near. We also seem to be the entities with the most sophisticated symbolic reasoning capabilities that we know, a reason why I’m as yet unconvinced that all the success with probabilistic models is getting us nearer to understanding human intelligence.

And speaking of human intelligence, I think that takes us to the second theme. The original concept of “AI Complete” included the idea that solving the problems would teach us something about intelligence writ large. That is, while there were always engineering goals in AI, and one of the reasons this magazine started (back when it was called IEEE Expert) was to reflect that, there tended to be a general feeling in the field that the goal of AI included an understanding of intelligence. Not necessarily human intelligence in the sense of cognitive modeling, but just as we know a lot more about how birds fly from having learned the aerodynamics of making planes fly faster, looking at the difference between computers and humans solving problems situated in the real world and needing a lot of knowledge, seemed like a way to learn more about humans and thought. I also remember that I used to hear at early AI conferences, but rarely if ever hear any more, that looking at problems that humans were much better at than computers was a good way to get inspiration as to what AI problems to attack – the challenge made them compelling, and steps towards solution could, again, help us understand more about what intelligence was.

And what about the success of the DARPA project? How could that tie into the narrowing of AI? I guess my reaction came from the fact that of all the researchers in the room, I was the only one who was really surprised that we were able to outperform the humans on this test. The point is not that the others were so blasé, but rather than in today’s AI milieu we are used to seeing AI programs outperform humans whether it be at searching for information, predicting traffic patterns, beating the world’s chess champion, or solving narrowly defined problems. My surprise in this case came from the fact that the problem we were solving was actually pretty hard, but not artificially so due to data overload, a bad interface or the similar

In the problem we attacked, there are many plausible solutions, but only a few good ones, and the restriction was that there was only a small amount of training data (in fact, one expert trace) for the computer to learn preferences from. This was the kind of problem that we once said we’d be able to get AI to do “someday,” and I was very glad to see, that at least for this particular case (and with an investment of a very large amount of time and money), that day had come. However, I was also a bit chagrined to realize how long it’s been since the last time I had that sort of thrill. While I’ve joined others in the field of AI in celebrating our successes, many of which I have pointed out in these pages over the past few years, this experience made me realize how rare it is that I feel the thrill I felt in the early days of my AI career when, I have to admit, it was a pretty amazing accomplishment when we got the computer to do anything that seemed “smart.”

The consequences

So why do these things bother me? After all, this seeming change in our direction has enabled AI to accomplish significant engineering advances and to become a stronger and better-understood technology. In fact, on the panel I mentioned earlier, one of my younger colleagues expressed how proud she was that when she first took at AI it seemed to be pretty ad hoc, and now when she teaches it the course is full of much stronger theoretical material. My rejoinder, grumpy old man that I’ve become, was that when I first took AI, I was hooked for life on the first day, when Roger Schank who was teaching the course, said something to the effect that almost nothing he would teach us was proven to be correct, and that any one of us was just as likely to come up with a key insight as the people whose work we’d be studying. I found this exhilarating; my colleague seems to have found it a weakness.

But here’s the thing! Despite all the major AI successes of the past decade, and the great strides we have made, what Roger said is still true! When it comes to really understanding the amazing symbolic processor that is the human mind, we still know very little. While I don’t mind that we have more techniques to teach our students, I think it is important that we don’t become enthralled by what Marvin Minsky has referred to as “physics envy.” We should admit, gloriously and deliberately, that when it comes to understanding intelligence our field is still in its very early days. The challenge remains, and it is one of the greatest intellectual challenges of our, nay of all, times – to understand thought, conciousness and intelligence. The best and brightest students go where the most exciting problems are, and we’ve got one of the all-time winners! Let’s not forget that fact.

So as my days as Editor-in-Chief wane, and you read this, my penultimate letter, I hope you will remember that although the focus of this magazine includes “systems,” with an emphasis on bringing AI theories into practice, it also includes “intelligent” and that’s something that we as a field mustn’t ignore. We’ve made a lot of progress, but the journey is far from over, and the original goal is still far over the horizon.

Happy Sailing,

Jim Hendler

Cuil, Semantic Search

Last week, Cuil.com caught my eye. It gave me very good impression in just 5 seconds (BTW,  10 seconds is a survival maximal for any website to me). First, I tried, as many people may do, my name. It didn’t disappoint me by hitting quite precisely my pages.  I also love the grid-based layout. A few minutes later, I found its “Explore by Category” option. It looks like that cuil has some sort of ontology hierarchies for web pages.

A few “google” results reveal that cuil may use some clustering technique to build such hierarchies. It is interesting to think will such hierarchies indeed improve search experience. When I search “Semantic Web”, cuil recommends me to browse “Ontology (computer Science)“  and some of its sub category; it also suggests me to look at “James Hendler”’s homepage. I would say that it will be very useful for exploring.

Building meta data using machine learning technology is a cool thing. On the other hand, I believe that human intervention is also critical. When wikipedia knowledge is used in clustering,  I expect some gain in recall or preciseness. As “Ontology (computer Science)” is a wikipedia page, I guess that cuil may have already used wikipedia information in their results.

Also don’t forget the “network effect”. I have created a prefix-based, syntactical gmail label hierarchy for a while. I really like to share part of the hierarchy to my friends, so that when I send a mail labeled with “party”, then they don’t need to relabel it again. If millions of users can share their small hierarchies (not only on gmail, but also on flicker, youtube, twine, etc.), each is connected somehow to hierarchies of friends and family, eventually we will have a very large network of ontologies which may improve search much more than we can do now. Just a random thougt.

P.S. I found one interesting thing. Cuil caches my wiki page at Iowa State University. However, that page should be offline no later than May 2008, while Cuil was online officially only on July 28, 2008.  It seems its crawler has been alive for a while.

Jie Bao

Captcha, Turing Test, and Semantic Web

On the web nobody knows you are a dog, …… or a human. That’s why there are programs on the web to identify one as a human (from bots or dog or cats or……). Most popular ones are captcha. It is based on a simple assumption: no OCR agent so far can be as smart as a human is. To me, it looks like a super-simplified Turing test: an AI program has “real” intelligence as a human has, if being asked by the same question, another human can’t tell who is AI and who is human.

I can’t help imagining that one day, when OCR agents get smart enough to pass the captcha test (I strongly believe that day is not far away), what test we will use to identify a human on web.  Math? That will be easy for a good program. Scrabble? maybe, but not that secure. Ask for a Shakespeare’s sonne? Or the end year of world war II? That looks more likely to succeed. But…There are two issues.

First,  an agent may have access to a knowledge base. With projects like Dbpedia, human knowledge has been KBized in a speed never seen before in the history. A query as ” the end year of world war II” may be answered by a semantic web agent fairly quickly. I can imagine that someday we will have to design increasingly hard questions (like art things)  to identify a human and fight spamming.

The other issue is that a human may have NO access to a knowledge base. Many, many people in the world does not know “the end year of world war II”, even if they may be knowledgeable in other things. They may not even know where to find such a knowledge. Also, they can get bored when been consistently asked such captcha questions and quit — technically, that means they failed the test thus are not “human”. When captcha becomes increasingly hard (like art things), more and more people may fail in one reason or another (including boredness). That will also lead to the failure of the identification system.

Will semantic web help spamming by designing smart agents? :) Maybe, let’s wait and see.

Jie Bao

Towards Webtop

by Jie Bao

Some of our Tetherless World researchers including me have just written a short paper to sell the idea of constructing a “webtop” using semantic technologies. In short, a webtop is a desktop on the web, that does similar jobs such as managing files, doing word processing, managing contacts, scheduling tasks, emailing, etc. Please see some examples of webtops with pretty GUIs.

Almost one decade ago, there has been hot for a while for the concept of “network computer”. At that time, a network computer means some low-end computer with limited storage and computational capacity that relying on the network to get great power. The webtop idea reminds me of network computer as they, while are different in many aspects, share the same idea of powering users with networked infrastructure. Ten years ago, this vision was tested with physical computers but largely failed, while today, with the advance of technologies, is revived by allowing users to create virtual computers that only exist on the websphere. I have many reasons to believe this time it will not only survive, but also prevail.

One reason is from my personal experience. From about two years ago, I stopped installing many software that have been with me for many years: Encarta is replaced by Wikipedia.com, Outlook is replaced by Gmail, MS Street is replaced by Google Maps, MS Word is replaced by writing in wiki, Powerpoint is replaced by online latex writing with the Beamer package, among a long list of other things. Browser is the application I stayed for more than 80% of time when I’m on my computers. There is indeed a strong need for me to organize all such online applications and data — simply bookmarking is barely a solution. I need something that can organize them, enable me quick access to them, and last but not least, pretty and neat. A webtop does exactly those things.

How semantic technologies help in providing a webtop? Actually, long before the term “ontology” getting popular, users are already creating ontologies on daily bases: email classification, creating file folder trees, grouping contacts or naming a photo as “Wedding picture at Troy”, all those efforts are creating relations between things or annotating a “meaning” to an entity. With semantic technologies, those relations and annotations can be made explicit so that data can be more easily managed and queried. For example, I may query that “find all 2005 photos of my friends”, or “show all meetings (even if they are not called meeting, such as “briefing”) in the past month”. A webtop based on semantic technologies will make such an ability universal to any application on its top.

There have been controversies about semantic web ever since the term is coined. I think this is partly because the semantic web community as a whole, failed to provide enough end-user friendly tools that can do something helpful in daily life. I wish to see more tools to help daily web activities: semantic email, semantic blog, semantic calender, semantic abstract of news (a little more than RSS), tagging files (picture, mp3,…) with taxonomy, etc. Even more important, to survive, such an application should never ask users to learn RDF or anything needs more than 3 minutes to understand. Bring such applications together, it’s a webtop. I believe something like this is one of the killer apps the community has long been waiting for.

OWL or OLD?

I just noticed the “OWL 2 Web Ontology Language: Requirements” document from the OWL Working Group. Interestingly, while the “W” in OWL stands for “Web”, I didn’t see any use case from web applications in the usual sense. As the leading requirements are from the need for domain knowledge bases, I would suggest the name of the new language, instead of OWL 2, to be Ontology Language of Domains (OLD) — Just kidding.  OWL claims to be needed by common web users, but such users are surprisingly under-represented in the specification process. We have already seen many specially designed, highly expressive, but, narrowly applied languages in the old KR schools. Do we need to invent yet another one here, again?

Jie

Human and the Semantic Web

“The Semantic Web is mainly serving machine agents” has been dominating my mind for many years. Now human users may also want to explore the big mass of RDF data not just for debugging purpose. Semantic Web user interaction is becoming an important part of Semantic Web layer cake and research direction (see SWUI workshops) in ISWC.

As a “web of data”, the Semantic Web, boosted by Linked Data efforts, presents web users a maze of RDF graph with billions of arcs (triples). To explore the maze, below are some html browser approaches I came across:

An alternative approach is graphical browser, which seem to be more intuitive to end users. An interesting blog Large-scale RDF Graph Visualization Tools covered a handful of useful resources including something I never encountered and even links to 28 visualization software packages. Of course the list missed some RDF viz browsers such as FOAFnaut, Welkin, and self visualization. It is notable that scalability is still bugging most of the visualization approaches due to the limit of memory size: my last experience was “Otter had a hard time when processing a graph with over 10,000 nodes”.

There are still many user interaction issues beyond the browsers (e.g. search engines, semantic wiki), and a well-designed UI component is probably the key to the Killer-App of the Semantic Web.

Li Ding

What leads to interoperability? Lessons learned from Dublin Core and DOI

Interoperability is a desired feature when people access Web content, and there is a long way towards this dream. In general, interoperability on the Web can be abstracted as many users communicating with one another to share information. Two extremes are obvious, (i) achieving a language for all at the cost of minimal information can be exchanged, and (ii) achieving a language for each pair so that such pair can maximally exchange information. These two extremes may converge when the users are homogeneous, i.e. from the same community and hosting similar information.While the simplicity and flexibility of Dublin Core (DC) have attracted many followers, they also lead to limited interoperability among DC applications. The comments in [2] made an interesting analogy: “Dublin Core applications are like snowflakes - no two are exactly the same”. For example, dc:date neither restricts the range of the value (that leaves no place of quality validation) nor offers clear enough semantics of that property (it works more like a legal document that needs lawyers’ interpretation). More researchers [1,3] criticized DC that such limited interoperability may restrict automated metadata processing and thus made DC useless.

Digital Object Identifier (DOI), on the other hand, has fast growing instance data space in the publishing industry. Unlike DC, DOI requires more agreements including (i) more mandatory properties, (ii) more restrictions on the value of properties; and (iii) a federated metadata registration mechanism. These features ensure better structured and interoperable DOI instance data.

From the above study, we may raise the following hypotheses:
1. simplicity and flexibility can lower adoption cost, but they should be carefully enforced to avoid damaging interoperability
2. restrictions (e.g. the range of property value) can ensure data quality and thus promote interoperability
3. making more information interoperated among systems is preferred to making all systems interoperating
4. interoperable metadata should support non-trivial automated data integration, such as and reference resolution.

Further readings
[1] Beall, J. (2004), “Dublin Core: an obituary“, Library HiTech News, Vol.21, No. 8, pp 40-1,
[2] Jill Hurst-Wahl (2007), “Dublin Core?”, (the comment is more interesting than the blog) access on July 15, 2008
[3] Allan Cho (2008), “Dublin Core is Dead, Long Live MODS“, access on July 15, 2008

Li Ding

OWL Mobile: Ontology Browser for iPhone/iTouch

The Tetherless World invites users of Apple’s iPhone and iPod Touch to try out our new ontology browser, OWL Mobile.

OWL Mobile is powered by Jena and Pellet, operating remotely, to provide speed and battery performance mobile devices users expect from their applications. Load one or more ontologies through the Load Ontologies tab. Supply a URL to a custom ontology or use the list of past ontologies. Once you’ve loaded an ontology, use the “Classes”, “Properties”, and “Individuals” tabs to browse through the ontology. Clicking on an item will expand it and give additional information about that particular object. Links which point to other members of the ontology will switch to the appropriate URI when clicked. External links such as web pages, email address, and phone numbers will open the appropriate application on iPhone (phone numbers won’t work on the iTouch) when activated.

Point Safari to http://onto.rpi.edu/demo/owlmobile2/ to try the application. Feel free to bookmark it or add it to the home screen for easy access.

Evan Patton

Author, author (for Planet RDF)

Alright, this is an odd blog post since it’s really directed at the bloggers who are aggregated on Planet RDF, and this was the only way I could figure out how to get it there.

Like many of the other blogs on this site, the Tetherless World blog has a number of different authors who write our pieces.  On PlanetRDF, however, if we don’t sign the blog, you cannot tell who wrote it (except by guessing) unless you link over to the original blog site.  This is odd as the RSS feeds from most of these blogs use some form of author field.

I did a little mousing around, and best I can tell there seem to be a bunch of different ways the different blogs report the author, with the RSS author element not being the most used.  I wonder if either we the bloggers, or PlanetRDF’s keepers, might not think about fixing this somehow?

At the very least, perhaps we could use the low tech solution and sign our blogs.

cheers

The masked blogger

Grandma Gone Surfing

Debbie Heisler has just sent me a link “Internet overhaul wins approval. One of the proposals mentioned catching my eye is that domain names written in Asian, Arabic or other scripts will be supported.

Although it may not be a new idea (for example, 3721.com, now part of Yahoo!, has provided a service of supporting urls in Chinese for years), having local names other than Roman characters is absolutely a good move. About 10 years ago, I was asked to teach one of my father’s colleague on how to use computers; it was a hard job because she didn’t know how to use keyboard, which in turn because she didn’t know what are characters “A”, “B”, “C”. My mom is better: she is now a daily web surfer and she knows Roman characters - but she can never remember English words like “Google”, not to mention google.com. What she does now is to set a hub page as her browser’s homepage, with a Google link on it (and of course, in Chinese). She uses baidu.com, a Chinese counterpart of Google, more frequently than Google, partly because the word “Bai Du”, which literally means “a hundred times”, is much easier for her to remember (on the other hand, Google’s local name “Guge” is almost meaningless).

We people in academia are so used to our (both language and technical) education and sometimes take many things for granted. Two weeks earlier at the Tetherless World Grand Opening, Wendy Hall, the ACM President-elect, had mentioned that in her recent visit to China for the WWW 2008 conference, she was surprised to learn that there is such a huge part of web that is only in Chinese. “Chinese may be the most popular language on the web in the future”, she said. This may or may not become true, but I agree that web technologies should be easier to use and consider internationalization even more.

However, “ease” means differently for different people. When my mom learned to use mouse, she had to use her both hands to control it :) — and she did not give up only because she wanted to use computers to communicate with me. Last weekend, I tried to teach my father-in-law to use computers, he also had a hard time to control the mouse: regular computer users have an _instinct_ to locally relocate the mouse so we never feel “the line is too short”, but he has no such an instinct.

I’m a little off the topic. But what I want to say is that computers should be designed not only for the youth, but also for seniors; not only for English-speaking people, but also for the other 3/4 of people in the world; not only for geeks, but also for grandmas.

As to the Semantic Web, we should also always keep our “users” in mind. Who gonna use semantic web? What things are on the top list we should support? I have been long thinking about this question: as most of our daily web activities are emailing, blogging, calendaring, searching, etc., why there is still no end user oriented semantic tools to help us for such activities? For example, I have tried many “semantic search engines”, e.g., Swoogle, SWSE and Sindice, none of them can be considered end-user oriented: I cannot explain most of their results in RDF to my mom, just for an example. Google is a killer app, as my mom can use it even if she cannot spell “Google” itself. We will need something like that.

Jie Bao

Patadata!

By a series of interesting coincidences in life, I have recently found myself in contact with Andrew Hugill, who is, among many other things, the Director of the Institute of Creative Technologies at DeMontfort University in Leicester, UK. Andrew sent me a copy of a CD of his music called “Pataphysical Piano” which I have truly enjoyed, and recommend to those interested in new directions in music. That, however, is not the intent of this blog (although I’m sure he won’t mind a few extra sales).

Rather, I was curious about the term “pataphysics” and was pleased to see a Wikipedia entry on the subject show up in the first page of the 45,000 or so Google finds. The original definition was “”the science of imaginary solutions, which symbolically attributes the properties of objects, described by their virtuality, to their lineaments” which didn’t shed much light. However, it was later stated to be the principles that rest on “the truth of contradictions and exceptions.” This latter, for those of you who know me, is way too good to believe — as I believe a crucial aspect of the Semantic Web is that we will have to learn to live with the truth of contradictions and exceptions, and that that is the main argument I’ve been having with the forces of neatness, many of whom have clustered in the OWL 2 WG.

The philosopher, playwright and general polymath Alfred Jarry, who coined the term pataphysics, stated that it was “as far from metaphysics as metaphysics extends from regular reality.” I am happy to report that Googling for the term “patadata” I only find 10 hits, none of which uses it to mean “data interpreted through the truth of contradictions and exceptions” which is “as far from metadata as metadata extends from a databased representation of reality.” So consider this term now to be coined with exactly that meaning, and I happily join the ranks of previous petaphysicists as I continue my study of the functions and properties of “patadata markup” — a long paper on which I will publish as soon as I work out a few more details.

Yours patalogically,
Jim Hendler

p.s. Interestingly, one of the interesting aspect of pataphysics throughout the past century has been a mix of seriousness and parody, often non-distinguishably entwined. I hope to continue that tradition with this blog post and my future writings on the subject.

Fellowship of the (Semantic) Web: The Two Towers

By popular request (okay, a couple of people asked for it), I have put my Talk from Semantic Technologies 2008 online - warning, it’s about 22M pdf (lots of gratuitous images to keep things fun)

Enjoy.

Jim H.

Earthquake, google, and more

Below is an email I sent to the group today.

———————-

Dear TW friends

You must have already known the huge earthquake occurred on May 12 in
Sichuan, China. It has caused enormous loss of lives: more than 50,000
confirmed death, around 30,000 missing, plus about 300,000 people
injured as of today. The whole nation, as well as Chinese all over the
world including me, are in deep sorrow for the tragedy.

For the memorial of the earthquake victims, on May 19 14:28pm (Beijing
Time), sharply one week after the earthquake, Chinese public held a
moment of silence. People stood silent for three minutes while air
defense, police and fire sirens, and the horns of vehicles, vessels
and trains sounded.

Google China released a traffic curve for the three minutes [1]. At
the deepest point, it dropped to 10% of the normal traffic. At the
time, millions of people stopped their work on computers, stood up and
lowered their heads to observe. The curve clearly conveys a message of
national unity of the Chinese people in a time of calamity. I’m pride
to be a part of the people.

Web plays an important role in the earthquake relief this time.
Messages and information are exchanged on the web much faster than
traditional ways in helping the rescue work. For example, when a girl
heard that army helicopters couldn’t find a landing site around her
home town, she immediately posted a good location on the internet, and
it was replicated thousands time across many sites in just a few
hours, until it reaches the army command. For another example, when
all communication avenues were cut off from the outside world, the
first message from the isolated area was from the website [2] of the
local government, which was revived by backup power and link; due to
reports from the website, it was decided to use airdrop instead of
land rescue for some area, otherwise it will be too late.

This can still be improved. With semantic web, such information can be
propagated, instead of by human forwarding, by software agents in just
seconds, to the handheld device of the pilot of helicopter. In
earthquake relief, every second saved in knowledge aggregation and
propagation means more hope for lives. I hope this dream of tetherless
world can become true as early as possible.

Thank you for reading this.

Jie

[1] http://googlechinablog.com/2008/05/blog-post_22.html
[2] http://www.abazhou.gov.cn

Research challenges from TWINE

An interesting interview(source), by John Breslin, revealed some interesting technology features behind Twine: privacy, data integration, and data storage. I got a mixed feeling on that none existing triple/quad stores are used and TWINE had developed its own. How do the current semantic web technologies fit in enterprise-level, small-group-level, and person-level applications, and which triple store solution is ready for supporting such applications? The eight-element tuple is designed for efficiency, but will that be a common model for other social semantic web sites? As for privacy, are there any new benefits or new challenges brought by the semantic web technologies, or we are still using (user, group) access control mechanisms widely used in Web 2.0. Finally, the data integration would be a very interesting challenge: do we have reasonably good automatic entity disambiguation tools; how to use “collective intelligence” to complement the automated tools; and how to present the integration results to end users without causing too much surprise. In general, the deployment of TWINE is promising; and that will produce more interesting and practical challenges to the research community.

Initially Radar had their own triple store, an LGPL one from the CALO project. They found that it didn’t scale towards web-scale applications, and it didn’t have the levels of transaction control you’d need from an enterprise application. They decided to go for a SQL database (PostgreSQL) with WebDAV. However, relational databases weren’t optimised for the “shape” of data that they were putting into it, so it needed to be tweaked. They’ve had no performance issues so far, but they may move to a federated model next year.

….Twine uses an eight-element tuple store (subject-predicate-object, provenance, time stamp, confidence value, and other statistics about the triple or item itself). They can do predicate inferencing across statements, access control, etc. …

… The key “secret sauce” is that everything in Twine is generated from an ontology. The entire site - user interface elements, sidebar, navbar, buttons, etc. - come from an application ontology…

Q: The first one was about privacy. What if you add something and then later you decide that you want to delete it - is it really deleted or does Twine keep it around?

A: Nova answered that currently, it is not really deleted, it goes into a non-visible triple. But they will be doing that (really deleting it) soon.

Q: As one imports information from various places, what exactly is there in Twine that will prevent a person having to merge any duplicate objects?

A: Nova said there is limited duplication detection at the moment, but this will be improved in a few months. Most people submit similar bookmarks and it is reasonably straightforward to identify these, e.g. when the same item is arrived at through different paths on a website and has different URLs.

Q: Why does Twine use tuple storage: why is it not using a quad?

A: Nova said it’s faster in their system, so for performance reasons they decided to avoid reification.

Li

Towards RDFS 3.0 (or OWL 2 R Full)

Summary — there is a new “profile” of OWL Full that might be of great interest to the RDF/Data Web community — read on:

To those who follow W3C happenings, you know that I’ve had some problems with, and resigned from, the new OWL Working Group. The problems have mainly been related to the philosophy of what this is all about, more than the details of specific language features, and maybe I’ll blog about that some other time. However, in this entry I want to say something positive about one small piece of what the working group has done, and direct the RDF community to take a look at it– I believe it may be close to something we’ve needed for a long time.

In the “OWL 2 Web Ontology Language: Profiles” document (http://www.w3.org/TR/2008/WD-owl2-profiles-20080411/) the group has created a new set of OWL profiles (formerly called fragments) so instead of OWL Lite, DL, and Full, we now have (probably to be renamed at a later date) OWL 2 Full and a number of profiles OWL 2 DL, OWL 2 EL++, OWL 2 DL-Lite, OWL 2 R DL, and OWL 2 R Full (there are also be the unnamed RDF equivalents of the EL++ and OWL DL-Lite, but the group refuses to acknolwedge that, a primary reason for my leaving — but that’s another story again).

Anyway, it is to the last of these “OWL 2 R Full” that I would like to direct the attention of the RDF community — it is a bit hard to tell from the relatively cryptic document, but this fragment is an extension to RDFS that adds a small amount of useful OWL vocabulary, without requiring commitment to some of the strong restrictions needed for the various DL dialects. The specification includes an axiomatic specification of the language (i.e. rules) and starting to circulate, but not in the OWL group’s document, is an N3 version of the language making it very easy to see the relation to RDF. A couple of the larger members of the Working Group have stated that they will support this language (I’m not sure whether in public or not, so I’ll let them speak for themselves) which bodes well.

For those people looking at the “Data Web” or at “Web 3.0″ applications, I think this profile of OWL may be worth looking at — it would definitely be improved by some comments from serious Web 3.0 application developers - as it may well be a good target of opportunity for further RDF development. In the famous Semantic Web layercake, this profile (which I would like to see renamed RDFS 3.0) would be able to sit under the Rules and Ontology fragments, where RDFS is now, without derailing RDF(S) into the peculiarities of description logics, yet allowing some useful constructs to be added. For example, FOAF, DOAP and other of the most used RDF-based ontologies would be within (or close to) this new profile

So if you’re not interested, or are studiously ignoring, the OWL drafts, let me suggest you take a look at Table 2 of section 4 of the Profiles document (and section 4.2.3 if you want to see the rules). I also suggest that one does not have to understand anything else in that section (much of which seems to me to be written for those with PhDs in AI or similar background) to be able to see there’s something useful in here.

So take a look at OWL 2 R Full - the name is awful, but the language might be a really powerful new tool on the RDF Web.

-Jim Hendler

p.s. Let me also suggest taking a look at the public email by Michael Schneider at http://lists.w3.org/Archives/Public/public-owl-wg/2008Apr/0171.html – one of the few RDF proponents in the working group, he gives a great example of using OWL R Full in an RDFS context…

wiki bots - one key to the success of wikipedia

Wikipedia has gain such a big success under massive human administration. One amazing feature coming with wikipedia is fairly fast response by the administrators - they continuously track the changes and keep the content in good shape.

One interesting helper in wikipedia is Wiki bot, who can be automate repetitive tedious jobs (source types of wiki bots) :

  • tagging/categorizing wiki articles, e.g. alai bot
  • importing content from outside wiki, e.g. history of wikibots
  • detecting spam and vandalism in wiki page, e.g.  voAbotII
  • checking for spell errors, e.g. cmdObot

I just went to wikipedia and randomly selected 10 pages, and there are 7 edited by bots (find “bot” in the editor’s name), and the rest 3 has fairly short history (no more than 20 revisions).  Dodge, Nebraska is one of my favorite example which was massively edited by bots.

OWL Experiences and Directions Meeting

The OWL Experiences and Directions meeting concluded today. Info on it is up at: http://www.webont.org/owled/2008dc/ .

One talk I found interesting was a controlled English interface. It is too early to consider robust but is interesting to explore for one kind of interface to our applications - http://www.webont.org/owled/2008dc/papers/owled2008dc_paper_5.pdf

An Questionnaire for OWL Experience

A lot of interesting experiences on OWL and new OWL features has been intensively discussed in OWL: Experiences and Directions(OWLED 2008). But (potential) users, for adoption purposes, still need some clarification on the lessons learned from past. Therefore, I’m hoping the following questionnaire be answered the OWL community.

  1. OWL constructs
    1. What have been used?
    2. What are still missing?
  2. OWL inference
    1. What inference has been used to solve problem?
    2. What other inference is used together with OWL inference, e.g. sparql, swrl?
  3. OWL user experience
    1. How hard is it to build/reuse OWL ontology?
    2. How hard is it to build/reuse OWL instance data?
    3. How does OWL help web users, and how does the Web impact OWL ?

semantic technology in AAAI Spring Sympothia 2008

Semantic technology turns out to be a hot topic among the workshops in AAAI spring Symposia 2008. Here are some context where semantic technologies are needed

  • knowledge representation
    • ontology, semantic web {all}
    • interoperability, interdisciplinary{ss01,ss05,ss07}
    • scientific knowledge; artificial characters {ss04,ss05, ss07}
    • business process; scientific workflow; provenance {ss01,ss05,ss07}
  • computation and reasoning
    • social web; network computation; collective intelligence {ss04,ss05,ss06,ss07}
    • rules, workflow analysis {ss01,ss05}
    • question answering, intelligent system, agent {ss02,ss03,ss04}
    • semantic enabled user interface, explanation {ss05,ss07}

Below is the list of workshops:

  • ss01 AI Meets Business Rules and Process Management
  • ss02 Architectures for Intelligent Theory-Based Agents
  • ss03 Creative Intelligent Systems
  • ss04 Emotion, Personality, and Social Behavior
  • ss05 Semantic Scientific Knowledge Integration
  • ss06 Social Information Processing
  • ss07 Symbiotic Relationships between Semantic Web and Knowledge Engineering
  • ss08 Using AI to Motivate Greater Participation in Computer Science

Thoughts on the Billion Triple Challenge

 The following is email that I sent out today with respect to the Semantic Web Challenge at this year’s ISWC.  If you are interested in this and have not yet joined the group billiontriples@yahoogroups.com then let me encourage you to do so — but I’d also welcome email (or blog comments, although they weren’t working right last time I posted from here) if you have any throughts — in the next week or so Peter Mike and I need to move this from random thoughts to something starting to resemble competition rules!

-Jim H

p.s. Oh yeah, I  forgot, if you are missing the context, Peter Mika and I are cochairing the ISWC 2008 Semantic Web Challenge …

(Email sent to billiontriples@yahoogroups.com):

All- Peter feels that we now have the collection and distribution of the triples underway, which means he gets to make me do some work finally… My role at the moment is to figure out what we would like to make the challenge part of the challenge be. Here are some thoughts, I welcome feedback:

We see four, very non-disjoint audiences for the challenge (in fact, Peter, me, and most of the people on this list are in at least several categories): Triple store developers, linked data technology developers, Semantic Web researchers interested in scalable reasoning, ontology-based research groups

Here are some of my thoughts with respect to these

A - Triple Store Developers We do not want this to be a “triple store shootout” in the sense of who can process a query fastest or such. We don’t see that competition as being all that useful at a time when people are still very much in development mode. Rather, we would like the outcome of this event to be a realization in the outside world that triple-stores can and do handle these sorts of numbers (the DB folks still say “triple stores break at a million triples” at conferences I go to - I have no idea where they get that, but let’s push it up a few orders of magnitude!!) So at the moment my thinking on this area is that we would like to give you folks bragging rights for being able to support systems other people develop (i.e. any of you who host this data and make it available via SPARQL should be listed as “winners” in some way) I also think that if some interesting, large, and complex SPARQL queries are developed against this dataset (say including filters and optionals), then those would become useful benchmarks, so we would like to find a way to encourage the sharing of these (maybe for a future date when a benchmarking shootout would be more appropriate)

B - Linked data technology developers: We write a lot about the Semantic Web as being the Web of linked data, but to date, in practice, most of that data is either within an enterprise or locked in a particular application. We are purposely designing this dataset to be very heterogeneous, but with many connections between pieces, so it should be a great dataset for showing off tools that can exploit the dataweb. In this area we are thinking of having some goals like “visualize (or browse) the dataweb”, Datamining of this sort of data, etc. — seems to us this is a ripe area for a challenge

C - SW researchers interested in scalable reasoning:  The data set we are developing will include a (large) number of triples tied to FOAF, DOAP and other “small o” ontologies. We also have a lot of data that will be made available that was crawled from microformats (where the “semantics” are well specified). This is thus an ideal proving grounds for the “little semantics goes a long way” philosophy, and thus this also seems like an appropriate challenge area

D - Ontology research Big A-Box, you got it! Show us something.

So, I think we will have the “competition” be fairly unspecified - we will identify several areas of interest from the above and work out how to tie that into an “announcible” competition.

I welcome, NEED, your feedback on this -Jim H.