Archive

Author Archive

A little, tiny semantics in action — from Google

February 20th, 2009

I just read about Google’s Canonical Link Tag. It’s a little application of RDFa’s “rel” property. It is not a big thing, but I’m happy it is from Google, who seems quite remote from semantic web technologies.

http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html

“Last week Google, Yahoo, and Microsoft announced support for a new link element to clean up duplicate urls on sites. The syntax is pretty simple: An ugly url such as http://www.example.com/page.html?sid=asdf314159265 can specify in the HEAD part of the document the following:

That tells search engines that the preferred location of this url (the “canonical” location, in search engine speak) is http://example.com/page.html instead of http://www.example.com/page.html?sid=asdf314159265 .”
VN:F [1.9.13_1145]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.13_1145]
Rating: 0 (from 0 votes)
Author: Categories: Semantic Web Tags:

AAAI Fall Symposium on Automated Scientific Discovery

November 11th, 2008

Day One

I (Joshua Taylor) am now back in Troy after spending the weekend (Friday through Sunday) in Arlington at the AAAI Fall Symposium on Automated Scientific Discovery. Presented papers, keynote address, and slides will become available on the supplementary symposium page over the next week or so.

After opening remarks by symposium chairs Selmer Bringsjord and Andrew Shilliday, Doug Lenat gave an opening keynote address entitled Looking Both Ways. Doug has a long history in automated discovery, from his 1976 PhD thesis on Automated Mathemetician (AM) and later Eurisko, to present day work within Cycorp. Doug explained a great deal about the techniques behind AM and Eurisko, and also talked about some of the criticisms that these systems received. He then spoke about the development of Cyc. We learned just how much Cyc has evolved, moving from frame-based systems and description logics to much more expressive formalism, leaving behind a theoretically desirable global consistency for more pragmatic and cognitively plausible locally consistent microtheories, and how Cyc now has enough (manually-encoded) knowledge, that high-level machine learning and automated knowledge acquisition are possible. He stressed that one of the factors making such knowledge acquisition and learning possible is the widespread adoption of the world wide web, and I noted that in some of his high level diagrams included SQL and SPARQL. (While I knew about the OpenCyc project, I only just now became aware that a great deal of OpenCyc is Semantic Web friendly.)

Andrew Shilliday followed Doug’s keynote with a history of automated scientific discovery, particularly as it relates to his own research and upcoming thesis. He also described the Elisa system for assisting users is discovery in scientific and mathematical domains.

After lunch, Alexandre Linhares discussed Douglas Hofstadter’s notion of Fluid Concepts and an implementation thereof.

Siemion Fajtlowicz spoke about development of Graffiti, a well-known system for conjecture generation within the domain of graph theory. Siemion also discussed more recent work connecting graph theory conjectures and molecular structure conjectures.

Jean-Gabriel Ganascia presented A Reconstruction of Some of Claude Bernard’s Scientific Steps, which also documents the development of Cybernard. In a manner that I am particularly fond of, Jean-Gabriel, in order to automate scientific discovery, takes as a starting point Claude Bernard, a human who both made many scientific discoveries, and also documented just what he did. One of the interesting aspects of this work is that it involves developing an ontologies of the scientific process of discovery, as well as of the scientific concepts with which Bernard worked. The importance of ontology evolution was also stressed, for as scientific knowledge increases, scientific conceptualization must also change.

After Jean-Gabriel, Susan Epstein discussed Knowledge Representation in Automated Scientific Discovery. She discussed how concepts in automated scientific discovery are often expressed as sets, and that as a result, the conjectures that are generated are usually those that can be expressed in set theoretic terms. For instance (and I’m choosing an example that I remember from Doug Lenat’s talk), the conjecture that perfect squares have at least three divisors (every perfect square n has as divisors at least 1, n, and the root of n), could be made based upon the observation that the set of perfect squares is a subset of the set of numbers with at least three divisors. She proposed a representation, different from sets, that uses testers and generators, which are, respectively, predicates for determining whether an example is an example of a concept and functions that produce examples of a concept.

Epstein’s talk concluded with an example of student discovery and conceptual refinement for the game Pong Hau K’i (or Umulkono, 우물고노, in Korea). As AI researchers, seeing the progression of formalizations that students went through in diagramming the state space of a game was quite interesting. (A colleague and I played the game on paper during the plenary session. I lost.)

Day Two

Alan Bundy started the day with a keynote called Why Ontology Evolution is Essential in Modeling Scientific Discovery. The title is a good overview, and some of the comments made about Jean-Gabriel’s work apply here too. The importance of the provision for ontologies which can change and evolve along with scientific conceptualization must not be underestimated. Examples were drawn from physics and astronomy, particularly the discovery that heat and temperature are not equivalent, or the precession of the perihelion of Mercury. Michael Chan’s later talk would also touch upon their work on ontology evolving and repair systems.

David Jensen spoke about Automatic Identification of Quasi-Experimental Designs for Discovering Causal Knowledge. I think that this work is important, particularly for science performed using Semantic Web technologies, where, ideally, data collection could be automated rather than planned. From his abstract, [Quasi-Experimental Designs] are a family of methods for exploiting fortuitous situations in observational data that emulate control and randomization.. For instance, although a dataset as a whole may have significant bias or sampling issues, subsets of the data may actually suggest another, and with better experimental method. Jensen discussed an example in which two groups of researchers made different conclusions about links between early sexual activity and juvenile delinquency. The first group of researchers used the dataset as a whole, while the second examined twins in the dataset, a subset of the observational data, but one in which it was practically guaranteed that most variables would be identical (e.g., twins in the same household, same type of family life, &c.)

The posted schedule had Konstantine Arkoudas speaking next, but he and I switched places, so I gave the next presentation, Discovery Using Heterogeneous Combined Logics. The title was a bit misleading (though it matched the extended abstract that I submitted) as I actually spoke about how specialized reasoners might be invoked on decidable subproblems of an overall goal. Particularly, I showed the Dreadsbury Mansion Mystery and the typical first-order logic formalization that students at RPI will generate for it. Of course, with an arbitrary FOL formalization, there aren’t many guarantees about what an automated reasoning system can do with it. I then showed that there’s a natural translation in the description logic ALBO which has a decision procedure. I have not yet done any work in automating this process, but neither does it seem impossible to automatically recognize such a reduction. This is all preliminary work, so I was grateful for references from Alan Bundy and Simon Colton to related work.

After lunch, Michael Chan presented more work on ontology repair. It seems that their research group has a framework in which ontology repair plans are components. Chan discussed one called Inconstancy, in addition to Where’s My Stuff? that Alan Bundy had mentioned earlier.

Selmer Bringsjord showed his continuing work on an automated discovery of Gödel’s first incompleteness result.

Day Three

Day three began with a keynote from Simon Colton called Joined-Up Reasoning for Automated Scientific Discovery: A Position Statement and Research Agenda. Simon discussed how HR. Doug Lenat had to leave the symposium early, which is unfortunate, as Simon’s presentation made some comparisons between AM and HR. Simon also made some connections between scientific discovery and artistic creativity. By this time, the symposium was drawing to a close, and so there was not as much time as we would have liked, but the presentation slides, and perhaps an audio recording, should be available on the supplementary symposium site relatively soon.

The symposium ended with a practical discussion of funding, publications, and possible future work and collaborations.

//JT

VN:F [1.9.13_1145]
Rating: 4.0/10 (2 votes cast)
VN:F [1.9.13_1145]
Rating: 0 (from 0 votes)
Author: Categories: AI, Uncategorized Tags:

Why Bother…

October 28th, 2008

From Talis: “Jim Hendler at the INSEMTIVE 2008 Workshop”

“that people will (and do) create metadata when there are obvious and immediate benefits in them doing so. No-one really consciously sits down to share or create metadata: they sit down to do a specific task and metadata drops out as a side-effect.”

I can not agree any more. I have tried to tag all my blogs once upon a time, after a few weeks, I found myself bored because there is no clear, immediate benefits for doing so. I would only tag things that I have to, like to tell my friends a list of posts of the same topic.

The only tagging system that is consistently successful upon me is the gmail labeling: I organize mails related to the same task (like writing a paper) on daily bases, because it is very useful, and immediately useful. Even though, I only label a tiny fragment of all my emails.

I have seen too many people have their desktop full of files and too lazy to organize them – myself is one of them. Every year I have to spare a day or two to reorganize my harddisk, and dig out the hidden treasures of my “Downloads” folder. I believe for semantic web to be successful, creating an ontology should be at least as easy as and as useful as organizing files on a harddisk.

In fact, people are creating meta data or even ontology everyday: every email sorting, every contact on the cell phone, every folder creating, every calender item, every wiki post, … We just need to make them explicit, and most of all, without bothering the user to click even one more button.

Jie Bao

VN:F [1.9.13_1145]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.13_1145]
Rating: 0 (from 0 votes)
Author: Categories: Uncategorized Tags:

Cuil, Semantic Search

August 13th, 2008

Last week, Cuil.com caught my eye. It gave me very good impression in just 5 seconds (BTW, 10 seconds is a survival maximal for any website to me). First, I tried, as many people may do, my name. It didn’t disappoint me by hitting quite precisely my pages. I also love the grid-based layout. A few minutes later, I found its “Explore by Category” option. It looks like that cuil has some sort of ontology hierarchies for web pages.

A few “google” results reveal that cuil may use some clustering technique to build such hierarchies. It is interesting to think will such hierarchies indeed improve search experience. When I search “Semantic Web”, cuil recommends me to browse “Ontology (computer Science)” and some of its sub category; it also suggests me to look at “James Hendler”‘s homepage. I would say that it will be very useful for exploring.

Building meta data using machine learning technology is a cool thing. On the other hand, I believe that human intervention is also critical. When wikipedia knowledge is used in clustering, I expect some gain in recall or preciseness. As “Ontology (computer Science)” is a wikipedia page, I guess that cuil may have already used wikipedia information in their results.

Also don’t forget the “network effect”. I have created a prefix-based, syntactical gmail label hierarchy for a while. I really like to share part of the hierarchy to my friends, so that when I send a mail labeled with “party”, then they don’t need to relabel it again. If millions of users can share their small hierarchies (not only on gmail, but also on flicker, youtube, twine, etc.), each is connected somehow to hierarchies of friends and family, eventually we will have a very large network of ontologies which may improve search much more than we can do now. Just a random thougt.

P.S. I found one interesting thing. Cuil caches my wiki page at Iowa State University. However, that page should be offline no later than May 2008, while Cuil was online officially only on July 28, 2008. It seems its crawler has been alive for a while.

Jie Bao

VN:F [1.9.13_1145]
Rating: 6.0/10 (1 vote cast)
VN:F [1.9.13_1145]
Rating: 0 (from 0 votes)
Author: Categories: Uncategorized Tags:

Captcha, Turing Test, and Semantic Web

August 6th, 2008

On the web nobody knows you are a dog, …… or a human. That’s why there are programs on the web to identify one as a human (from bots or dog or cats or……). Most popular ones are captcha. It is based on a simple assumption: no OCR agent so far can be as smart as a human is. To me, it looks like a super-simplified Turing test: an AI program has “real” intelligence as a human has, if being asked by the same question, another human can’t tell who is AI and who is human.

I can’t help imagining that one day, when OCR agents get smart enough to pass the captcha test (I strongly believe that day is not far away), what test we will use to identify a human on web. Math? That will be easy for a good program. Scrabble? maybe, but not that secure. Ask for a Shakespeare’s sonne? Or the end year of world war II? That looks more likely to succeed. But…There are two issues.

First, an agent may have access to a knowledge base. With projects like Dbpedia, human knowledge has been KBized in a speed never seen before in the history. A query as ” the end year of world war II” may be answered by a semantic web agent fairly quickly. I can imagine that someday we will have to design increasingly hard questions (like art things) to identify a human and fight spamming.

The other issue is that a human may have NO access to a knowledge base. Many, many people in the world does not know “the end year of world war II”, even if they may be knowledgeable in other things. They may not even know where to find such a knowledge. Also, they can get bored when been consistently asked such captcha questions and quit — technically, that means they failed the test thus are not “human”. When captcha becomes increasingly hard (like art things), more and more people may fail in one reason or another (including boredness). That will also lead to the failure of the identification system.

Will semantic web help spamming by designing smart agents? :) Maybe, let’s wait and see.

Jie Bao

VN:F [1.9.13_1145]
Rating: 8.0/10 (2 votes cast)
VN:F [1.9.13_1145]
Rating: +1 (from 1 vote)
Author: Categories: Uncategorized Tags: