Archive

Archive for the ‘AI’ Category

Data.gov – it’s useful, but also could be better.

April 5th, 2011

The “Nerd Collider” Web site invited me to be a “power nerd” and respond to the question “What would you change about Data.gov to get more people to care?”  The whole discussion including my response can be found here.  However, I hope people won’t mind my reprinting my response here, as the TWC blog gets aggregated to some important Linked Data/Semantic Web sites.

My response:

I was puzzling over how I wanted to respond until I saw the blog in the Guardian – http://www.guardian.co.uk/news/datablog/2011/apr/05/data-gov-crisis-obama – which also reflects this flat line as a failure, and poses, by contrast, the number of hits the Guardian.com website gets. This is such a massive apples vs. oranges error that I figure I should start there.

So, primarily, let’s think about what visits to a web page are about — for the Guardian, they are lots of people coming to read the different articles each day. However, for data.gov, there isn’t lot of repeat traffic – the data feeds are updated on a relatively slow basis, and once you’ve downloaded some, you don’t have to go back for weeks or months until the next update. Further, for some of the rapidly changing data, like the earthquake data, there are RSS feeds so once setup, one doesn’t return to the site. So my question is, are we looking at the right number?

In fact, the answer is no — if you want to see the real use of data.gov, take a look at the chart at http://www.data.gov/metric/visitorstats/monthlyredirecttrend — the number of total downloads of dataset since 2009 is well over 1,000,000 and in February of this year (the most recent data available) there were over 100,000 downloads — so the 10k number appears to be tracking the wrong thing – the data is being downloaded and that implies it is being used!!

Could we do better? Yes, very much so. Here’s things I’m interested in seeing (and working with the data.gov team to make available)

1 – Searching for data on the site is tough — keyword search is not a good way to look for data (for lots of reasons) and thus we need better ways – doing this really well is a research task I’ve got some PhD students working on, but doing better than is there requires some better metadata and approach. There is already work afoot at data.gov (assuming funding continues) to improve this significantly.

2 – Tools for using the data, and particularly for mashing it up, need to be more easily used and more widely available. My group makes a lot of info and tools available at http://logd.tw.rpi.edu – but a lot more is needed. This is where the developer community could really help.

3 – Tools to support community efforts (see the comment by Danielle Gould to this effect) are crucial – she says it better than I can so go read that.

4- there are efforts by data.gov to create communities – these are hard to get going, but could be a great value in the long run. I suggest people look to these at the data.gov communities site, and think about how they could be improved to bring more use – I know the data.gov leadership team would love to get some good comments about that.

5 – We need to find ways to turn the data release into a “conversation” between government and users. I have discussed this with Vivek Kundra numerous times and he is a strong proponent (and we have thought about writing a paper on the subject if time ever allows). The British data.gov.uk site has some interesting ideas along this line, based on open streetmap and similar projects, but I think one could do better. This is the real opportunity for “government 2.0″ – a chance for citizens to comment just on legislation, but to help make sure the data that informs the policy decisions is the best it can be.

So, to summarize, there are things we can do to improve things, many of which are getting done. However, the numbers in the graph above are misleading, and don’t really reflect the true usage of data.gov per se, let alone the other sites and sites like the LOGD site I mention above which are powered by data.gov.

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Wolfram|Alpha vs? Open-linked data

May 5th, 2009

I saw a webinar demo of Wolfram|Alpha given by Stephen Wolfram today — the system is very impressive — (you can read my blog about it on the Nature Network)

One thought struck me – we’re trying to do open-linked data, while what Wolfram has is an impressive engine that uses the “closed” knowledge that lets them do curation, testing and computation.  Sort of like open social networks vs. the many specialized ones — still, he shows what is possible with data and a good computational engine, and I think that will be generally good for all of us “data on the web” types.

(Jim H.)

VN:F [1.9.22_1171]
Rating: 9.3/10 (4 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: AI, Semantic Web Tags:

URL daily (Radical translation)

December 11th, 2008

url: http://en.wikipedia.org/wiki/Radical_translation

Radical translation is a term invented by American philosopher W. V. O. Quine to describe the situation in which a linguist is attempting to translate a completely unknown language, which is unrelated to his own, and is therefore forced to rely solely on the observed behavior of its speakers in relation to their environment.”

“Quine tells a story (Quine 1960) to illustrate his point, in which an explorer is trying to puzzle out the meaning of the word “gavagai”. He observes that the word is used in the presence of rabbits, but is unable to determine whether it means ‘undetached rabbit part’, or ‘fusion of all rabbits’, or ‘temporal stage of a rabbit’, or ‘the universal ‘rabbithood’”

“radical translation” carries the similar criticism to strong AI as chinese room by John Searle

“…(Searle 1980), which attempts to show that a symbol-processing machine like a computer can never be properly described as having a “mind” or “understanding“, regardless of how intelligently it may behave.”

While language translation is by itself a very interesting work, I would wonder when Chinese was translated into English for the first time. Here are some examples:

1. Proper names of real world entities, such as elephant (象), can be easily translated.

source: http://en.wikipedia.org/wiki/Elephant

2. Functionary figures such as dragon (龙) carries different meanings

source: http://en.wikipedia.org/wiki/Dragon source:http://en.wikipedia.org/wiki/European_dragon

3. non-accessible things, such as the philosophical term Tao (道), causes more difficulties because they themselves do not have a clear cut definition in their native language.

4. Another example is the term china,which is also used to refer high-quality porcelain or ceramic ware, originally made in China. This sense is a good example of radical translation, where Quine’s “rabbit” was replaced by porcelain and “gavagai” was replaced by “china”.

source: http://en.wikipedia.org/wiki/Image:Ming-Schale1.jpg

The above philosophical arguments and real world translation examples lead to the following thoughts on the social norms:

1. meaning is rather Quine’s ontological commit, where the definition is socially agreed

2. while understanding and translation may be done by one person, the correctness of these actions is evaluated by social peers

3. it is worthy to read Searle’s The Construction of Social Reality (1995), (wikipedia provided a nice briefing)

Li Ding, 2008-12-11

VN:F [1.9.22_1171]
Rating: 8.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: AI, Semantic Web, Web Science Tags:

AAAI Fall Symposium on Automated Scientific Discovery

November 11th, 2008

Day One

I (Joshua Taylor) am now back in Troy after spending the weekend (Friday through Sunday) in Arlington at the AAAI Fall Symposium on Automated Scientific Discovery. Presented papers, keynote address, and slides will become available on the supplementary symposium page over the next week or so.

After opening remarks by symposium chairs Selmer Bringsjord and Andrew Shilliday, Doug Lenat gave an opening keynote address entitled Looking Both Ways. Doug has a long history in automated discovery, from his 1976 PhD thesis on Automated Mathemetician (AM) and later Eurisko, to present day work within Cycorp. Doug explained a great deal about the techniques behind AM and Eurisko, and also talked about some of the criticisms that these systems received. He then spoke about the development of Cyc. We learned just how much Cyc has evolved, moving from frame-based systems and description logics to much more expressive formalism, leaving behind a theoretically desirable global consistency for more pragmatic and cognitively plausible locally consistent microtheories, and how Cyc now has enough (manually-encoded) knowledge, that high-level machine learning and automated knowledge acquisition are possible. He stressed that one of the factors making such knowledge acquisition and learning possible is the widespread adoption of the world wide web, and I noted that in some of his high level diagrams included SQL and SPARQL. (While I knew about the OpenCyc project, I only just now became aware that a great deal of OpenCyc is Semantic Web friendly.)

Andrew Shilliday followed Doug’s keynote with a history of automated scientific discovery, particularly as it relates to his own research and upcoming thesis. He also described the Elisa system for assisting users is discovery in scientific and mathematical domains.

After lunch, Alexandre Linhares discussed Douglas Hofstadter’s notion of Fluid Concepts and an implementation thereof.

Siemion Fajtlowicz spoke about development of Graffiti, a well-known system for conjecture generation within the domain of graph theory. Siemion also discussed more recent work connecting graph theory conjectures and molecular structure conjectures.

Jean-Gabriel Ganascia presented A Reconstruction of Some of Claude Bernard’s Scientific Steps, which also documents the development of Cybernard. In a manner that I am particularly fond of, Jean-Gabriel, in order to automate scientific discovery, takes as a starting point Claude Bernard, a human who both made many scientific discoveries, and also documented just what he did. One of the interesting aspects of this work is that it involves developing an ontologies of the scientific process of discovery, as well as of the scientific concepts with which Bernard worked. The importance of ontology evolution was also stressed, for as scientific knowledge increases, scientific conceptualization must also change.

After Jean-Gabriel, Susan Epstein discussed Knowledge Representation in Automated Scientific Discovery. She discussed how concepts in automated scientific discovery are often expressed as sets, and that as a result, the conjectures that are generated are usually those that can be expressed in set theoretic terms. For instance (and I’m choosing an example that I remember from Doug Lenat’s talk), the conjecture that perfect squares have at least three divisors (every perfect square n has as divisors at least 1, n, and the root of n), could be made based upon the observation that the set of perfect squares is a subset of the set of numbers with at least three divisors. She proposed a representation, different from sets, that uses testers and generators, which are, respectively, predicates for determining whether an example is an example of a concept and functions that produce examples of a concept.

Epstein’s talk concluded with an example of student discovery and conceptual refinement for the game Pong Hau K’i (or Umulkono, 우물고노, in Korea). As AI researchers, seeing the progression of formalizations that students went through in diagramming the state space of a game was quite interesting. (A colleague and I played the game on paper during the plenary session. I lost.)

Day Two

Alan Bundy started the day with a keynote called Why Ontology Evolution is Essential in Modeling Scientific Discovery. The title is a good overview, and some of the comments made about Jean-Gabriel’s work apply here too. The importance of the provision for ontologies which can change and evolve along with scientific conceptualization must not be underestimated. Examples were drawn from physics and astronomy, particularly the discovery that heat and temperature are not equivalent, or the precession of the perihelion of Mercury. Michael Chan’s later talk would also touch upon their work on ontology evolving and repair systems.

David Jensen spoke about Automatic Identification of Quasi-Experimental Designs for Discovering Causal Knowledge. I think that this work is important, particularly for science performed using Semantic Web technologies, where, ideally, data collection could be automated rather than planned. From his abstract, [Quasi-Experimental Designs] are a family of methods for exploiting fortuitous situations in observational data that emulate control and randomization.. For instance, although a dataset as a whole may have significant bias or sampling issues, subsets of the data may actually suggest another, and with better experimental method. Jensen discussed an example in which two groups of researchers made different conclusions about links between early sexual activity and juvenile delinquency. The first group of researchers used the dataset as a whole, while the second examined twins in the dataset, a subset of the observational data, but one in which it was practically guaranteed that most variables would be identical (e.g., twins in the same household, same type of family life, &c.)

The posted schedule had Konstantine Arkoudas speaking next, but he and I switched places, so I gave the next presentation, Discovery Using Heterogeneous Combined Logics. The title was a bit misleading (though it matched the extended abstract that I submitted) as I actually spoke about how specialized reasoners might be invoked on decidable subproblems of an overall goal. Particularly, I showed the Dreadsbury Mansion Mystery and the typical first-order logic formalization that students at RPI will generate for it. Of course, with an arbitrary FOL formalization, there aren’t many guarantees about what an automated reasoning system can do with it. I then showed that there’s a natural translation in the description logic ALBO which has a decision procedure. I have not yet done any work in automating this process, but neither does it seem impossible to automatically recognize such a reduction. This is all preliminary work, so I was grateful for references from Alan Bundy and Simon Colton to related work.

After lunch, Michael Chan presented more work on ontology repair. It seems that their research group has a framework in which ontology repair plans are components. Chan discussed one called Inconstancy, in addition to Where’s My Stuff? that Alan Bundy had mentioned earlier.

Selmer Bringsjord showed his continuing work on an automated discovery of Gödel’s first incompleteness result.

Day Three

Day three began with a keynote from Simon Colton called Joined-Up Reasoning for Automated Scientific Discovery: A Position Statement and Research Agenda. Simon discussed how HR. Doug Lenat had to leave the symposium early, which is unfortunate, as Simon’s presentation made some comparisons between AM and HR. Simon also made some connections between scientific discovery and artistic creativity. By this time, the symposium was drawing to a close, and so there was not as much time as we would have liked, but the presentation slides, and perhaps an audio recording, should be available on the supplementary symposium site relatively soon.

The symposium ended with a practical discussion of funding, publications, and possible future work and collaborations.

//JT

VN:F [1.9.22_1171]
Rating: 4.0/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: AI, Uncategorized Tags:

Fellowship of the (Semantic) Web: The Two Towers

May 25th, 2008

By popular request (okay, a couple of people asked for it), I have put my Talk from Semantic Technologies 2008 online – warning, it’s about 22M pdf (lots of gratuitous images to keep things fun)

Enjoy.

Jim H.

VN:F [1.9.22_1171]
Rating: 7.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: AI, Semantic Web, tetherless world, Web Science Tags: