Archive

Archive for December, 2013

Miscellaneous of my AGU meeting

December 21st, 2013

Don’t remember exactly how many times  I have been to bay area, especially for summer interns. The foggy and windy San Francisco is nothing new to me, but the AGU meeting is like an adventure that I could never imagine what I will come across. I knew from lab-mates that AGU is a huge events, over 20,000 attendants from numerous of domains, it is nevertheless until the time I register at the moscone center west that I realize its overwhelming diversity. To be honest, as I talk to some students in the student breakfast, I literally “know” what are their background, such as biological chemistry, but essentially the problem and the approaches are never friendly to me. Initially, I try to show appreciation, as a sign of politeness however I never feel really any spontaneous reflection from the conversation. I told myself I need to get out of something from this 7 days conversation, something useful for both parties of the conversation.

Therefore, I decided to take charge of the conversation, asking questions like: I am working on A, which tries to solve general problems such as B and C, does that benefit what you are working on? Basically, I regard most of the conversation to be a good survey if something I did will make their lives much easier. One reason is from my observation that domain scientist, which category most attendant falls into, are using some information technologies, such as matlab or R or excel but they are not satisfied. For most of the scientists, as general users, might not know exactly what they want since otherwise they will just build the tool! In most cases, they are looking for some magical tools that can do things better, just like the first time automobile is invented where most people think they need faster horses. From this observation, I had several great conversations with a couple of scientist on problem I am working on and it turns out that these conversations really help me a lot!

Data integration:
One topic people are talking about is data integration. Essentially, this is the problem of ontology and vocabulary matching, where many efforts haven been made on aligning heterogenous of schemas and the corresponding keywords in the schema. Just as the project I am working on with Cyndy Chandler and Adam Shepherd form WHOI, domain data scientists from US, EU and Australia try to align part of their research data with a commonly accepted vocabulary entry, which is called the NERC that is something looks like http://vocab.nerc.ac.uk/collection/L06/current/. Most of the work by now is through manual aligning, therefore I am here to apply some natural language processing trick to align the terms based on their metadata such as description and definitions. Some folks are working on this already, but not a completely automatic process,where they try to provide a guideline tool for the scientist when they are not sure which terms to use when devising their dataset. They learn the “recommended” terms by putting a large set of vocabularies that let them cluster using the LDA tricks. However, I really doubt whether it would work as LDA is simply for unsupervised learning which means we can’t specify what data to put into the same category. Then it means we disregard an important aspect of the dataset, which is not a good approach to an appropriate model. Besides I have more concern regarding the applicability of this work, as most scientist are familiar with what vocabularies to use and those who didn’t might not be the domain scientists. Usually, computer scientist will be under supervision of those domain scientist while creating these dataset, so it is still not necessary for them to use this vocabulary guideline service. Anyways, bringing some AI techniques to this domain is always a good trial.

Data exploration and data portals:
Another trend is that many people are concerning about building data portal. A lot of them are featured with facet search, map based search and filtering. As I see a number of posters from George Mason, I just realized that they are doing the same thing as S2S however the generation of those facets and user-interface are generated by hard-query and xml file. They are not concerning using the ontology as the guideline for the hierarchy of the user interface. But as I briefed something on our work of S2S, they become very interested and eager to try out if there is source code and enough documents available for the S2S framework. I showed them our work on DCO, not a perfect one but already articulate the idea. They even suggest to create a user community for the s2s framework such that whenever there is a question or they would like to contribute a widget, there is a place they can discuss and commit the code. Besides George Mason, there are also other schools that raise the same suggestion.

Big data analysis:
Lots of people are talking about big data analysis, ranging from NASA to small institutes. What I really expect from the talk is the technical aspect of what’s the scale of their dataset and how they solve the problem using parallelism, redundancy etc. However, most of the talk are discussing the hierarchy of their system, key components without much detail on what’s the data looks like, any problems they come across and how are those problem solved. It might be because the audiences for this meeting are more with domain backgrounds that doesn’t really know or even concern those technical details.

Service Computing:
The best thing I found about this meeting is that there are several people that working towards the same goal as I do but with slightly different approach. It’s good because it shows what I are interested is meaningful and there is still room for me as we are not doing exactly the same thing. One professor I talked with is Prof. Jia Zhang from Carnegie Mellon. What she is working on is to facilitate the scientist to reuse data, practices and algorithms such that preventing reinventing the wheels and more importantly accelerating the process of adopting someone else work to save time and effort. Moreover, she also developed an service recommendation system for scientists. The system will be able to suggest specific algorithms based on input metadata and goal. The algorithm workflow is figured out based on path-finding algorithms. However, something they haven’t done yet is providing a web-based platform to execute the workflow and they are not using reasoning engine to find the path. I talked some of my work, idea and thoughts and how I will approach the problem using some semantic web technologies. We are both happy after the conversion because although the goal is the same but the approach is slightly different, so there is a good source of reference and collaborations.

Looking forward:

After the meeting, I have a clearer picture of the significance of my work, possible directions, and potential collaborators etc. One regret is that I didn’t give any presentation on any of my work so this is a really good catalyst for me to get down to my work and contribute some of my work to this community in the future year.

VN:F [1.9.22_1171]
Rating: 9.0/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)
Author: Categories: tetherless world Tags:

ODIP workshop notes – part 1 – a unique viewpoint of data citation

December 20th, 2013

During the last two weeks I’ve been on this exciting journey to attend ODIP workshop #2 plus AGU fall meeting 2013. I’ve been talking with both researchers and managers (definitions will come in a later blog) and taking notes of both what we’ve been talking about and what I’ve been thinking of. This is the first time I feel that I would like to write something so much. I’ll split my stories and thoughts into multiple parts to make each of them really coherent.

So here is the first story, it’s about the process of producing scientific time series data, presented by Justin Buck from BODC at ODIP workshop #2. I cannot find his slides now so I recreated one of his plot in the slides from my memory as follows.

justin-buck-data-producing-process

Most of the data are recorded at almost the same time as they are observed, shown as blue crosses in the above figure. Some data are missed at observation time, so they need to be filled in later if possible, shown as green crosses. Finally, corrections are made to the data for various reasons, as the red cross indicates.

Justin presented this data producing process in the context of data citation. He then continued to point out three kinds of data citations based on the ever-changing nature of the data to cite:

  1. cite a time slice, which includes data recording, adding and editing logs within a certain period of time along the time series;
  2. cite a snapshot, which is the “as is” data at a certain time point;
  3. cite the continuum, which includes every change made to the data set up till a certain time.

Very interesting viewpoint of data citation.

 

VN:F [1.9.22_1171]
Rating: 8.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: tetherless world Tags:

What is ontology?

December 19th, 2013

The topic of a blog in my mind, after five days at the American Geophysical Union 2013 Fall Meeting discussing Earth and space science informatics, is to give an introduction of ontology to researchers in Earth and environmental sciences and beyond.

To attract your interest, I would say that ontology is the invisible hand behind anything. (It took me a few minutes to think about whether I should add an ‘an’ before the ‘ontology’ here. For reasons see below.)

First let’s see the etymology of the word ‘ontology’. According to Wiktionary (http://en.wiktionary.org/wiki/ontology), ontology is ‘originally Latin ontologia (1606, Ogdoas Scholastica, by Jacob Lorhard (Lorhardus)), from Ancient Greek ὤν (ōn, “on”), present participle of εἰμί (eimi, “being, existing, essence”) + λόγος (logos, “account”).’

Second let’s see the definition of the word. It is also interesting to see that Wiktionary claims that in philosophy the word ‘ontology’ can be either uncountable or countable. For the former, ontology is defined by Wiktionary as ‘The branch of metaphysics that addresses the nature or essential characteristics of being and of things that exist; the study of being.’ This definition is more or less the same as another one done by the Oxford English Dictionary, ‘The science or study of being; that branch of metaphysics concerned with the nature or essence of being or existence.’ That Oxford definition was used in my PhD defense (http://www.slideshare.net/MarshallXMa/ontology-spectrum-for-geological-data-interoperability-phddefence). For the countable ‘ontology’, Wiktionary defines it as ‘The theory of a particular philosopher or school of thought concerning the fundamental types of entity in the universe.’ I had not done any work relevant to that definition yet but I just found Oxford also has a similar definition ‘As a count noun: a theory or conception relating to the nature of being.’

The word metaphysics is mentioned in the definition of ontology as an unaccountable noun. In now days when people talk about metaphysics they often refer to Aristotle (384 – 322 BCE). If you (especially those who are working for a Doctor of PHILOSOPHY ;-)) are interested in his study you can read the two most famous books 1) Politics: A Treatise on Government and 2) The Ethics of Aristotle by him on the Gutenberg website (http://www.gutenberg.org/ebooks/author/2747). The story does not stop here. In a famous Chinese book, I Ching (or the Book of Changes, c. 450 – 250 BCE), there are also topics about metaphysics, such as a sentence which is my personal favorite: ‘What is above form is called Tao; what is within form is called tool.’

The philosophical meaning of the word ontology is the background and for most cases in the domain of Earth and space science informatics we care more about another meaning of the word: ontology as a countable noun in computer science. Before discussing definition of ontology as a computer science word, let’s first see how hot this word is in recent years. I did a few searches with the topic ‘ontology’ in isiknowledge.com (on Dec 19, 2013), which showed that there are about 44884 publications for all years, and publication numbers for separate periods are 1470/1945–1995, 1498/1995–2000, ~7901/2000–2005, ~24528/2005–2010, and ~16891/2010–2013. If I refined the results by limiting to the research area ‘Computer Science’, the results are: ~22251/all years, 114/1945–1995, 673/1995–2000, ~5095/2000–2005, ~14316/2005–2010, and ~5971/2010–2013. And there are a big number of publications that applied informatics and were filtered out by the keyword ‘Computer Science’. From those results we can see many meanings, one is that works with the computer science ‘ontology’ has been increasing significantly since 2000.

For the definition of the computer science word ‘ontology’, many people have cited the publications of T.R. Gruber (1993, 1995, see: http://dx.doi.org/10.1006/knac.1993.1008 and http://dx.doi.org/10.1006/ijhc.1995.1081): ‘An ontology is an explicit specification of a conceptualization’. Middle 1990s is the golden age for discussing the definition of ontology. N. Guarino (1997, see: http://dx.doi.org/10.1006/ijhc.1996.0091) made a nice review of the definition of ‘ontology’, in which I think one key point he discussed was the ‘shared conceptualization’ feature of an ontology. So in my PhD dissertation (Ma, 2011, see: http://www.itc.nl/library/papers_2011/phd/ma.pdf) I tried to re-address the definition of the computer science ‘ontology’: ‘Ontologies in computer science are defined as shared conceptualizations of domain knowledge (Gruber, 1995; Guarino, 1997b)…’

Third, after seeing the definition of ontology, let’s focus on how to put a computer science ‘ontology’ into practice, especially in the domain of Earth and space science informatics. Early 2000s is the golden age for that work. McGuinness (2003, see: http://www-ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-mit-press-%28with-citation%29.htm) made a wonderful discussion of the ontology spectrum. McGuinness also made a footnote to that spectrum figure: ‘This spectrum arose out of a conversation in preparation for an ontology panel at AAAI ’99. The panelists (Gruninger, Lehman, McGuinness, Ushold, and Welty), chosen because of their years of experience in ontologies found that they encountered many forms of specifications that different people termed ontologies. McGuinness refined the picture to the one included here.’ When I was doing my PhD I read this note and I tried to find a few other publications by people in the panelists listed by McGuinness, and I did find a few that also discussed the ontology spectrum, for example:
Welty, C., 2002. Ontology-driven conceptual modeling. In: Pidduck, A.B., Mylopoulos, J., Woo, C.C., Ozsu, M.T. (Eds.), Advanced Information Systems Engineering, Lecture Notes in Computer Science, vol. 2348. Springer-Verlag, Berlin & Heidelberg, Germany, pp. 3-3. Lecture slides available at: http://www.cs.toronto.edu/caise02/cwelty.pdf
Obrst, L., 2003. Ontologies for semantically interoperable systems. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, New Orleans, LA, USA, 366-369.
Uschold, M., Gruninger, M., 2004. Ontologies and semantics for seamless connectivity. SIGMOD Record 33 (4), 58–64.
Borgo, S., Guarino, N., Vieu, L., 2005. Formal ontology for semanticists. In: Lecture notes of the 17th European Summer School in Logic, Language and Information (ESSLLI 2005), Edinburgh, Scotland, 12pp. http://www.loa-cnr.it/Tutorials/ESSLLI1.pdf

OS1
An ontology spectrum (from McGuinness 2003)

To help myself understand the ontology spectrum better, I redrew the diagram (see below) in my PhD dissertation. Very recently (Dec 03, 2013) Jim McGusker, a PhD student with McGuinness, made a thorough explanation of the spectrum in his blog (see: http://info.5amsolutions.com/blog/bid/154967/6-Points-Along-the-Ontology-Spectrum).

OS2
Ontology spectrum (adapted from Borgo et al., 2005; McGuinness, 2003; Obrst, 2003; Uschold and Gruninger, 2004; Welty, 2002). Texts in italics explain a typical relationship in each ontology type (from Ma 2011)

Finally, I would like to share a few examples for different types of ontologies following the spectrum:

Catalog/Glossary:
Neuendorf, K.K.E., Mehl, J.J.P., Jackson, J.A., 2005. Glossary of Geology, 5th edition. American Geological Institute: Alexandria, VA, USA, p. 800. See latest version at: http://www.agiweb.org/pubs/glossary/

Taxonomy:
BGS Rock Classification Scheme, see: https://www.bgs.ac.uk/bgsrcs/

Thesaurus:
AQSIQ, 1988. GB/T 9649-1988 The Terminology Classification Codes of Geology and Mineral Resources. General Administration of Quality Supervision, Inspection and Quarantine of P.R. China (AQSIQ). Standards Press of China, Beijing, China. 1937 pp.

Conceptual Schema:
NADM Steering Committee, 2004. NADM Conceptual Model 1.0—A conceptual model for geologic map information: U.S. Geological Survey Open-File Report 2004-1334, North American Geologic Map Data Model (NADM) Steering Committee, Reston, VA, USA, 58 pp. See: http://pubs.usgs.gov/of/2004/1334

Ontologies encoded in RDF format:
Semantic Web for Earth and Environmental Terminology (SWEET). See: http://sweet.jpl.nasa.gov/

Now a short wrap up about what is ontology:
For fun: the invisible hand behind anything;
In philosophy: (uncountable) the science or study of being; that branch of metaphysics concerned with the nature or essence of being or existence; (countable) a theory or conception relating to the nature of being;
In computer science: shared conceptualization of domain knowledge.

To put ontologies (computer science) into practice, keep in mind an ontology spectrum with enriching meanings: catalog/glossary -> taxonomy -> thesaurus -> conceptual schema -> formal constraints.

VN:F [1.9.22_1171]
Rating: 8.8/10 (4 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)