Archive

Author Archive

Data Management – Serendipity in Academic Career

November 11th, 2014

A few days ago I began to think about the topic for a blog and the first reflection in my mind was ‘data management’ and then a Chinese poem sentence ‘无心插柳柳成荫’ followed. I went to Google for an English translation of that sentence and the result was ‘Serendipitiously’. Interesting, I never saw that word before and I had to use a dictionary to find that ‘serendipity’ means unintentional positive outcomes, which expresses the meaning of that Chinese sentence quite well. So, I regard data management as serendipity in my academic career. I think that’s because I was trained as a geoinformatics researcher through my education in China and the Netherlands, how it comes that most of my current time is being spent on data management?

One clue I could see is that I have been working on ontologies, vocabularies and conceptual models for geoscience data services, which is relevant to data management. Another more relevant clue is a symposium ‘Data Management in Research: A Challenging Issue’ organized at University of Twente campus in 2011 spring. Dr. David Rossiter, Ms. Marga Koelen, I and a few other ITC colleagues attend the event. That symposium highlighted both technical and social/cultural issues faced by the 3TU.Datacentrum (http://datacentrum.3tu.nl/en/home/), a data repository for the three technological universities in the Netherlands. It is very interesting to see that several topics of my current work had already discussed in that symposium, whereas I paid almost no attention because I was completely focused on my vocabulary work at that time. Since now I am working on data management, I would like to introduce a few concepts relevant to it and the current social and technical trends.

Data management, in simple words, means what you will do with your datasets during and after a research. Conventionally, we treat paper as the ‘first class’ product of research and many scientists pay less attention to data management. This may lower the efficiency of research activities and hinder communications among research groups in different institutions. There is even a rumor that 80% of a scientist’s time is spent on data discovery, retrieval and assimilation, and only 20% of time is for data analysis and scientific discovery. An ideal situation is that reverse the allocation of time, but that requires efforts on both a technical infrastructure for data publication and a set of appropriate incentives to the data authors.

After coming to United States the first data repository caused my attention was the California Digital Library (CDL) (http://www.cdlib.org/), which is similar to the services offered by 3TU.Datacentrum. I like the technical architecture CDL work not only because they provide a place for depositing datasets but also, and more importantly, they provide a series of tools and services (http://www.cdlib.org/uc3/) to allow users to draft data manage plans to address funding agency requirements, to mint unique and persistent identifiers to published datasets, and to improve the visibility of the published datasets. The word data publication is derived from paper publication. By documenting metadata, minting unique identifiers (e.g., Digital Object Identifiers (DOIs)), and archiving copies of datasets into a repository, we can make a piece of published dataset similar to a piece of published paper. The identifier and metadata make the dataset citable, just like what we do with published papers. A global initiative, the DataCite, had been working on standards of metadata schema and identifier for datasets, and is increasing endorsed by data repositories across the word, including both CDL and 3TU.Datacentrum. A technological infrastructure for data publication is emerging, and now people begin to talk about the cultural change to treat data as ‘first class’ product of research.

Though funding agencies already require data management plans in funding proposals, such as the requirements of National Science Foundation in US and the Horizon 2020 in EU (A Google search with key word ‘data management’ and the name of the funding agency will help find the agency’s guidelines), The science community still has a long way to go to give data publication the same attention as what they do with paper publication. Various community efforts have been take to promote data publication and citation. The FORCE11 published the Joint Declaration of Data Citation Principles (https://www.force11.org/datacitation) in 2013 to promote good research practice of citing datasets. Earlier than that, in 2012, the Federation of Earth Science Information Partners published Data Citation Guidelines for Data Providers and Archives (http://commons.esipfed.org/node/308), which offers more practical details on how a piece of published dataset should be cited. In 2013, the Research Data Alliance (https://rd-alliance.org/) was launched to build the social and technical bridges that enable open sharing of data, which enhances existing efforts, such as CODATA (http://www.codata.org/), to promote data management and sharing.

LogoCloud

To promote data citation, a number of publishers have launched so called data journals in recent years, such as Scientific Data (http://www.nature.com/sdata/) of Nature Publishing Group, Geoscience Data Journal (http://onlinelibrary.wiley.com/journal/10.1002/%28ISSN%292049-6060) of Wiley, and Data in Brief (http://www.journals.elsevier.com/data-in-brief/) of Elsevier. Such a data journal often has a number of affiliated and certified data repositories. A data paper allows the authors to describe a piece of dataset published in a repository. A data paper itself is a journal paper, so it is citable, and the dataset is also citable because there are associated metadata and identifier in the data repository. This makes data citation flexible (and perhaps confusing): you can cite a dataset by either citing the identifier of the associated data paper, or the identifier of the dataset itself, or both. More interestingly, a paper can cite a dataset, a dataset can cite a dataset, and a dataset can also cite paper (e.g., because the dataset may be derived from tables in a paper). The Data Citation Index (http://wokinfo.com/products_tools/multidisciplinary/dci/) launched by Thomson Reuters provides services to index the world’s leading data repositories, connect datasets to related literature indexed in the Web of Science database and to search and access data across subjects and regions.

Although there is such huge progress on data publication and citation, we are not yet there to fully treat data as ‘first class’ products of research. A recent good news is that, in 2013, the National Science Foundation renamed Publications section in biographical sketch of funding applicants as Products and allowed datasets and software to be listed there (http://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp). However, this is still just a small step. We hope more similar incentives appear in academia. For instance, even we have the Data Citation Index, are we ready to mix the data citation and paper citation to generate the H-index of a scientist? And even there is such an H-index, are we ready to use it in research assessment?

Data management involves so many social and technological issues, which make it quite different from pure technical questions in geoinformatics research. This is an enjoyable work and in the next step I may spend more time on data analysis, for which I may introduce a few ideas in another blog.

VN:F [1.9.22_1171]
Rating: 8.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

A layer cake of spatial data, and in a jigsaw puzzle style

September 4th, 2014

During a lunch at the GeoData 2014 workshop, Boulder, CO, USA, June 2014, people sitting around the table began to chat about topics relevant to data sharing, data format, interoperability – all those topics relevant to geoscience data – well, inter-agency data interoperability was the central topic of that workshop. When someone rose up the topic of comparing data sharing policies in USA with those in Europe and China, a few people (those who know me) looked at me and began to smile. Yes, I am confident to say that I have some comments on the geoscience data sharing in Europe.

Before I came to USA I spent about four and half years in the Netherlands working for a PhD degree on geoscience data interoperability . When I looked back, it seems very interesting because I knew nothing about what was happening on data sharing in Europe before I headed to ITC. But the world is a really small cycle. At the second year of my PhD study, I got in contact with a colleague in the Commission for Management and Application of Geoscience Information of the International Union of Geological Sciences, and he worked at the Geological Survey of the Netherlands at Utrecht. I visited him several times and from him I also came to know about the giant data sharing initiative of EU, the Infrastructure for Spatial Information in Europe (INSPIRE).

Initially, what attracted me is some technical details in INSPIRE, especially those surrounding the works on vocabulary modeling and web map services. INSPIRE covers 34 data themes, among which geology is my favorite because geological data is the topic of my PhD work at ITC. And I really appreciated the data specification working group of the Geology theme in INSPIRE, as colleagues in that group offered me so many fresh technical ideas. Then, in my fourth ITC year, when I began to prepare my PhD dissertation and the defense, a guideline ‘Don’t get lost in details, look at the big picture’ inspired me review the INSPIRE from another angle and discuss my ideas with advisors and colleagues at ITC.

I forgot to mention that many such discussions happened during coffee breaks or lunch breaks at ITC (Well, there is no such a culture in the USA). And then, one day, during such a coffee break chat, a view came into my brain – a jigsaw puzzle layer cake – a nice analog of the INSPIRE initiative: the 34 data themes represent 34 layers and the 27 EU nations (in 2011) represent 27 puzzle pieces. The data specifications and implementation rules of INSPIRE are the recopies for making cakes, and the public agencies in EU nations are the cake cooks.

A 'jigsaw puzzle layer cake view' of the EU INSPIRE initiative

This ‘cake’ view sounds like a jest, but I took it seriously and I know in GIScience people used to call data as layer cakes. I drafted a manuscript to describe my view immediately after that coffee break chat, but it was out of my plan that the short article was not published until four years later – actually, just one month before the lunch table meeting at GeoData 2014, and
EU has 28 nations now (Croatia joined in 2013). The article is accessible at http://onlinelibrary.wiley.com/doi/10.1002/2014EO190006/abstract.

The INSPIRE initiative is combination of bottom-up and top-down approaches. The bottom-up approach is reflected in the works of data specification drafting and technical infrastructure constructions, which represent the consensus of experts from the EU nations. The top-down approach is reflected in the formally issued EU directive for the INSPRE, which makes it a de jure initiative, that is, EU member nations are required to comply with the INSPIRE data specifications and implementation rules when build their national spatial data infrastructures.

USA has a different administrative system comparing with EU. That, more or less, is also reflected in the geoscience data sharing policies and technologies. However, people here also build such data cakes. What can USA benefit from the EU experience and what suggestions can it provide based on its own work? I do not have a single answer now but I hope I will have some comments a few years later. Fortunately, similar to my encounter with the colleague at the Geological Survey of the Netherlands, now I also come to know colleagues at NASA, USGS, NOAA, EPA, USGCRP, and more, who are showing me the picture of geoscience data issues in the USA.

VN:F [1.9.22_1171]
Rating: 9.0/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)
Author: Categories: Data Science, tetherless world Tags: , , ,

Geoscience in the Web era – a few facets

July 30th, 2014

In middle July 2014 I attended the DCO summer school at Big Sky Resort, MT, with a 2-day field trip at Yellowstone National Park (YNP) – a nice experience – the venue is wonderful, and also the topics covered by the curriculum. But what impressed me the most is to see how the Web brings changes to geoscience works as well as geoscientists.

We have three excellent field trip guides, Lisa Morgan, Pat Shanks and Bill Inskeep. They prepared and distributed a 82-page YNP field trip guide! Of course they first shared it online through Dropbox. What also impressed me is that when I showed my golden spike information portal to Lisa, she also showed me a few APPs on her iPhone with state geologic map services – useful gadget for field work. But our field trip experience in YNP showed that a paper map is still necessary as it is bigger and provides a overview of a wider area, and it needs no battery.

The YNP itself has a virtual observatory website called Yellowstone Volcano Observatory, hosted by USGS and University of Utah. The portal provides “timely monitoring and hazard assessment of volcanic, hydrothermal, and earthquake activity in the Yellowstone Plateau region.” Featured information includes publications, online mapping services, and also images, videos and webcams about YNP.

I was happy to see that Katie Pratt and I are accompanied by many other summer school participants when we were tweeting on Twitter. Search the hashtag #DCOSS14 you will find how active the participants were on Twitter during the period of the summer school. I was even a little surprise to see that Donato Giovannelli ‏@d_giovannelli helped answer a question about twitter impact on citation by pasting the link to a paper, a few seconds after I gave a short introduction to the Altmetric.com and its use in Nature Publishing Group, Springer and Wiley.

And my role at the summer school was two-fold: participant and lecturer. I gave a presentation titled ‘Why data science matters and what we can do with it‘, in which I addressed four sub-topics: data management and publication, interoperability of data, provenance of research, and era of Science 2.0. The slides are accessible on Slidershare [link].

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)
Author: Categories: tetherless world Tags:

Conceptual model of a workshop

June 24th, 2014

In June 2014 I helped organize two workshops, the DCO Data Science Day 2014 and the GeoData 2014. The experience was unique and I thought it is necessary to write down some notes for future events. Hope it also be useful to other people who are planning to organize a workshop or small conference.
The list of models below is following the idea of an ontology spectrum.

Model 1 (via Bruce Caron, easy and impressive): people, coffee, beer + shaking well.

Model 2 (following the context model of 5W1H): date, topic, location, people, agenda, logistics.

Model 3 (things to do – result of a brainstorm):
0 website;
1 date;
2 central topic, purpose, output;
3 topic of sessions, preferred topic of invited talks, topic of panels, topic of breakouts;
4 meeting rooms, hotel, visa application support;
5 organizing committee, meeting chair, session chair, invited speaker, breakout moderator, note taker, technical assistant, workshop report writer;
6 handouts pack (agenda, badge, logistics memo);
7 logistics: announcement, wifi, power strips, emergency contact, projector, whiteboard and marker, remote access facility, alcohol service permission, travel support, travel agency, dietary requirement, morning and afternoon break, lunch, dinner, reception, local transportation, reimbursement method.

Model 4 (following a timeline):
0 Science: topic, purpose;
1 Finance: meeting budget;
2 Planning: meeting proposal, organizing committee, logistics administrator, organizing meetings, date, location, announcement;
3 Agenda: topic of sessions, preferred topic of invited talks, topic of panels, topic of breakouts, meeting rooms, meeting chair, session chair, invited speaker, breakout moderator, note taker, technical assistant;
4 Logistics: handouts pack (agenda, badge, reimbursement form, logistics memo), emergency contact, wifi, power strips, projector, whiteboard and marker, remote access facility, travel support, travel agency, hotel, visa application support, dietary requirement, morning and afternoon break, lunch, dinner, reception, alcohol service permission, local transportation, reimbursement method.
5 Output: online virtual community of attendees, workshop summary and recommendations, workshop report writer.

Model 5 (an ontology? ;-))
Should be something like:
twc:Workshop a prov:Activity.
twc:SessionChair a prov:Role.

Comments and complements are welcome!

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: tetherless world Tags:

What is ontology?

December 19th, 2013

The topic of a blog in my mind, after five days at the American Geophysical Union 2013 Fall Meeting discussing Earth and space science informatics, is to give an introduction of ontology to researchers in Earth and environmental sciences and beyond.

To attract your interest, I would say that ontology is the invisible hand behind anything. (It took me a few minutes to think about whether I should add an ‘an’ before the ‘ontology’ here. For reasons see below.)

First let’s see the etymology of the word ‘ontology’. According to Wiktionary (http://en.wiktionary.org/wiki/ontology), ontology is ‘originally Latin ontologia (1606, Ogdoas Scholastica, by Jacob Lorhard (Lorhardus)), from Ancient Greek ὤν (ōn, “on”), present participle of εἰμί (eimi, “being, existing, essence”) + λόγος (logos, “account”).’

Second let’s see the definition of the word. It is also interesting to see that Wiktionary claims that in philosophy the word ‘ontology’ can be either uncountable or countable. For the former, ontology is defined by Wiktionary as ‘The branch of metaphysics that addresses the nature or essential characteristics of being and of things that exist; the study of being.’ This definition is more or less the same as another one done by the Oxford English Dictionary, ‘The science or study of being; that branch of metaphysics concerned with the nature or essence of being or existence.’ That Oxford definition was used in my PhD defense (http://www.slideshare.net/MarshallXMa/ontology-spectrum-for-geological-data-interoperability-phddefence). For the countable ‘ontology’, Wiktionary defines it as ‘The theory of a particular philosopher or school of thought concerning the fundamental types of entity in the universe.’ I had not done any work relevant to that definition yet but I just found Oxford also has a similar definition ‘As a count noun: a theory or conception relating to the nature of being.’

The word metaphysics is mentioned in the definition of ontology as an unaccountable noun. In now days when people talk about metaphysics they often refer to Aristotle (384 – 322 BCE). If you (especially those who are working for a Doctor of PHILOSOPHY ;-)) are interested in his study you can read the two most famous books 1) Politics: A Treatise on Government and 2) The Ethics of Aristotle by him on the Gutenberg website (http://www.gutenberg.org/ebooks/author/2747). The story does not stop here. In a famous Chinese book, I Ching (or the Book of Changes, c. 450 – 250 BCE), there are also topics about metaphysics, such as a sentence which is my personal favorite: ‘What is above form is called Tao; what is within form is called tool.’

The philosophical meaning of the word ontology is the background and for most cases in the domain of Earth and space science informatics we care more about another meaning of the word: ontology as a countable noun in computer science. Before discussing definition of ontology as a computer science word, let’s first see how hot this word is in recent years. I did a few searches with the topic ‘ontology’ in isiknowledge.com (on Dec 19, 2013), which showed that there are about 44884 publications for all years, and publication numbers for separate periods are 1470/1945–1995, 1498/1995–2000, ~7901/2000–2005, ~24528/2005–2010, and ~16891/2010–2013. If I refined the results by limiting to the research area ‘Computer Science’, the results are: ~22251/all years, 114/1945–1995, 673/1995–2000, ~5095/2000–2005, ~14316/2005–2010, and ~5971/2010–2013. And there are a big number of publications that applied informatics and were filtered out by the keyword ‘Computer Science’. From those results we can see many meanings, one is that works with the computer science ‘ontology’ has been increasing significantly since 2000.

For the definition of the computer science word ‘ontology’, many people have cited the publications of T.R. Gruber (1993, 1995, see: http://dx.doi.org/10.1006/knac.1993.1008 and http://dx.doi.org/10.1006/ijhc.1995.1081): ‘An ontology is an explicit specification of a conceptualization’. Middle 1990s is the golden age for discussing the definition of ontology. N. Guarino (1997, see: http://dx.doi.org/10.1006/ijhc.1996.0091) made a nice review of the definition of ‘ontology’, in which I think one key point he discussed was the ‘shared conceptualization’ feature of an ontology. So in my PhD dissertation (Ma, 2011, see: http://www.itc.nl/library/papers_2011/phd/ma.pdf) I tried to re-address the definition of the computer science ‘ontology’: ‘Ontologies in computer science are defined as shared conceptualizations of domain knowledge (Gruber, 1995; Guarino, 1997b)…’

Third, after seeing the definition of ontology, let’s focus on how to put a computer science ‘ontology’ into practice, especially in the domain of Earth and space science informatics. Early 2000s is the golden age for that work. McGuinness (2003, see: http://www-ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-mit-press-%28with-citation%29.htm) made a wonderful discussion of the ontology spectrum. McGuinness also made a footnote to that spectrum figure: ‘This spectrum arose out of a conversation in preparation for an ontology panel at AAAI ’99. The panelists (Gruninger, Lehman, McGuinness, Ushold, and Welty), chosen because of their years of experience in ontologies found that they encountered many forms of specifications that different people termed ontologies. McGuinness refined the picture to the one included here.’ When I was doing my PhD I read this note and I tried to find a few other publications by people in the panelists listed by McGuinness, and I did find a few that also discussed the ontology spectrum, for example:
Welty, C., 2002. Ontology-driven conceptual modeling. In: Pidduck, A.B., Mylopoulos, J., Woo, C.C., Ozsu, M.T. (Eds.), Advanced Information Systems Engineering, Lecture Notes in Computer Science, vol. 2348. Springer-Verlag, Berlin & Heidelberg, Germany, pp. 3-3. Lecture slides available at: http://www.cs.toronto.edu/caise02/cwelty.pdf
Obrst, L., 2003. Ontologies for semantically interoperable systems. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, New Orleans, LA, USA, 366-369.
Uschold, M., Gruninger, M., 2004. Ontologies and semantics for seamless connectivity. SIGMOD Record 33 (4), 58–64.
Borgo, S., Guarino, N., Vieu, L., 2005. Formal ontology for semanticists. In: Lecture notes of the 17th European Summer School in Logic, Language and Information (ESSLLI 2005), Edinburgh, Scotland, 12pp. http://www.loa-cnr.it/Tutorials/ESSLLI1.pdf

OS1
An ontology spectrum (from McGuinness 2003)

To help myself understand the ontology spectrum better, I redrew the diagram (see below) in my PhD dissertation. Very recently (Dec 03, 2013) Jim McGusker, a PhD student with McGuinness, made a thorough explanation of the spectrum in his blog (see: http://info.5amsolutions.com/blog/bid/154967/6-Points-Along-the-Ontology-Spectrum).

OS2
Ontology spectrum (adapted from Borgo et al., 2005; McGuinness, 2003; Obrst, 2003; Uschold and Gruninger, 2004; Welty, 2002). Texts in italics explain a typical relationship in each ontology type (from Ma 2011)

Finally, I would like to share a few examples for different types of ontologies following the spectrum:

Catalog/Glossary:
Neuendorf, K.K.E., Mehl, J.J.P., Jackson, J.A., 2005. Glossary of Geology, 5th edition. American Geological Institute: Alexandria, VA, USA, p. 800. See latest version at: http://www.agiweb.org/pubs/glossary/

Taxonomy:
BGS Rock Classification Scheme, see: https://www.bgs.ac.uk/bgsrcs/

Thesaurus:
AQSIQ, 1988. GB/T 9649-1988 The Terminology Classification Codes of Geology and Mineral Resources. General Administration of Quality Supervision, Inspection and Quarantine of P.R. China (AQSIQ). Standards Press of China, Beijing, China. 1937 pp.

Conceptual Schema:
NADM Steering Committee, 2004. NADM Conceptual Model 1.0—A conceptual model for geologic map information: U.S. Geological Survey Open-File Report 2004-1334, North American Geologic Map Data Model (NADM) Steering Committee, Reston, VA, USA, 58 pp. See: http://pubs.usgs.gov/of/2004/1334

Ontologies encoded in RDF format:
Semantic Web for Earth and Environmental Terminology (SWEET). See: http://sweet.jpl.nasa.gov/

Now a short wrap up about what is ontology:
For fun: the invisible hand behind anything;
In philosophy: (uncountable) the science or study of being; that branch of metaphysics concerned with the nature or essence of being or existence; (countable) a theory or conception relating to the nature of being;
In computer science: shared conceptualization of domain knowledge.

To put ontologies (computer science) into practice, keep in mind an ontology spectrum with enriching meanings: catalog/glossary -> taxonomy -> thesaurus -> conceptual schema -> formal constraints.

VN:F [1.9.22_1171]
Rating: 8.8/10 (4 votes cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)