Archive

Author Archive

Data and Semantics — Topics of Interest at ESIP 2015 Summer Meeting

July 27th, 2015

The ESIP 2015 Summer Meeting was held at Pacific Grove, CA in the week of July 14-17. Pacific Grove is such a beautiful place with the coast line, sand beach and sun set. What excited me more are the science and technical topics covered in the meeting sessions, as well as the opportunity to catch up with friends in the ESIP community. Excellent topics + a scenic place + friends = a wonderful meeting. Thanks a lot to the meeting organizers!

The theme of this summer meeting is “The Federation of Earth Science Information Partners & Community Resilience: Coming Together.” Though my focus was Semantic Web and data stewardship relevant sessions, I was able to see the topic ‘resilience’ in various presented works. It was nice to see that the ESIP community has an ontology portal. It implements the Bio Portal infrastructure and focuses on collecting ontologies and vocabularies in the field of Earth sciences. With more submissions from the community in the future the portal has great potential for geo-semantics research, similar to what the Bio Portal does for bioinformatics. An important topic was reviewing progress and discussing directions for the future. Prof. Peter Fox from RPI offered a short overview. The ESIP Semantic Web cluster is nine years old, and it is nice to see that through the cluster has helped improve the visibility of semantic web methods and technologies in the grand field of geoinformatics. A key feature supporting the success of Semantic Web is that it is an open world and it evolves and updates.

There were several topics or projects of interest that I recorded during the meeting:

(1) schema.org: It recently released version 2.0 and introduced a new mechanism for extension. There are now two types of extensions: reviewed/hosted extensions and external extensions. The former (e1) gets its own chunk of schema.org namespace: e1.schema.org. All items in that extension are created and maintained by their own creators. The latter means a third party to create extensions specific to an application. Extensions to location and time might be a topic for the Earth science community in the near future.

(2) GCIS Ontology: GCIS is such a nice project it is incorporated several state-of-the-art Semantic Web methods and technologies. The provenance representation in GCIS means it is not just a static knowledge representation. It is more about what are the facts, what do people believe and why. In the ontology engineering for GCIS we also see the collaboration between geoscientists and computer scientists. That is, conceptual model came first, as a product that geoscientists can understand, before it was bound to logic and ontology encoding grammar. The process can be seen as within the scope of semiology. We can do good jobs with syntax and semantics, and very often we will struggle with the pragmatics.

(3) PROV-ES: Provenance of scientific findings is receiving increasing attending. Earth science community has taken a lead on working of capturing provenance. The World Wide Web Consortium (W3C) PROV standard provide a platform for Earth science community to adopt and extend. The Provenance – Earth Science (PROV-ES) Working Group was initiated in 2013 and it primarily focused on extending the PROV standard, and tested the outputs with sample projects. In the PROV-ES hackathon at the summer meeting, Hook Hua and Gerald Manipon showed more technical details of with PROV-ES, especially about its encodings, discovery, and visualization.

(4) Entity linking: Jin Guang Zheng and I had a poster about our ESIP 2014 Test bed project. The topic is about linking entity mentions in documents and datasets to entities in the Web of Data. Entity recognition and linking is a valuable work in works with datasets collected from multiple sources. Detecting and linking entity mentions in datasets can be facilitated by using knowledge bases on the Web, such as ontologies and vocabularies. In this work we built a web-based entity linking and wikification service for datasets. Our current demo system uses DBPedia as the knowledge base, and we have been collecting geoscience ontologies and vocabularies. A potential future collaboration is to use the ESIP ontology portal as the knowledge base. Discussion with colleagues during the poster session shows that this work may also be beneficial to works on dark data, such as pattern recognition and knowledge discovery from legacy literature.

(5) Big Earth Data Initiative: This is an inter-agency coordination work for geo-data interoperability in US. I would copy paste a part of the original session description to show the detailed relationships about a few entities and organizations that were mentioned: ‘The US Group on Earth Observations (USGEO) Data Management Working Group (DMWG) is an inter-agency body established under the auspices of the White House National Science and Technology Council (NSTC). DMWG members have been drafting an “Earth Observations Common Framework” (EOCF) with recommended approaches for supporting and improving discoverability, accessibility, and usability for federally held earth observation data. The recommendations will guide work done under the Big Earth Data Initiative (BEDI), which provided funding to some agencies for improving those data attributes.’ It will be nice to see more outputs from this effort and compare the work with similar efforts in Europe such as the INSPIRE, as well as the global initiative GEOSS.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: tetherless world Tags:

GYA, CODATA-ECDP and Open Science

June 7th, 2015

During May 25-29, 2015, the Global Young Academy (GYA) held the 5th International Conference for Young Scientists and its Annual General Meeting at Montebello, Quebec, Canada. I attended the public day of the conference on May 27, as a delegate of the CODATA Early Career Data Professionals Working Group (ECDP).

The GYA was founded in 2010 and its objective is to be the voice of young scientists around the world. Members are chosen for their demonstrated excellence in scientific achievement and commitment to service. Currently there are 200 members from 58 countries, representing all major world regions. Most
GYA members attended the conference at Montebello, together with about 40 guests from other institutions, including Prof. Gordon McBean, president of the International Council for Science and Prof. Howard Alper, former co-chair of IAP: the Global Network of Science Academies.

GYA issued a position statement on Open Science in 2012, which calls for scientific results and data to be made freely available for scientists around the world, and advocates ways forward that will transform scientific research into a truly global endeavor. Dr. Sabina Leonelli from the University of Exeter, UK is one of the lead authors of the position statement, and also a lead of the GYA Open Science Working Group. A major objective of my attendance to the GYA conference is to discuss the future opportunities on collaborations between CODATA-ECDP and GYA. Besides Sabina, I also met Dr. Abdullah Tariq, another lead of the GYA Open Science WG, and several other members of the GYA executive committee.
The discussion was successful. We mentioned the possibility of an interest group in Global Open Science within CODATA, to have a few members join both organizations, to propose sessions on the diversity of conditions under which open data work around the world, perhaps for the next CODATA/RDA meeting in Paris or later meetings of the type, to collaborate around business models for data centers, and to reach out to other organizations and working groups of open data and/or open science, etc.

GYA is such an active group both formed and organized by young people. And I was so happy to see that Open Science is one of the four core activities that GYA is currently promoting. I would recommend ECDP and CODATA members to see more details of GYA on the website and propose future collaborations to promote topics of common interest on open data and open science.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: tetherless world Tags:

Open Science in an Open World

December 21st, 2014

I began to think about a blog for this topic after I read a few papers about Open Codes and Open Data published in Nature and Nature Geoscience in November 2014. Later on I also noticed that the editorial office of Nature Geoscience made a cluster of articles themed on Transparency in Science (http://www.nature.com/ngeo/focus/transparency-in-science/index.html), which really created an excellent context for further discussion of Open Science.

A few weeks later I attended the American Geophysical Union (AGU) Fall Meeting at San Francisco, CA. That is used to be a giant meeting with more than 20,000 attendees. My personal focus is presentations, workshops and social activities in the group of Earth and Space Science Informatics. To summarize the seven-day meeting experience with a few keywords, I would choose: Data Rescue, Open Access, Gap between Geo and Info, Semantics, Community of Practice, Bottom-up, and Linking. Putting my AGU meeting experience together with thoughts after reading the Nature and Nature Geoscience papers, now it is time for me to finish a blog.

Besides incentives for data sharing and open source policies of scholarly journals, we can extend the discussion of software and data publication, reuse, citation and attribution by shedding more light on both technological and social aspects of an environment for open science.

Open science can be considered as a socio-technical system. One part of the system is a way to track where everything goes and another is a design of appropriate incentives. The emerging technological infrastructure for data publication adopts an approach analogous to paper publication and has been facilitated by community standards for dataset description and exchange, such as DataCite (http://www.datacite.org), Open Archives Initiative-Object Reuse and Exchange (http://www.openarchives.org/ore) and the Data Catalog Vocabulary (http://www.w3.org/TR/vocab-dcat). Software publication, in a simple way, may use a similar approach, which calls for community efforts on standards for code curation, description and exchange, such as the Working towards Sustainable Software for Science (http://wssspe.researchcomputing.org.uk). Simply minting Digital Object Identifiers to codes in a repository makes software publication no difference from data publication (See also: http://www.sciforge-project.org/2014/05/19/10-non-trivial-things-github-friends-can-do-for-science/) . Attention is required for code quality, metadata, license, version and derivation, as well as metrics to evaluate the value and/or impact of a software publication.

Metrics underpin the design of incentives for open science. An extended set of metrics – called altmetrics – was developed for evaluating research impact and has already been adopted by leading publishers such as Nature Publishing Group (http://www.nature.com/press_releases/article-metrics.html). Factors counted in altmetrics include how many times a publication has been viewed, discussed, saved and cited. It was very interesting to read some news about funders’ attention to altmetrics (http://www.nature.com/news/funders-drawn-to-alternative-metrics-1.16524) on my flight back from the AGU meeting – from the 12/11/2014 issue of Nature which I picked from the NPG booth at the AGU meeting exhibition hall. For a software publication the metrics might also count how often the code is run, the use of code fragments, and derivations from the code. A software citation indexing service – similar to the Data Citation Index (http://wokinfo.com//products_tools/multidisciplinary/dci/) of Thomson Reuters – can be developed to track citations among software, datasets and literature and to facilitate software search and access.

Open science would help everyone – including the authors – but it can be laborious and boring to give all the fiddly details. Fortunately fiddly details are what computers are good at. Advances in technology are enabling the categorization, identification and annotation of various entities, processes and agents in research as well as the linking and tracing among them. In our 06/2014 Nature Climate Change article we discussed the issue of provenance of global change research (http://www.nature.com/nclimate/journal/v4/n6/full/nclimate2141.html). Those works on provenance capture and tracing further extend the scope of metrics development. Yet, incorporating those metrics in incentive design requires the science community to find an appropriate way to use them in research assessment. A recent progress is that NSF renamed Publications section as Products in the biographical sketch of funding applicants and allowed datasets and software to be listed (http://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp). To fully establish the technological infrastructure and incentive metrics for open science, more community efforts are still needed.

VN:F [1.9.22_1171]
Rating: 8.5/10 (6 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Data Management – Serendipity in Academic Career

November 11th, 2014

A few days ago I began to think about the topic for a blog and the first reflection in my mind was ‘data management’ and then a Chinese poem sentence ‘无心插柳柳成荫’ followed. I went to Google for an English translation of that sentence and the result was ‘Serendipitiously’. Interesting, I never saw that word before and I had to use a dictionary to find that ‘serendipity’ means unintentional positive outcomes, which expresses the meaning of that Chinese sentence quite well. So, I regard data management as serendipity in my academic career. I think that’s because I was trained as a geoinformatics researcher through my education in China and the Netherlands, how it comes that most of my current time is being spent on data management?

One clue I could see is that I have been working on ontologies, vocabularies and conceptual models for geoscience data services, which is relevant to data management. Another more relevant clue is a symposium ‘Data Management in Research: A Challenging Issue’ organized at University of Twente campus in 2011 spring. Dr. David Rossiter, Ms. Marga Koelen, I and a few other ITC colleagues attend the event. That symposium highlighted both technical and social/cultural issues faced by the 3TU.Datacentrum (http://datacentrum.3tu.nl/en/home/), a data repository for the three technological universities in the Netherlands. It is very interesting to see that several topics of my current work had already discussed in that symposium, whereas I paid almost no attention because I was completely focused on my vocabulary work at that time. Since now I am working on data management, I would like to introduce a few concepts relevant to it and the current social and technical trends.

Data management, in simple words, means what you will do with your datasets during and after a research. Conventionally, we treat paper as the ‘first class’ product of research and many scientists pay less attention to data management. This may lower the efficiency of research activities and hinder communications among research groups in different institutions. There is even a rumor that 80% of a scientist’s time is spent on data discovery, retrieval and assimilation, and only 20% of time is for data analysis and scientific discovery. An ideal situation is that reverse the allocation of time, but that requires efforts on both a technical infrastructure for data publication and a set of appropriate incentives to the data authors.

After coming to United States the first data repository caused my attention was the California Digital Library (CDL) (http://www.cdlib.org/), which is similar to the services offered by 3TU.Datacentrum. I like the technical architecture CDL work not only because they provide a place for depositing datasets but also, and more importantly, they provide a series of tools and services (http://www.cdlib.org/uc3/) to allow users to draft data manage plans to address funding agency requirements, to mint unique and persistent identifiers to published datasets, and to improve the visibility of the published datasets. The word data publication is derived from paper publication. By documenting metadata, minting unique identifiers (e.g., Digital Object Identifiers (DOIs)), and archiving copies of datasets into a repository, we can make a piece of published dataset similar to a piece of published paper. The identifier and metadata make the dataset citable, just like what we do with published papers. A global initiative, the DataCite, had been working on standards of metadata schema and identifier for datasets, and is increasing endorsed by data repositories across the word, including both CDL and 3TU.Datacentrum. A technological infrastructure for data publication is emerging, and now people begin to talk about the cultural change to treat data as ‘first class’ product of research.

Though funding agencies already require data management plans in funding proposals, such as the requirements of National Science Foundation in US and the Horizon 2020 in EU (A Google search with key word ‘data management’ and the name of the funding agency will help find the agency’s guidelines), The science community still has a long way to go to give data publication the same attention as what they do with paper publication. Various community efforts have been take to promote data publication and citation. The FORCE11 published the Joint Declaration of Data Citation Principles (https://www.force11.org/datacitation) in 2013 to promote good research practice of citing datasets. Earlier than that, in 2012, the Federation of Earth Science Information Partners published Data Citation Guidelines for Data Providers and Archives (http://commons.esipfed.org/node/308), which offers more practical details on how a piece of published dataset should be cited. In 2013, the Research Data Alliance (https://rd-alliance.org/) was launched to build the social and technical bridges that enable open sharing of data, which enhances existing efforts, such as CODATA (http://www.codata.org/), to promote data management and sharing.

LogoCloud

To promote data citation, a number of publishers have launched so called data journals in recent years, such as Scientific Data (http://www.nature.com/sdata/) of Nature Publishing Group, Geoscience Data Journal (http://onlinelibrary.wiley.com/journal/10.1002/%28ISSN%292049-6060) of Wiley, and Data in Brief (http://www.journals.elsevier.com/data-in-brief/) of Elsevier. Such a data journal often has a number of affiliated and certified data repositories. A data paper allows the authors to describe a piece of dataset published in a repository. A data paper itself is a journal paper, so it is citable, and the dataset is also citable because there are associated metadata and identifier in the data repository. This makes data citation flexible (and perhaps confusing): you can cite a dataset by either citing the identifier of the associated data paper, or the identifier of the dataset itself, or both. More interestingly, a paper can cite a dataset, a dataset can cite a dataset, and a dataset can also cite paper (e.g., because the dataset may be derived from tables in a paper). The Data Citation Index (http://wokinfo.com/products_tools/multidisciplinary/dci/) launched by Thomson Reuters provides services to index the world’s leading data repositories, connect datasets to related literature indexed in the Web of Science database and to search and access data across subjects and regions.

Although there is such huge progress on data publication and citation, we are not yet there to fully treat data as ‘first class’ products of research. A recent good news is that, in 2013, the National Science Foundation renamed Publications section in biographical sketch of funding applicants as Products and allowed datasets and software to be listed there (http://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp). However, this is still just a small step. We hope more similar incentives appear in academia. For instance, even we have the Data Citation Index, are we ready to mix the data citation and paper citation to generate the H-index of a scientist? And even there is such an H-index, are we ready to use it in research assessment?

Data management involves so many social and technological issues, which make it quite different from pure technical questions in geoinformatics research. This is an enjoyable work and in the next step I may spend more time on data analysis, for which I may introduce a few ideas in another blog.

VN:F [1.9.22_1171]
Rating: 9.0/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

A layer cake of spatial data, and in a jigsaw puzzle style

September 4th, 2014

During a lunch at the GeoData 2014 workshop, Boulder, CO, USA, June 2014, people sitting around the table began to chat about topics relevant to data sharing, data format, interoperability – all those topics relevant to geoscience data – well, inter-agency data interoperability was the central topic of that workshop. When someone rose up the topic of comparing data sharing policies in USA with those in Europe and China, a few people (those who know me) looked at me and began to smile. Yes, I am confident to say that I have some comments on the geoscience data sharing in Europe.

Before I came to USA I spent about four and half years in the Netherlands working for a PhD degree on geoscience data interoperability . When I looked back, it seems very interesting because I knew nothing about what was happening on data sharing in Europe before I headed to ITC. But the world is a really small cycle. At the second year of my PhD study, I got in contact with a colleague in the Commission for Management and Application of Geoscience Information of the International Union of Geological Sciences, and he worked at the Geological Survey of the Netherlands at Utrecht. I visited him several times and from him I also came to know about the giant data sharing initiative of EU, the Infrastructure for Spatial Information in Europe (INSPIRE).

Initially, what attracted me is some technical details in INSPIRE, especially those surrounding the works on vocabulary modeling and web map services. INSPIRE covers 34 data themes, among which geology is my favorite because geological data is the topic of my PhD work at ITC. And I really appreciated the data specification working group of the Geology theme in INSPIRE, as colleagues in that group offered me so many fresh technical ideas. Then, in my fourth ITC year, when I began to prepare my PhD dissertation and the defense, a guideline ‘Don’t get lost in details, look at the big picture’ inspired me review the INSPIRE from another angle and discuss my ideas with advisors and colleagues at ITC.

I forgot to mention that many such discussions happened during coffee breaks or lunch breaks at ITC (Well, there is no such a culture in the USA). And then, one day, during such a coffee break chat, a view came into my brain – a jigsaw puzzle layer cake – a nice analog of the INSPIRE initiative: the 34 data themes represent 34 layers and the 27 EU nations (in 2011) represent 27 puzzle pieces. The data specifications and implementation rules of INSPIRE are the recopies for making cakes, and the public agencies in EU nations are the cake cooks.

A 'jigsaw puzzle layer cake view' of the EU INSPIRE initiative

This ‘cake’ view sounds like a jest, but I took it seriously and I know in GIScience people used to call data as layer cakes. I drafted a manuscript to describe my view immediately after that coffee break chat, but it was out of my plan that the short article was not published until four years later – actually, just one month before the lunch table meeting at GeoData 2014, and
EU has 28 nations now (Croatia joined in 2013). The article is accessible at http://onlinelibrary.wiley.com/doi/10.1002/2014EO190006/abstract.

The INSPIRE initiative is combination of bottom-up and top-down approaches. The bottom-up approach is reflected in the works of data specification drafting and technical infrastructure constructions, which represent the consensus of experts from the EU nations. The top-down approach is reflected in the formally issued EU directive for the INSPRE, which makes it a de jure initiative, that is, EU member nations are required to comply with the INSPIRE data specifications and implementation rules when build their national spatial data infrastructures.

USA has a different administrative system comparing with EU. That, more or less, is also reflected in the geoscience data sharing policies and technologies. However, people here also build such data cakes. What can USA benefit from the EU experience and what suggestions can it provide based on its own work? I do not have a single answer now but I hope I will have some comments a few years later. Fortunately, similar to my encounter with the colleague at the Geological Survey of the Netherlands, now I also come to know colleagues at NASA, USGS, NOAA, EPA, USGCRP, and more, who are showing me the picture of geoscience data issues in the USA.

VN:F [1.9.22_1171]
Rating: 9.0/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)
Author: Categories: Data Science, tetherless world Tags: , , ,