Data Management – Serendipity in Academic Career
A few days ago I began to think about the topic for a blog and the first reflection in my mind was ‘data management’ and then a Chinese poem sentence ‘无心插柳柳成荫’ followed. I went to Google for an English translation of that sentence and the result was ‘Serendipitiously’. Interesting, I never saw that word before and I had to use a dictionary to find that ‘serendipity’ means unintentional positive outcomes, which expresses the meaning of that Chinese sentence quite well. So, I regard data management as serendipity in my academic career. I think that’s because I was trained as a geoinformatics researcher through my education in China and the Netherlands, how it comes that most of my current time is being spent on data management?
One clue I could see is that I have been working on ontologies, vocabularies and conceptual models for geoscience data services, which is relevant to data management. Another more relevant clue is a symposium ‘Data Management in Research: A Challenging Issue’ organized at University of Twente campus in 2011 spring. Dr. David Rossiter, Ms. Marga Koelen, I and a few other ITC colleagues attend the event. That symposium highlighted both technical and social/cultural issues faced by the 3TU.Datacentrum (http://datacentrum.3tu.nl/en/home/), a data repository for the three technological universities in the Netherlands. It is very interesting to see that several topics of my current work had already discussed in that symposium, whereas I paid almost no attention because I was completely focused on my vocabulary work at that time. Since now I am working on data management, I would like to introduce a few concepts relevant to it and the current social and technical trends.
Data management, in simple words, means what you will do with your datasets during and after a research. Conventionally, we treat paper as the ‘first class’ product of research and many scientists pay less attention to data management. This may lower the efficiency of research activities and hinder communications among research groups in different institutions. There is even a rumor that 80% of a scientist’s time is spent on data discovery, retrieval and assimilation, and only 20% of time is for data analysis and scientific discovery. An ideal situation is that reverse the allocation of time, but that requires efforts on both a technical infrastructure for data publication and a set of appropriate incentives to the data authors.
After coming to United States the first data repository caused my attention was the California Digital Library (CDL) (http://www.cdlib.org/), which is similar to the services offered by 3TU.Datacentrum. I like the technical architecture CDL work not only because they provide a place for depositing datasets but also, and more importantly, they provide a series of tools and services (http://www.cdlib.org/uc3/) to allow users to draft data manage plans to address funding agency requirements, to mint unique and persistent identifiers to published datasets, and to improve the visibility of the published datasets. The word data publication is derived from paper publication. By documenting metadata, minting unique identifiers (e.g., Digital Object Identifiers (DOIs)), and archiving copies of datasets into a repository, we can make a piece of published dataset similar to a piece of published paper. The identifier and metadata make the dataset citable, just like what we do with published papers. A global initiative, the DataCite, had been working on standards of metadata schema and identifier for datasets, and is increasing endorsed by data repositories across the word, including both CDL and 3TU.Datacentrum. A technological infrastructure for data publication is emerging, and now people begin to talk about the cultural change to treat data as ‘first class’ product of research.
Though funding agencies already require data management plans in funding proposals, such as the requirements of National Science Foundation in US and the Horizon 2020 in EU (A Google search with key word ‘data management’ and the name of the funding agency will help find the agency’s guidelines), The science community still has a long way to go to give data publication the same attention as what they do with paper publication. Various community efforts have been take to promote data publication and citation. The FORCE11 published the Joint Declaration of Data Citation Principles (https://www.force11.org/datacitation) in 2013 to promote good research practice of citing datasets. Earlier than that, in 2012, the Federation of Earth Science Information Partners published Data Citation Guidelines for Data Providers and Archives (http://commons.esipfed.org/node/308), which offers more practical details on how a piece of published dataset should be cited. In 2013, the Research Data Alliance (https://rd-alliance.org/) was launched to build the social and technical bridges that enable open sharing of data, which enhances existing efforts, such as CODATA (http://www.codata.org/), to promote data management and sharing.
To promote data citation, a number of publishers have launched so called data journals in recent years, such as Scientific Data (http://www.nature.com/sdata/) of Nature Publishing Group, Geoscience Data Journal (http://onlinelibrary.wiley.com/journal/10.1002/%28ISSN%292049-6060) of Wiley, and Data in Brief (http://www.journals.elsevier.com/data-in-brief/) of Elsevier. Such a data journal often has a number of affiliated and certified data repositories. A data paper allows the authors to describe a piece of dataset published in a repository. A data paper itself is a journal paper, so it is citable, and the dataset is also citable because there are associated metadata and identifier in the data repository. This makes data citation flexible (and perhaps confusing): you can cite a dataset by either citing the identifier of the associated data paper, or the identifier of the dataset itself, or both. More interestingly, a paper can cite a dataset, a dataset can cite a dataset, and a dataset can also cite paper (e.g., because the dataset may be derived from tables in a paper). The Data Citation Index (http://wokinfo.com/products_tools/multidisciplinary/dci/) launched by Thomson Reuters provides services to index the world’s leading data repositories, connect datasets to related literature indexed in the Web of Science database and to search and access data across subjects and regions.
Although there is such huge progress on data publication and citation, we are not yet there to fully treat data as ‘first class’ products of research. A recent good news is that, in 2013, the National Science Foundation renamed Publications section in biographical sketch of funding applicants as Products and allowed datasets and software to be listed there (http://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp). However, this is still just a small step. We hope more similar incentives appear in academia. For instance, even we have the Data Citation Index, are we ready to mix the data citation and paper citation to generate the H-index of a scientist? And even there is such an H-index, are we ready to use it in research assessment?
Data management involves so many social and technological issues, which make it quite different from pure technical questions in geoinformatics research. This is an enjoyable work and in the next step I may spend more time on data analysis, for which I may introduce a few ideas in another blog.