DCO-DS participation at Research Data Alliance Plenary 5 meeting

April 30th, 2015

In early March I attended the Research Data Alliance Fifth Plenary and “Adoption Day” event to present our plans for adopting DataTypes and Persistent Identifier Types in the DCO Data Portal. This was the first plenary following the publishing of the data type and persistent identifer type outputs and the RDA community was interested in seeing how early adopters were faring.

At the Adoption Day event I gave a short presentation on our plan for representing DataTypes in the DCO Data Portal knowledge base. Most of the other adopter presentations were limited to organizational requirements or high-level architecture around data types or persistent identifiers – our presentation stood out because we presented details on ‘how’ we intended to implement RDA outputs rather than just ‘why’. I think our attention on technical details was appreciated; from listening to the presentations it did not sound like many other groups were very far into their adoption process.

My main takeaways from the conference were the following:
– we are ahead of the curve on adopting the RDA data type and persistent identifier outputs
– we are viewed as leaders on how to implement data types; people are paying attention to what we are doing
– the chair of the DataType WG was very happy that we were thinking of how data types made sense within the context of our existing infrastructure rather than looking to the WGs reference implementation as the sole way to implement the output
– the DataType WG reference repository is more proof-of-concept then production system
– The data type community is interested in the topic of federating repositories but is not ready to do much on that yet

Overall I think we are well positioned to be a leader on data types. Our work to-date was very well received and many members involved in the DataType WG will be very interested in what more we have to show next September at the Sixth Plenary.

Good work team and let’s keep up the good work!

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: Blog, tetherless world Tags: ,

Another AGU and we all get wet from the rain in San Fran…

January 10th, 2015

The 2014 Meeting of the American Geophysical Union in the wet city of San Francisco has not yet faded from memory. Unfortunately, it may be remembered for the “year of the RFID mess” over the great science progress. However, let’s start with the positive. Rensselaer’s Tetherless World was well represented – see what we did at http://tw.rpi.edu/web/event/AGU/FM/2014/Participation = Patrick, Stephan, Marshall, Evan and Paulo (representing others including Linyun and Han) in talks, posters covering both research and project progress, and the academic booth (go RPI!). This year, we presented in Informatics (IN) and Education (ED) sessions with talks and many posters. Just on a logistics note, I was very pleased to have the exhibit hall adjoined to one of the poster halls this year. This made the task of moving between them and not missing one or the other, much easier. Hope that continues. It was another excellent year for Informatics; I’ve misplaced the stats but suffice to say increasing numbers of abstracts, great student contributions and a sea of new faces. A continuing treat is the Leptoukh Lecture (honouring Greg L, whom I still miss very much). This year, Dr. Bryan Lawrence (working in the UK, but actually a Kiwi) gave a tour de force lecture on computation and data aspects of climate science. The attendance was excellent, clearly pulling in a wide cross-section of attendees from well beyond the IN folks. Thanks Bryan. This year was the change over for Informatics leadership with Kerstin Lehnert taking over from Michael Piasecki as President – thanks Michael for your leadership and efforts over the last two years. Ruth Duerr (NSIDC) came in as President-Elect and Anne Wilson (CU/LASP) as secretary. Diversity rules in Informatics!!!

In regard to IN poster sessions, we saw an increase in the flash mob approach. What is that you ask? It is where, at an appointed time during the poster session, the session convener arranges for all poster presenters to be present. After having also advertised by twitter, email and general coercion, they gather poster attendees around each poster (in order, down the row). The presenter has 5 minutes to present their poster and then the mob moves on. It has shown to be a very effective way of engaging attendees and the presenters. If the session organiser has pre-planned it, the sequencing can also be very effective. After each has been presented, may attendees stay to quiz specific posters they were interested in. The one aspect that makes this style hard is the general noise level in the poster hall. Poster presenters need to “speak up” and project their voice: not all are prepared for that but it is very good practice!

I am author / co-author on quite a few presentations each year. This year I had two posters (both invited) as lead. You can see them via the link above. Sixth generation of data and information architectures, and Anatomy and Physiology of Data Science drew quite a lot of interest. But I must say, I did enjoy getting to stand with Mark Parsons at our poster “Why Data Citation Misses the Point” (I will add that to the website) and elaborate on our premise. Interestingly, we had a lot of agreement with the work — we’d hope to provoke arguments (!! as usual !!). Now to find time to write that up.

I want to acknowledge the excellent presentation of other works I was co-author on. The TWCers noted above are indeed skilled and knowledgeable researchers and practitioners. I know that but it is always excellent to have peers approach me to tell me that and how impressed they are with both the work and the people!

And the RFID issue – just go here and see for yourselves: http://petitions.moveon.org/sign/say-no-to-rfid-tracking.fb47

See all of you next December.


VN:F [1.9.22_1171]
Rating: 9.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: Data Science, Semantic Web, tetherless world Tags:

American Geophysical Union Informatics

December 22nd, 2014

A held post from last year – just releasing it….

I’ve  lost count of how many AGU meetings I’ve been to, except for knowing that this was my 11th consecutive year to the Fall Meeting in recent times. I am thinking about this since I also received my 25th year AGU pin this year. Ouch. To say there was a lot going on at AGU, is like saying it gets busy around the shops during the holidays. So, it was an average year for me in terms of length of day and tiredness, etc. Each year, I have at least one stand out memory. This year it began with the number of colleagues from solar and space physics that I bumped into (and remembered and they remembered me) and had very interesting and relevant conversations with (about software, and data, and science). Next in line was Simon Cox’s Leptoukh lecture – an excellent tour de force to demonstrate what taking a few steps back and conceiving a core observations and measurement model can do to impact a significant number of application fields. Well done mate. The RPI Tetherless World contributions were (again) very strong – to see what I mean – take a look at: http://tw.rpi.edu/web/event/AguFallMeeting2013. My appreciation goes to Patrick, Marshall, Yu, Linyun, Evan and Massimo for your efforts – the booth, the posters, the talks – (and Jin, Han, John, and others left back on the ranch) – all provided an excellent showcase of our (RPI/TWC) collective work. Now, on to the science – informatics to be specific in all its discipline-specific forms – the Special Focus Group ESSI (essi.agu.org) is thriving with increasingly diverse participants and new faces appearing each year. As far as topics, I’ll spare you all a word cloud but “Data” was the word. The other word was, well, “Big” – in the sense that a number of sessions succumbed to Big Data (or at least the Era of Big Data, one phrase I prefer)… and more than just informatics; Union, Education, Global Environmental Change, … Thus, I’m okay with that. With a meeting that big science highlights are hard to capture in a shortish blog.   Having been around long enough, it is normal (or even required) for me to be critical of certain aspects of the meetings logistics/ organization as they affect people and the efficacy of the scientific exchange itself. I have shared those concerns as well as the positive aspects to the appropriate people/ committees. If any of you wish to pass your observations (positive and otherwise) to me, I will pass them on. One thing I cannot let pass is the new AGU data policy that was approved and pre-released during the meeting. I am sure that there will be some noise about this in the days to come.

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Open Science in an Open World

December 21st, 2014

I began to think about a blog for this topic after I read a few papers about Open Codes and Open Data published in Nature and Nature Geoscience in November 2014. Later on I also noticed that the editorial office of Nature Geoscience made a cluster of articles themed on Transparency in Science (http://www.nature.com/ngeo/focus/transparency-in-science/index.html), which really created an excellent context for further discussion of Open Science.

A few weeks later I attended the American Geophysical Union (AGU) Fall Meeting at San Francisco, CA. That is used to be a giant meeting with more than 20,000 attendees. My personal focus is presentations, workshops and social activities in the group of Earth and Space Science Informatics. To summarize the seven-day meeting experience with a few keywords, I would choose: Data Rescue, Open Access, Gap between Geo and Info, Semantics, Community of Practice, Bottom-up, and Linking. Putting my AGU meeting experience together with thoughts after reading the Nature and Nature Geoscience papers, now it is time for me to finish a blog.

Besides incentives for data sharing and open source policies of scholarly journals, we can extend the discussion of software and data publication, reuse, citation and attribution by shedding more light on both technological and social aspects of an environment for open science.

Open science can be considered as a socio-technical system. One part of the system is a way to track where everything goes and another is a design of appropriate incentives. The emerging technological infrastructure for data publication adopts an approach analogous to paper publication and has been facilitated by community standards for dataset description and exchange, such as DataCite (http://www.datacite.org), Open Archives Initiative-Object Reuse and Exchange (http://www.openarchives.org/ore) and the Data Catalog Vocabulary (http://www.w3.org/TR/vocab-dcat). Software publication, in a simple way, may use a similar approach, which calls for community efforts on standards for code curation, description and exchange, such as the Working towards Sustainable Software for Science (http://wssspe.researchcomputing.org.uk). Simply minting Digital Object Identifiers to codes in a repository makes software publication no difference from data publication (See also: http://www.sciforge-project.org/2014/05/19/10-non-trivial-things-github-friends-can-do-for-science/) . Attention is required for code quality, metadata, license, version and derivation, as well as metrics to evaluate the value and/or impact of a software publication.

Metrics underpin the design of incentives for open science. An extended set of metrics – called altmetrics – was developed for evaluating research impact and has already been adopted by leading publishers such as Nature Publishing Group (http://www.nature.com/press_releases/article-metrics.html). Factors counted in altmetrics include how many times a publication has been viewed, discussed, saved and cited. It was very interesting to read some news about funders’ attention to altmetrics (http://www.nature.com/news/funders-drawn-to-alternative-metrics-1.16524) on my flight back from the AGU meeting – from the 12/11/2014 issue of Nature which I picked from the NPG booth at the AGU meeting exhibition hall. For a software publication the metrics might also count how often the code is run, the use of code fragments, and derivations from the code. A software citation indexing service – similar to the Data Citation Index (http://wokinfo.com//products_tools/multidisciplinary/dci/) of Thomson Reuters – can be developed to track citations among software, datasets and literature and to facilitate software search and access.

Open science would help everyone – including the authors – but it can be laborious and boring to give all the fiddly details. Fortunately fiddly details are what computers are good at. Advances in technology are enabling the categorization, identification and annotation of various entities, processes and agents in research as well as the linking and tracing among them. In our 06/2014 Nature Climate Change article we discussed the issue of provenance of global change research (http://www.nature.com/nclimate/journal/v4/n6/full/nclimate2141.html). Those works on provenance capture and tracing further extend the scope of metrics development. Yet, incorporating those metrics in incentive design requires the science community to find an appropriate way to use them in research assessment. A recent progress is that NSF renamed Publications section as Products in the biographical sketch of funding applicants and allowed datasets and software to be listed (http://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp). To fully establish the technological infrastructure and incentive metrics for open science, more community efforts are still needed.

VN:F [1.9.22_1171]
Rating: 8.2/10 (5 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Data Management – Serendipity in Academic Career

November 11th, 2014

A few days ago I began to think about the topic for a blog and the first reflection in my mind was ‘data management’ and then a Chinese poem sentence ‘无心插柳柳成荫’ followed. I went to Google for an English translation of that sentence and the result was ‘Serendipitiously’. Interesting, I never saw that word before and I had to use a dictionary to find that ‘serendipity’ means unintentional positive outcomes, which expresses the meaning of that Chinese sentence quite well. So, I regard data management as serendipity in my academic career. I think that’s because I was trained as a geoinformatics researcher through my education in China and the Netherlands, how it comes that most of my current time is being spent on data management?

One clue I could see is that I have been working on ontologies, vocabularies and conceptual models for geoscience data services, which is relevant to data management. Another more relevant clue is a symposium ‘Data Management in Research: A Challenging Issue’ organized at University of Twente campus in 2011 spring. Dr. David Rossiter, Ms. Marga Koelen, I and a few other ITC colleagues attend the event. That symposium highlighted both technical and social/cultural issues faced by the 3TU.Datacentrum (http://datacentrum.3tu.nl/en/home/), a data repository for the three technological universities in the Netherlands. It is very interesting to see that several topics of my current work had already discussed in that symposium, whereas I paid almost no attention because I was completely focused on my vocabulary work at that time. Since now I am working on data management, I would like to introduce a few concepts relevant to it and the current social and technical trends.

Data management, in simple words, means what you will do with your datasets during and after a research. Conventionally, we treat paper as the ‘first class’ product of research and many scientists pay less attention to data management. This may lower the efficiency of research activities and hinder communications among research groups in different institutions. There is even a rumor that 80% of a scientist’s time is spent on data discovery, retrieval and assimilation, and only 20% of time is for data analysis and scientific discovery. An ideal situation is that reverse the allocation of time, but that requires efforts on both a technical infrastructure for data publication and a set of appropriate incentives to the data authors.

After coming to United States the first data repository caused my attention was the California Digital Library (CDL) (http://www.cdlib.org/), which is similar to the services offered by 3TU.Datacentrum. I like the technical architecture CDL work not only because they provide a place for depositing datasets but also, and more importantly, they provide a series of tools and services (http://www.cdlib.org/uc3/) to allow users to draft data manage plans to address funding agency requirements, to mint unique and persistent identifiers to published datasets, and to improve the visibility of the published datasets. The word data publication is derived from paper publication. By documenting metadata, minting unique identifiers (e.g., Digital Object Identifiers (DOIs)), and archiving copies of datasets into a repository, we can make a piece of published dataset similar to a piece of published paper. The identifier and metadata make the dataset citable, just like what we do with published papers. A global initiative, the DataCite, had been working on standards of metadata schema and identifier for datasets, and is increasing endorsed by data repositories across the word, including both CDL and 3TU.Datacentrum. A technological infrastructure for data publication is emerging, and now people begin to talk about the cultural change to treat data as ‘first class’ product of research.

Though funding agencies already require data management plans in funding proposals, such as the requirements of National Science Foundation in US and the Horizon 2020 in EU (A Google search with key word ‘data management’ and the name of the funding agency will help find the agency’s guidelines), The science community still has a long way to go to give data publication the same attention as what they do with paper publication. Various community efforts have been take to promote data publication and citation. The FORCE11 published the Joint Declaration of Data Citation Principles (https://www.force11.org/datacitation) in 2013 to promote good research practice of citing datasets. Earlier than that, in 2012, the Federation of Earth Science Information Partners published Data Citation Guidelines for Data Providers and Archives (http://commons.esipfed.org/node/308), which offers more practical details on how a piece of published dataset should be cited. In 2013, the Research Data Alliance (https://rd-alliance.org/) was launched to build the social and technical bridges that enable open sharing of data, which enhances existing efforts, such as CODATA (http://www.codata.org/), to promote data management and sharing.


To promote data citation, a number of publishers have launched so called data journals in recent years, such as Scientific Data (http://www.nature.com/sdata/) of Nature Publishing Group, Geoscience Data Journal (http://onlinelibrary.wiley.com/journal/10.1002/%28ISSN%292049-6060) of Wiley, and Data in Brief (http://www.journals.elsevier.com/data-in-brief/) of Elsevier. Such a data journal often has a number of affiliated and certified data repositories. A data paper allows the authors to describe a piece of dataset published in a repository. A data paper itself is a journal paper, so it is citable, and the dataset is also citable because there are associated metadata and identifier in the data repository. This makes data citation flexible (and perhaps confusing): you can cite a dataset by either citing the identifier of the associated data paper, or the identifier of the dataset itself, or both. More interestingly, a paper can cite a dataset, a dataset can cite a dataset, and a dataset can also cite paper (e.g., because the dataset may be derived from tables in a paper). The Data Citation Index (http://wokinfo.com/products_tools/multidisciplinary/dci/) launched by Thomson Reuters provides services to index the world’s leading data repositories, connect datasets to related literature indexed in the Web of Science database and to search and access data across subjects and regions.

Although there is such huge progress on data publication and citation, we are not yet there to fully treat data as ‘first class’ products of research. A recent good news is that, in 2013, the National Science Foundation renamed Publications section in biographical sketch of funding applicants as Products and allowed datasets and software to be listed there (http://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp). However, this is still just a small step. We hope more similar incentives appear in academia. For instance, even we have the Data Citation Index, are we ready to mix the data citation and paper citation to generate the H-index of a scientist? And even there is such an H-index, are we ready to use it in research assessment?

Data management involves so many social and technological issues, which make it quite different from pure technical questions in geoinformatics research. This is an enjoyable work and in the next step I may spend more time on data analysis, for which I may introduce a few ideas in another blog.

VN:F [1.9.22_1171]
Rating: 9.0/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)