Home > Data Science, personal ramblings, tetherless world > Why the term ‘data publication’?

Why the term ‘data publication’?

December 14th, 2010

Over the last 6 months I have been present in at least 10 distinct discussions around topics such as data publication, data citation and data attribution. At first I was engaged in the topics but very quickly I kept pausing and asking myself, what’s the use case (duh!). What I was hearing was coming from ‘data people’ (yes, I am one of them). What I wanted to hear was: “I want to be cited for the datasets I spend a lot of time and intellectual effort collecting, calibrating and analyzing”, or “… really I want to get credit for that as much as the one or two publications I might get”. I’ve heard this, in fact I’ve said it myself many times. So what’s the problem? Well, when a researcher wants credit and citation for a piece of work, they prepare and publish a paper, yes a body of intellectual work. Our communities and disciplines have spent many centuries developing this approach. So, if want I really want is credit and citation for my data, why do I need to publish it? At present, many people are getting such credit but it is an informal way such as narrative level acknowledgement in the text of the paper and not formal (Parsons, Duerr and Minster 2010 EOS). That’s as good as no acknowledgement unless someone sees it and records it somewhere. The mechanism for paper citation is now well established, I cite your paper in my paper and your citation count increases and gets reported. If you are up for promotion or tenure or review and that count is taken into account, you get credit. It’s the identification of the artifact that counts not the fact that it is published. In short, the capability that is needed is: a way to identify your data contribution and a way to record it (and thus count it). Identification and reference, that’s it. Now, I am not writing about ‘publication data publication’, i.e. the data that is the foundation for figures, tables, and other descriptions in a published paper. I am all for that data being made available as a part of the publication. That is also another story. I am addressing just regular data (collections/ sets).

For now I am suggesting that there are other models to make data available to start with, and one of them is the software release cycle/ process. Alpha, pre-beta, beta, release candidate, release, revision, documentation, feedback, bug fixes… it is more like the process for data that I know of. Now, this may not be the right approach but I think we should explore it, and others. I’m no longer in favour of just adopting a model (marriage) of convenience (publishing). We are savvy enough to take a step back and implement a model that meets the needs of the data scientists who deserve it most. Yes, there’s more to be said. Tag your it.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
  1. December 17th, 2010 at 02:49 | #1

    And that would exain why the ESIP stewardship cluster has been working on Identifier standards for Earth Science data!

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
  2. Mark
    July 21st, 2011 at 12:50 | #2

    I think the software release model is good for capturing data set versions, which need to be tracked well in conjunction with any identifier/locator/reference scheme. I’m not so sure the s/w model fully captures the archival requirements implied by “publication”. Then there’s “peer-review,” which is implied by publication whether the review happened or not. Software is tested and audited not peer-reviewed is that sufficient for data? I suppose open source s/w could be seen to be peer-revied in a way. A similar open annotation approach could be helpful for data.

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
  1. No trackbacks yet.