Corner cases and the PLOS data policy

Posted on Posted in Publishing

God people, this is not that hard.. If you are working in academia, in the US, you are very likely to be funded by taxpayer money. The data you produce is enabled by taxpayers- all (most) of us.. So why is is such a hard friggin’ concept that you should be required to share your data freely, upon publication?  Had you asked most scientists this question – ‘Should you release data on publication?’ – most would agree. Unfortunately we live in a fucking world where talk is cheap, and when somebody actually tries to enforce this tenet, even if the tenant is not water-tight, everybody backs down.. everybody thinks of reasons why they could not possibly ever do it..

This is what basically happened.. Today (or was it yesterday) PLOS released a policy statement that essentially requires people to deposit data – read the post here. Now you would have thought the the viking apocalypse itself had occurred by the magnitude of the shit-storm response, including a myopic blog post by DrugMonkey and laughable  article in ‘The Scientist’. I get it that there are issues, especially with the sharing of massive amounts of data, but look- these are corner cases. Look at the last issue of your favorite journal.. What number of papers contain data that could not be shared on Dryad/Figshare/NCBI/SRA or whatever.. The number is VERY small.. That is not to say that there are exceptions, but again, these are corner cases..

Another ‘major’ objection is the definition of ‘raw data’, which is what is to be released.. Again, these are corner cases.. What proportion of  PLOS papers critically depend on 2839642983 hours of high res video, or some other more obtuse data type. A few probably, but not many. There is a grey area here – I see that.. Do I submit raw output from fancy machine X, or its slightly more useful compiled format. Whatever, people.. whatever.. Do what you’ve always done, and see where it takes you.. How about this, make an honest effort to make the data accessible and useful to others, and chances are you’re probably good to go.  Many people do this currently, and for them this policy change should be no problem..

There are other objections – one type is the ‘my raw data are so damn special that nobody can over make sense of them’, while another is ‘I use special software and stuff, so they are probably not useful to anybody else’. I call BS on both of these arguments. Maybe you have the worlds most complicated data, but why not release them and not worry about whether or not people find them useful – that is not your concern (though it should be). Remember, the policy is not ‘make the data available so that everybody can use it easily and with minimal effort’, but instead to ‘share’. This does require extra effort, but data curation is part of the job.

Look, change is hard, and there will be challenges in implementing this policy. PLOS is clearly trying blaze their own trails, something which they have done previously with great success. Do I think this is a perfect policy – absolutely not. Do I think its better that what we have now, yes.   How many of us have been hindered by ‘data not available’? I know I have, and I don’t think I’m unique in that regard.

So, I challenge you, dear reader, Go to any of the PLOS journals and look at the last month of publications.. How many of these contain data so super-special or large that they could not be posted?  Who knows, maybe I’m wrong.

  • Ian Dworkin
    • Matt MacManes

      nice post!

  • CAYdenberg

    Thank you. I wrote almost this exact same thing on DrugMonkey yesterday.

    Some of the criticism seems to stem from the idea that curating data in a format that “anyone” can understand is too hard. I can understand this to an extent: there are lots of different types of data and no “standard” format for lots of them. My view is that if data is exposed the standards will develop organically as a conversation between authors, the journal, and data users. But if data continues to remain hidden then the development of standards will never happen.

    • Matt MacManes

      excellant point!

  • Tim Vines

    Thanks for writing this – I completely agree. For what it’s worth, Molecular Ecology has been operating a draconian data sharing policy for a few years now, and as far as we can tell, the sky has not fallen.

    • Matt MacManes

      Thanks Tim, I really appreciate your comment. Just like we look at open access with a ‘obviously thats the right thing to do’ mentality, I hope soon we will feel the same about data sharing! The implementation of these policies are sometimes not easy, but they will worked out eventually.

      • Tim Vines

        We’re going to look back in twenty years and laugh at this. The data are an absolutely integral part of the paper and not having them publicly available makes about as much sense as saying “The Tables and Figures are available from the authors upon request”.

  • Where did people leave their brains? Science generally requires data to prove. How could you disagree with making that *proving* data public, given modern web technology?

  • Margolis Lab

    Not corner cases. We have to record about 500 GB per mouse per session. 12 sessions per mouse, 10 mice per group, we’re talking 60 TB of data (high speed video from multiple cameras, ephys, etc.). Those are conservative numbers from one study only. Systems neuroscience labs do this routinely and things are going more and more in this direction with emphasis on looking at correlation between brain activity and behavior. Raw data of this sort is not feasible to host / store. Some intermediate form of data would have to suffice otherwise a definite #PLOSfail

    • How exactly is this not a corner case? What fraction of PLoS papers are “systems neuroscience”?

    • aeonsim

      Those numbers are big but they hardly make it a uniquely large problem. They’re within the range of numerous groups working in genomics with next generation datasets, Sure you’ll need to sit down and talk to PLOS or someone else but that doesn’t make it impossible, annoying perhaps but it’s good science and only a few percentage of the groups working in any field are going to have such large datasets.

    • I would suggest that only a handful of people might want “raw video” but it would be both reasonable and desirable to report data in a form more detailed than the summary means and variances that are often provided in tables.

  • There are issues of course, particularly with anonymising human data. But the library/information sciences/documentation world has long been at work on this: the UK Data Archive has produced best practice guidelines, and so on.
    So philosophically, it has to be asked, “If you don’t want to share data, can you really, in the 21st century, continue to describe yourself as a ‘scientist.’?”
    More here: http://juliusbeezer.blogspot.fr/2013/02/lecture-notes-on-internet-method-n1743.html

  • Pingback: February highlights from the world of scientific publishing | sharmanedit()

  • I totally agree with what you are saying and publisher and funder mandates are the things that need to push this. It still amazes me when people can publish stuff like microarrays in some journals without putting that data in GEO. I’m sure there are times where people thought putting that data in a public repository was a ridiculous excessive burden on research.

    I was at a talk a couple years ago with someone who was presenting a great, interesting data set that they had published in Science. Someone asked how they could check what was going on with their gene, and they said they wouldnt consider putting their data out there, since it would be “too easy” for other people to mine that data. That data, which had already been looked over and analysed in Science. This kind of behavior leads to a balkanization of science, where you can only get the information if you are part of a selective clique.

  • Pingback: strong opinions about data sharing mandates–mine included | [citation needed]()

  • Pingback: Lit Review: #PLOSFail and Data Sharing Drama | Data Pub()