I’m trying (still) to get my manuscript on transcriptome assembly best practices published. I\’m sorry to say that it is not going that well.. It seems that there is a disconnect between editors (who mostly hate it) and people actually assembling transcriptomes (who find it very valuable). How to reconcile these differences.. Here is the latest rejection, from an unnamed editor who ‘reviewed’ the manuscript and rejected it..
While manuscripts on best practices are generally able to be considered for publication, this manuscript is too small of an advance to be considered further at BMC Genomics. Your recommendations are either trivial (“sequence 1 or more tissues”), obvious (“remove adapters”) or flawed (suggesting 20-40 million reads as adequate, unqualified by considerations of sequencing technology, transcriptome complexity or read length, is problematic). I am sure you found it helpful for the three data sets you used, anecdotally, but to consider guidelines useful for a wider audience you would have to test them on a far larger number and probably consider other assemblers as well.
I’d like to take this opportunity to break down his comments.
- …recommendations are trivial (“sequence 1 or more tissues”)
- The actual recommendation is this: “Summary Statement: Sequence 1 or more tissues from 1 individual…” This is an important recommendation! Said more verbosely, assembly 1 individual, (but as many tissues as the are required by the experimental design). People often/typically try to make assemblies consisting of between 1 and n individuals samples from a population. This causes all sorts of problems with de bruijn graph based assemblers – centered on polymorphism. The recommendation is not trivial – as in it is very common to see it done in the published literature. Also, I have data to support the hypothesis that sequencing 1 individual results in a cleaner assembly than does sequencing, for instance, 10. (Table 1)
- …recommendation are obvious (“remove adapters”)
- The actual recommendation. “Visualize your read data. Error correct reads using bfc for low to moderately sized datasets and RCorrector for higher coverage datasets. Remove adapters, and employ gentle quality filtering using PHRED < 2 as a threshold”. The point here is to make a set of recommendations that people can follow step by step.. Adapter trimming is obviously an important step. I was not claiming this to be groundbreaking, only complete.. Make a protocol that is not explicit with regards to each step and see what happens – I dare you..
- …recommendation is flawed (suggesting 20-40 million reads as adequate, unqualified by considerations of sequencing technology, transcriptome complexity or read length)
- Actual rec: “Summary Statement: Sequence 1 or more tissues from 1 individual to a depth of between 20 million and 40 million 100bp or longer paired-end reads”
- Fair point about sample size – I can demonstrate this effect in more datasets, but to call this anecdote is highly offensive. These are randomly chosen datasets, assembled in all different ways. Ask for more replication, but don’t call it anecdote. I would note that each dataset added is hundreds of hours of work over 10+ assemblies, so it’s not trivial to add datasets.. So, how many datasets would be appropriate – 10 more? 100 more?? Dear editor, how many additional assemblies should I do?
- Actually, the rec. is qualified. I say in this section that this relates to Illumina only, I specify read length 100bp PE, vertebrate transcriptomes.. So, I qualify it in exactly the way the editor is claiming that I did not.. I have to wonder, did he even read the paper? Could the rec. be different for 250bp reads or 36 bp reads – maybe. But given how the work scales, its not feasible to do all of these different combos. Besides, most people are using reads of approximately 100bp, so this length seems reasonable.
- …probably consider other assemblers
- I justified my choice of assembler – Trinity. It has over and over again been shown to be (one of) the most accurate, and a survey of the literature suggests it is by far the most widely used. Why benchmark other assemblers that are known to be worse/are not used very much again??
FYI, because preprints rock, you can read the manuscript here: http://biorxiv.org/content/early/2016/02/18/035642 and the protocol here: http://oyster-river-protocol.readthedocs.org/en/latest/ and tell me what you think..
Do you think this manuscript has merit? Are you an editor someplace and want to suggest a potential home?