Shot down..

Posted on Posted in Uncategorized

I’m trying (still) to get my manuscript on transcriptome assembly best practices published. I\’m sorry to say that it is not going that well.. It seems that there is a disconnect between editors (who mostly hate it) and people actually assembling transcriptomes (who find it very valuable). How to reconcile these differences.. Here is the latest rejection, from an unnamed editor who ‘reviewed’ the manuscript and rejected it..

While manuscripts on best practices are generally able to be considered for publication, this manuscript is too small of an advance to be considered further at BMC Genomics. Your recommendations are either trivial (“sequence 1 or more tissues”), obvious (“remove adapters”) or flawed (suggesting 20-40 million reads as adequate, unqualified by considerations of sequencing technology, transcriptome complexity or read length, is problematic). I am sure you found it helpful for the three data sets you used, anecdotally, but to consider guidelines useful for a wider audience you would have to test them on a far larger number and probably consider other assemblers as well.

I’d like to take this opportunity to break down his comments.

  • …recommendations are trivial (“sequence 1 or more tissues”)
    • The actual recommendation is this: “Summary Statement: Sequence 1 or more tissues from 1 individual…” This is an important recommendation! Said more verbosely, assembly 1 individual, (but as many tissues as the are required by the experimental design). People often/typically try to make assemblies consisting of between 1 and n individuals samples from a population. This causes all sorts of problems with de bruijn graph based assemblers – centered on polymorphism. The recommendation is not trivial – as in it is very common to see it done in the published literature. Also, I have data to support the hypothesis that sequencing 1 individual results in a cleaner assembly than does sequencing, for instance, 10. (Table 1)
  • …recommendation are obvious (“remove adapters”)
    • The actual recommendation. “Visualize your read data. Error correct reads using bfc for low to moderately sized datasets and RCorrector for higher coverage datasets. Remove adapters, and employ gentle quality filtering using PHRED < 2 as a threshold”. The point here is to make a set of recommendations that people can follow step by step.. Adapter trimming is obviously an important step. I was not claiming this to be groundbreaking, only complete.. Make a protocol that is not explicit with regards to each step and see what happens – I dare you..
  • …recommendation is flawed (suggesting 20-40 million reads as adequate, unqualified by considerations of sequencing technology, transcriptome complexity or read length)
    • Actual rec: “Summary Statement: Sequence 1 or more tissues from 1 individual to a depth of between 20 million and 40 million 100bp or longer paired-end reads”
    • Fair point about sample size – I can demonstrate this effect in more datasets, but to call this anecdote is highly offensive. These are randomly chosen datasets, assembled in all different ways. Ask for more replication, but don’t call it anecdote. I would note that each dataset added is hundreds of hours of work over 10+ assemblies, so it’s not trivial to add datasets.. So, how many datasets would be appropriate – 10 more? 100 more?? Dear editor, how many additional assemblies should I do?
    • Actually, the rec. is qualified. I say in this section that this relates to Illumina only, I specify read length 100bp PE, vertebrate transcriptomes.. So, I qualify it in exactly the way the editor is claiming that I did not.. I have to wonder, did he even read the paper? Could the rec. be different for 250bp reads or 36 bp reads – maybe. But given how the work scales, its not feasible to do all of these different combos. Besides, most people are using reads of approximately 100bp, so this length seems reasonable.
  • …probably consider other assemblers
    • I justified my choice of assembler – Trinity. It has over and over again been shown to be (one of) the most accurate, and a survey of the literature suggests it is by far the most widely used. Why benchmark other assemblers that are known to be worse/are not used very much again??

FYI, because preprints rock, you can read the manuscript here: http://biorxiv.org/content/early/2016/02/18/035642 and the protocol here: http://oyster-river-protocol.readthedocs.org/en/latest/ and tell me what you think..

Do you think this manuscript has merit? Are you an editor someplace and want to suggest a potential home?

  • F. Olivier Hébert

    I work on a small parasitic transcriptome that I had to de novo assemble and the choice of individuals was critical in my capacity to fully assemble “true transcripts” and messing around with multiple-sample assemblies made me realize that 1 individual would give me better and more accurate results (LOTS of chimeras with multiple samples). It took me several weeks to find this out… I would have liked to have some sort of guidelines on how to do this in the first place. This manuscript does EXACTLY this job and I think a lot of people need it. People have been using RNAseq for the past 8 years or so and still, nobody knows how to properly proceed. Thanks for your work Matt, I hope this gets published soon!

  • F. Olivier Hébert

    I work on a small parasitic transcriptome that I had to de novo assemble and the choice of individuals was critical in my capacity to fully assemble “true transcripts” and messing around with multiple-sample assemblies made me realize that 1 individual would give me better and more accurate results (LOTS of chimeras with multiple samples). It took me several weeks to find this out… I would have liked to have some sort of guidelines on how to do this in the first place. This manuscript does EXACTLY this job and I think a lot of people need it. People have been using RNAseq for the past 8 years or so and still, nobody knows how to properly proceed. Thanks for your work Matt, I hope this gets published soon!

  • Melissa DeBiasse

    I am working on a project to test the effect of ocean acidification (OA) on a encrusting sponge/coral species interaction- the coral and sponge are in competition for space on reefs. The corals builds a calcium carbonate structure and the sponge overgrows the coral and bioerodes the structure away. I have RNAseq data from many experimental treatments and time points and am interested in testing differential gene expression among them. I understand that assembling a transcriptome from more than one individual complicates the process. But if I ultimately want to map gene expression reads back to that transcriptome, doesn’t it make sense to assemble a “transcriptome” from individuals across the experimental treatments in order to capture the full complement of genes that may be expressed at different times/under different conditions? Does your advice to assemble a transcriptome from one individual apply for questions of differential gene expression? Thank you for the clarification!

    • Matt MacManes

      The rec here is to include 1 individual per treatment group, assuming you expect new isoforms / genes in treatments.

      • Melissa DeBiasse

        Hi Matt, thank you for the reply! Would you worry about missing intraspecific variation? Or is this just an inevitable tradeoff required to make the assemblies run correctly?

  • Ben Sutherland

    MacManes shares his valuable expertise in transcriptome assembly by providing a roadmap for the complex process of generating a reference transcriptome. This is a vital first step in many transcriptomics experiments for non-model species. With well-organized information collected for the first time in one place, this manuscript provides (much needed) best practices, and ways for the user to evaluate his/her own assemblies through the sharing of the code used in the analyses.

    The manuscript focuses mainly on the various key parameters around an assembly (e.g. # individuals, # reads, digital normalization) and does not attempt to compare Trinity to other assemblers, as this has been already done and is cited within the document. This does not limit the utility of the document, but rather keeps one aspect controlled while other aspects are tested (e.g. number of individuals used for assembly).

    Some remaining questions/suggestions:
    Regarding the description of using two individuals for generating the reference transcriptome if the individuals come from different conditions (e.g. one male, one female; at L132), this makes sense, but then what about the problem of splitting alleles into two redundant transcripts due to the polymorphism? Is this not an issue, or is there a way to solve this issue? And if the alleles get split into different transcripts, how will this affect further mapping for gene quantification due to the redundancy within the reference transcriptome?
    Further to this, I would be very interested to hear how this specific feature affects the biological interpretation of a differential expression (DE) analysis. For example, regarding the use of the sub-sampled 10 individuals assembly compared to the full 1 individual assembly: how would this affect identified DE genes? In the analysis based on a single individual, are genes not identified as DE if they lack expression in the reference individual? Are there problems introduced by polymorphisms resulting in the mapping of reads to two different redundant ‘transcripts’ that are actually the same gene, but split by polymorphism from the 10 individual subsampled assembly? How does DE analysis behave in these cases? Although these cases can be somewhat experiment specific, even some further discussion about these problems would be much appreciated!

    I certainly will be using this manuscript to inform my own work.