New Paper: On the optimal trimming of high-throughput mRNA sequence data

I\’ve just finished work on a new On the optimal trimming of high-throughput mRNA sequence data, which is ~~as a preprint on bioRxiv: http://biorxiv.org/content/early/2013/12/23/000422. The paper has been submitted to Frontiers in Bioinformatics and Computational Biology for potential inclusion~~ published in a special issue dealing with the Quality Control of NGS data (http://journal.frontiersin.org/Journal/10.3389/fgene.2014.00013/abstract).

I began work on this paper a few months ago, not intending to write this up for publication, but instead because I was interested in understanding trimming as it relates to transcriptome assembly. Indeed, when you look in the NGS literature, you see people trimming typically at Phred=20. This is by far the most common trimming level, but it was unclear to me why, as it seems very harsh. Back in my early NGS days, I too had trimmed at 20, thinking about that old saying \’garbage in, garbage out\’. Im much less sure now..

When you take a critical look at Phred20 trimming, it just looks bad. Here is a plot of an Illumina dataset error profile, with the red dotted line indicating the Phred20 trimming threshold. It is plainly evident that trimming this aggressively may result in a loss of a lot of data.

$\"Rplot\"$

So anyway, I got to thinking– can I demonstrate P20 trimming is bad for transcriptome assembly, and if it is, what level might be better? It turns out that this is reasonably complex. There are issues with coverage- optimal trimming might be related to depth of coverage. Also, it could be dataset specific- maybe each dataset, and the specifics of it\’s error profile, have a unique optimal point.. (this would suck!). The optimal trimming point could be dependent on the assemblers used, or ever the specific trimming program. Last, individual researchers may have different goals, and therefore be more or less tolerant of different types of errors. The take home message from all this– there was no way to reasonably test all these variables, and so I needed to make some decisions about which I wanted to test..

Further complicating matters is how best to evaluate the \’goodness\’ of a transcriptome assembly- this is an issue I\’ve worried a lot about. For genome assembly, this is much easier- N50 makes sense, and there are people developing tools to assay genome assemblies (e.g. REAPR). These things don\’t exist for transcriptomes, so it\’s hard to know what constitutes a good assembly versus a bad one- The paper used a few metrics but I\’m not convinced they are the right ones..

So you\’l have to read the paper if you want the full details, and please comment on the manuscript either here or over at bioRxiv. What I\’ll say it that you should probably stop trimming your RNAseq reads at P20.. Instead, try P2 or P5. Also, one of the most interesting and puzzling details about the paper is why trimming did not have MORE effect. I would have thought that trimming would have resulted in more profound improvement, but it really doesn\’t.

Does this hold for genome assembly or metatranscriptomics or whatever else? I don\’t really know, and I certainly don\’t have an official recommendation. My impression however, is that trimming at Phred=20 is always*** bad, and that a more gentle trimming strategy is better. Now, go read (and comment on) the full paper. http://biorxiv.org/content/early/2013/12/23/000422

One final note: This is the 1st publication sporting my new affiliation and role as a PI. The paper is freely available as a preprint, and Frontiers In… is an OA journal. This is how the MacManes Lab will roll! Since moving to UNH, the issues of access have ben more \’in my face\’ than they were at Berkeley (being Berkeley, they had an online subscription to seemingly every journal). Here, I cannot see many papers that would be of interest– they are behind a paywall. While the OA fees are sometimes expensive, especially for a new PI, I have to be the change I want to see in the scientific community. Must walk the walk..

*** The one exception may be when the primary purpose of the experiment is to identify SNPs.. Maybe then very strict trimming is beneficial.

Pingback: New Paper: On the optimal trimming of high-throughput mRNA sequence data | MacManes Lab | Roberts Lab()
Richard Smith

Matt, thanks for this thoughtful paper, and especially for making your process so open. I have been thinking about and writing software for transcriptome assembly optimisation for the last year, and I have some thoughts on the paper. If you don’t mind, I’m going to reanalyse some of your assemblies using a broader range of metrics, and will post a response to your paper (either as a blog post or a preprint, depending on the strength of the conclusions. I just wanted to drop a comment to say thanks. Methods matter a lot. Many researchers are throwing money at sequencing without giving enough care to the analysis, and if we’re throwing away perfectly useful reads then people should be made aware of it. I like that you’ve challenged a pretty basic assumption: that being strict is a good thing. Also, nice work being one of the first preprints on the biorXiv 🙂
- Richard Smith
  
  Oh, and if your lab continues like this, it’s gonna be one to watch!
  - Matt MacManes
    
    please do!
- Matt MacManes
  
  Aporove
  - Mbandi SK
    
    We have submitted a paper in which we addressed the concept of aggressive trimming and artefact removal and provided a perspective on how to gauge trimming in the context of de novo transcriptome. I am happy that the conclusions are similar with yours. In particular, we described a metric (HSP ratio) which is based on seed alignment of transfrags derived proteins to known proteins from Uniprot. For highly fragmented assemblies, the HSP ratios are generally low. The metric was motivated from the observation that de novo transcriptome
    assemblies generated with different parameters quite often produce unequal number of blast hits or functional annotations. I am currently making recommended changes from reviewers and I’ll put a bitbucket repository from all scripts used therein.
    P.S: I posed a tweet in relation to this when I decided to write our manuscript.
Pingback: Is trimming is beneficial in RNA-Seq?? | MacManes Lab()
Pingback: Digital normalization revealed | Bits of DNA()