New Paper: On the optimal trimming of high-throughput mRNA sequence data

I\’ve just finished work on a new On the optimal trimming of high-throughput mRNA sequence data, which is ~~as a preprint on bioRxiv: http://biorxiv.org/content/early/2013/12/23/000422. The paper has been submitted to Frontiers in Bioinformatics and Computational Biology for potential inclusion~~ published in a special issue dealing with the Quality Control of NGS data (http://journal.frontiersin.org/Journal/10.3389/fgene.2014.00013/abstract).

I began work on this paper a few months ago, not intending to write this up for publication, but instead because I was interested in understanding trimming as it relates to transcriptome assembly. Indeed, when you look in the NGS literature, you see people trimming typically at Phred=20. This is by far the most common trimming level, but it was unclear to me why, as it seems very harsh. Back in my early NGS days, I too had trimmed at 20, thinking about that old saying \’garbage in, garbage out\’. Im much less sure now..

When you take a critical look at Phred20 trimming, it just looks bad. Here is a plot of an Illumina dataset error profile, with the red dotted line indicating the Phred20 trimming threshold. It is plainly evident that trimming this aggressively may result in a loss of a lot of data.

$\"Rplot\"$

So anyway, I got to thinking– can I demonstrate P20 trimming is bad for transcriptome assembly, and if it is, what level might be better? It turns out that this is reasonably complex. There are issues with coverage- optimal trimming might be related to depth of coverage. Also, it could be dataset specific- maybe each dataset, and the specifics of it\’s error profile, have a unique optimal point.. (this would suck!). The optimal trimming point could be dependent on the assemblers used, or ever the specific trimming program. Last, individual researchers may have different goals, and therefore be more or less tolerant of different types of errors. The take home message from all this– there was no way to reasonably test all these variables, and so I needed to make some decisions about which I wanted to test..

Further complicating matters is how best to evaluate the \’goodness\’ of a transcriptome assembly- this is an issue I\’ve worried a lot about. For genome assembly, this is much easier- N50 makes sense, and there are people developing tools to assay genome assemblies (e.g. REAPR). These things don\’t exist for transcriptomes, so it\’s hard to know what constitutes a good assembly versus a bad one- The paper used a few metrics but I\’m not convinced they are the right ones..

So you\’l have to read the paper if you want the full details, and please comment on the manuscript either here or over at bioRxiv. What I\’ll say it that you should probably stop trimming your RNAseq reads at P20.. Instead, try P2 or P5. Also, one of the most interesting and puzzling details about the paper is why trimming did not have MORE effect. I would have thought that trimming would have resulted in more profound improvement, but it really doesn\’t.

Does this hold for genome assembly or metatranscriptomics or whatever else? I don\’t really know, and I certainly don\’t have an official recommendation. My impression however, is that trimming at Phred=20 is always*** bad, and that a more gentle trimming strategy is better. Now, go read (and comment on) the full paper. http://biorxiv.org/content/early/2013/12/23/000422

One final note: This is the 1st publication sporting my new affiliation and role as a PI. The paper is freely available as a preprint, and Frontiers In… is an OA journal. This is how the MacManes Lab will roll! Since moving to UNH, the issues of access have ben more \’in my face\’ than they were at Berkeley (being Berkeley, they had an online subscription to seemingly every journal). Here, I cannot see many papers that would be of interest– they are behind a paywall. While the OA fees are sometimes expensive, especially for a new PI, I have to be the change I want to see in the scientific community. Must walk the walk..

*** The one exception may be when the primary purpose of the experiment is to identify SNPs.. Maybe then very strict trimming is beneficial.

Share this: