Is trimming beneficial for RNA-Seq??

Posted on December 28, 2013March 19, 2016 macmanesPosted in Uncategorized

I was pointed to a new paper in PLOS ONE: An Extensive Evaluation of Read Trimming Effects on Illumina NGS Data Analysis. Their central thesis seems to be this:

\”….trimming is beneficial in RNA-Seq, SNP identification and genome assembly procedures, with the best effects evident for intermediate quality thresholds (Q between 20 and 30).

This it a topic about which I have thought a lot, as I\’ve recently written up a manuscript on the same topic: On the optimal trimming of high-throughput mRNAseq data (see also companion Blog post). I show that anything more than VERY gentle trimming is harmful to de novo assembly and transcriptome characterization. My findings seem to be in conflict with those presented by Giorgi and colleagues. I\’ll tell you up from that I think I\’m right, at least for the RNAseq part of their paper.

With regards to RNAseq, they show that percentage of reads mapped to the reference increases with moderate trimming (red bars), then decreases with more aggressive >Q30 trimming.. Note that I don\’t think that better mapping is necessarily equivalent to better RNAseq results, but save that issue for later.. $\"journal.pone.0085024.g001\"$

Anyway, I don\’t think we really care about the percentage of reads mapped correctly, we care about the total number of reads correctly mapped. Surely, 99% mapping of a 1M read dataset is much worse than 80% mapping of a 100M read dataset. This is basically what they show, that trimmings reduces the size of the dataset (blue bars), but increases the mapping rate (red bars).. No big deal there.

Again, what we really care about is the absolute number or reads mapped correctly, and when you look at that- trimming, particularly at their \’best\’ trimming thresholds looks anything but beneficial for RNAseq- Here is their data plotted using the info contained in their supplementary table S1. See what happened to the number of reads mapped as trimming threshold increases?

$\"rebuttal\"$

This shows that trimming past Q5 (Q10 for fastX) results in a reduction in the absolute number or reads mapping– the reduction is really profound at the trimming levels they report as best! At Q30, only 10% of the reads map, as compared to 72% of the reads in the untrimmed dataset. I\’m not going to spend the time to determine if this reduction is meaningful to the downstream RNAseq analyses (though the authors of the paper should have), but I\’m going to suggest that this amount of reduction would be very detrimental to any RNAseq experiment, not beneficial, as the authors claim..

So, it trimming beneficial to RNASeq- the answer is no- at least beyond very gentle trimming.

Follow @PeroMHC

Dave Bridges

what about when the general quality of reads is relatively low? Do you still hold that trimming would be less than useful then
- Matt MacManes
  
  actually yes, still aggressive trimming should be avoided.. See pink dotted lines in all figures: http://biorxiv.org/content/biorxiv/early/2013/12/23/000422.full.pdf
  
  This represents is a lower quality dataset- numbers are different, but patterns similar.
Robert King

Agree with what you say in my experience of snp calling and mapping where anything more than minimal trimming, increases false positives and increase in mapping reads is inmaterial. Authors seem to be missing the big picture. Unless data quality is awfull then maybe a bit more trimming but if junk data then likey junk results.
Pingback: Is trimming is beneficial in RNA-Seq?? | Bioinf...()
Tim Roth

I do believe Trimming is beneficial for RNASeq, at least under the parameters the authors measured. As they say, the decision to trim RNASeq reads is usually a tradeoff between the percentage of mapping reads and number of surviving reads. The idea of measuring directly the correctness of mapping (i.e. the number of reads that align correctly) is challenging; however it’s really hard to define “true positives” in such a scenario, where we start with uncertainty in the genome sequence itself, in the divergence from the reference of the sequenced organism, adding various flavours of uncertainty at the level of contaminant reads (environmental, pathogens/symbionts, human operators-derived), library preparation biasess and plain sequencing errors. The only really unbiased way to assess this is through an assessment of the “mappability” of the surviving reads: a higher percentage means the trimmer operates by increasing the such defined “quality” of the surviving population. And once more, it’s a matter of tradeoff between quality and quantity (size of post-trimming population of reads), as everything in life I must add.
- Matt MacManes
  
  Thanks Tim for the comment! Tradeoff yes, I just think that the recommended Q20-30 threshold is not supported at all. Nothing in their paper suggests that the outcomes are improved. Mapping percentage is meaningless.
Simone Scalabrin

Thanks for you comments Matt.

I am Simone Scalabrin, one of the authors of the article. I’d like to add my point to what Federico already wrote.

In our paper we do not impose our faith, we just provide different datasets and objective trimming effects (as supplementary table 1 that you pointed out) and our personal opinion on that. Different reviewers agreed with us.

The RNA-seq dataset from which your critics sprout is of extremely low quality. You can easily see that based on how trimmers work on that. We believed that was the most interesting dataset to be evaluated. You can have a look also at the Arabidopsis dataset. In general, from reads we internally produce, with Q20 we trim between 1 and 2% of the read but low quality experiments as the one we propose are not that rare, at least with Illumina runs.

A few comments on what you wrote about this dataset: first of all, we are not talking about a dataset of 100M reads trimmed down to 1M reads. And if we would have such a bad dataset we would certainly throw it away or keep at most the 0.99M reads (99%) mapping after trimming rather than using 80M reads (80%) with likely quite a few unreliable mappings. I believe quality values provided by Illumina are quite reliable and Q as low as 5 or even 2 is extreme, that is 32% and 63% of probability that the single base is wrong, respectively.

Second, we do not show the percentage of reads mapped correctly. We show the percentage of reads mapping after trimming. For most tools, this percentage stabilizes to about 90% at Q>=20. We used real datasets (simulated reads are highly biased and useless) and cannot say how many reads are placed correctly. Taking your idea of using the raw reads just because the absolute number of mapped reads is higher moves the question to the aligner and how reliable it is on clipping.

Third, we proposed very general guidelines and tried to avoid a clear recipe. Please read the final Discussion, including the sentence ” Therefore, it is up to the researcher to select the best trade-off between read loss and dataset quality”. With trimming you decrease specificity in favor of sensitivity on the later step of mapping. In particular we wrote this sentence thinking about the different scenarios that may happen while you write that the reduction caused by trimming is “detrimental to any RNA-seq experiment”. This hasn’t been proved, neither from our paper nor from yours.

Have a nice 2014,
one of your co-authors in the assemblathon2 paper 😉
- Matt MacManes
  
  Hi Simone- thanks for your comments!
  
  You: “Second, we do not show the percentage of reads mapped correctly. We show the percentage of reads mapping after trimming”
  
  Me: Right, had you shown that the number of reads mapping correctly was increased with trimming, that would have been evidence of positive effect. As it stands, you show that absolute numbers decrease, and in aggressive trimming, they decrease dramatically. This is not likely to have a positive effect on RNAseq, unless you are assuming that the difference between the number of reads mapping is almost mostly erroneous read mapping. One thing that would be interesting is to know how many reads mapped with 0 mismatches.
  
  You:”This hasn’t been proved, neither from our paper nor from yours”
  
  Me: Actually, I think I do provide evidence that aggressive trimming does harm RNAseq- at least one of the common goals of RNAseq, which is to discover transcript sequences. Obviously I have not evaluated every dataset ever, but 2 random datasets show similar patterns.
  
  How about this: send me 2 small read datasets, one with adapter trimming only, one with adapter trimming + Q30 trimming. I’ll assemble them and apply the same metrics I did in my paper, and see what happens. I’ll send you 2 datasets, you replicate your work.. We’ll write something up (quick and dirty) and post on bioRxiv.
  - Simone Scalabrin
    
    We are both right and both wrong, what mainly matters is what you need, either sensitivity or specificity
    
    When I say “this hasn’t been proved” I mean test all possible RNA-seq applications (de novo assembly, read mapping and RPKM count, etc) and for each of these applications you can find subtle differences, e.g. study just paralogous or all genes. You did a very good job on discovery of transcripts!
    
    Last, if we want to discuss about our paper, what about using the two RNA-seq datasets we used? They are not that big? Or just use the human dataset, the nasty one. I guess we can discuss about this in private via email, ok? And there we can also choose which tool to use for adapter removal and trimming!
    
    Best,
    Simone
Pingback: RNA-seq | bionet()
Mbeyagala Emmanuel

Matt, thanks for the discussions on trimming, Is there an effect of read filtering on RNASeq?
If so their optimal filtering?
- Matt MacManes
  
  What do you mean by filtering? Removing PCR duplicates?
  - Mbeyagala Emmanuel
    
    I am using fastx_toolkit to do a quality filter of the reads before they are trimmed.
    The filtering is for removing low quality reads.