I find myself continually frustrated**. Reading transcriptome assembly paper after paper, all of them quality trimming mindlessly at a Phred score 20 or 30. Why do people do this??!?!?! I just don\’t understand, given there is nary a shred of evidence to suggest its benefit, and a substantial amount of evidence to suggest the contrary (http://journal.frontiersin.org/Journal/10.3389/fgene.2014.00013/abstract, http://journal.frontiersin.org/Journal/10.3389/fgene.2014.00017/abstract), that quality trimming, especially the type of quality trimming often employed (trimming at Phred >10) is baaaaad for assemblies – content and contiguity alike..
So.. I decided to attack this problem from a different angle – from that of kmer distributions. Most people that will read this are intimately familiar with the typical transcriptome kmer distribution. In short, super abundance of unique kmers with a long tail. These unique kmers are presumed to be erroneous, largely owing to sequencing error. So, the types of quality trimming we like to think we\’re doing should target these unique kmers – reducing their abundance.. The good news is, that quality trimming does reduce the number of unique kmers..
What however, would you think of trimming if I told you that it was removing non-unique kmers – those very likely to be \’biologically real\’? And I\’m not talking about a small number of non-unique kmers either.
I decided to do a simple analysis. I downloaded 11 datasets from the ENA. All were single end datasets from a Hiseq 2000. They varied in size from 4M reads to about 100M. The datasets were trimmed using Trimmomatic, and kmers were counted with Jellyfish. Full details at the bottom of the post.
The figure I\’m showing you is a plot of non-unique kmer loss. Specifically, the proportion of kmers of a given frequency (freq. 2 to 20) that are lost with quality trimming of Phred=20 versus the raw data. What you can see is that the most common type of quality trimming results in a loss of non-unique kmers by an average of close to 10%. These non-unique kmers are likely \’real\’, and may be the reason why harsh trimming is associated with poorer assemblies. For instance, there is a reduction in the number of kmers observed twice by between 9%, and 30% in these datasets.
So, in short, quality trimming gets rid of unique kmers like we\’d hope it would, but id also removes non-unique kmers and this my friends is baaaad! Please stop quality trimming so stringently, now.
\”Full Methods\”
Datasets used: SRR536828, SRR575841, SRR656612, SRR776537, SRR830399, SRR867201, SRR900790, SRR891361, SRR776566,SRR536829
Counting kmers in raw read files
<br /> for i in <code>MARKDOWN_HASHfbca6a3e2dd57da113940bbbb4d3d5b9MARKDOWN_HASH</code>; do jellyfish count -m 25 -s 200M -t 12 -C -o $i.jf $i done
Trimming, which resulted in ~1-98% of reads, except for SRR900790, where 29% were removed. I think this variation in % trimming is pretty typical of more recent Illumina datasets.
for i in <code>MARKDOWN_HASHfbca6a3e2dd57da113940bbbb4d3d5b9MARKDOWN_HASH</code>; do java -Xmx30g -jar /share/trinityrnaseq_r20140717/trinity-plugins/Trimmomatic-0.32/trimmomatic-0.32.jar SE -threads 12 $i $i.Phred20.fq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:20 LEADING:20 TRAILING:20 MINLEN:25 done
Make histograms
for i in <code>MARKDOWN_HASH3ad4421bb0225b8d1cea09db216b6a4dMARKDOWN_HASH</code>; do jellyfish histo $i -o $i.histo done
code to compile the individual histograms into the file that I used to make the graph. The final file is here in a public github gist: https://gist.github.com/macmanes/c5860e391e83a330bf89
paste <(head -100 SRR900790.fastq.jf.histo | awk \'{print $2}\') <(head -100 SRR891361.fastq.jf.histo | awk \'{print $2}\') <(head -100 SRR867201.fastq.jf.histo | awk \'{print $2}\') <(head -100 SRR830399.fastq.jf.histo | awk \'{print $2}\') <(head -100 SRR776566.fastq.jf.histo | awk \'{print $2}\') <(head -100 SRR776537.fastq.jf.histo | awk \'{print $2}\') <(head -100 SRR656612.fastq.jf.histo | awk \'{print $2}\') <(head -100 SRR575841.fastq.jf.histo | awk \'{print $2}\') <(head -100 SRR575726.fastq.jf.histo | awk \'{print $2}\') <(head -100 SRR536829.fastq.jf.histo | awk \'{print $2}\') <(head -100 SRR536828.fastq.jf.histo | awk \'{print $2}\') <(head -100 SRR900790.fastq.Phred20.fq.jf.histo | awk \'{print $2}\') <(head -100 SRR891361.fastq.Phred20.fq.jf.histo | awk \'{print $2}\') <(head -100 SRR867201.fastq.Phred20.fq.jf.histo | awk \'{print $2}\') <(head -100 SRR830399.fastq.Phred20.fq.jf.histo | awk \'{print $2}\') <(head -100 SRR776566.fastq.Phred20.fq.jf.histo | awk \'{print $2}\') <(head -100 SRR776537.fastq.Phred20.fq.jf.histo | awk \'{print $2}\') <(head -100 SRR656612.fastq.Phred20.fq.jf.histo | awk \'{print $2}\') <(head -100 SRR575841.fastq.Phred20.fq.jf.histo | awk \'{print $2}\') <(head -100 SRR575726.fastq.Phred20.fq.jf.histo | awk \'{print $2}\') <(head -100 SRR536829.fastq.Phred20.fq.jf.histo | awk \'{print $2}\') <(head -100 SRR536828.fastq.Phred20.fq.jf.histo | awk \'{print $2}\') > histo.txt
R code to make the plot
boxplot( as.numeric(1-(histo[2,12:22] / histo[2,1:11])), as.numeric(1-(histo[3,12:22] / histo[3,1:11])), as.numeric(1-(histo[4,12:22] / histo[4,1:11])), as.numeric(1-(histo[5,12:22] / histo[5,1:11])), as.numeric(1-(histo[6,12:22] / histo[6,1:11])), as.numeric(1-(histo[7,12:22] / histo[7,1:11])), as.numeric(1-(histo[8,12:22] / histo[8,1:11])), as.numeric(1-(histo[9,12:22] / histo[9,1:11])), as.numeric(1-(histo[10,12:22] / histo[10,1:11])), as.numeric(1-(histo[11,12:22] / histo[11,1:11])), as.numeric(1-(histo[12,12:22] / histo[12,1:11])), as.numeric(1-(histo[13,12:22] / histo[13,1:11])), as.numeric(1-(histo[14,12:22] / histo[14,1:11])), as.numeric(1-(histo[15,12:22] / histo[15,1:11])), as.numeric(1-(histo[16,12:22] / histo[16,1:11])), as.numeric(1-(histo[17,12:22] / histo[17,1:11])), as.numeric(1-(histo[18,12:22] / histo[18,1:11])), as.numeric(1-(histo[19,12:22] / histo[19,1:11])), as.numeric(1-(histo[20,12:22] / histo[20,1:11])), col=\'light blue\', frame.plot=F, xaxt=\'n\', main=\'Percent loss of non-unique kmers n No qual. trim versus trimming at Phred=20\') axis(1, at=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19), labels=c(\'kmer_freq=2\', "kmer_freq=3", "kmer_freq=4", "kmer_freq=5", "kmer_freq=6", "kmer_freq=7", "kmer_freq=8", "kmer_freq=9", "kmer_freq=10", "kmer_freq=11", "kmer_freq=12", "kmer_freq=13", "kmer_freq=14", "kmer_freq=15", "kmer_freq=16", "kmer_freq=17", "kmer_freq=18", "kmer_freq=19", "kmer_freq=20"), las=2 )
** Not really, but this sounds better..