Improving transcriptome assembly through error correction of high-throughput sequence reads

Posted on Posted in Bioinformatics, Illumina

I am writing this blog post in support of a paper that I have just submitted to arXiv: Improving transcriptome assembly through error correction of high-throughput sequence reads. My goal is not to talk about the nuts and bolts of the paper so much as it is to ramble about its motivation and the writing process.

First, a little bit about me, as this is my 1st paper with my postdoctoral advisor, Mike Eisen. In short, I am a evolutionary biologist by training, having done my PhD on the relationship between mating systems and immunogenes in wild rodents. My postdoc work focuses on adaptation to desert life in rodents- I work on Peromyscus rodents in the Southern California deserts, combining field work and genomics. My overarching goals include the ability to operate in multiple domains– genomics, field biology, evolutionary biology to better understand basic questions– the links between genotype and phenotype, adaptation, etc… OK, enough.. on the the paper.


The study of functional genomics–particularly in non-model organisms has been dramatically improved over the last few years by use of transcriptomes and RNAseq. While these studies are potentially extremely powerful, a computationally intensive procedure–the de novo construction of a reference transcriptome must be completed as a prerequisite to further analyses. The accurate reference is critically important as all downstream steps, including estimating transcript abundance are critically dependent on the construction of an accurate reference. Though a substantial amount of research has been done on assembly, only recently have the pre-assembly procedures been studied in detail. Specifically, several stand-alone error correction modules have been reported on, and while they have shown to be effective in reducing errors at the level of sequencing reads, how error correction impacts assembly accuracy is largely unknown. Here, we show via use of a simulated dataset, that applying error correction to sequencing reads has significant positive effects on assembly accuracy, by reducing assembly error by nearly 50%, and therefore should be applied to all datasets.

For the past couple of years, I have had an interest in better understanding the dynamics of de novo transcriptome assembly.. I had mostly selfish/practical reasons for wanting to understand–a large amount of my work depends on getting these assemblies ‘right’..  It was quickly evident that much of the computational research is directed at assembly itself, and very little on the pre- and post-assembly processes.. We know these things are important, but often an understanding of their effects is lacking…

How error correction of sequencing reads affects assembly accuracy has been one of the specific ideas I’ve been interested in thinking about for the past several months. The idea of simulating RNAseq reads, applying various error corrections, then understanding their effects is logical– so much so that I was really surprised that this has not been done before. So off I went..

I wrote this paper over the coarse of a couple of weeks. It is a short and simple paper, and was quite easy to write. Of note, about 75% of the paper was written on the playground in the UC Berkeley University Village, while (loosely) providing supervision for my 2 youngest daughters.  How is that for work-life balance!

The read data will be available on Figshare, and I owe thanks to those guys for lifting the upload limit- the read file is 2.6Gb with .bz2 compression, so not huge, but not small either. The winning (AllPathsLG corrected) assembly is there as well.

This type of work is inspired, in a very real sense, by C. Titus Brown, who is quickly becoming to be the go-to guy for understanding the nuts and bolts of genome assembly (and also got tenure based on his klout score HA!). His post and paper on The challenges of mRNAseq analysis is the type of stuff that I aspire to…

Anyway, I’d be really interested in hearing what you all think of the paper, so read, enjoy, commentand get to error correcting those reads!


UPDATE 1: The paper has made it to Haldane’s Sieve: and

Question about Structural Variants: These seem to be, like always, poorly reconstructed from de novo assemblies.. No worse with error corrected reads, but no better. If fact, contigs that are ‘full of errors’ are almost always those with complicated structural variation…

Real reads would behave differently than simulated reads. This certainly could be the case (but I don’t think so).. What I can say is that the 1st iteration of this experiment was done using reads from from the SRA– Homo.. The reason I did not use that experiment in the end is that it was hard to tell error from polymorphism relative to the reference.. but the patterns were the same. Fewer differences in error corrected reads relative to the the reference than in raw reads.. So, I do not think that the results I see are a result of the artificial nature of the simulated reads.