After months of work, my new manuscript \”An opinionated guide to the proper care and feeding of your transcriptome\” is out on bioRxiv!! Accompanying this is a living document – the version controlled and updated set of current best-practices \”Oyster River Protocol for Transcriptome Assembly\”. These documents serve to provide evidence-based guidelines for de novo transcriptome assembly, and a clear how-to guide for its implementation. Here is the abstract:
Characterizing transcriptomes in both model and non-model organisms has resulted in a massive increase in our understanding of biological phenomena. This boon, largely made possible via high-throughput sequencing, means that studies of functional, evolutionary and population genomics are now being done by hundreds or even thousands of labs around the world. For many, these studies begin with a de novo transcriptome assembly, which is a technically complicated process involving several discrete steps. Each step may be accomplished in one of several different ways, using different software packages, each producing different results. This analytical complexity begs the question — Which method(s) are optimal? Using reference and non-reference based evaluative methods, I propose a set of guidelines that aim to standardize and facilitate the process of transcriptome assembly. These recommendations include the generation of between 20 million and 40 million sequencing reads from single individual where possible, error correction of reads, gentle quality trimming, assembly filtering using Transrate and/or gene expression, annotation using dammit, and appropriate reporting. These recommendations have been extensively benchmarked and applied to publicly available transcriptomes, resulting in improvements in both content and contiguity. To facilitate the implementation of the proposed standardized methods, I have released a set of version controlled open-sourced code, The Oyster River Protocol for Transcriptome Assembly, available at http://oyster-river-protocol.rtfd.org/.
The paper, in brief, evaluated the different steps for transcriptome assembly, and provides recommendations.
- Check the quality of your reads, and make an archival copy
- Error correct using BFC if you have less than 20 million paired-end reads, or RCorrector if you have more.
- Aggressive adapter removal, and gentle quality trimming (Phred=2)
- Assemble using Trinity.
- Evaluate the quality of your assembly.
- Filter based on Transrate and/or gene expression
- Annotate your transcriptome using dammit
- Appropriate reporting (e.g., die N50)
The secret sauce for the protocol is really the filtering done after the assembly. I\’ve already published on quality trimming and error correction (here and here), so I\’m hoping these things are already becoming parts of standard workflows (ahem, ***). Filtering, however…. people talk about filtering assemblies, and it\’s like the wild-west out there, with each person doing her/his own thing. This protocol formalizes these procedures in a objective evaluative fashion, basically by using Transrate and BUSCO to guide the filtering process. Because I evaluate before and after each filtering step, I know when I can push a little harder, and when I\’ve filtered too stringently. The appropriate filtering of assemblies is critical..
*** The backstory of this manuscript: I wrote this manuscript because I (along with several others including Titus Brown and Richard Smith-Unna) was frustrated with the state of the science of transcriptome assembly. You don\’t have to look very deeply (basically pick any transcriptome assembly paper at random) into the literature to find examples of people doing really terrible things to their transcriptomes. From quality trimming half your read dataset away, to assembling with an antiquated assembled and improper filtering, to evaluating assemblies with N50.. It\’s clear many people aren\’t really thinking about these methods too much.. I said above that filtering was the wild west of assembly, but actually it\’s the whole darn thing. Everybody is doing their own thing with each step, and much is really sub-optimal. Methods matter – and this protocol is my initial attempt at getting people to think about methods, or at least to do what I tell them to do (I\’ve thought a lot about the methods).
Like I said above the protocol is a living document – in other words, I plan to keep it updated, and will be seeking NSF support for this. Everybody knows the science is moving very rapidly, and what is considered a best-practice right now might be different in a six months or a year.. If you have things you think I should try, open an issue here: https://github.com/macmanes-lab/Oyster_River_Protocol/issues or ping me on Gitter