I’ve been working on a head to head comparison of the de novo transcriptome assemblers Trinity v.2.2.0 versus the new kids on the block – Shannon and BinPacker. The tl;dr version of the post is that these new assemblers are very good, and should be considered for new assembly projects, with just a couple of caveats.
Shannon is based on an information-theoretic approach to assembly that seeks to establish both necessary and sufficient conditions for optimal assembly as well as algorithms for achieving optimal assembly.
BinPacker models the transcriptome assembly problem as tracking a set of trajectories of items with their sizes representing coverage of their corresponding isoforms by solving a series of bin-packing problems.
Trinity – well, probably no need to describe this one again, as people are generally familiar with this.
Anyway, I used the standard Mouse dataset that Trinity uses as a benchmarking dataset. These datasets are available at curl -LO https://sourceforge.net/projects/trinityrnaseq/files/misc/MouseRNASEQ/mouse_SS_rnaseq.50M.fastqs.tgz
I assembled these datasets using the BinPacker and Shannon. The Trinity assembly was provided to me by Ben Fulton/Brian Haas. I evaluated each dataset with BUSCO and TransRate. BinPacker was version 1.0, downloaded on 3/4/16 from https://github.com/macmanes-lab/BinPacker
(which is a fork of the most recent version on SourceForge). The version of Shannon I used was from the develop branch specifically at this commit https://github.com/sreeramkannan/Shannon/commit/428c3106289ce5b658f17f64879e23bbc59d5ad3
BinPacker run in Strand Specific mode
/share/BinPacker/BinPacker -q -d -s fq -p pair -m RF -k 25 -g 200 -o binpacker_mouse
-l /mouse/trin_mouse.mouse.all.Left.fq
-r /mouse/trin_mouse/mouse.all.Right.fq
Shannon no Strand-Specific assembly
python /share/Shannon/shannon.py -p 20 -o shannon_trin_noSS --left /mouse/trin_mouse/mouse.all.Left.fq --right /mouse/trin_mouse/mouse.all.Right.fq
and
Shannon with Strand Specific
python /share/Shannon/shannon.py -p 20 -o shannon_trin_SS --left /mouse/trin_mouse/mouse.all.Left.fq --right /mouse/trin_mouse/mouse.all.Right.fq --ss
I’ll post the raw data tables below, but here is the compiled version. (BinPacker and Trinity run in Strand specific mode)
Run Time
Trinity = BinPacker = Shannon SS < Shannon non-SS
For ~50M reads, we’re talking about 8 hours for the 3, and like 24 hours for Shannon non-SS. I’ve been talking a lot to Sreeran the Shannon developed about this. The SS mode is already very fast and I’m not sure why the non-SS mode is so much slower.
BUSCO Complete
Shannon NON-Strand Specific > Trinity = BinPacker = Shannon strand specific.
Here, we’re talking about 69%-73%. All of them are pretty good.
TransRate Score
BinPacker > Trinity > Shannon SS > Shannon non-SS
TransRate Optimized Score
Trinity > Shannon SS > Shannon non-SS > Binpacker
Number of reconstructed ‘transcripts’
BinPacker <<< Trinity <<< Shannon SS < Shannon non-SS
Summary
The main issues I have with the new assemblers is their scalability. Neither BinPacker nor Shannon can really handle large datasets at the moment. Any more that 50-100M reads and they seem to choke. This is an issue that both development teams are aware of and are actively working on. Aside from this – checkpoints, better parallelization (e.g., speed!). Signal:noise ratio is an issue for Shannon as are reconstructing duplicates (see BUSCO duplicate fraction), and it is too bad that the BUSCO percent complete is so much lower for Shannon SS verus non-SS assemblies.
It’s occurs to me now more than ever, that the ‘best’ assembly is likely to result from the merging of a bunch of different assemblies. The new tool – transfuse – applied to these 3 assemblies may effectively pull down the best from each, resulting in a better assembly that any of the individuals. This analysis is running – stay tuned for another blog post ASAP.
BinPacker Data
abyss-fac BinPacker.fa
n n:500 L50 min N80 N50 N20 E-size max sum name
39356 30183 6814 500 1530 3060 5247 3564 17514 64.75e6 BinPacker.fa
Summarized benchmarks in BUSCO notation:
C:71%[D:21%],F:3.7%,M:24%,n:3023
Representing:
1504 Complete Single-copy BUSCOs
660 Complete Duplicated BUSCOs
114 Fragmented BUSCOs
745 Missing BUSCOs
3023 Total BUSCO groups searched
[ INFO] 2016-03-08 11:18:17 : fragments 52645238
[ INFO] 2016-03-08 11:18:17 : fragments mapped 43605273
[ INFO] 2016-03-08 11:18:17 : p fragments mapped 0.83
[ INFO] 2016-03-08 11:18:17 : good mappings 38104840
[ INFO] 2016-03-08 11:18:17 : p good mapping 0.72
[ INFO] 2016-03-08 11:18:17 : bad mappings 5500433
[ INFO] 2016-03-08 11:18:17 : potential bridges 20876
[ INFO] 2016-03-08 11:18:17 : bases uncovered 1011715
[ INFO] 2016-03-08 11:18:17 : p bases uncovered 0.01
[ INFO] 2016-03-08 11:18:17 : contigs uncovbase 24873
[ INFO] 2016-03-08 11:18:17 : p contigs uncovbase 0.63
[ INFO] 2016-03-08 11:18:17 : contigs uncovered 385
[ INFO] 2016-03-08 11:18:17 : p contigs uncovered 0.01
[ INFO] 2016-03-08 11:18:17 : contigs lowcovered 22996
[ INFO] 2016-03-08 11:18:17 : p contigs lowcovered 0.58
[ INFO] 2016-03-08 11:18:17 : contigs segmented 3325
[ INFO] 2016-03-08 11:18:17 : p contigs segmented 0.08
[ INFO] 2016-03-08 11:18:17 : Read metrics done in 1580 seconds
[ INFO] 2016-03-08 11:18:17 : No reference provided, skipping comparative diagnostics
[ INFO] 2016-03-08 11:18:17 : TRANSRATE ASSEMBLY SCORE 0.2836
[ INFO] 2016-03-08 11:18:17 : -----------------------------------
[ INFO] 2016-03-08 11:18:17 : TRANSRATE OPTIMAL SCORE 0.3465
[ INFO] 2016-03-08 11:18:17 : TRANSRATE OPTIMAL CUTOFF 0.2549
[ INFO] 2016-03-08 11:18:18 : good contigs 31583
[ INFO] 2016-03-08 11:18:18 : p good contigs 0.8
Trinity Data
abyss-fac Trinity.fasta
n n:500 L50 min N80 N50 N20 E-size max sum name
80922 36311 8127 500 1246 2573 4592 3125 15366 67.12e6 Trinity.fasta
#Summarized BUSCO benchmarking for file: Trinity.fasta
#BUSCO was run in mode: trans
Summarized benchmarks in BUSCO notation:
C:69%[D:23%],F:4.8%,M:25%,n:3023
Representing:
1389 Complete Single-copy BUSCOs
724 Complete Duplicated BUSCOs
148 Fragmented BUSCOs
762 Missing BUSCOs
3023 Total BUSCO groups searched
[ INFO] 2016-03-09 10:04:52 : fragments 52645238
[ INFO] 2016-03-09 10:04:52 : fragments mapped 43243512
[ INFO] 2016-03-09 10:04:52 : p fragments mapped 0.82
[ INFO] 2016-03-09 10:04:52 : good mappings 37043549
[ INFO] 2016-03-09 10:04:52 : p good mapping 0.7
[ INFO] 2016-03-09 10:04:52 : bad mappings 6199963
[ INFO] 2016-03-09 10:04:52 : potential bridges 39139
[ INFO] 2016-03-09 10:04:52 : bases uncovered 5012605
[ INFO] 2016-03-09 10:04:52 : p bases uncovered 0.06
[ INFO] 2016-03-09 10:04:52 : contigs uncovbase 46089
[ INFO] 2016-03-09 10:04:52 : p contigs uncovbase 0.57
[ INFO] 2016-03-09 10:04:52 : contigs uncovered 4457
[ INFO] 2016-03-09 10:04:52 : p contigs uncovered 0.06
[ INFO] 2016-03-09 10:04:52 : contigs lowcovered 58918
[ INFO] 2016-03-09 10:04:52 : p contigs lowcovered 0.73
[ INFO] 2016-03-09 10:04:52 : contigs segmented 3940
[ INFO] 2016-03-09 10:04:52 : p contigs segmented 0.05
[ INFO] 2016-03-09 10:04:52 : Read metrics done in 1768 seconds
[ INFO] 2016-03-09 10:04:52 : No reference provided, skipping comparative diagnostics
[ INFO] 2016-03-09 10:04:53 : TRANSRATE ASSEMBLY SCORE 0.1241
[ INFO] 2016-03-09 10:04:53 : -----------------------------------
[ INFO] 2016-03-09 10:04:53 : TRANSRATE OPTIMAL SCORE 0.3793
[ INFO] 2016-03-09 10:04:53 : TRANSRATE OPTIMAL CUTOFF 0.4422
[ INFO] 2016-03-09 10:04:53 : good contigs 34896
[ INFO] 2016-03-09 10:04:53 : p good contigs 0.43
Shannon SS data
abyss-fac shannon.fasta
n n:500 L50 min N80 N50 N20 E-size max sum name
141089 95941 23199 500 1646 3067 5203 3565 22986 218.6e6 shannon.fasta
#BUSCO was run in mode: trans
Summarized benchmarks in BUSCO notation:
C:69%[D:51%],F:5.3%,M:25%,n:3023
Representing:
546 Complete Single-copy BUSCOs
1553 Complete Duplicated BUSCOs
161 Fragmented BUSCOs
763 Missing BUSCOs
3023 Total BUSCO groups searched
[ INFO] 2016-03-16 10:15:02 : -----------------------------------
[ INFO] 2016-03-16 10:15:02 : fragments 52645238
[ INFO] 2016-03-16 10:15:02 : fragments mapped 43275563
[ INFO] 2016-03-16 10:15:02 : p fragments mapped 0.82
[ INFO] 2016-03-16 10:15:02 : good mappings 37376795
[ INFO] 2016-03-16 10:15:02 : p good mapping 0.71
[ INFO] 2016-03-16 10:15:02 : bad mappings 5898768
[ INFO] 2016-03-16 10:15:02 : potential bridges 42051
[ INFO] 2016-03-16 10:15:02 : bases uncovered 79375764
[ INFO] 2016-03-16 10:15:02 : p bases uncovered 0.34
[ INFO] 2016-03-16 10:15:02 : contigs uncovbase 109024
[ INFO] 2016-03-16 10:15:02 : p contigs uncovbase 0.77
[ INFO] 2016-03-16 10:15:02 : contigs uncovered 37726
[ INFO] 2016-03-16 10:15:02 : p contigs uncovered 0.27
[ INFO] 2016-03-16 10:15:02 : contigs lowcovered 113452
[ INFO] 2016-03-16 10:15:02 : p contigs lowcovered 0.8
[ INFO] 2016-03-16 10:15:02 : contigs segmented 8526
[ INFO] 2016-03-16 10:15:02 : p contigs segmented 0.06
[ INFO] 2016-03-16 10:15:02 : Read metrics done in 2874 seconds
[ INFO] 2016-03-16 10:15:02 : No reference provided, skipping comparative diagnostics
[ INFO] 2016-03-16 10:15:02 : TRANSRATE ASSEMBLY SCORE 0.0875
[ INFO] 2016-03-16 10:15:02 : -----------------------------------
[ INFO] 2016-03-16 10:15:02 : TRANSRATE OPTIMAL SCORE 0.36
[ INFO] 2016-03-16 10:15:02 : TRANSRATE OPTIMAL CUTOFF 0.3962
[ INFO] 2016-03-16 10:15:03 : good contigs 58011
[ INFO] 2016-03-16 10:15:03 : p good contigs 0.41
Shannon non-SS
abyss-fac shannon.fa
n n:500 L50 min N80 N50 N20 E-size max sum name
178823 136990 32463 500 1875 3477 5960 4097 23409 350.6e6 shannon.fa
#BUSCO was run in mode: trans
Summarized benchmarks in BUSCO notation:
C:73%[D:47%],F:3.8%,M:23%,n:3023
Representing:
769 Complete Single-copy BUSCOs
1441 Complete Duplicated BUSCOs
116 Fragmented BUSCOs
697 Missing BUSCOs
3023 Total BUSCO groups searched
[ INFO] 2016-03-19 12:37:02 : fragments 52645238
[ INFO] 2016-03-19 12:37:02 : fragments mapped 43506801
[ INFO] 2016-03-19 12:37:02 : p fragments mapped 0.83
[ INFO] 2016-03-19 12:37:02 : good mappings 37476940
[ INFO] 2016-03-19 12:37:02 : p good mapping 0.71
[ INFO] 2016-03-19 12:37:02 : bad mappings 6029861
[ INFO] 2016-03-19 12:37:02 : potential bridges 39894
[ INFO] 2016-03-19 12:37:02 : bases uncovered 168476247
[ INFO] 2016-03-19 12:37:02 : p bases uncovered 0.46
[ INFO] 2016-03-19 12:37:02 : contigs uncovbase 156980
[ INFO] 2016-03-19 12:37:02 : p contigs uncovbase 0.88
[ INFO] 2016-03-19 12:37:02 : contigs uncovered 75138
[ INFO] 2016-03-19 12:37:02 : p contigs uncovered 0.42
[ INFO] 2016-03-19 12:37:02 : contigs lowcovered 154536
[ INFO] 2016-03-19 12:37:02 : p contigs lowcovered 0.86
[ INFO] 2016-03-19 12:37:02 : contigs segmented 8918
[ INFO] 2016-03-19 12:37:02 : p contigs segmented 0.05
[ INFO] 2016-03-19 12:37:02 : Read metrics done in 1873 seconds
[ INFO] 2016-03-19 12:37:02 : No reference provided, skipping comparative diagnostics
[ INFO] 2016-03-19 12:37:02 : TRANSRATE ASSEMBLY SCORE 0.0573
[ INFO] 2016-03-19 12:37:02 : -----------------------------------
[ INFO] 2016-03-19 12:37:02 : TRANSRATE OPTIMAL SCORE 0.3274
[ INFO] 2016-03-19 12:37:02 : TRANSRATE OPTIMAL CUTOFF 0.4245
[ INFO] 2016-03-19 12:37:02 : good contigs 54453
[ INFO] 2016-03-19 12:37:02 : p good contigs 0.3