Optimizing genome assemblies

Posted on Posted in Uncategorized

I am working on a genome assembly- this is a human sized vertebrate genome (Peromyscus eremicus). I work on this species in the context of adaptation to deserts (e.g. extreme heat and aridity), and in the future on its behavior. Aside from the biological goals, I’m interested in showing that large genome assembly projects don’t need huge consortiums or massive budgets. To meet the needs of many projects (e.g. genome for use in studies of (comparative) functional biology), these assemblies can be done with modest budgets by a single lab. This is a work in progress, so we’ll see if I’m right..

The data consist of 5 PE lanes and 1 lane from 2 MP libraries (5kb and 7kb insert) of read length 150nt. All of these are from rapid runs on a HiSeq 2500. In addition to this I have 3x coverage PacBio data fom the new P5-C3 chemistries (read N50=8800nt, max=38kb). As you can see, this is very modest. I’ll do some more sequencing, but we’ll see how far this gets me.

FYI, I’m using a CRAY supercomputer for unitig construction. I’ve split up my input files into 145 equally sized chunks, ran with 1536 nodes — ABySS absolutely flies like this. Unitig construction done in about 75 minutes!!! I use XSEDE Blacklight for the contig construction and scaffolding.

So the question for today is of parameter optimization. I’m using ABySS, and in the process of optimizing k at the unitig stage. The issue I have, and everybody doing genome assembly has, is what parameter to optimize.  See the table below, red number are the ‘optimum’ for a given parameter:

Screen Shot 2014-02-09 at 10.23.07 AM

Optimize contiguity is the goal, but unfortunately (cue Assemblathon 2), different measures of contiguity suggest different values of k.  NG50 suggests k=105, N50 suggests k=75, and finally E-size (from GAGE paper) suggests k=91… Further, given all of the contiguity stats are relatively similar, at least between the larger values for k, maybe I should decide based on the size of the assembly. For ABySS, if a given kmer (or contig) is absent from the unitig assembly, it’s absent from all other stages, so maybe maximizing the size of the assembly (within reason) is something to consider. You can always get rid of contigs later on in the process, I suppose..

So, some outstanding questions:

  1. What k is best? I think I’ll go with 105, thought the kmers 91,93,95 look good, too.
  2. How is the k that maximizes unitig construction related to the k that maximizes contig assembly (and later, scaffolding)? Are these going to be different? If so, how different?
  • Chris

    Interesting post. Any reason you’re starting with k=61? I’d imagine for read lengths of [uncorrected] 150bp length reads, you might want to look in the 40s-50s. I’m aware that there appears to be a clear distribution in each category (which is actually a bit surprising). In my experience, I’ve had trouble with ABySS inflating N50 due to poor handling of low complexity repeats.

  • Rayan

    1) k=105 looks best: highest ng50, highest sum, largest max unitig length. The N80, N50, N20 metrics cannot be compared here, as the assembly sizes are very different.
    2) I’d say that, for a single-k assembler, the k that maximizes unitigs quality is likely the one that will produce the best assembly after scaffolding.

    • Matt MacManes

      Hi Rayan, Good to have my thoughts confirmed. I’ll go with 105! I’m going to do some tests with varying k in the contigging and scaffolding steps.. I’ll report back.