PALADIN: software for rapid functional characterization of metagenomes

Posted on Posted in Bioinformatics

At long last, I\’d like to introduce PALADIN, software created by a group of UNH researchers (Anthony Westbrook – the lead developer, Jordan Ramsdell, Taruna Aggarwal, Louisa Normington, Dan Bergeron, Kelley Thomas and myself). It allows for rapid functional characterization of meta-communities using a mapping approach **. It is available at https://github.com/twestbrookunh/paladin, and go there for details about how to install and run. We\’d really love for people to try it out, tell us what works, what sucks, etc… We\’re in the process of putting together the manuscript, which will be available as a preprint ASAP.

** Here are (some of) the details, in no specific order.

  • PALADIN maps in amino acid space. We capitalize on the fact that the majority of bacterial genomes consist of coding sequences. We smartly find ORF\’s in reads, translate to amino acids, map to an amino acid reference. This allows us to characterize the functional profile of the metasample, while losing taxonomic resolution.
    • This amino acid reference can be 1. Swiss-Prot (preferred, default), 2. A genome with GTF, 3. A transcriptome or other file that consists purely of coding sequences.
  • When running with the UNIPROT database as a reference, PALADIN will produce a SAM file and a really neat functional report that includes count data, UniprotKB id, Organism, Gene name, GO features, pathway info, and a few other things. Our hope is that going from reads to genes in basically one rapid step will enable a novel set of downstream analyses.
  • PALADIN is based heavily on the BWA MEM algorithms. Lots of changes related to mapping in amino acid space though.
  • Don\’t trust the taxonomic assignments too much. We know that by mapping in amino acid space, we lose a lot of sensitivity. This is our philosophical position, characterizing the taxonomic profile of a metasample has some value, but is of relatively little value when the researcher aims to understand the functional profile of said sample. We\’re much more interested in function.
  • PALADIN does not use paired-end information. So, interleave/cat/merge your PE reads. For the functional profiling the PE info does not seem to add much value, though we\’d like to hear if people feel otherwise.
  • PALADIN works best with longer Illumina reads (for instance 250bp), but will work with shorter reads.

Future Work

  • ORF detection: This is still relatively slow and insensitive. A lot of development regarding how to detect true ORF\’s has already happened, but there is a lot of work yet to be done.
  • Downstream analysis: We\’d love help here!! What to do once you have a list of genes, the GO terms, pathway into? We\’re working on neat ways to compare 2+ samples mapped with PALADIN. Are there genes that are unique to one sample or the other, do abundances change, basically, how are the functional profiles different??
  • More generally, https://github.com/twestbrookunh/paladin/issues

Please leave comments here, file issue on Github, ask questions on Twitter, etc..