I\’ve been working on a project where I need to identify orthogroups – We used to use OrthoMCL for this, but it\’s very slooow. When I read a few months ago about OrthoFinder (http://www.genomebiology.com/2015/16/1/157, https://github.com/davidemms/OrthoFinder) I was pretty stoked cause it was faster and just as accurate or more. I\’ve run into a few issue, and have some solutions, so might as well post them here. Most of the problems have been in going from the huge list of Orthogroups that OrthoFinder spits out, to single copy orthologs, which is what we want for building trees and testing for selection. So, this is my current workflow. The key to this working, and I think for easy OrthoFinder use is to name your files, and the seqeunces within this file something useful.
So, in a folder pep_files
, I make sure all the files are named Genus_species.pep
and in a seperate folder cds_files
I have all the files names Genus_species.cds
. I go into each folder and rename the sequences within. Note this only works when there is 1:1 correspondence between your pep and cds files. This is the case for Maker produced files, but other scenarios may need some different naming convention.
for i in $(ls *pep); do F=<code>MARKDOWN_HASHf394e3c052bb32c38644cee2484e9fb1MARKDOWN_HASH</code>; awk \'/^&gt;/{$0="&gt;\'$F\'_"++i}1\' $i &gt; $F.fasta; done
and
for i in $(ls *cds); do F=<code>MARKDOWN_HASHf115f33dff07a82bdde74b68b3bfa288MARKDOWN_HASH</code>; awk \'/^&gt;/{$0="&gt;\'$F\'_"++i}1\' $i &gt; $F.fasta; done
This renames all the sequences like >Balearica_regulorum_N
where N is an integer from 1 to however many sequences there are in the file.
Then run Orthofinder – the more cores the better
python orthofinder.py -f directory_containing_fasta_pep_files -t number_of_processes
When it is done, in the Results_Date folder
there will be a large file, OrthologousGroups.txt
which has all the Orthogroups. The issue with this file is that there are orthologs AND paralogs all together, and you probably don\’t want the paralogs, but instead the single copy orthologs. If you have named the files and seqeunces like I suggest, then this is pretty easy to figure out. Well, I\’m defining single copy orthologs as cases where there is a single sequence identified in all the species. I suppose there could be some paralogy contamination in there – imagine if a gene were duplicated, then one of the copies was subsequently lost leaving only the duplicated version. Anyway, I hope that is not too common or else another more complicated solution is warranted.
Copy this script – you will have to change the value for NSPECIES
to reflect the number of species you have, and also pay attention to the language about disambiguation, which I describe in the preamble of the script.
</pre> <pre>#!/bin/bash <h1>To run, you must be in the <code>MARKDOWN_HASHfd69c5cf902969e6fb71d043085ddee6MARKDOWN_HASH</code> folder produced by OrthoFinder</h1> # <h1>bash ./sco.sh</h1> # <h1>This script takes the output from OrthoFinder, contained in the OrthologousGroups.txt file, and pulls out the single copy orthologes</h1> <h1>The only thing you have to change is the number of species in your OrthoFinder run, so, change NSPECIES from 2 to whatever number is appropriate</h1> <h1>The only little issue is that you have to have your species names in some cogent way, so that each species is uniquely recognizable</h1> <h1>My protein names are like <code>MARKDOWN_HASHe27c657ff9e4be409e973ded606edf3dMARKDOWN_HASH</code>, where N is an integer.</h1> <h1>To disambiguate <code>MARKDOWN_HASH99d0eec2460ef7583da6862f1dbe64dbMARKDOWN_HASH</code> from <code>MARKDOWN_HASH56373a50c21feefe947b349e3b93aadaMARKDOWN_HASH</code> I need 12 characters, which is why <code>MARKDOWN_HASH6cdc8129c74a40260aacdfd8f87d2fe4MARKDOWN_HASH</code> is 12 and not something else.</h1> <h1>look at <code>MARKDOWN_HASH337e76bddfe58264cd5bb7b66b659ca4MARKDOWN_HASH</code> to figure this out.</h1> NSPECIES=47 #change this to however many species you are comparing. TRUNC=12 <h2>#</h2> <h3>No Editing below here</h3> <h2>#</h2> min=$(expr $NSPECIES / 2 + 2) equal=$(expr $NSPECIES + 1) input=OrthologousGroups.txt END=$(wc -l $input | awk \'{print $1}\') START=1 LIMIT=$(expr $NSPECIES + 2) rm SCOs.txt 2> /dev/null for i in $(eval echo "{$START..$END}") ; do sed -n \'\'$i\'p\' $input | awk \'!$var\' var="$LIMIT" | awk \'!array[$0]++\' | tr -s \' \' \\n | awk \'{print substr($0,0,tr)}\' tr="$TRUNC" | sort -u | wc -l > 1.txt; sed -n \'\'$i\'p\' $input | awk \'!$var\' var="$LIMIT" | awk \'!array[$0]++\' | tr -s \' \' \\n | awk \'{print substr($0,0,tr)}\' tr="$TRUNC" | wc -l > 2.txt; if [ $(cat 1.txt) -eq $(cat 2.txt) ] && [ $(cat 1.txt) -gt "$min" ]; then sed -n \'\'$i\'p\' $input >> SCOs.txt; fi ; done rm 1.txt 2.txt</pre> <pre>
known issues:
- still need to go from the single copy orthologs to fasta files.
In the end, you\’ll have a file with the single copy orthologs present in at least 50% of the species, in a file named SCOs.txt
Hope this is helpful and please let me know if something messed up.