sed and awk for genomics

Posted on Posted in Bioinformatics, Illumina

In my continuing quest to conquer fastQ, fastA, sam & bam files, I have accumulated several useful ‘tools’. Many of them are included in other software packages (e.g. SAMtools), but for some tasks, especially file management and  conversion, no standard toolkit exists, and instead researchers script their own solution.

For me, sed and awk, along with a few other standard *nix tools have been extremely useful, a few of them are generally useful, I think… In hopes of helping others, I’m going to start a list, here. The initial list is short (these were the ones I used in the past week or so, fresh in my mind), but I’ll plan to add and update as I discover new, better ways of handling these files. MOst all of these were inspired by stuff I’ve seen on the internet, so if you have something you want to share.. please add it in the comments!

#Create an unwrapped fasta file, you should add ambiguous nucleotides to the search if you have them, also, fasta defline ending in ‘ACTGNn-‘ will trip the script up, so use caution…

sed -i ':begin;$!N;/[ACTGNn-]n[ACTGNn-]/s/n//;tbegin;P;D' test.fa

#Add /1 to end of fastQ defline, good for assemblers that need those to tell left and right files. Note that you should change [0-9] to whatever follows your @– this is machine specific.

sed -i 's_^@[0-9]:.*_&/1_g' left.fq
sed -i 's_^@[0-9]:.*_&/2_g' right.fq

#Deinterleave a shuffled fastq file

sed -n '/2:N/{N;N;N;p;}' shuff.fastq > right.fq
sed -n '/1:N/{N;N;N;p;}' shuff.fastq > left.fq

##Get the sum of a column (Column 2 ($2)) in this case

awk '{add +=$2} END {print add}'

#Get SD of a column (Column 2 ($2) in this case

awk 'NR>2 {sum+=$2; array[NR]=$2} END {for(x=1;x<=NR;x++){sumsq+=((array[x]-(sum/NR))^2);}print sqrt(sumsq/NR)}'

#Print mean of column (Column 2 ($2)) in this case

awk '{sum+=$2} END { print "Average = ",sum/NR}'

#Remove duplicate entries in column 10, and print new spreadsheet at unique10.txt

cat test.txt | sort -k10 | awk '!a[$10]++' > unique10.txt

#Retain entries in speadsheet where column 3 is greater than 300 (imagine, culling by sequence length)

cat table.txt | awk '300>$3{next}1' > morethan300.txt

#Extract fastQ from BAM for a bunch of BAM files– requires Picard. (sorry, no sed or awk in this one)

for i in `ls *bam`;
do F=`basename $i .bam`;
java -Xmx9g -jar ~/software/SamFormatConverter.jar MAX_RECORDS_IN_RAM=1000000 INPUT=$i OUTPUT=/dev/stdout | java -Xmx9g -jar ~/software/SamToFastq.jar MAX_RECORDS_IN_RAM=1000000 INPUT=/dev/stdin FASTQ=$F.1.fq SECOND_END_FASTQ=$F.2.fq;
done