I hate it when GREP doesn’t work like I want it to!

Posted on Posted in Uncategorized

OK, so I have been fighting with GREP all afternoon– Im about to kick it in the teeth! Somehow, there are more lines in my output file than in my subject file.. This is driving me crazy cause I can’t figure out why!!!!

Here is the simple enough command:

query.txt | sort -k1 | awk '{print $1}' | grep -wf - subject.txt > out.txt

>head query.txt
comp10000_c0_seq1 0
comp10002_c0_seq1 0
comp10003_c0_seq1 0
comp10004_c0_seq1 0
comp10005_c0_seq1 0
comp10007_c0_seq1 0
comp1000_c0_seq1 0
comp10011_c0_seq1 0
comp10013_c0_seq1 0
comp10014_c0_seq1 0

>head subject.txt
comp10000_c0_seq1 comp1898_c0_seq2 100.00 5407 0 0 1 5407 1 5407 0.0 9985
comp10002_c0_seq1 comp8374_c0_seq1 100.00 754 0 0 1 754 1 754 0.0 1393
comp10003_c0_seq1 comp8423_c0_seq1 100.00 4387 0 0 1 4387 1 4387 0.0 8102
comp10004_c0_seq1 comp8084_c0_seq1 100.00 3036 0 0 1 3036 1 3036 0.0 5607
comp10005_c0_seq1 comp8387_c0_seq1 100.00 2122 0 0 1 2122 1 2122 0.0 3919
comp10007_c0_seq1 comp8168_c0_seq1 100.00 1141 0 0 1 1141 1 1141 0.0 2108
comp1000_c0_seq1 comp23962_c0_seq1 100.00 326 0 0 1 326 1 326 2e-172 603
comp10011_c0_seq1 comp2125_c0_seq1 100.00 333 0 0 1 333 718 386 3e-176 616
comp10013_c0_seq1 comp8442_c0_seq1 100.00 2745 0 0 1 2745 1 2745 0.0 5070
comp10014_c0_seq1 comp8362_c0_seq1 100.00 1335 0 0 1 1335 1 1335 0.0 2466

>head out.txt
comp10000_c0_seq1 comp1898_c0_seq2 100.00 5407 0 0 1 5407 1 5407 0.0 9985
comp10002_c0_seq1 comp8374_c0_seq1 100.00 754 0 0 1 754 1 754 0.0 1393
comp10003_c0_seq1 comp8423_c0_seq1 100.00 4387 0 0 1 4387 1 4387 0.0 8102
comp10004_c0_seq1 comp8084_c0_seq1 100.00 3036 0 0 1 3036 1 3036 0.0 5607
comp10005_c0_seq1 comp8387_c0_seq1 100.00 2122 0 0 1 2122 1 2122 0.0 3919
comp10007_c0_seq1 comp8168_c0_seq1 100.00 1141 0 0 1 1141 1 1141 0.0 2108
comp1000_c0_seq1 comp23962_c0_seq1 100.00 326 0 0 1 326 1 326 2e-172 603
comp10011_c0_seq1 comp2125_c0_seq1 100.00 333 0 0 1 333 718 386 3e-176 616
comp10013_c0_seq1 comp8442_c0_seq1 100.00 2745 0 0 1 2745 1 2745 0.0 5070
comp10014_c0_seq1 comp8362_c0_seq1 100.00 1335 0 0 1 1335 1 1335 0.0 2466

>wc -l query.txt subject.txt out.txt
22885 query.txt
23560 subject.txt
23560 out.txt

So in theory, query is a subset of subject, so there should be no more than 22885 hits in the outfile.. there should be no duplicates using the -w option in GREP..

Nevertheless, I scanned these files for duplicates, and found none…


cat query.txt | sort -k1 | awk '!a[$1]++' | wc -l
22885
cat subject.txt | sort -k1 | awk '!a[$1]++' | wc -l
23560
cat out.txt | sort -k1 | awk '!a[$1]++' | wc -l
23560

No duplicates…

So I’m stumped..