I hate it when GREP doesn’t work like I want it to!

OK, so I have been fighting with GREP all afternoon– Im about to kick it in the teeth! Somehow, there are more lines in my output file than in my subject file.. This is driving me crazy cause I can’t figure out why!!!!

Here is the simple enough command:
query.txt | sort -k1 | awk '{print $1}' | grep -wf - subject.txt > out.txt


>head query.txt

comp10000_c0_seq1 0

comp10002_c0_seq1 0

comp10003_c0_seq1 0

comp10004_c0_seq1 0

comp10005_c0_seq1 0

comp10007_c0_seq1 0

comp1000_c0_seq1 0

comp10011_c0_seq1 0

comp10013_c0_seq1 0

comp10014_c0_seq1 0
>head subject.txt

comp10000_c0_seq1 comp1898_c0_seq2 100.00 5407 0 0 1 5407 1 5407 0.0 9985

comp10002_c0_seq1 comp8374_c0_seq1 100.00 754 0 0 1 754 1 754 0.0 1393

comp10003_c0_seq1 comp8423_c0_seq1 100.00 4387 0 0 1 4387 1 4387 0.0 8102

comp10004_c0_seq1 comp8084_c0_seq1 100.00 3036 0 0 1 3036 1 3036 0.0 5607

comp10005_c0_seq1 comp8387_c0_seq1 100.00 2122 0 0 1 2122 1 2122 0.0 3919

comp10007_c0_seq1 comp8168_c0_seq1 100.00 1141 0 0 1 1141 1 1141 0.0 2108

comp1000_c0_seq1 comp23962_c0_seq1 100.00 326 0 0 1 326 1 326 2e-172 603

comp10011_c0_seq1 comp2125_c0_seq1 100.00 333 0 0 1 333 718 386 3e-176 616

comp10013_c0_seq1 comp8442_c0_seq1 100.00 2745 0 0 1 2745 1 2745 0.0 5070

comp10014_c0_seq1 comp8362_c0_seq1 100.00 1335 0 0 1 1335 1 1335 0.0 2466
>head out.txt

comp10000_c0_seq1 comp1898_c0_seq2 100.00 5407 0 0 1 5407 1 5407 0.0 9985

comp10002_c0_seq1 comp8374_c0_seq1 100.00 754 0 0 1 754 1 754 0.0 1393

comp10003_c0_seq1 comp8423_c0_seq1 100.00 4387 0 0 1 4387 1 4387 0.0 8102

comp10004_c0_seq1 comp8084_c0_seq1 100.00 3036 0 0 1 3036 1 3036 0.0 5607

comp10005_c0_seq1 comp8387_c0_seq1 100.00 2122 0 0 1 2122 1 2122 0.0 3919

comp10007_c0_seq1 comp8168_c0_seq1 100.00 1141 0 0 1 1141 1 1141 0.0 2108

comp1000_c0_seq1 comp23962_c0_seq1 100.00 326 0 0 1 326 1 326 2e-172 603

comp10011_c0_seq1 comp2125_c0_seq1 100.00 333 0 0 1 333 718 386 3e-176 616

comp10013_c0_seq1 comp8442_c0_seq1 100.00 2745 0 0 1 2745 1 2745 0.0 5070

comp10014_c0_seq1 comp8362_c0_seq1 100.00 1335 0 0 1 1335 1 1335 0.0 2466

>wc -l query.txt subject.txt out.txt 22885 query.txt 23560 subject.txt 23560 out.txt

So in theory, query is a subset of subject, so there should be no more than 22885 hits in the outfile.. there should be no duplicates using the -w option in GREP..

Nevertheless, I scanned these files for duplicates, and found none…

No duplicates…

So I’m stumped..

Share this: