find non-matching lines of two files bash - bash

I'm still new to bash and I've found similar questions to mine, but i still can't solve my problem. I have two files with 2 columns each, separated by a space.
file 1:
1 AGCATTTTTCAAACGAAAGATTTACTACCGATGTGT
2 TGCTCACCAACAAAAACAGGCGTCTCAGCAGCAGCA
3 GATCGAACCGGCTGCCTACTGCGTGTAAAGCCGCCC
4 CCGACACAGAGAACATTAGAATACTCAGAGCCATNN
5 TAAGCCTGAGCCTAAACCTAAGCCTAAACATAAGAA
6 AGCAGAGAAGAGATGAGTTGTCGAGTGAGGCGTAAG
7 AACGTTGAAAAATTATCCCGTCAACAGTCTCCAGAA
8 GCCAGAGAGTAAAATATTGGGTGAAGCCAGAGAGTA
9 TGCTCACCAACAAAAACAGGCGTCTCAGCAGCAGCA
file 2:
1 AGCATTTTTCAAACGAAAGATTTACTACCGATGTGT
2 TGCTCACCAACAAAAACAGGCGTCTCAGCAGCAGCA
3 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
4 CCGACACAGAGAACATTAGAATACTCAGAGCCATNN
5 TAAGCCTGAGCCTAAACCTAAGCCTAAACATAAGAA
6 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
7 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
8 GCCAGAGAGTAAAATATTGGGTGAAGCCAGAGAGTA
9 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
I'd like to compare only the second columns of each file, line by line, and output a third file with only the non-matching lines.
output:
3 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
6 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
7 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
9 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

You can use awk:
awk 'NR==FNR{a[$2];next} !($2 in a)' file1 file2
3 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
6 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
7 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
9 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Explanation:
NR == FNR { # While processing the first file
a[$2] # just push the second field in an array
next # move to next record of first file
}
!($2 in a) # print lines from file2 if array a doesn't that line

grep -vf file1 file2
Output:
3 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
6 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
7 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
9 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

You could use diff for this. diff will print out differences in two files.
/test>diff file1 file2
3c3
< 3 GATCGAACCGGCTGCCTACTGCGTGTAAAGCCGCCC
---
> 3 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
6,7c6,7
< 6 AGCAGAGAAGAGATGAGTTGTCGAGTGAGGCGTAAG
< 7 AACGTTGAAAAATTATCCCGTCAACAGTCTCCAGAA
---
> 6 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
> 7 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
9c9
< 9 TGCTCACCAACAAAAACAGGCGTCTCAGCAGCAGCA
---
> 9 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Grepping for just differences from the second file:
/test>diff file1 file2 | grep ">"
> 3 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
> 6 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
> 7 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
> 9 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Related

sed with a comment in macOS, it gives me an error

I can't find any resources about this issue.
macOS and ubuntu both give me the same result like below.
>>> seq 10 | sed '
3p
6d
'
1
2
3
3
4
5
7
8
9
10
but when I insert the comment, sed in macOS gives me the error.
>>> seq 10 | sed '
3p # print 3rd line
6d # print 6th line
'
sed: 2: "
3p # print 3rd line
6 ...": extra characters at the end of p command
is a comment not supported in macOS? or did I some mistake?
please let me know, thank you.
Any time you use sed for more than s/old/new/ you're using the wrong tool and probably using non-portable constructs. Just use awk for portability, clarity, efficiency, robustness, etc. This will work using any awk in any shell on every UNIX box:
$ seq 10 | awk '
NR == 3 { print }
NR == 6 { print }
{ print }
'
1
2
3
3
4
5
6
6
7
8
9
10
No comments required because the code is clear. You can add comments if you like of course:
$ seq 10 | awk '
NR == 3 { print } # print 3rd line
NR == 6 { print } # print 6th line
{ print } # print all lines
'
1
2
3
3
4
5
6
6
7
8
9
10
or if you wanted to delete instead of print the 6th line:
$ seq 10 | awk '
NR == 3 { print } # print 3rd line
NR == 6 { next } # delete 6th line
{ print } # print all lines
'
1
2
3
3
4
5
7
8
9
10
and you can make the code a bit less clear by relying on default behavior if you prefer brevity:
# seq 10 | awk '
NR == 3 # print 3rd line
NR == 6 # print 6th line
1 # print all lines
'
1
2
3
3
4
5
6
6
7
8
9
10
$ seq 10 | awk '
NR == 3 # print 3rd line
NR != 6 # print all lines except the 6th
'
1
2
3
3
4
5
7
8
9
10

Combine if and NR in awk

I've been trying to figure this silly thing with awk in the last hours but no luck so far.
I understand how to plot every second line, for example:
awk 'NR%2' file
and I also understand how to print a column based file if one column is within a specific range, for example:
awk '{if ($1 > 'yourvalue') print}' file
What I don't quite get is how to combine the two.
In practize, if I have a file organized as:
1 3 6 8
2 8 4 5
3 9 8 7
4 7 3 5
5 7 3 6
6 2 4 6
7 1 4 7
8 3 2 1
9 7 5 3
10 4 5 6
11 8 2 5
how can I get, for example:
1 3 6 8
3 9 8 7
5 7 3 6
7 1 4 7
8 3 2 1
9 7 5 3
10 4 5 6
11 8 2 5
so return every two lines if column 1 is smaller than 7 and print normally the rest.
I tried to combine everything in one single line but I always get errors.
You can reverse the 2nd condition and use OR condition to combine them:
awk 'NR%2 || $1>=7' file
1 3 6 8
3 9 8 7
5 7 3 6
7 1 4 7
8 3 2 1
9 7 5 3
10 4 5 6
11 8 2 5
You can combine conditions using && (and) and || (or).
You can use parentheses for nesting conditions.
For example:
awk 'cond1 && (cond2 || cond3)' file
This:
awk '{if ($1 > 7) print}' file
... is equivalent to this:
awk '$1 > 7 { print }' file
... because you can write conditions outside of the {...} to use as filters.
... which is equivalent to:
awk '$1 > 7' file
... because the default action is to print.

Paste every two lines in a file together as one line BASH

My colleague has given me a file, in which half of the lines are made of 8 columns of info and the other half are made of the 9th column of info. They are always next to each other, e.g.
1 2 3 4 5 6 7 8
1.1
2 3 4 5 6 7 8 9
1.2
...
a b c d e f g h
abcd
I know how to paste every two lines as one and print them out in Python. But I was wondering if it's possible to do that even more conveniently in BASH?
Thanks guys!
You could use sed or awk, as other answers have mentioned. Those answers are all good.
You could also do this easily in pure shell.
$ while read line1; do read line2; echo "$line1 $line2"; done < input.txt
1 2 3 4 5 6 7 8 1.1
2 3 4 5 6 7 8 9 1.2
Note that whitespace is not preserved.
There's another tool available on most unix-like systems called paste:
$ paste - - < input.txt
1 2 3 4 5 6 7 8 1.1
2 3 4 5 6 7 8 9 1.2
In this case, there's a big space in the first line because paste separates columns using tabs, by default, and the trailing space in the first line of input.txt caused the separating tab to be offset to the next column. You can read paste's man page for options to control this.
Another awk
awk '{f=$0;getline;print f,$0}' file
1 2 3 4 5 6 7 8 1.1
2 3 4 5 6 7 8 9 1.2
And just for the fun of it a gnu awk
awk -v RS="[0-9][.][0-9]" '{$1=$1;print $0,RT}' file
1 2 3 4 5 6 7 8 1.1
2 3 4 5 6 7 8 9 1.2
Here is set the Record Separator to the value in line two.
Then the RT will have the actual separator stored.
try:
awk '{printf "%s%s",$0,(NR%2?FS:RS)}' file
or:
awk 'NR%2{printf "%s ",$0;next}7' file
test:
kent$ echo "1 2 3 4 5 6 7 8
1.1
2 3 4 5 6 7 8 9
1.2"|awk '{printf "%s%s",$0,(NR%2?FS:RS)}'
1 2 3 4 5 6 7 8 1.1
2 3 4 5 6 7 8 9 1.2
kent$ echo "1 2 3 4 5 6 7 8
1.1
2 3 4 5 6 7 8 9
1.2"|awk 'NR%2{printf "%s ",$0;next}7'
1 2 3 4 5 6 7 8 1.1
2 3 4 5 6 7 8 9 1.2
You can sed:
sed 'N;s/\n/ /' file
or awk:
awk 'NF==1{print $0}{printf "%s ",$0}' file

Search replace string in a file based on column in other file

If we have the first file like below:
(a.txt)
1 asm
2 assert
3 bio
4 Bootasm
5 bootmain
6 buf
7 cat
8 console
9 defs
10 echo
and the second like:
(b.txt)
bio cat BIO bootasm
bio defs cat
Bio console
bio BiO
bIo assert
bootasm asm
bootasm echo
bootasm console
bootmain buf
bootmain bio
bootmain bootmain
bootmain defs
cat cat
cat assert
cat assert
and we want the output will be like this:
3 7 3 4
3 9 7
3 8
3 3
3 2
4 1
4 10
4 8
5 6
5 3
5 5
5 9
7 7
7 2
7 2
we read each second column in each file in the first file, we search if it exist in each column in each line in the second file if yes we replace it with the the number in the first column in the first file. i did it in only the fist column, i couldn't do it for the rest.
Here the command i use
awk 'NR==FNR{a[$2]=$1;next}{$1=a[$1];}1' a.txt b.txt
3 cat bio bootasm
3 defs cat
3 console
3 bio
3 assert
4 asm
4 echo
4 console
5 buf
5 bio
5 bootmain
5 defs
7 cat
7 assert
7 assert
how should i do to the other columns ?
Thankyou
awk 'NR==FNR{h[$2]=$1;next} {for (i=1; i<=NF;i++) $i=h[$i];}1' a.txt b.txt
NR is the global record number (line number default) across all files. FNR is the line number for the current file. The NR==FNR block specifies what action to take when global line number is equal to the current number, which is only true for the first file, i.e., a.txt. The next statement in this block skips the rest of the code so the for loop is only available to the second file, e.i., b.txt.
First, we process the first file in order to store the word ids in an associative array: NR==FNR{h[$2]=$1;next}. After which, we can use these ids to map the words in the second file. The for loop (for (i=1; i<=NF;i++) $i=h[$i];) iterates over all columns and sets each column to a number instead of the string, so $i=h[$i] actually replaces the word at the ith column with its id. Finally the 1 at the end of the scripts causes all lines to be printed out.
Produces:
3 7 3 4
3 9 7
3 8
3 3
3 2
4 1
4 10
4 8
5 6
5 3
5 5
5 9
7 7
7 2
7 2
To make the script case-insensitive, add tolower calls into the array indices:
awk 'NR==FNR{h[tolower($2)]=$1;next} {for (i=1; i<=NF;i++) $i=h[tolower($i)];}1' a.txt b.txt
divide and conquer!, a bit archaic but does the job =)
awk 'NR==FNR{a[$2]=$0;next}{$1=a[$1];}1' a.txt b.txt | tr ' ' ',' | awk '{ print $1 }' FS="," > 1
awk 'NR==FNR{a[$2]=$0;next}{$1=a[$2];}1' a.txt b.txt | tr ' ' ',' | awk '{ print $1 }' FS="," > 2
awk 'NR==FNR{a[$2]=$0;next}{$1=a[$3];}1' a.txt b.txt | tr ' ' ',' | awk '{ print $1 }' FS="," > 3
awk 'NR==FNR{a[$2]=$0;next}{$1=a[$4];}1' a.txt b.txt | tr ' ' ',' | awk '{ print $1 }' FS="," > 4
paste 1 2 3 4 | tr '\t' ' '
gives:
3 7 3 4
3 9 7
3 8
3 3
3 2
4 1
4 10
4 8
5 6
5 3
5 5
5 9
7 7
7 2
7 2
in this case I just changed the number of columns and paste the results together with a bit of edition in between.
{
cat a.txt; echo "--EndA--";cat b.txt
} | sed -n '1 h
1 !H
$ {
x
: loop
s/^ *\([[:digit:]]\{1,\}\) *\([^[:cntrl:]]*\)\(\n\)\(.*\)\2/\1 \2\3\4\1/
t loop
s/^ *[[:digit:]]\{1,\} *[^[:cntrl:]]*\n//
t loop
s/^[[:space:]]*--EndA--\n//
p
}
'
"--EndA--" could be something else if chance that it will present in one of the file (a.txt mainly)

Split specific column(s)

I have this kind of recrods:
1 2 12345
2 4 98231
...
I need to split the third column into sub-columns to get this (separated by single-space for example):
1 2 1 2 3 4 5
2 4 9 8 2 3 1
Can anybody offer me a nice solution in sed, awk, ... etc ? Thanks!
EDIT: the size of the original third column may vary record by record.
Awk
% echo '1 2 12345
2 4 98231
...' | awk '{
gsub(/./, "& ", $3)
print
}
'
1 2 1 2 3 4 5
2 4 9 8 2 3 1
...
[Tested with GNU Awk 3.1.7]
This takes every character (/./) in the third column ($3) and replaces (gsub()) it with itself followed by a space ("& ") before printing the entire line.
Sed solution:
sed -e 's/\([0-9]\)/\1 /g' -e 's/ \+/ /g'
The first sed expression replaces every digit with the same digit followed by a space. The second expression replaces every block of spaces with a single space, thus handling the double spaces introduced by the previous expression. With non-GNU seds you may need to use two sed invocations (one for each -e).
Using awk substr and printf:
[srikanth#myhost ~]$ cat records.log
1 2 12345 6 7
2 4 98231 8 0
[srikanth#myhost ~]$ awk '{ len=length($3); for(i=1; i<=NF; i++) { if(i==3) { for(j = 1; j <= len; j++){ printf substr($3,j,1) " "; } } else { printf $i " "; } } printf("\n"); }' records.log
1 2 1 2 3 4 5 6 7
2 4 9 8 2 3 1 8 0
You can use this for more than three column records as well.
Using perl:
perl -pe 's/([0-9])(?! )/\1 /g' INPUT_FILE
Test:
[jaypal:~/Temp] cat tmp
1 2 12345
2 4 98231
[jaypal:~/Temp] perl -pe 's/([0-9])(?! )/\1 /g' tmp
1 2 1 2 3 4 5
2 4 9 8 2 3 1
Using gnu sed:
sed 's/\d/& /3g' INPUT_FILE
Test:
[jaypal:~/Temp] sed 's/[0-9]/& /3g' tmp
1 2 1 2 3 4 5
2 4 9 8 2 3 1
Using gnu awk:
gawk '{print $1,$2,gensub(/./,"& ","G", $NF)}' INPUT_FILE
Test:
[jaypal:~/Temp] gawk '{print $1,$2,gensub(/./,"& ","G", $NF)}' tmp
1 2 1 2 3 4 5
2 4 9 8 2 3 1
If you don't care about spaces, this is a succinct version:
sed 's/[0-9]/& /g'
but if you need to remove spaces, we just chain another regexp:
sed 's/[0-9]/& /g;s/ */ /g'
Note this is compatible with the original sed, thus will run on any UNIX-like.
$ awk -F '' '$1=$1' data.txt | tr -s ' '
1 2 1 2 3 4 5
2 4 9 8 2 3 1
This might work for you:
echo -e "1 2 12345\n2 4 98231" | sed 's/\B\s*/ /g'
1 2 1 2 3 4 5
2 4 9 8 2 3 1
Most probably GNU sed only.

Resources