find identical keys with different values from two text files - bash

I have two files which have data in a format like this:
cat File1.txt
A: 1
B: 2
C: 3
D: 4
E: 5
cat File2.txt
A: 10
B: 2
C: 30
D: 4
F: 6
I was wondering how I could print the diff for common keys like:
A: 1, 10
C: 3, 30

You could try
awk -F":" 'NR==FNR{a[$1]=$2} FNR!=NR && a[$1] && a[$1]!=$2{print $1":"a[$1]","$2}' File1.txt File2.txt

As it seems there are no duplicates in the file, this should do:
$ awk '{if(($1 in a)&&$2!=a[$1])print $1,a[$1] ", " $2;else a[$1]=$2}' file1 file2
Output:
A: 1, 10
C: 3, 30
Explained:
$ awk '{
if(($1 in a) && $2!=a[$1]) # if $1 already seen and $2 not equal to previous
print $1,a[$1] ", " $2 # output
else
a[$1]=$2 # else store the value as seen for the first time
}' file1 file2

$ cat tst.awk
BEGIN { OFS=", " }
NR==FNR {
a[$1] = $2
next
}
($1 in a) && (a[$1] != $2) {
print $0, a[$1]
}
$ awk -f tst.awk file2 file1
A: 1, 10
C: 3, 30

Related

counting occurence of character

I have a file that looks like this
chr1A_p1
chr1A_p2
chr10B_p1
chr10A_p1
chr11D_p2
chr18B_p2
chr9D_p1
I need to count number of time A, B & D occur. Individually, I would do it like this
awk '{if($1~/A/) print $0 }' < test.txt | wc
awk '{if($1~/B/) print $0 }' < test.txt | wc
awk '{if($1~/D/) print $0 }' < test.txt | wc
How to join these lines so that I can count number of A,B,D just through one liner instead of 3 separate lines.
For specific line format (where the needed char is before _):
$ awk -F"_" '{ seen[substr($1, length($1))]++ }END{ for(k in seen) print k, seen[k] }' file
A 3
B 2
D 2
Counting occurrences is generally done by keeping track of a counter. So a single of the OP's awk lines;
awk '{if($1~/A/) print $0}' < test.txt | wc
can be rewritten as
awk '($1~/A/){c++}END{print c}' test.txt
for multiple cases, you can now do:
awk '($1~/A/){c["A"]++}
($1~/B/){c["B"]++}
($1~/D/){c["D"]++}
END{for(i in c) print i,c[i]}' test.txt
Now you can even clean this up a bit more:
awk '{c["A"]+=($1~/A/)}
{c["B"]+=($1~/B/)}
{c["D"]+=($1~/D/)}
END{for(i in c) print i,c[i]}' test.txt
which you can clean up further as:
awk 'BEGIN{split("A B D",a)}
{for(i in a) c[a[i]]+=($1~a[i])}
END{for(i in c) print i,c[i]}' test.txt
But these cases just count how many times a line occurs that contains the letter, not how many times the letter occurs.
awk 'BEGIN{split("A B D",a)}
{for(i in a) c[a[i]]+=gsub(a[i],"",$1)}
END{for(i in c) print i,c[i]}' test.txt
Perl to the rescue!
perl -lne '$seen{$1}++ if /([ABD])/; END { print "$_:$seen{$_}" for keys %seen }' < test.txt
-n reads the input line by line
-l removes newlines from input and adds them to output
a hash table %seen is used to keep the number of occurrences of each symbol. Each time it's matched it's captured and the corresponding field in the hash is incremented.
END is run when the file ends. It outputs all the keys of the hash, i.e. the matched characters, each followed by the number of occurrences.
datafile:
chr1A_p1
chr1A_p2
chr10B_p1
chr10A_p1
chr11D_p2
chr18B_p2
chr9D_p1
script.awk
BEGIN {
arr["A"]=0
arr["B"]=0
arr["D"]=0
}
/A/ { arr["A"]++ }
/B/ { arr["B"]++ }
/D/ { arr["D"]++ }
END {
printf "A: %s, B: %s, D: %s", arr["A"], arr["B"], arr["D"]
}
execution:
awk -f script.awk datafile
result:
A: 3, B: 2, D: 2

awk - Compare columns from two files and replace text in first file

I have two files. The first has 1 column and the second has 3 columns. I want to compare first columns of both files. If there is a coincidence, replace column 2 and 3 for specific values; if not, print the same line.
File 1:
$ cat file1
26
28
30
File 2:
$ cat file2
1,a,0
2,a,0
22,a,0
23,a,0
24,a,0
25,a,0
26,r,1510139756
27,a,0
28,r,1510244156
29,a,0
30,r,1510157364
31,a,0
32,a,0
33,r,1510276164
34,a,0
40,a,0
Desired output:
$ cat file2
1,a,0
2,a,0
22,a,0
23,a,0
24,a,0
25,a,0
26,a,0
27,a,0
28,a,0
29,a,0
30,a,0
31,a,0
32,a,0
33,r,1510276164
34,a,0
40,a,0
I am using gawk to do this (it's inside a shell script and I am using solaris) but I can't get the output right. It only prints the lines that matches:
$fuente="file2"
gawk -v fuente="$fuente" 'FNR==NR{a[FNR]=$1; next}{print $1,$2="a",$3="0" }' $fuente file1 > file3
The output I got:
$ cat file3
26 a 0
28 a 0
30 a 0
awk one-liner:
awk 'NR==FNR{ a[$1]; next }$1 in a{ $2="a"; $3=0 }1' file1 FS=',' OFS=',' file2
The output:
1,a,0
2,a,0
22,a,0
23,a,0
24,a,0
25,a,0
26,a,0
27,a,0
28,a,0
29,a,0
30,a,0
31,a,0
32,a,0
33,r,1510276164
34,a,0
40,a,0
Really spread out for clarity; called (fuente.awk) like so:
awk -F \, -v fuente=file1 -f fuente.awk file2 # -F == IFS
BEGIN {
OFS="," # set OFS to make printing easier
while (getline x < fuente > 0) # safe way; read file into array
{
a[++i]=x # stuff indexed array
}
}
{ # For each line in file2
for (k=1 ; k<=i ; k++) # Lop over array (elements in file1)
{
if (($1==a[k]) && (! flag))
{
print($1,"a",0) # Found print new line
flag=1 # print only once
}
}
if (! flag) # Not found
{
print($0) # print original
}
flag=0 # reset flag
}
END { }

awk: two files are queried

I have two files
file1:
>string1<TAB>Name1
>string2<TAB>Name2
>string3<TAB>Name3
file2:
>string1<TAB>sequence1
>string2<TAB>sequence2
I want to use awk to compare column 1 of respective files. If both files share a column 1 value I want to print column 2 of file1 followed by column 2 of file2. For example, for the above files my expected output is:
Name1<TAB>sequence1
Name2<TAB>sequence2
this is my code:
awk 'BEGIN{FS=OFS="\t"} FNR == NR { a[$1] = $1; next } $1 in a { print a[$2], $2 }' file1 file2 >out
But the only thing I get is an empty first columnsequence
where is the error here?
your assignment is not right.
$ awk 'BEGIN {FS=OFS="\t"}
NR==FNR {a[$1]=$2; next}
$1 in a {print a[$1],$2}' file1 file2
Name1 sequence1
Name2 sequence2

find difference and similarities between two text files using awk

I have two files:
file 1
1
2
34:rt
4
file 2
1
2
34:rt
7
I want to display rows that are in file 2 but not in file 1, vice versa, and the same values in both text files. So file the expected result should look like:
1 in both
2 in both
34:rt in both
4 in file 1
7 in file 2
This is what I have so far but I am not sure if this is the right structure:
awk '
FNR == NR {
a[$0]++;
next;
}
!($0 in a) {
// print not in file 1
}
($0 in a) {
for (i = 0; i <= NR; i++) {
if (a[i] == $0) {
// print same in both
}
}
delete a[$0] # deletes entries which are processed
}
END {
for (rest in a) {
// print not in file 2
}
}' $PWD/file1 $PWD/file2
Any suggestions?
If the order is not relevant then you can do:
awk '
NR==FNR { a[$0]++; next }
{
print $0, ($0 in a ? "in both" : "in file2");
delete a[$0]
}
END {
for(x in a) print x, "in file1"
}' file1 file2
1 in both
2 in both
34:rt in both
7 in file2
4 in file1
Or using comm as suggested by choroba in comments:
comm --output-delimiter="|" file1 file2 |
awk -F'|' '{print (NF==3 ? $NF " in both" : NF==2 ? $NF "in file2" : $NF " in file1")}'
1 in both
2 in both
34:rt in both
4 in file1
7 in file2

I need to merge two files and create a new files

input 1
1 10611 2 122 C:0.983607 G:0.0163934
input 2
1 10611 rs146752890 C G 100 PASS AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07
here
1st and 2nd column are matching and values before ':' of 5th column of first file and 4th column of 2nd files are equel and 6th column(values before ':') of first and 5th column of second files are equel and output is creating based on this match.Will get the clear idea from input and output line and both files are .gz files
output
1 10611 rs146752890 C G 100 PASS AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07;REF=0.983607;ALT=0.0163934;
Here's one way using awk:
awk 'FNR==NR { split($5,a,":"); split($6,b,":"); c[$1,$2,a[1],b[1]]="REF=" a[2] ";ALT=" b[2] ";"; next } ($1,$2,$4,$5) in c { print $0 ";" c[$1,$2,$4,$5] }' input1 input2
Result:
1 10611 rs146752890 C G 100 PASS AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07;REF=0.983607;ALT=0.0163934;
So for compressed files, try:
awk 'FNR==NR { split($5,a,":"); split($6,b,":"); c[$1,$2,a[1],b[1]]="REF=" a[2] ";ALT=" b[2] ";"; next } ($1,$2,$4,$5) in c { print $0 ";" c[$1,$2,$4,$5] }' <(gzip -dc input1.gz) <(gzip -dc input2.gz) | gzip > output.gz
EDIT:
From the comments below, try this:
awk 'FNR==NR { split($5,a,":"); split($6,b,":"); c[$1,$2,a[1],b[1]]="REF=" a[2] ";ALT=" b[2] ";"; next } ($1,$2,$4,$5) in c { print $1, $2, $3, $4, $5, $6, $7, c[$1,$2,$4,$5] $8 ";" }' file1 file2
Result:
1 10611 rs146752890 C G 100 PASS REF=0.983607;ALT=0.0163934;AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07;
This should work (assuming you have enough disk space to store the expanded .gz files):
zcat 1 | awk '{print $1$2,$0}' | sort > new1
zcat 2 | awk '{print $1$2,$0}' | sort > new2
join new1 new2 -11 -21 -o "2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 1.6 1.7"|sed 's/ C:/;REF=/'|sed 's/ G:/;ALT=/' > output

Resources