merge 2 files based on 1,2,3 column matching in 2 files & print specific columns from 2 files. also retain non-pairing rows from both files in output [duplicate] - bash

file1
rs12345 G C
rs78901 A T
file2
3 22745180 rs12345 G C,G
12 67182999 rs78901 A G,T
desired output
3 22745180 rs12345 G C
12 67182999 rs78901 A T
I tried
awk 'NR==FNR {h[$1] = $3; next} {print $1,$2,$3,h[$2]}' file1 file2
output generated
3 22745180 rs12345
print first 4 columns of file2 and 3rd column of file1 as 5th col in output

You may use this awk:
awk 'FNR == NR {map[$1,$2] = $3; next} ($3,$4) in map {$NF = map[$3,$4]} 1' f1 f2 | column -t
3 22745180 rs12345 G C
12 67182999 rs78901 A T
A more readable version:
awk '
FNR == NR {
map[$1,$2] = $3
next
}
($3,$4) in map {
$NF = map[$3,$4]
}
1' file1 file2 | column -t
Used column -t for tabular output only.

In the case you presented (rows in both files are matched) this will work
paste file1 file2 | awk '{print $4,$5,$6,$7,$3}'

Related

Vlookup using awk command

I have two files in my linux server.
File 1
9190784
9197256
9170546
9184139
9196854
File 2
S NO.,Column1,Column2,Column3
72070,9196854,TGM,AP
72071,9172071,BGM,MP
72072,9184139,AGM,KN
72073,9172073,TGM,AP
I want to write a script or a single line command in bash using awk command, so as whatever the element in File -1 should match the same with column 1 in File -2 and print Column 1, Column2 and Column3. Also if any entry is not found it should print entry from file 1 and print NA in Column 2 and Column 3
Output : it should redirect the output to a new file as below.
new_file
9190784,TGM,AP
9197256,NA,NA
9170546,NA,NA
9184139,AGM,KN
9196854,TGM,AP
I hope the query is understandable. Anyone please help me on the same.
standard join operation with awk
$ awk 'BEGIN {FS=OFS=","}
NR==FNR {a[$2]=$3 OFS $4; next}
{print $1, (($1 in a)?a[$1]:"NA" OFS "NA")} file2 file1
substring variation (not tested)
$ awk 'BEGIN {FS=OFS=","}
NR==FNR {a[substr($2,1,7)]=$3 OFS $4; next}
{key=substr($1,1,7);
print $1, ((key in a)?a[key]:"NA" OFS "NA")} file2 file1
Does it have to be awk? It's done with join:
Having two files:
echo '9190784
9197256
9170546
9184139
9196854' >file2
echo 'S NO.,Column1,Column2,Column3
72070,9196854,TGM,AP
72071,9172071,BGM,MP
72072,9184139,AGM,KN
72073,9172073,TGM,AP' > file1
One can join the on , as separator on the second field from the first file1 -12 with removed the first header line tail -n +2 and sorted using the second field sort -t, -k2 with the first field from the second file -21 sorted sort.
join -t, -12 -21 -o1.2,1.3,1.4 <(tail -n +2 file1 | sort -t, -k2) <(sort file2)
will output:
9184139,AGM,KN
9196854,TGM,AP

awk: two files are queried

I have two files
file1:
>string1<TAB>Name1
>string2<TAB>Name2
>string3<TAB>Name3
file2:
>string1<TAB>sequence1
>string2<TAB>sequence2
I want to use awk to compare column 1 of respective files. If both files share a column 1 value I want to print column 2 of file1 followed by column 2 of file2. For example, for the above files my expected output is:
Name1<TAB>sequence1
Name2<TAB>sequence2
this is my code:
awk 'BEGIN{FS=OFS="\t"} FNR == NR { a[$1] = $1; next } $1 in a { print a[$2], $2 }' file1 file2 >out
But the only thing I get is an empty first columnsequence
where is the error here?
your assignment is not right.
$ awk 'BEGIN {FS=OFS="\t"}
NR==FNR {a[$1]=$2; next}
$1 in a {print a[$1],$2}' file1 file2
Name1 sequence1
Name2 sequence2

Compare all but last N Columns across two files in bash

I have 2 files: one with 18 columns; another with many more. I need to find the rows that mismatch on ONLY the first 18 columns while ignoring the rest in the other file. However, I need to preserve and print the entire row (cut will not work).
File 1:
F1 F2 F3....F18
A B C.... Y
AA BB CC... YY
File 2:
F1 F2 F3... F18... F32
AA BB CC... YY... 123
AAA BBB CCC... YYY...321
Output Not In File 1:
AAA BBB CCC YYY...321
Output Not In File 2:
A B C...Y
If possible, I would like to use diff or awk with as few loops as possible.
You can use awk:
awk '{k=""; for(i=1; i<=18; i++) k=k SUBSEP $i} FNR==NR{a[k]; next} !(k in a)' file1 file2
For each row in both files we are first creating a key by concatenating first 18 fields
We are then storing this key in an associative array while iterating first file
Finally we print each row from 2nd file when this new key value is not found in our associative array.
You can use grep:
grep -vf file1 file2
grep -vf <(cut -d" " -f1-18 file2) file1
to get set differences between two files, you'll need little more, similar to #anubhava's answer
$ awk 'NR==FNR{f1[$0]; next}
{k=$1; for(i=2;i<=18;i++) k=k FS $i;
if(k in f1) delete f1[k];
else f2[$0]}
END{print "not in f1";
for(k in f2) print k;
print "\nnot in f2";
for(k in f1) print k}' file1 file2
can be re-written to preserve order in file2
$ awk 'NR==FNR{f1[$0]; next}
{k=$1; for(i=2;i<=18;i++) k=k FS $i;
if(k in f1) delete f1[k];
else {if(!p) print "not in f1";
f2[$0]; print; p=1}}
END{print "\nnot in f2";
for(k in f1) print k}' file1 file2

Values missing in awk

My Input files :
file1
231|35000
234|15000
242|60000
254|12313
345|50000
435|24300
file2
1|madhan|retl|231|tcs
2|vaisakh|retl|234|tcs
4|sam|ins|242|infy
5|tina|bfs|254|tcs
3|ram|bfs|345|infy
6|subbu|bfs|435|infy
Ouput :
Trying to get
col1 , col2 of file1 and col2 of file2 based on common column(col1 of file1 and col4 of file2)
My code :
awk 'BEGIN { FS="|";} NR==FNR{a[$1] = $2;next} ($4 in a) {print $2 "|" $4 "|" a[$1]} ' file_1 file_2
O/p i got:
madhan|231|
vaisakh|234|
sam|242|
tina|254|
ram|345|
subbu|435|
Can you help why last col is coming as spaces
Try something like:
join -t '|' -1 1 -2 4 file1 file2 | awk -F'|' '{print $1 "|" $2 "|" $4}'
Join on field 1 from file1 and field 4 on file 2 and extract fields what you need using awk.
This should do:
awk -F\| 'FNR==NR {a[$1]=$0;next} {for (i in a) if (i==$4) print a[i]"|"$2}' file1 file2
231|35000|madhan
234|15000|vaisakh
242|60000|sam
254|12313|tina
345|50000|ram
435|24300|subbu
It store file1 in array a using first field as index.
Then it test index in first file against fourth field in file2.
If they are equal, print data from file1 and second field from file2.
It is coming up blank because the key does not exist in the array. You are storing first column of file1 as key which is 4th column of file2.
$ awk '
BEGIN { FS=OFS="|" }
NR==FNR { a[$1]=$2; next }
($4 in a) { print $2, $4, a[$4] }
' file1 file2
madhan|231|35000
vaisakh|234|15000
sam|242|60000
tina|254|12313
ram|345|50000
subbu|435|24300
If you need the order stated in your requested O/P then
$ awk 'BEGIN {FS=OFS="|"}NR==FNR{a[$4]=$2;next} ($1 in a) {print $0, a[$1]}' file2 file1
231|35000|madhan
234|15000|vaisakh
242|60000|sam
254|12313|tina
345|50000|ram
435|24300|subbu

I need to merge two files and create a new files

input 1
1 10611 2 122 C:0.983607 G:0.0163934
input 2
1 10611 rs146752890 C G 100 PASS AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07
here
1st and 2nd column are matching and values before ':' of 5th column of first file and 4th column of 2nd files are equel and 6th column(values before ':') of first and 5th column of second files are equel and output is creating based on this match.Will get the clear idea from input and output line and both files are .gz files
output
1 10611 rs146752890 C G 100 PASS AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07;REF=0.983607;ALT=0.0163934;
Here's one way using awk:
awk 'FNR==NR { split($5,a,":"); split($6,b,":"); c[$1,$2,a[1],b[1]]="REF=" a[2] ";ALT=" b[2] ";"; next } ($1,$2,$4,$5) in c { print $0 ";" c[$1,$2,$4,$5] }' input1 input2
Result:
1 10611 rs146752890 C G 100 PASS AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07;REF=0.983607;ALT=0.0163934;
So for compressed files, try:
awk 'FNR==NR { split($5,a,":"); split($6,b,":"); c[$1,$2,a[1],b[1]]="REF=" a[2] ";ALT=" b[2] ";"; next } ($1,$2,$4,$5) in c { print $0 ";" c[$1,$2,$4,$5] }' <(gzip -dc input1.gz) <(gzip -dc input2.gz) | gzip > output.gz
EDIT:
From the comments below, try this:
awk 'FNR==NR { split($5,a,":"); split($6,b,":"); c[$1,$2,a[1],b[1]]="REF=" a[2] ";ALT=" b[2] ";"; next } ($1,$2,$4,$5) in c { print $1, $2, $3, $4, $5, $6, $7, c[$1,$2,$4,$5] $8 ";" }' file1 file2
Result:
1 10611 rs146752890 C G 100 PASS REF=0.983607;ALT=0.0163934;AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07;
This should work (assuming you have enough disk space to store the expanded .gz files):
zcat 1 | awk '{print $1$2,$0}' | sort > new1
zcat 2 | awk '{print $1$2,$0}' | sort > new2
join new1 new2 -11 -21 -o "2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 1.6 1.7"|sed 's/ C:/;REF=/'|sed 's/ G:/;ALT=/' > output

Resources