Left outer join of multiple files data using awk command - bash

I have base file and multiple files having common data based on 1st field of base file. I need output file with combination of all data. I have tried many commands due to file size taking to much time for output many times awk helps me out but i don't have any idea of awk array programing
example
Base File
aa
ab
ac
ad
ae
File -1
aa,Apple
ab,Orange
ac,Mango
File -2
aa,1
ab,2
ae,3
Output File expected
aa,Apple,1
ab,Orange,2
ac,Mango,
ad,,
ae,,3
This is what I tried:
awk -F, 'FNR==NR{a[$1]=$0;next}{if(b=a[$1]) print b,$2; else print $1 }' OFS=, test.txt test2.txt

You could try 2 successive join. Something like the following function should work :
join -a 1 -t, -e '' -o auto <(join -a 1 -t, -e '' -o auto base_file file1) file2
Here, we first join base_file and file1, then join the result with file2.
Explanation :
join -a 1 -t, -e '' -o auto base_file file1 :
-a 1 : displays the fields of base_file even if there is no match in the file1
-t, : we treat the character , as our field separator. This impacts both the input files and the output of the function.
-e '' -o auto : when a field is not present, output the string ''. The -e option is dependant on the -o option. -o auto is the default output format.
Output :
aa,Apple,1
ab,Orange,2
ac,Mango,
ad,,
ae,,3

awk way:
awk -F, -v OFS="," 'NR==FNR{a[$1]=$2}FILENAME==ARGV[2]{b[$1]=$2}
FILENAME==ARGV[3]{print $0,a[$0],b[$0]}' f1 f2 base

This will work in any awk for any number of input files:
$ cat tst.awk
BEGIN { FS=OFS="," }
!seen[$1]++ { keys[++numKeys] = $1 }
FNR==1 { ++numFiles }
{ a[$1,numFiles]=$2 }
END {
for (keyNr=1; keyNr <= numKeys; keyNr++) {
key = keys[keyNr]
printf "%s%s", key, OFS
for (fileNr=2;fileNr<=numFiles;fileNr++) {
printf "%s%s", a[key,fileNr], (fileNr<numFiles ? OFS : ORS)
}
}
}
$ awk -f tst.awk base file1 file2
aa,Apple,1
ab,Orange,2
ac,Mango,
ad,,
ae,,3

Related

sed unterminated 's' command modify line of file

I'm trying to modify a groups.tsv file (I'm on repl.it so path to file is fine).
Each line in the file looks like this:
groupname \t amountofpeople \t lastadded
and I'm trying to count the occurences of both groupname($nomgrp) and a login($login), and change lastadded to login.
varcol2=$(grep "$nomgrp" groups | cut "-d " -f2- | awk -F"\t" '{print $2}' )
((varcol21=varcol2+1));
varcol3=$(awk -F"\t" '{print $3}' groups)
sed -i "s|${nomgrp}\t${varcol2}\t$varcol3|${nomgrp}\t${varcol21}\t${login}|" groups
However, I'm getting the error message:
sed : -e expression #1, char 27: unterminated 's' command
The groups file has lines such as " sudo 2 user1" (delimited with a tab): a user inputs "user" which is stored in $login, then "sudo" which is stored in $nomgrp.
What am I doing wrong?
Sorry if this has been answered/super easy to fix, I'm quite the newbie here...
If I understand what you are trying to do correctly and if you have GNU awk, you could do
gawk -i inplace -F '\t' -v group="$nomgrp" -v login="$login" -v OFS='\t' '$1 == group { $2 = $2 + 1; $3 = login; } { print }' groups.tsv
Example:
$ cat groups.tsv
wheel 1000 2019-12-10
staff 1234 2019-12-11
users 9001 2019-12-12
$ gawk -i inplace -F '\t' -v group=wheel -v login=2019-12-12 -v OFS='\t' '$1 == group { $2 = $2 + 1; $3 = login; } 1' groups.tsv
$ cat groups.tsv
wheel 1001 2019-12-12
staff 1234 2019-12-11
users 9001 2019-12-12
This works as follows:
-i inplace is a GNU awk extension that allows you to change a file in place,
-F '\t' sets the input field separator to a tab so that the input is interpreted as TSV and fields with spaces in them are not split apart,
-v variable=name sets an awk variable for use in awk's code,
specifically, -v OFS='\t' sets the output field separator variable to a tab, so that the output is again a TSV
So we set variables group, login to your shell variables and ensure that awk outputs a TSV. The code then works as follows:
$1 == group { # If the first field in a line is equal to the group variable
$2 = $2 + 1; # add 1 to the second field
$3 = login; # and overwrite the third with the login variable
}
{ # in all lines:
print # print
}
{ print } could also be abbreviated as 1, I'm sure people someone will point out, but I find this way easier to explain.
If you do not have GNU awk, you could achieve the same with a temporary file, e.g.
awk -F '\t' -v group="$nomgrp" -v login="$login" -v OFS='\t' '$1 == group { $2 = $2 + 1; $3 = login; } { print }' groups.tsv > groups.tsv.new
mv groups.tsv.new groups.tsv

Consolidate two tables awk

I have two files (all tab delimited):
database.txt
MAR001;string1;H
MAR002;string2;G
MAR003;string3;H
data.txt
data1;MAR002
data2;MAR003
And I want to consolidate these two tables using the MAR### column. Expected output (tab-delimited):
data1;MAR002;string2;G
data2;MAR003;string3;H
I want to use awk; this is my attempt:
awk 'BEGIN{FS=OFS="\t"} FNR == NR { a[$2] = $1; next } $2 in a { print $0, a[$1] }' data.txt database.txt
but this fails...
I would just use the join command. It's very easy:
join -t \; -1 1 -2 2 database.txt data.txt
MAR002;string2;G;data1
MAR003;string3;H;data2
You can specify output column order using -o. For example:
join -t \; -1 1 -2 2 -o 2.1,2.2,1.2,1.3 database.txt data.txt
data1;MAR002;string2;G
data2;MAR003;string3;H
P.S. I did assume your files are "semicolon separated" and not "tab separated". Also, your files need to be sorted by the key column.
awk -F '\t' 'FNR==1 && NR == 1 { strt=1 } FNR==1 && NR != 1 { strt=0} strt==1 {dat[$1]=$2";"$3 } strt==0 { if ( dat[$2] != "" ) { print $1";"$2";"dat[$2] } }' database.txt data.txt
Read database.txt in first and read the data into an array dat. Then when we encounter the data.txt file, check for entries in the dat array and print the required data if there is one.
Output:
data1;MAR002;string2;G
data2;MAR003;string3;H
First of all ; and \t are different characters. If your real input files are tab delimited, here is the fix on your codes:
Change your codes into:
awk '....... $1 in a { print a[$1], $0 }' data.txt database.txt

using awk how to merge 2 files, say A & B and do a left outer join function and include all columns in both files

I have multiple files with different number of columns, i need to do a merge on first file and second file and do a left outer join in awk respective to first file and print all columns in both files matching the first column of both files.
I have tried below codes to get close to my output. But i can't print the ",', where no matching number is found in second file. Below is the code. Join needs sorting and takes more time than awk. My file sizes are big, like 30 million records.
awk -F ',' '{
if (NR==FNR){ r[$1]=$0}
else{ if($1 in r)
r[$1]=r[$1]gensub($1,"",1)}
}END{for(i in r){print r[i]}}' file1 file2
file1
number,column1,column2,..columnN
File2
numbr,column1,column2,..columnN
Output
number,file1.column1,file1.column2,..file1.columnN,file2.column1,file2.column3...,file2.columnN
file1
1,a,b,c
2,a,b,c
3,a,b,c
5,a,b,c
file2
1,x,y
2,x,y
5,x,y
6,x,y
7,x,y
desired output
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,,
5,a,b,c,x,y
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
tail = gensub(/[^,]*,/,"",1)
if ( FNR == 1 ) {
empty = gensub(/[^,]/,"","g",tail)
}
file2[$1] = tail
next
}
{ print $0, ($1 in file2 ? file2[$1] : empty) }
$ awk -f tst.awk file2 file1
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
The above uses GNU awk for gensub(), with other awks it's just one more step to do [g]sub() on the appropriate variable after initially assigning it.
An interesting (to me at least!) alternative you might want to test for a performance difference is:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
tail = gensub(/[^,]*,/,"",1)
idx[$1] = NR
file2[NR] = tail
if ( FNR == 1 ) {
file2[""] = gensub(/[^,]/,"","g",tail)
}
next
}
{ print $0, file2[idx[$1]] }
$ awk -f tst.awk file2 file1
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
but I don't really expect it to be any faster and it MAY even be slower.
you can try,
awk 'BEGIN{FS=OFS=","}
FNR==NR{d[$1]=substr($0,index($0,",")+1); next}
{print $0, ($1 in d?d[$1]:",")}' file2 file1
you get,
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
join to the rescue:
$ join -t $',' -a 1 -e '' -o 0,1.2,1.3,1.4,2.2,2.3 file1.txt file2.txt
Explanation:
-t $',': Field separator token.
-a 1: Do not discard records from file 1 if not present in file 2.
-e '': Missing records will be treated as an empty field.
-o: Output format.
file1.txt
1,a,b,c
2,a,b,c
3,a,b,c
5,a,b,c
file2.txt
1,x,y
2,x,y
5,x,y
6,x,y
7,x,y
Output
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y

Merging output of two awk acripts into 1 file

I have a large input file with 150+ columns and 50M rows, a sample of which is shown here:
id,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14
1,0,0,0,0,1,0,0,1,1,0,0,1,0,0
2,0,0,1,0,0,1,1,0,0,0,1,0,0,1
I have a bash shell script:
function awkScript() {
awk -F, -v cols="$1" -v hdr="$2" '
BEGIN {OFS=FS}
NR==1 {n=split(cols,cn);
for(i=1;i<=NF;i++)
for(j=1;j<=n;j++)
if($i==cn[j]) c[++k]=i;
$(NF+1)=hdr}
NR >1 {v1=$c[1]; v2=$c[2]; v3=$c[3]
if(!v2 && !v3) $(NF+1) = v1?10:0
else $(NF+1) = v3?(v1-v3)/v3:0 + v2?(v1-v2)/v2:0}1' "$3"
}
function awkScript1() {
awk -F, -v cols="$1" -v hdr="$2" '
BEGIN {OFS=FS}
NR==1 {n=split(cols,cn);
for(i=1;i<=NF;i++)
for(j=1;j<=n;j++)
if($i==cn[j]) c[++k]=i;
$(NF+1)=hdr}
NR >1 {v1=$c[1]; v2=$c[2]; v3=$c[3]; v4=$c[4]
$(NF+1) = v1?(v1/(v1+v2+v3+v4)):0
}1' "$3"
}
function awkScriptWrapper() {
awkScript "$1" "$2"
}
function awkScriptWrapper1() {
awkScript1 "$1" "$2"
}
awkScript "c1,c2,c3" "Header1" "input.txt" | awkScriptWrapper "c4,c5,c6" "Header2" >> output.txt
awkScript1 "c7,c8,c9,c10" "Header3" "input.txt" | awkScriptWrapper1 "c11,c12,c13,c14" "Header4" >> output1.txt
Sample of output.txt is:
id,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,Header1,Header2
1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,-1
2,0,0,1,0,0,1,1,0,0,0,1,0,0,1,-1,-1
Sample of output1.txt is:
id,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,Header3,Header4
1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,0
2,0,0,1,0,0,1,1,0,0,0,1,0,0,1,1,0.5
My requirement is that I have to append Header1,Header2,Header3,Header4 into the end of the same input file i.e., the above script should produce just 1 output file "finaloutput.txt":
id,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,Header1,Header2,Header3,Header4
1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,-1,0,0
2,0,0,1,0,0,1,1,0,0,0,1,0,0,1,-1,-1,1,0.5
I tried doing the following statements:
awkScript "c1,c2,c3" "Header1" "input.txt" | awkScriptWrapper "c4,c5,c6" "Header2" >> temp_output.txt
awkScript1 "c7,c8,c9,c10" "Header3" "temp_output.txt" | awkScriptWrapper1 "c11,c12,c13,c14" "Header4" >> finaloutput.txt
But I'm not getting it.
Any help would be much appreciated.
Assuming that you need to join two commands in a pipeline:
$ cmd1 | join --header -j1 -t, -o1.{1..17} -o2.16,2.17 - <(cmd2)
id,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,Header1,Header2,Header3,Header4
1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,-1,0,0
2,0,0,1,0,0,1,1,0,0,0,1,0,0,1,-1,-1,1,0.5
The above assumes that cmd1 outputs:
id,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,Header1,Header2
1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,-1
2,0,0,1,0,0,1,1,0,0,0,1,0,0,1,-1,-1
While cmd2 outputs:
id,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,Header3,Header4
1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,0
2,0,0,1,0,0,1,1,0,0,0,1,0,0,1,1,0.5
How does it work?
--header will treat the first line in each file as fields headers
-j1 will joint on field one
-t, specified , as a field delimiter
-o xxx will specify output columns, 1.1 means columns one from file one, in this case cmd1. 2.1 means columns one from file two, in this case cmd2
-o1.{1..17} will expand to:
-o1.1 -o1.2 -o1.3 -o1.4 -o1.5 -o1.6 -o1.7 -o1.8 -o1.9 -o1.10 -o1.11 -o1.12 -o1.13 -o1.14 -o1.15 -o1.16 -o1.17
And is a quick way to specify the first 17 columns from cmd1.
- refers to standard input, which in this case it the output from cmd1
<(command) is a process substitution.
You can change to:
join [options] file1 file2
if you need to join two regular files.

AWK: Compare two CSV files

I have two CSV files and I want to compare them using AWK and generate a new file.
file1.csv:
"no","loc"
"abc121","C:/pro/in"
"abc122","C:/pro/abc"
"abc123","C:/pro/xyz"
"abc124","C:/pro/in"
file2.csv:
"no","loc"
"abc121","C:/pro/in"
"abc122","C:/pro/abc"
"abc125","C:/pro/xyz"
"abc126","C:/pro/in"
output.csv:
"file1","file2","Diff"
"abc121","abc121","Match"
"abc122","abc122","Match"
"abc123","","Unmatch"
"abc124","","Unmatch"
"","abc125","Unmatch"
"","abc126","Unmatch"
One way with awk:
script.awk:
BEGIN {
FS = ","
}
NR>1 && NR==FNR {
a[$1] = $2
next
}
FNR>1 {
print ($1 in a) ? $1 FS $1 FS "Match" : "\"\"" FS $1 FS "Unmatch"
delete a[$1]
}
END {
for (x in a) {
print x FS "\"\"" FS "Unmatch"
}
}
Output:
$ awk -f script.awk file1.csv file2.csv
"abc121","abc121",Match
"abc122","abc122",Match
"","abc125",Unmatch
"","abc126",Unmatch
"abc124","",Unmatch
"abc123","",Unmatch
I didn't use awk alone, but if I understood the gist of what you're asking correctly, I think this long one-liner should do it...
join -t, -a 1 -a 2 -o 1.1 2.1 1.2 2.2 file1.csv file2.csv | awk -F, '{ if ( $3 == $4 ) var = "\"Match\""; else var = "\"Unmatch\"" ; print $1","$2","var }' | sed -e '1d' -e 's/^,/"",/' -e 's/,$/,"" /' -e 's/,,/,"",/g'
Description:
The join portion takes the two CSV files, joins them on the first column (default behavior of join) and outputs all four fields (-o 1.1 2.1 1.2 2.2), making sure to include rows that are unmatched for both files (-a 1 -a 2).
The awk portion takes that output and replaces combination of the 3rd and 4th columns to either "Match" or "Unmatch" based on if they do in fact match or not. I had to make an assumption on this behavior based on your example.
The sed portion deletes the "no","loc" header from the output (-e '1d') and replaces empty fields with open-close quote marks (-e 's/^,/"",/' -e 's/,$/,""/' -e 's/,,/,"",/g'). This last part might not be necessary for you.
EDIT:
As tripleee points out, the above fails if the two initial files are unsorted. Here's an updated command to fix that. It punts the header line and sorts each file before passing them to join...
join -t, -a 1 -a 2 -o 1.1 2.1 1.2 2.2 <( sed 1d file1.csv | sort ) <( sed 1d file2.csv | sort ) | awk -F, '{ if ( $3 == $4 ) var = "\"Match\""; else var = "\"Unmatch\"" ; print $1","$2","var }' | sed -e 's/^,/"",/' -e 's/,$/,""/' -e 's/,,/,"",/g'

Resources