compare two columns in awk and print values from lookup files into output file - shell

I have two files with first file having ~16000 files and the second file is lookup file having ~4000 lines.
Sample contents of file1 is given below:
id,title,name,value,details
01,23456, , ,abcdefg
02,23456, , ,abcdefg
03,12345, , ,abcdefg
04,34534, , ,abcdefg
...
Sample contents of lookup file file2 is given below:
sno,title,name,value
1,23456,abc,xyz
2,12345,cde,efg
3,34534,543,234
Now my requirement is compare column 2 of file1 in the lookup file and insert the values of column3 and column4 from lookup file into new output file.
The output file should look like below:
id,title,name,value,details
01,23456,abc,xyz,abcdefg
02,23456,abc,xyz,abcdefg
03,12345,cde,efg,abcdefg
04,34534,543,234,abcdefg
I did try few iterations by looking at existing questions but didn't get the results I desired. Any solution with awk would be much helpful.

$ cat vino.awk
BEGIN { FS = OFS = "," }
NR==FNR { name[$2]=$3; value[$2]=$4; next }
{ print $1, $2, name[$2], value[$2], $5 }
$ cat file1
id,title,name,value,details
01,23456, , ,abcdefg
02,23456, , ,abcdefg
03,12345, , ,abcdefg
04,34534, , ,abcdefg
$ cat file2
sno,title,name,value
1,23456,abc,xyz
2,12345,cde,efg
3,34534,543,234
$ awk -f vino.awk file2 file1
id,title,name,value,details
01,23456,abc,xyz,abcdefg
02,23456,abc,xyz,abcdefg
03,12345,cde,efg,abcdefg
04,34534,543,234,abcdefg

Here's an awk oneliner:
awk -F, 'FNR==NR {n[$2]=$3;v[$2]=$4} FNR!=NR{OFS=","; print $1,$2,n[$2],v[$2],$5}' file2 file1
The idea is to process in two passes, first for file2 to store all of the names and values, then for file1, to print out each line including the collected names and values.

awk -F"," 'BEGIN{OFS=","} NR==FNR {a[$2]=$3","$4;next} {print $1,$2,a[$2],$5;}' file2 file1

Related

awk for string comparison with multiple delimiters

I have a file with multiple delimiters, I m looking to compare the value after the first / when read from right left with another file.
code :-
awk -F'[/|]' NR==FNR{a[$3]; next} ($1 in a )' file1 file2 > output
cat file1
AAB/BBC/customer|fed|12931|
/customer|fed|982311|
BXC/DEF/OTTA|fed|92374|
AVD/customer|FST|8736481|
FFS/T6TT/BOSTON|money|18922|
GTS/trust/YYYY|opt|62376|
XXY/IJSH/trust|opt|62376|
cat file2
customer
trust
expected output :-
AAB/BBC/customer|fed|12931|
/customer|fed|12931|
AVD/customer|FST|8736481|
XXY/IJSH/trust|opt|62376|
$ awk -F\| ' # just use one FS
NR==FNR {
a[$1]
next
}
{
n=split($1,t,/\//) # ... and use split to the 1st field
if(t[n] in a) # and compare the last split part
print
}' file2 file1
Output:
AAB/BBC/customer|fed|12931|
/customer|fed|982311|
AVD/customer|FST|8736481|
XXY/IJSH/trust|opt|62376|
If you use this [/|] you will have 2 delimiters and you will not know what the value after the last pipe was.
Reading your question, you want to compare the first value after the last slash without pipe chars.
If there has to be a / present in the string, you can set that as the field separator and check if there are at least 2 fields using NF > 1
Then take the last field using $NF, split on | and check if the first part is present in one of the values of file2 which are stored in array a
$cat file1
AAB/BBC/customer|fed|12931|
/customer|fed|982311|
BXC/DEF/OTTA|fed|92374|
AVD/customer|FST|8736481|
FFS/T6TT/BOSTON|money|18922|
GTS/trust/YYYY|opt|62376|
XXY/IJSH/trust|opt|62376|
customer
Example code
awk -F/ '
NR==FNR {a[$1];next}
NF > 1 {
split($NF, t, "|")
if(t[1] in a) print
}
' file2 file1
Output
AAB/BBC/customer|fed|12931|
/customer|fed|982311|
AVD/customer|FST|8736481|
XXY/IJSH/trust|opt|62376|

Merge File1 with File2 (keep appending from File1 to File2 until no more rows)

I can't find a solution.
So here is the problem.
Result should be 100 rows (File1) with contents from File2 repeating 25 times.
What I want is to join the contents even though the number of rows is not equal. Keep repeating including lines from File2 until number of rows from File1 is met.
File1:
test1#domain.com
test2#domain2.com
test3#domain3.com
test4#domain4.com
File2:
A1,B11
A2,B22
A3,B33
A4,B44
What I want is to combine the files in the following to have the following expected result:
File3:
test1#domain.com,A1,B12
test2#domain2.com,A2,B22
test3#domain3.com,A3,B33
test4#domain4.com,A4,B44
Note here: After it finishes with the 4 rows from File2, start again from first line, then repeat.
test5#domain5.com,A1,B12
test6#domain6.com,A2,B22
test7#domain7.com,A3,B33
test8#domain8.com,A4,B44
The example in your question isn't clear but I THINK this is what you're trying to do:
$ awk -v OFS=',' 'NR==FNR{a[++n]=$0;next} {print $0, a[(FNR-1)%n+1]}' file2 file1
test1#domain.com,A1,B11
test2#domain2.com,A2,B22
test3#domain3.com,A3,B33
test4#domain4.com,A4,B44
test5#domain5.com,A1,B11
test6#domain6.com,A2,B22
The above was run against this input:
$ cat file1
test1#domain.com
test2#domain2.com
test3#domain3.com
test4#domain4.com
test5#domain5.com
test6#domain6.com
$
$ cat file2
A1,B11
A2,B22
A3,B33
A4,B44
Could you please try following.
awk '
BEGIN{
OFS=","
}
FNR==NR{
a[++count]=$0
next
}
{
count_curr++
count_curr=count_curr>count?1:count_curr
print a[count_curr],$0
}
' Input_file2 Input_file1

using awk how to merge 2 files, say A & B and do a left outer join function and include all columns in both files

I have multiple files with different number of columns, i need to do a merge on first file and second file and do a left outer join in awk respective to first file and print all columns in both files matching the first column of both files.
I have tried below codes to get close to my output. But i can't print the ",', where no matching number is found in second file. Below is the code. Join needs sorting and takes more time than awk. My file sizes are big, like 30 million records.
awk -F ',' '{
if (NR==FNR){ r[$1]=$0}
else{ if($1 in r)
r[$1]=r[$1]gensub($1,"",1)}
}END{for(i in r){print r[i]}}' file1 file2
file1
number,column1,column2,..columnN
File2
numbr,column1,column2,..columnN
Output
number,file1.column1,file1.column2,..file1.columnN,file2.column1,file2.column3...,file2.columnN
file1
1,a,b,c
2,a,b,c
3,a,b,c
5,a,b,c
file2
1,x,y
2,x,y
5,x,y
6,x,y
7,x,y
desired output
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,,
5,a,b,c,x,y
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
tail = gensub(/[^,]*,/,"",1)
if ( FNR == 1 ) {
empty = gensub(/[^,]/,"","g",tail)
}
file2[$1] = tail
next
}
{ print $0, ($1 in file2 ? file2[$1] : empty) }
$ awk -f tst.awk file2 file1
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
The above uses GNU awk for gensub(), with other awks it's just one more step to do [g]sub() on the appropriate variable after initially assigning it.
An interesting (to me at least!) alternative you might want to test for a performance difference is:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
tail = gensub(/[^,]*,/,"",1)
idx[$1] = NR
file2[NR] = tail
if ( FNR == 1 ) {
file2[""] = gensub(/[^,]/,"","g",tail)
}
next
}
{ print $0, file2[idx[$1]] }
$ awk -f tst.awk file2 file1
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
but I don't really expect it to be any faster and it MAY even be slower.
you can try,
awk 'BEGIN{FS=OFS=","}
FNR==NR{d[$1]=substr($0,index($0,",")+1); next}
{print $0, ($1 in d?d[$1]:",")}' file2 file1
you get,
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
join to the rescue:
$ join -t $',' -a 1 -e '' -o 0,1.2,1.3,1.4,2.2,2.3 file1.txt file2.txt
Explanation:
-t $',': Field separator token.
-a 1: Do not discard records from file 1 if not present in file 2.
-e '': Missing records will be treated as an empty field.
-o: Output format.
file1.txt
1,a,b,c
2,a,b,c
3,a,b,c
5,a,b,c
file2.txt
1,x,y
2,x,y
5,x,y
6,x,y
7,x,y
Output
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y

splitting file1 using AWK and then name the new files based on lines from file2

File 1 has long string of data between <PhotoField1> multiple times.
Example:
<PhotoField1>alidkfjaeijwoeij<PhotoField1>akdfjalskdfasd<PhotoField1>
File 2 has list of IDs that i want to use to label the Files
Example:
A00565415
A00505050
A54531245
I have an AWK command to parse each string between <PhotoField1> from File1 into its own file, but it only labels the files temp with numbers:
awk -v RS="<PhotoField1>" '{ print $0 > "temp" NR }' File1.xml
I need to replace the temp* part with the a line from the 2nd file
So the new files would be named A00565415, A00505050, A54531245, etc..
- it would be excellent if I could add a .txt to the end of the files: A54531245.txt
The awk command works great for separating it into different files, but i need to be able to name them based on the File2 list.
awk 'NR==FNR{fname[NR]=$0".txt";next} {print > fname[FNR]}' File2.list RS="<PhotoField1>" File1.xml
You can use this awk:
awk -v RS="<PhotoField1>|\n" 'FNR==NR{a[NR]=$0; next}
NF{ print $0 > a[FNR] ".txt" }' file2 file1

How to check whehter rows of a file within the rows of another file

I am fresh to Shell or Bash. I have file1 with one column and about 5000 rows and file2 have five columns with 240k rows. How can I check whether the values of the 5000 rows in file1 within or not the second column of file2?
$wc -l file1
$5188
$wc -l file2
$240,888
You can do this with awk, something like this:
awk 'NR == FNR {a[$2] = $1; next} {if ($2 in a){print(a[$2], $1)}}' file1 file2
Basically you read the first file in and store its contents in an array "a". Then you read the second file and check if the second field of each line is contained within array "a" and print it if it is.
My answer assumes your fields are separated by white space, if they are not you will have to change the separator. So, if your fields are separated by commas, you will need:
awk -F, .....
The above syntax does work, and it can be further simplified as:
awk 'FNR==NR{a[$1]=$2; next} {print $1, a[$1]}' file2 file1

Resources