awk to input column data from one file to another based on a match - shell

Objective
I am trying to fill out $9(booking ref), $10 (client) in file1.csv with information pulled from $2 (booking ref) and $3 (client) of file2.csv using "CAMPAIGN ID" ($5 in file1.csv and $1 in file2.csv). So where I have a match between the two files based on "CAMPAIGN ID" I want to print the columns of file2.csv into the matching rows of file1.csv
File1.csv
INVOICE,CLIENT,PLATFORM,CAMPAIGN NAME,CAMPAIGN ID,IMPS,TFS,PRICE,Booking Ref,client
BOB-UK,clientname1,platform_1,campaign1,20572431,5383594,0.05,2692.18,,
BOB-UK,clientname2,platform_1,campaign2,20589101,4932821,0.05,2463.641,,
BOB-UK,clientname1,platform_1,campaign3,23030494,4795549,0.05,2394.777,,
BOB-UK,clientname1,platform_1,campaign4,22973424,5844194,0.05,2925.21,,
BOB-UK,clientname1,platform_1,campaign5,21489000,4251031,0.05,2122.552,,
BOB-UK,clientname1,platform_1,campaign6,23150347,3123945,0.05,1561.197,,
BOB-UK,clientname3,platform_1,campaign7,23194965,2503875,0.05,1254.194,,
BOB-UK,clientname3,platform_1,campaign8,20578983,1522448,0.05,765.1224,,
BOB-UK,clientname3,platform_1,campaign9,22243554,920166,0.05,463.0083,,
BOB-UK,clientname1,platform_1,campaign10,20572149,118865,0.05,52.94325,,
BOB-UK,clientname2,platform_1,campaign11,23077785,28077,0.05,14.40385,,
BOB-UK,clientname2,platform_1,campaign12,21811100,5439,0.05,5.27195,,
File2.csv
CAMPAIGN ID,Booking Ref,client
20572431,ref1,1
21489000,ref2,1
23030494,ref3,1
22973424,ref4,1
23150347,ref5,1
20572149,ref6,1
20578983,ref7,2
22243554,ref8,2
20589101,ref9,3
23077785,ref10,3
21811100,ref11,3
23194965,ref12,3
Desired Output
INVOICE,CLIENT,PLATFORM,CAMPAIGN NAME,CAMPAIGN ID,IMPS,TFS,PRICE,Booking Ref,client
BOB-UK,clientname1,platform_1,campaign1,20572431,5383594,0.05,2692.18,ref1,1
BOB-UK,clientname2,platform_1,campaign2,20589101,4932821,0.05,2463.641,ref9,3
BOB-UK,clientname1,platform_1,campaign3,23030494,4795549,0.05,2394.777,ref3,1
BOB-UK,clientname1,platform_1,campaign4,22973424,5844194,0.05,2925.21,ref4,1
BOB-UK,clientname1,platform_1,campaign5,21489000,4251031,0.05,2122.552,ref2,1
BOB-UK,clientname1,platform_1,campaign6,23150347,3123945,0.05,1561.197,ref5,1
BOB-UK,clientname3,platform_1,campaign7,23194965,2503875,0.05,1254.194,ref12,3
BOB-UK,clientname3,platform_1,campaign8,20578983,1522448,0.05,765.1224,ref7,2
BOB-UK,clientname3,platform_1,campaign9,22243554,920166,0.05,463.0083,ref8,2
BOB-UK,clientname1,platform_1,campaign10,20572149,118865,0.05,52.94325,ref6,1
BOB-UK,clientname2,platform_1,campaign11,23077785,28077,0.05,14.40385,ref10,3
BOB-UK,clientname2,platform_1,campaign12,21811100,5439,0.05,5.27195,ref11,3
What I've tried
From the research I've done on line this appears to be possible using awk and join (How to merge two files using AWK? to get me the closest out of what I found online).
I've tried various awk codes I've found online and I can't seem to get it to achieve my goal. below is the code I've been trying to massage and work that gets me the closes. At current the code is set up to try and populate just the booking ref as I presume I can just rinse repeat it for the client column. With this code I was able to get it to populate the booking ref but it required me to move CAMPAIGN ID to $1 and all it did was replace the values.
NOTE: The order for file1.csv won't sync with file2.csv. All rows may be in a different order as shown in this example.
current code
awk -F"," -v OFS=',' 'BEGIN { while (getline < "fil2.csv") { f[$1] = $2; } } {print $0, f[$1] }' file1.csv
Can someone confirm where I'm going wrong with this code as I've tried altering the columns in this - and the file - without success? Maybe it's just how I'm understanding the code itself.

Like this:
awk 'BEGIN{FS=OFS=","} NR==FNR{r[$1]=$2;c[$1]=$3;next} NR>1{$9=r[$5];$10=c[$5]} 1' \
file2.csv file1.csv
Explanation in multi line form:
# Set input and output field delimiter to ,
BEGIN{
FS=OFS=","
}
# Total row number is the same as the row number in file
# as long as we are reading the first file, file2.csv
NR==FNR{
# Store booking ref and client id indexed by campaign id
r[$1]=$2
c[$1]=$3
# Skip blocks below
next
}
# From here code runs only on file1.csv
NR>1{
# Set booking ref and client id according to the campaign id
# in field 5
$9=r[$5]
$10=c[$5]
}
# Print the modified line of file1.csv (includes the header line)
{
print
}

Could you please try following.
awk '
BEGIN{
FS=OFS=","
print " print "INVOICE,CLIENT,PLATFORM,CAMPAIGN NAME,CAMPAIGN ID,IMPS,TFS,PRICE,Booking Ref,client"
}
FNR==NR && FNR>1{
val=$1
$1=""
sub(/^,/,"")
a[val]=$0
next
}
($5 in a) && FNR>1{
sub(/,*$/,"")
print $0,a[$5]
}
' file2.csv file1.csv
Explanation: Adding explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of code here.
FS=OFS="," ##Setting FS(field separator) and OFS(output field separator)as comma here.
print "INVOICE,CLIENT,PLATFORM,CAMPAIGN NAME,CAMPAIGN ID,IMPS,TFS,PRICE,Booking Ref,client"
} ##Closing BEGIN section of this program now.
FNR==NR && FNR>1{ ##Checking condition FNR==NR which will be true when file2.csv is being read.
val=$1 ##Creating variable val whose value is $1 here.
$1="" ##Nullifying $1 here.
sub(/^,/,"") ##Substitute initial comma with NULL in this line.
a[val]=$0 ##Creating an array a whose index is val and value is $0.
next ##next will skip all further statements from here.
} ##Closing BLOCK for condition FNR==NR here.
($5 in a) && FNR>1{ ##Checking if $5 is present in array a this condition will be checked when file1.csv is being read.
sub(/,*$/,"") ##Substituting all commas at last of line with NULL here.
print $0,a[$5] ##Printing current line and value of array a with index $5 here.
} ##Closing BLOCK for above ($5 in a) condition here.
' file2.csv file1.csv ##Mentioning Input_file names here.
Output will be as follows.
INVOICE,CLIENT,PLATFORM,CAMPAIGN NAME,CAMPAIGN ID,IMPS,TFS,PRICE,Booking Ref,client
BOB-UK,clientname1,platform_1,campaign1,20572431,5383594,0.05,2692.18,ref1,1
BOB-UK,clientname2,platform_1,campaign2,20589101,4932821,0.05,2463.641,ref9,3
BOB-UK,clientname1,platform_1,campaign3,23030494,4795549,0.05,2394.777,ref3,1
BOB-UK,clientname1,platform_1,campaign4,22973424,5844194,0.05,2925.21,ref4,1
BOB-UK,clientname1,platform_1,campaign5,21489000,4251031,0.05,2122.552,ref2,1
BOB-UK,clientname1,platform_1,campaign6,23150347,3123945,0.05,1561.197,ref5,1
BOB-UK,clientname3,platform_1,campaign7,23194965,2503875,0.05,1254.194,ref12,3
BOB-UK,clientname3,platform_1,campaign8,20578983,1522448,0.05,765.1224,ref7,2
BOB-UK,clientname3,platform_1,campaign9,22243554,920166,0.05,463.0083,ref8,2
BOB-UK,clientname1,platform_1,campaign10,20572149,118865,0.05,52.94325,ref6,1
BOB-UK,clientname2,platform_1,campaign11,23077785,28077,0.05,14.40385,ref10,3
BOB-UK,clientname2,platform_1,campaign12,21811100,5439,0.05,5.27195,ref11,3

Related

Copy one csv header to another csv with type modification

I want to copy one csv header to another in row wise with some modifications
Input csv
name,"Mobile Number","mobile1,mobile2",email2,Address,email21
test, 123456789,+123456767676,a#test.com,testaddr,a1#test.com
test1,7867778,8799787899898,b#test,com, test2addr,b2#test.com
In new csv this should be like this and file should also be created. And for sting column I will pass the column name so only that column will be converted to string
name.auto()
Mobile Number.auto()
mobile1,mobile2.string()
email2.auto()
Address.auto()
email21.auto()
As you see above all these header with type modification should be inserted in different rows
I have tried with below command but this is only for copy first row
sed '1!d' input.csv > output.csv
You may try this alternative gnu awk command as well:
awk -v FPAT='"[^"]+"|[^,]+' 'NR == 1 {
for (i=1; i<=NF; ++i)
print gensub(/"/, "", "g", $i) "." ($i ~ /,/ ? "string" : "auto") "()"
exit
}' file
name.auto()
Mobile Number.auto()
mobile1,mobile2.string()
email2.auto()
Address.auto()
email21.auto()
Or using sed:
sed -i -e '1i 1234567890.string(),My address is test.auto(),abc3#gmail.com.auto(),120000003.auto(),abc-003.auto(),3.com.auto()' -e '1d' test.csv
EDIT: As per OP's comment to print only first line(header) please try following.
awk -v FPAT='[^,]*|"[^"]+"' '
FNR==1{
for(i=1;i<=NF;i++){
if($i~/^".*,.*"$/){
gsub(/"/,"",$i)
print $i".string()"
}
else{
print $i".auto()"
}
}
exit
}
' Input_file > output_file
Could you please try following, written and tested with GUN awk with shown samples.
awk -v FPAT='[^,]*|"[^"]+"' '
FNR==1{
for(i=1;i<=NF;i++){
if($i~/^".*,.*"$/){
gsub(/"/,"",$i)
print $i".string()"
}
else{
print $i".auto()"
}
}
next
}
1
' Input_file
Explanation: Adding detailed explanation for above.
awk -v FPAT='[^,]*|"[^"]+"' ' ##Starting awk program and setting FPAT to [^,]*|"[^"]+".
FNR==1{ ##Checking condition if this is first line then do following.
for(i=1;i<=NF;i++){ ##Running for loop from i=1 to till NF value.
if($i~/^".*,.*"$/){ ##Checking condition if current field starts from " and ends with " and having comma in between its value then do following.
gsub(/"/,"",$i) ##Substitute all occurrences of " with NULL in current field.
print $i".string()" ##Printing current field and .string() here.
}
else{ ##else do following.
print $i".auto()" ##Printing current field dot auto() string here.
}
}
next ##next will skip all further statements from here.
}
1 ##1 will print current line.
' Input_file ##Mentioning Input_file name here.

How to compare two files and print the values of both the files which are different

There are 2 files. I need to sort them first and then compare the 2 files and then the difference I need to print the value from File 1 and File 2.
file1:
pair,bid,ask
AED/MYR,3.918000,3.918000
AED/SGD,3.918000,3.918000
AUD/CAD,3.918000,3.918000
file2:
pair,bid,ask
AUD/CAD,3.918000,3.918000
AUD/CNY,3.918000,3.918000
AED/MYR,4.918000,4.918000
Output should be:
pair,inputbid,inputask,outputbid,outtputask
AED/MYR,3.918000,3.918000,4.918000,4.918000
The only difference in 2 files is AED/MYR with different bid/ask rates. How can I print difference value from file 1 and file 2.
I tried using below commands:
nawk -F, 'NR==FNR{a[$1]=$4;a[$2]=$5;next} !($4 in a) || !($5 in a) {print $1 FS a[$1] FS a[$2] FS $4 FS $5}' file1 file2
Result output as below:
pair,bid,ask,bid,ask
AUD/CAD,3.918000,3.918000,3.918000,3.918000
AUD/CHF,3.918000,3.918000,3.918000,3.918000
AUD/CNH,3.918000,3.918000,3.918000,3.918000
AUD/CNY,3.918000,3.918000,3.918000,3.918000
AED/MYR,3.918000,3.918000,4.918000,4.918000
We are still not able to get only the difference.
Could you please try following, written and tested in GNU awk with shown samples.
awk -v header="pair,inputbid,inputask,outputbid,outtputask" '
BEGIN{
FS=OFS=","
}
FNR==NR{
arr[$1]=$0
next
}
($1 in arr) && arr[$1]!=$0{
val=$1
$1=""
sub(/^,/,"")
if(!found){
print header
found=1
}
print arr[val],$0
}' Input_file1 Input_file2
Explanation: Adding detailed explanation for above.
awk -v header="pair,inputbid,inputask,outputbid,outtputask" ' ##Starting awk program from here and setting this to header value here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=OFS="," ##Setting field separator and output field separator as comma here.
}
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when Input_file1 is being read.
arr[$1]=$0 ##Creating arr with index $1 and keep value as current line.
next ##next will skip all further statements from here.
}
($1 in arr) && arr[$1]!=$0{ ##Checking condition if first field is present in arr and its value NOT equal to $0
val=$1 ##Creating val which has current line value in it.
$1="" ##Nullifying irst field here.
sub(/^,/,"") ##Substitute starting , with NULL here.
if(!found){ ##Checking if found is NULL then do following.
print header ##Printing header here only once.
found=1 ##Setting found here.
}
print arr[val],$0 ##Printing arr with index of val and current line here.
}' Input_file1 Input_file2 ##Mentioning Input_files here.
With bash process substitution, then join and then choosing with awk:
# print header
printf "%s\n" "pair,inputbid,inputask,outputbid,outtputask"
# remove first line from both files, then sort them on first field
# then join them on first field and output first 5 fields
join -t, -11 -21 -o1.1,1.2,1.3,2.2,2.3 <(tail -n +2 file1 | sort -t, -k1) <(tail -n +2 file2 | sort -t, -k1) |
# output only those lines, that columns differ
awk -F, '$2 != $4 || $3 != $5'

Append delimiters for implied blank fields

I am looking for a simple solution to have for each line the same number of commas in file (CSV file)
e.g.
example of file:
1,1
A,B,C,D,E,F
2,2,
3,3,3,
4,4,4,4
expected:
1,1,,,,
A,B,C,D,E,F
2,2,,,,
3,3,3,,,
4,4,4,4,,
the line with the largest number of commas has 5 commas in this case (line #2). so, I want to add other commas in all lines to have the same number for each line (i.e. 5 commas)
Using awk:
$ awk 'BEGIN{FS=OFS=","} {$6=$6} 1' file
1,1,,,,
A,B,C,D,E,F
2,2,,,,
3,3,3,,,
4,4,4,4,,
As you can see above, in this approach the max. number of fields must be hardcoded in the command.
Another take on providing making all lines in a CSV file have the same number of fields. The number of fields need not be known. The max fields will be calculated and a substring of needed commas appended to each record, e.g.
awk -F, -v max=0 '{
lines[n++] = $0 # store lines indexed by line number
fields[lines[n-1]] = NF # store number of field indexed by $0
if (NF > max) # find max NF value
max = NF
}
END {
for(i=0;i<max;i++) # form string with max commas
commastr=commastr","
for(i=0;i<n;i++) # loop appended substring of commas
printf "%s%s\n", lines[i], substr(commastr,1,max-fields[lines[i]])
}' file
Example Use/Output
Pasting at the command-line, you would receive:
$ awk -F, -v max=0 '{
> lines[n++] = $0 # store lines indexed by line number
> fields[lines[n-1]] = NF # store number of field indexed by $0
> if (NF > max) # find max NF value
> max = NF
> }
> END {
> for(i=0;i<max;i++) # form string with max commas
> commastr=commastr","
> for(i=0;i<n;i++) # loop appended substring of commas
> printf "%s%s\n", lines[i], substr(commastr,1,max-fields[lines[i]])
> }' file
1,1,,,,
A,B,C,D,E,F
2,2,,,,
3,3,3,,,
4,4,4,4,,
Could you please try following, a more generic way. This code will work even number of fields are not same in your Input_file and will first read and get maximum number of fields from whole file and then 2nd time reading file it will reset the fields(why because we have set OFS as , so if current line's number of fields are lesser than nf value those many commas will be added to that line). Enhanced version of #oguz ismail's answer.
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
nf=nf>NF?nf:NF
next
}
{
$nf=$nf
}
1
' Input_file Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program frmo here.
BEGIN{ ##Starting BEGIN section of awk program from here.
FS=OFS="," ##Setting FS and OFS as comma for all lines here.
}
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
nf=nf>NF?nf:NF ##Creating variable nf whose value is getting set as per condition, if nf is greater than NF then set it as NF else keep it as it is,
next ##next will skip all further statements from here.
}
{
$nf=$nf ##Mentioning $nf=$nf will reset current lines value and will add comma(s) at last of line if NF is lesser than nf.
}
1 ##1 will print edited/non-edited lines here.
' Input_file Input_file ##Mentioning Input_file names here.

gsub with awk on a file by name

I'm trying to learn how to use awk with gsub for a particular field, but passing the name, not the number of the column on this data:
especievalida,nom059
Rhizophora mangle,Amenazada (A)
Avicennia germinans,Amenazada (A)
Laguncularia racemosa,Amenazada (A)
Cedrela odorata,Sujeta a protección especial (Pr)
Litsea glaucescens,En peligro de extinción (P)
Conocarpus erectus,Amenazada (A)
Magnolia schiedeana,Amenazada (A)
Carpinus caroliniana,Amenazada (A)
Ostrya virginiana,Sujeta a protección especial (Pr)
I tried
awk -F, -v OFS="," '{gsub("\\(.*\\)", "", $2 ) ; print $0}'
removes everything between parentheses on the second ($2) column; but I'd really like to be able to pass "nom059" to the expression, to get the same result
When reading the first line of your input file (the header line) build an array (f[] below) that maps the field name to the field number. Then you can access the fields by just using their names as an index to f[] to get their numbers and then contents:
$ cat tst.awk
BEGIN {
FS = OFS = ","
}
NR==1 {
for (i=1; i<=NF; i++) {
f[$i] = i
}
}
{
gsub(/\(.*\)/,"",$(f["nom05"]))
print
}
$ awk -f tst.awk file
especievalida,nom059
Rhizophora mangle,Amenazada
Avicennia germinans,Amenazada
Laguncularia racemosa,Amenazada
Cedrela odorata,Sujeta a protección especial
Litsea glaucescens,En peligro de extinción
Conocarpus erectus,Amenazada
Magnolia schiedeana,Amenazada
Carpinus caroliniana,Amenazada
Ostrya virginiana,Sujeta a protección especial
By the way, read https://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps for why you should be using gsub(/.../ (a constant or literal regexp) instead of gsub("..." (a dynamic or computed regexp).
Could you please try following. I have made an awk variable named header_value where you could mention field name on which you want to use gsub.
awk -v header_value="nom059" '
BEGIN{
FS=OFS=","
}
FNR==1{
for(i=1;i<=NF;i++){
if($i==header_value){
field_value=i
}
}
print
next
}
{
gsub(/\(.*\)/, "",$field_value)
}
1
' Input_file
Explanation: Adding explanation of above code.
awk -v header_value="nom059" ' ##Starting awk program here and creating a variable named header_value whose value is set as nom059.
BEGIN{ ##Starting BEGIN section of this program here.
FS=OFS="," ##Setting FS and OFS value as comma here.
} ##Closing BEGIN section here.
FNR==1{ ##Checking condition if FNR==1, line is 1st line then do following.
for(i=1;i<=NF;i++){ ##Starting a for loop which starts from i=1 to till value of NF.
if($i==header_value){ ##checking condition if any field value is equal to variable header_value then do following.
field_value=i ##Creating variable field_value whose value is variable i value.
}
}
print ##Printing 1st line here.
next ##next will skip all further statements from here.
}
{
gsub(/\(.*\)/, "",$field_value) ##Now using gsub to Globally substituting everything between ( to ) with NULL in all lines.
}
1 ##Mentioning 1 will print edited/non-edited line.
' Input_file ##Mentioning Input_file name here.

Compare two columns of two files and display the third column with input if it matching of not in unix

I would like to compare the first two columns of two files file1.txt and file2.txt and if they match to write to another file output.txt with the third column of both file1,file 2 along with details if it matches or not .
file1.txt
ab|2001|name1
cd|2000|name2
ef|2002|name3
gh|2003|name4
file2.txt
xy|2001|name5
cd|2000|name6
ef|2002|name7
gh|2003|name8
output.txt
name1 name5 does not match
name2 name6 matches
name3 name7 matches
name4 name8 matches
Welcome to stack overflow, could you please try following and let me know if this helps you.
awk -F"|" 'FNR==NR{a[$2]=$1;b[$2]=$3;next} ($2 in a) && ($1==a[$2]){print b[$2],$3,"matched properly.";next} {print b[$2],$3,"Does NOT matched."}' file1.txt file2.txt
EDIT: Adding a non-one liner form of solution too here.
awk -F"|" '
FNR==NR{
a[$2]=$1;
b[$2]=$3;
next
}
($2 in a) && ($1==a[$2]){
print b[$2],$3,"matched properly.";
next
}
{
print b[$2],$3,"Does NOT matched."
}
' file1.txt file2.txt
Explanation: Adding explanation for above code.
awk -F"|" ' ##Starting awk program from here and setting field separator as | here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when file1.txt is being read.
a[$2]=$1; ##Creating an array with named a whose index is $2 and value is $1.
b[$2]=$3; ##Creating an array named b whose index is $2 and value is $3.
next ##next will skip all further statements from here.
} ##Closing BLOCK for FNR==NR condition here.
($2 in a) && ($1==a[$2]){ ##Checking condition if $2 is in array a AND first field is equal to value of array a value of index $2.
print b[$2],$3,"matched properly."; ##Printing array b value with index $2 and 3rd field with string value matched properly.
next ##next will skip all statements from here.
} ##Closing BLOCK for above condition here.
{
print b[$2],$3,"Does NOT matched." ##Printing value of array b with index $2 and 3rd field with string Does NOT matched here.
}
' file1.txt file2.txt ##Mentioning Input_file names here.
You can use paste and awk to get what you want.
Below solution is assuming the fields in file1 and file2 will be always delimited by "|"
paste -d "|" file1.txt file2.txt | awk -F "|" '{ if( $1 == $4 && $2 == $5 ){print $3, $6, "matches"} else {print $3, $6, "does not match"} }' > output.txt

Resources