awk match and merge two files on basis of key values - bash

I have two files in which $3,$4 = $3,$2.
file1:
1211,A2,ittp,1,IPSG,a2,PA,3000,3000
1311,A4,iztv,1,IPSG,a4,PA,240,250
1411,B4,iztq,0,IPSG,b4,PA,230,250
file2:
TP,0,nttp,0.865556,0.866667
TP,1,ittp,50.7956,50.65
TP,1,iztv,5.42444,13.8467
TP,0,iztq,645.194,490.609
I want to merge these files and print a new file like if file1 $3,$4 = file2 $3,$2 then print merged file like
TP,1211,A2,ittp,1,IPSG,a2,PA,3000,3000,0.865556,0.866667
TP,1311,A4,iztv,1,IPSG,a4,PA,240,250,50.7956,50.65
TP,1411,B4,iztq,0,IPSG,b4,PA,230,250,5.42444,13.8467
BOTH THE FILES ARE CSV FILES.
I tried using awk but I'm not getting the desired output. It's printing only file1.
$ awk -F, 'NR==FNR{a[$3,$4]=$3$2;next}{print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10 a[$1] }' OFS=, 1.csv 2.csv

awk -F, 'BEGIN {OFS=",";}
NR == FNR {a[$3,$4] = $0;}
NR != FNR && a[$3,$2] {print $1, a[$3,$2], $4, $5;}' 1.csv 2.csv

One way with awk:
awk 'NR==FNR{a[$4,$3]=$0;next}($2,$3) in a{print $1,a[$2,$3],$4,$5}' FS=, OFS=, f1 f2
TP,1211,A2,ittp,1,IPSG,a2,PA,3000,3000,50.7956,50.65
TP,1311,A4,iztv,1,IPSG,a4,PA,240,250,5.42444,13.8467
TP,1411,B4,iztq,0,IPSG,b4,PA,230,250,645.194,490.609

Using Join
If i1 and i2 are the input files
cat i1.txt | awk -F',' '{print $3 "-" $4 "," $1 "," $2 "," $5 "," $6 "," $7 "," $8 "," $9}' | sort > s1.txt
cat i2.txt | awk -F',' '{print $3 "-" $2 "," $1 "," $4 "," $5 }' | sort > s2.txt
join -t',' s1.txt s2.txt | tr '-' ',' > t12.txt
cat t12.txt | awk -F ',' '{print $10 "," $3 "," $4 "," $1 "," $2 "," $5 "," $6 "," $7 "," $8 "," $9 "," $11 "," $12 }'

Related

AWK - Processing multiple file through for loop and conditional check

File 1: myfilename_WEEK.csv
w27_2018,257,1,26.20,0.00,24.26
w28_2018,257,1,7.97,0.00,24.26
w29_2018,257,1,34.86,0.00,24.26
w30_2018,257,1,3.29,0.00,24.26
File 2: myfilename_MONTH.csv
m07_2018,257,1,94.78,0.00,121.31
m08_2018,257,1,719.60,0.00,262.47
m09_2018,257,1,14925.60,0.00,13903.24
m10_2018,257,1,51099.66,0.00,81600.69
File 3: myfilename_HALF.csv
h02_2018,257,1,155345.19,480029.21,235802.91
h01_2019,257,1,273961.84,552545.36,140706.27
h02_2018,258,1,3250552.06,1299785.91,3697749.57
h01_2019,258,1,3582585.66,2670427.72,4009391.28
calendar_file:
20180805,08/05/2018,w27_2018,WK27 2018,m07_2018,AUG 2018,q03_2018,Q03 2018,h02_2018,H02 2018,a2018,FY2018,27,WEEK 27,01,SUNDAY
20180806,08/06/2018,w27_2018,WK27 2018,m07_2018,AUG 2018,q03_2018,Q03 2018,h02_2018,H02 2018,a2018,FY2018,27,WEEK 27,02,MONDAY
...
20180811,08/11/2018,w27_2018,WK27 2018,m07_2018,AUG 2018,q03_2018,Q03 2018,h02_2018,H02 2018,a2018,FY2018,27,WEEK 27,07,SATURDAY
20180812,08/12/2018,w28_2018,WK28 2018,m07_2018,AUG 2018,q03_2018,Q03 2018,h02_2018,H02 2018,a2018,FY2018,28,WEEK 28,01,SUNDAY
..
20180816,08/16/2018,w28_2018,WK28 2018,m07_2018,AUG 2018,q03_2018,Q03 2018,h02_2018,H02 2018,a2018,FY2018,28,WEEK 28,05,THURSDAY
Expected output (newlines added for readability):
2018,w27_2018,WK27 2018,257,1,26.20,0.00,24.26
2018,w27_2018,WK27 2018,258,1,97192.07,9028.38,52130.32
2018,w27_2018,WK27 2018,300,1,181.44,0.00,-69.72
2018,m07_2018,AUG 2018,257,1,94.78,0.00,121.31
2018,m07_2018,AUG 2018,258,1,509253.46,45141.91,399648.71
2018,m07_2018,AUG 2018,300,1,409.10,0.00,-348.60
2018,h02_2018,H02 2018,257,1,155345.19,480029.21,235802.91
2018,h02_2018,H02 2018,258,1,3250552.06,1299785.91,3697749.57
2018,h02_2018,H02 2018,300,1,1112.93,0.00,-1164.35
I would like to join all myfilename_* to add a label and Fiscal Year using calendar_file:
Individual commands are:
awk -F, 'NR==FNR {a[$3]=substr($12,3,4) FS $3 FS $4; next} {print a[$1] FS $2 FS $3 FS $4 FS $5 FS $6}' calendar_file myfilename_WEEK.csv >> my_report.csv
awk -F, 'NR==FNR {a[$5]=substr($12,3,4) FS $5 FS $6; next} {print a[$1] FS $2 FS $3 FS $4 FS $5 FS $6}' calendar_file myfilename_MONTH.csv >> my_report.csv
awk -F, 'NR==FNR {a[$9]=substr($12,3,4) FS $9 FS $10; next} {print a[$1] FS $2 FS $3 FS $4 FS $5 FS $6}' calendar_file myfilename_HALF.csv >> my_report.csv
I am trying to join all of these into a single loop:
I have tried the following but it doesn't work:
for exp_file in `ls myfilename_*.csv`
do
awk -F, '\
{ \
if(NR==FNR && FILENAME ~ /WEEK/) {a[$3]=substr($12,3,4) FS $3 FS $4; next} ;\
if(NR==FNR && FILENAME ~ /MONTH/) {a[$5]=substr($12,3,4) FS $5 FS $6; next} ;\
if(NR==FNR && FILENAME ~ /HALF/) {a[$9]=substr($12,3,4) FS $9 FS $10; next} ;\
{print a[$1] FS $2 FS $3 FS $4 FS $5 FS $6} \
}' calendar_file $exp_file >> my_report.csv
done
How can I achieve this? Thanks for your help in advance!
Firt way(GNU awk, if you don't have GNU awk please leave comment):
awk -F, 'NR==FNR{y=substr($12,3,4); a[$3]=y FS $3 FS $4; b[$5]=y FS $5 FS $6; c[$9]=y FS $9 FS $10; next} FNR==1{printf nl;nl="\n"} match(FILENAME, /myfilename_([A-Z]*)/, f){NF=6;switch(f[1]){case "WEEK": $1=a[$1];break; case "MONTH": $1=b[$1];break; case "HALF": $1=c[$1];}}1' OFS=, calendar_file myfilename_{WEEK,MONTH,HALF}.csv
Multiple lines for readability:
awk -F, '
NR==FNR{
y=substr($12,3,4);
a[$3]=y FS $3 FS $4;
b[$5]=y FS $5 FS $6;
c[$9]=y FS $9 FS $10;
next
}
FNR==1{printf nl;nl=ORS} ## The newlines between sectors, if you do not need those newlines then remove this line.
match(FILENAME, /myfilename_([A-Z]*)/, f){
NF=6; ## To limit results for 6 columns only, can remove it here.
switch(f[1]){
case "WEEK":
$1=a[$1];
break;
case "MONTH":
$1=b[$1];
break;
case "HALF":
$1=c[$1];
}
}1' OFS=, calendar_file myfilename_{WEEK,MONTH,HALF}.csv
An update to it:
awk -F, '
NR==FNR{
y=substr($12,3,4);
a[$3]=y FS $3 FS $4;
b[$5]=y FS $5 FS $6;
c[$9]=y FS $9 FS $10;
next
}
FNR==1{printf nl;nl=ORS} ## The newlines between sectors, if you do not need those newlines then remove this line.
match(FILENAME, /myfilename_([A-Z]*)/, f){
NF=6; ## To limit results for 6 columns only, can remove in your case.
$1 = f[1]=="WEEK" ? a[$1] : ( f[1]=="MONTH" ? b[$1] : (f[1]=="HALF" ? c[$1] : $1) )
}1' OFS=, calendar_file myfilename_{WEEK,MONTH,HALF}.csv
Second Way, more concise and without using switch (also GNU awk):
awk -F, '
NR==FNR{
y=substr($12,3,4);
a[$3 "WEEK"]=y FS $3 FS $4;
a[$5 "MONTH"]=y FS $5 FS $6;
a[$9 "HALF"]=y FS $9 FS $10;
next
}
FNR==1{printf nl;nl=ORS} ## The newlines between sectors, if you do not need those newlines then remove this line.
match(FILENAME, /myfilename_([A-Z]*)/, f){
$1=a[$1 f[1]];
}1' OFS=, calendar_file myfilename_{WEEK,MONTH,HALF}.csv
Third way: If your data are all corresponding to their filenames, like you showed in your samples, there's a third way which removes the need of match, so it can work on other awks:
awk -F, '
NR==FNR{
y=substr($12,3,4);
a[$3 "w"]=y FS $3 FS $4;
a[$5 "m"]=y FS $5 FS $6;
a[$9 "h"]=y FS $9 FS $10;
next
}
FNR==1{printf nl;nl=ORS} ## The newlines between sectors, if you do not need those newlines then remove this line.
$1~/^([wmh])[0-9]{2}_[0-9]{4}/{ ## Check first fields to make sure it matches, the checking is optional if your data is all like you showed.
$1=a[$1 substr($1,1,1)]
print
}' OFS=, calendar_file myfilename_{WEEK,MONTH,HALF}.csv
Given a second thought, based on your data-filename relations, there's actually no need to check the first letter (nor the filename):
awk -F, '
NR==FNR{
y=substr($12,3,4);
a[$3]=y FS $3 FS $4;
a[$5]=y FS $5 FS $6;
a[$9]=y FS $9 FS $10;
next
}
FNR==1{printf nl;nl=ORS} ## The newlines between sectors, if you do not need those newlines then remove this line.
{ ## Add $1~/^([wmh])[0-9]{2}_[0-9]{4}/ to the beginning of this line if you want to check and make sure first column.
$1=a[$1]
}1' OFS=, calendar_file myfilename_{WEEK,MONTH,HALF}.csv
Here is another awk solution which is portable, efficient and doesn't rely on input filenames but the order they are given on command line.
awk -F ',' -v OFS=',' '
NR==FNR {
y=substr($12,3,4)
a[ARGV[2],$3]=y OFS $3 OFS $4 # week
a[ARGV[3],$5]=y OFS $5 OFS $6 # month
a[ARGV[4],$9]=y OFS $9 OFS $10 # half
next
}
{
$1=a[FILENAME,$1]
} 1' calendar.csv week.csv month.csv half.csv
Note that if your calendar file is sorted, there is no need to parse fiscal year field again and again for each line. Something like this would be way more efficient in that case:
if(p!=$12) y=substr(p=$12,3,4)

Print a column in awk if it matches, if not then still print the line (without that column)

I'm trying to do some filtering with awk but I'm currently running into an issue. I can make awk match a regex and print the line with the column but I cannot make it print the line without the column.
awk -v OFS='\t' '$6 ~ /^07/ {print $3, $4, $5, $6}' file
Is currently what I have. Can I make awk print the line without the sixth column if it doesn't match the regex?
Set $6 to the empty string if the regex doesn't match. As simple as that. This should do it:
awk -v OFS='\t' '{ if ($6 ~ /^07/) { print $3, $4, $5, $6 } else { $6 = ""; print $0; } }' file
Note that $0 is the entire line, including $2 (which you didn't seem to use). It will print every column except the 6th column.
If you just want to print $3, $4 and $5 when there isn't a match, use this instead:
awk -v OFS='\t' '{ if ($6 ~ /^07/) print $3, $4, $5, $6; else print $3, $4, $5 }' file

Need to separate row using sed or awk command in Solaris

I have the following data in one line.
2014-12-30 00:00:02,317 pool-14076-thread-3 DEBUG [com.fundamo.connector.airtime.service.AirtimeService] ERS Response XML - <soap:Envelope><soap:Body><TopUpPhoneAccountResult><MessageID>1913351092</MessageID><MessageRefID>BD9123000000003</MessageRefID><TopUpPhoneAccountStatus><StatusID>200</StatusID><Comment>Transaction Successful</Comment></TopUpPhoneAccountStatus><TopUpPhoneAccountAmountSent><Amount>2000</Amount><AmountExcludingTax>2000</AmountExcludingTax><TaxName/><TaxAmount>0</TaxAmount><PhoneNumber>1766910910</PhoneNumber><ResponseDateTime>20141230000002320</ResponseDateTime><ServiceType>PRETOP</ServiceType><CurrencyCode>TK</CurrencyCode></TopUpPhoneAccountAmountSent></TopUpPhoneAccountResult></soap:Body></soap:Envelope>
Now I want to take a few values from them. I used this command:
cat ERS_RESPONSE_30Dec_atp11.txt |awk -F'<' '{print $1 "," $5 "," $7 "," $10 ","$12"," $16 "," $23}'
Output:
2014-12-30 00:00:02,317 pool-14076-thread-3 DEBUG [com.fundamo.connector.airtime.service.AirtimeService] ERS Response XML - ,MessageID>1913351092,MessageRefID>BD9123000000003,StatusID>200,Comment>Transaction Successful,Amount>2000,PhoneNumber>1766910910
However, I only want the fields shown below.
2014-12-30 00:00:02,317 ,1913351092,BD9123000000003,200,Transaction Successful,2000,1766910910
What should I do?
Here is how to do it with awk
awk -F"[ <>]" '{print $1" "$2,$18,$22,$28,$32" "$33,$41,$55}' OFS=, ERS_RESPONSE_30Dec_atp11.txt
2014-12-30 00:00:02,317,1913351092,BD9123000000003,200,Transaction Successful,2000,1766910910
And here are some tip.
Try to find what separate every fields, her it would be , < and >
Then find all fields, by run it like this: awk -F"[ <>]" '{for (i=1;i<=NF;i++) print i"="$i}' file
Then its just to put it all together.
You can try sed as follows (this is a little long) followed by your file
sed 's#\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\} [0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\},\S*\).*<MessageID>\([[:digit:]]\{1,\}\)<.*<MessageRefID>\([[:alpha:]]\{1,\}[[:digit:]]\{1,\}\).*<StatusID>\([[:digit:]]\{1,\}\).*\(Transaction Successful\).*<Amount>\([[:digit:]]\{1,\}\).*<PhoneNumber>\([[:digit:]]\{1,\}\).*#\1 ,\2,\3,\4,\5,\6,\7#g'
replace sed with sed -i.bak to make a back up of the original file and make actual changes,once this works (command line tested on my side)
You need to use nawk on Solaris instead of awk. In Solaris' version of awk, the -F parameter can only take a single character, while in nawk, it can take a regular expression.
You need to specify the whole pattern of <.....> as a separator instead of just <:
This works on the Mac:
$ awk -F'<[^<]+>' '{print $1 "," $5 "," $7 "," $10 ","$12"," $16 "," $23}' ERS_RESPONSE_30Dec_atp11.txt
Try the following on Solaris
$ nawk -F'<[^<]+>' '{print $1 "," $5 "," $7 "," $10 ","$12"," $16 "," $23}' ERS_RESPONSE_30Dec_atp11.txt
If that doesn't work...
$ nawk -F'<[^<][^<]*>' '{print $1 "," $5 "," $7 "," $10 ","$12"," $16 "," $23}' ERS_RESPONSE_30Dec_atp11.txt

Multiple condition in nawk command

I have the nawk command where I need to format the data based on the length .All the time I need to keep first 6 digit and last 4 digit and make xxxx in the middle. Can you help in fine tuning the below script
#!/bin/bash
FILES=/export/home/input.txt
cat $FILES | nawk -F '|' '{
if (length($3) >= 13 )
print $1 "|" $2 "|" substr($3,1,6) "xxxxxx" substr($3,13,4) "|" $4"|" $5
else
print $1 "|" $2 "|" $3 "|" $4 "|" $5"|
}' > output.txt
done
input.txt
"2"|"X"|"A"|"ST"|"245552544555201"|"1111-11-11"|75.00
"6"|"Y"|"D"|"VT"|"245652544555200"|"1111-11-11"|95.00
"5"|"X"|"G"|"ST"|"3445625445552023"|"1111-11-11"|75.00
"3"|"Y"|"S"|"VT"|"24532254455524"|"1111-11-11"|95.00
output.txt
"X"|"ST"|"245552544555201"|"245552xxxxx5201"
"Y"|"VT"|"245652544555200"|"245652xxxxx5200"
"X"|"ST"|"3445625445552023"|"344562xxxxxx2023"
"Y"|"VT"|"24532254455524"|"245322xxxx5524"
Try this:
$ awk '
BEGIN {FS = OFS = "|"}
length($5)>=13 {
fld5=$5
start = substr($5,1,7)
end = substr($5,length($5)-4)
gsub(/./,"x",fld5)
sub(/^......./,start,fld5)
sub(/.....$/,end,fld5)
$1=$2; $2=$4; $3=$5; $4=fld5; NF-=3;
}1' file
"X"|"ST"|"245552544555201"|"245552xxxxx5201"
"Y"|"VT"|"245652544555200"|"245652xxxxx5200"
"X"|"ST"|"3445625445552023"|"344562xxxxxx2023"
"Y"|"VT"|"24532254455524"|"245322xxxx5524"

awk syntax error help identifying

I can't seem to identify where the syntax error is ..I've tried to these 2 statements but nothing gets written to the 'BlockedIPs' file. Can someone please help? Thanks!
awk '/ (TCP|UDP) / { split($5, addr, /:/); cmd = "/Users/user1/Scripts/geoiplookup " addr[1] | awk '{print $4, $5, $6}'; cmd | getline rslt; close(cmd); print $1, $2, $3, rslt }' < "$IP_PARSED" >> "$BlockedIPs"
awk '/ (TCP|UDP) / { split($5, addr, /:/); cmd = "/Users/user1/Scripts/geoiplookup " addr[1] " | awk '{print $4, $5, $6}'" ; cmd | getline rslt; close(cmd); print $1, $2, $3, rslt }' < "$IP_PARSED" >> "$BlockedIPs"
Your problem is primarily with quoting and stems from the fact that you're trying to call AWK from within an AWK one-liner. It's certainly possible, but getting the quoting right would be very tricky.
It would be much better if you retrieved the complete output of geoiplookup into a variable then did a split() to get just the data you need. Something like:
awk '/ (TCP|UDP) / { split($5, addr, /:/); cmd = "/Users/user1/Scripts/geoiplookup " addr[1]; cmd | getline rslt; split(rslt, r); close(cmd); print $1, $2, $3, r[4], r[5], r[6] }' < "$IP_PARSED" >> "$BlockedIPs"

Resources