using awk to lookup and insert data - shell

In my continuing crusade not to use MS Excel, I'd like to process some data, send it to a file, and then insert some records from a separate file into a third file using field $1 as the index. Is this possible?
I have data like this:
2600,foo,stack,1,04/02/2015,ACH Payment,ACH Settled,1500
2600,foo,stack,2,04/06/2015,Credit Card Sale,Settled,100
2600,foo,stack,3,04/07/2015,Credit Card Sale,Settled,157.13
2600,foo,stack,4,04/07/2015,ACH Credit,ACH Settled,.03
I have this to group it:
cat group.awk
#!/usr/bin/awk -f
BEGIN {
OFS = FS = ","
}
NR > 1 {
arr[$1 OFS $2 OFS $3]++
}
END {
for (key in arr)
print key, arr[key]
}
The group makes it like this:
2600,foo,stack,4
Simple multiplication is applied to fields 5, 6 and 7 where applicable--depends on fields 3.
In this example we can say the finished record looks like this:
2600,foo,stack,4,.2,19.8
Now in a separate file, I have this data:
2600,registered user,5hPASLJlHlgJR4AQc9sZQ==
basic flow is:
awk -f group.awk data.csv | awk -f math.awk > finished.csv
Then use awk (if it can do this) to look up field $1 in finished.csv and find corresponding record above in the separate file(bill.csv) and print to a third file or insert into bill.csv.
Expected output in third file(bill.csv):
x,y,,1111111,2600,,,,,,,19.8,,,registered user,,,,,,,,,,RS,,,N5hPASLJlHlgJR4AQc9sZQ==,z,a
x,y,,1111111,RS,z,a will be pre-populated to I only need to insert three new records.
Is this something awk can accomplish?
Edit
Field $3 is the accountID that sets the multiplication on 5, 6 and 7.
Here's the idea:
bill.awk:
NR>1{if($3=="stack" && $4>199) $5=$4*0.03;
if($3=="stack" && $4<200) $5=$4*0.05
if($3=="user") $5=$4*.01
}1
total.awk:
awk -F, -v OFS="," 'NR>1{if($3=="stack" && $5<20) $6=20-$5;
if($3=='stack && $5>20) $6=0;}1'
This part is working and final output is like above:
2600,foo,stack,4,.2,19.8
4*.05 = .2 & 20 - .2 = 19.8
But the minimium charge is $20
So we'll correct it:
4*.05 = .2 & 20 - .2 = 20
Extra populated fields came from a separate file (bill.csv) and I need to fill in 20 to the correct record on bill.csv
bill.csv contains everything needed except the 20
before:
x,y,,1111111,2600,,,,,,,,,,,registered user,,,,,,,,,,RS,,,N5hPASLJlHlgJR4AQc9sZQ==,z,a
after:
x,y,,1111111,2600,,,,,,,20,,,registered user,,,,,,,,,,RS,,,N5hPASLJlHlgJR4AQc9sZQ==,z,a
Is this a better explanaiton? Go on the assumption that group.awk, bill.awk and total.awk are working correctly. I just need to extract the correct total for field $1 and put it in bill.csv in the correct spot.

Is maybe this last awk what you need. I´ve tried to understand what you want and I think is just that merging awk way:
For explaininng: We first save the fileA in an array with the first key as the index. Then we search for each line o file B if the field1 is between the indexes of our array, and if it´s, we print all data from two files together
awk -F"," 'BEGIN {while (getline < "record.dat"){ a[$1]=$0; }} {if($1 in a){ print a[$1]","$0}}' file.dat
2600,foo,stack,4,10,10.4,2600,registered user,5hPASLJlHlgJR4AQc9sZQ==

This is the kind of solution you need:
$ cat fileA
2600,foo,stack,1,04/02/2015,ACH Payment,ACH Settled,1500
2600,foo,stack,2,04/06/2015,Credit Card Sale,Settled,100
2600,foo,stack,3,04/07/2015,Credit Card Sale,Settled,157.13
2600,foo,stack,4,04/07/2015,ACH Credit,ACH Settled,.03
2600,foo,stack,5,04/09/2015,ACH Payment,ACH Settled,147.10
$ cat fileB
2600,registered user,5hPASLJlHlgJR4AQc9sZQ==
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR{
cnts[$1][$2FS$3]++
next
}
{
for (val in cnts[$1]) {
cnt = cnts[$1][val]
print $1, val, cnt, cnt*2.5, $2, $3
}
}
$ awk -f tst.awk fileA fileB
2600,foo,stack,5,12.5,registered user,5hPASLJlHlgJR4AQc9sZQ==
but until you update your question we can't provide any more concrete help than that.
The above uses GNU awk 4.* for true 2D arrays.

Related

awk to get first column if the a specific number in the line is greater than a digit

I have a data file (file.txt) contains the below lines:
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=22:00,dom=sss.co.uk,user2=lis
I'm expecting to get the first column ($1) only if the ETA= number is greater than 15, like here I will have 2nd and 3rd line first column only is expected.
345
456
I tried like cat file.txt | awk -F [,TPF=]' '{print $1}' but its print whole line which has ETA at the end.
Using awk
$ awk -F"[=, ]" '{for (i=1;i<NF;i++) if ($i=="ETA") if ($(i+1) > 15) print $1}' input_file
345
456
With your shown samples please try following GNU awk code. Using match function of GNU awk where I am using regex (^[0-9]+).*ETA=([0-9]+):[0-9]+ which creates 2 capturing groups and saves its values into array arr. Then checking condition if 2nd element of arr is greater than 15 then print 1st value of arr array as per requirement.
awk '
match($0,/(^[0-9]+).*\<ETA=([0-9]+):[0-9]+/,arr) && arr[2]+0>15{
print arr[1]
}
' Input_file
I would harness GNU AWK for this task following way, let file.txt content be
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=02:00,dom=sss.co.uk,user2=lis
then
awk 'substr($0,index($0,"ETA=")+4,2)+0>15{print $1}' file.txt
gives output
345
Explanation: I use String functions, index to find where is ETA= then substr to get 2 characters after ETA=, 4 is used as ETA= is 4 characters long and index gives start position, I use +0 to convert to integer then compare it with 15. Disclaimer: this solution assumes every row has ETA= followed by exactly 2 digits.
(tested in GNU Awk 5.0.1)
Whenever input contains tag=value pairs as yours does, it's best to first create an array of those mappings (v[]) below and then you can just access the values by their tags (names):
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
v["ETA"]+0 > 15 {
print $1
}
$ awk -f tst.awk file
345
456
With that approach you can trivially enhance the script in future to access whatever values you like by their names, test them in whatever combinations you like, output them in whatever order you like, etc. For example:
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
(v["pro"] ~ /b/) && (v["ETA"]+0 > 15) {
print $1, v["team"], v["dom"]
}
$ awk -f tst.awk file
345,abc,sbc.int
456,efg,sss.co.uk
Think about how you'd enhance any other solution to do the above or anything remotely similar.
It's unclear why you think your attempt would do anything of the sort. Your attempt uses a completely different field separator and does not compare anything against the number 15.
You'll also want to get rid of the useless use of cat.
When you specify a column separator with -F that changes what the first column $1 actually means; it is then everything before the first occurrence of the separator. Probably separately split the line to obtain the first column, space-separated.
awk -F 'ETA=' '$2 > 15 { split($0, n, /[ \t]+/); print n[1] }' file.txt
The value in $2 will be the data after the first separator (and up until the next one) but using it in a numeric comparison simply ignores any non-numeric text after the number at the beginning of the field. So for example, on the first line, we are actually literally checking if 12:00, team=xyz,user1=tom,dom=dby.com is larger than 15 but it effectively checks if 12 is larger than 15 (which is obviously false).
When the condition is true, we split the original line $0 into the array n on sequences of whitespace, and then print the first element of this array.
Using awk you could match ETA= followed by 1 or more digits. Then get the match without the ETA= part and check if the number is greater than 15 and print the first field.
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4)+0 > 15) print $1
}' file
Output
345
456
If the first field should start with a number:
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4) > 15)+0 print $1
}' file

awk keep header when using IGNORECASE

I have a csv file I want to search for a string in a specific column columnB (column 5 in my dataset) (case ignored), and apply a filter on anothercolumnC(column 10 in my dataset). Then save selected columns to a file.
sample of the dataset
columnA columnB columnC columnD
abc Apple 100 today
nbd apple 50 tomorrow
ccc apple 101 today
desired output
columnB columnC
Apple 100
apple 101
the problem when I use awk I can select columnB, but I can't output the header.
awk 'BEGIN {IGNORECASE = 1} {if($5 == "Apple") print $0 }' Data.csv> testPipe.txt
I have tried using NR==1 but for some reason it doesn't work with IGNORECASE.
I also tried the methods here and here.
I tried to use grip, I can output the header but I can't specify columnB for the string matching.And the search will be applied to all columns.
cat Data.csv |{ head -1; grep -I "Apple";} | awk -F',' '{ if ($10 >100 ) { print } }'>testPipe.txt
Is there a way to combine both methods and get the desired output?
Thanks
Use the function tolower():
awk 'NR==1{print; next} tolower($5) == "apple"' file
Explanation:
# Print the headers
NR==1 {
print
next
}
# Print the current line if $5 matches the condition
# Note that if there is no action specified, awk will
# use print $0 by default
tolower($5)
If you want to write further actions if the condition is true, put them into a block
tolower($5) {
...
}
In opposite to IGNORECASE which works only with GNU awk, the tolower() will work with any version of awk because it is defined by POSIX.
Update: apparently my answer isn't as good as I thought, see the comment from Ed Morton below. I'll keep it anyway, as a "How not to do it".
Original (bad) answer:
add the following to your BEGIN clause, either before or after setting IGNORECASE:
getline;
print;
explanation: the BEGIN clause is executed once before everything else, so you can process lines there, too, but you have to read them in manually.
full example:
awk '
BEGIN {
getline;
print;
IGNORECASE = 1;
}
$2 == "apple" && $3 <= 100 {
print $1;
}
'

Multiple Big file sort

I have two files that each line order by timestamp but has different structure. I want merge there file info one single file and order by timestamp. look like:
file A(less than 2G)
1,1,1487779199850
2,2,1487779199852
3,3,1487779199854
4,4,1487779199856
5,5,1487779199858
file B(less than 15G)
1,1,10,100,1487779199850
2,2,20,200,1487779199852
3,3,30,300,1487779199854
4,4,40,400,1487779199856
5,5,50,500,1487779199858
how can I accomplish this? is there any way can make it as fast as possible?
$ awk -F, -v OFS='\t' '{print $NF, $0}' fileA fileB | sort -s -n -k1,1 | cut -f2-
1,1,1487779199850
1,1,10,100,1487779199850
2,2,1487779199852
2,2,20,200,1487779199852
3,3,1487779199854
3,3,30,300,1487779199854
4,4,1487779199856
4,4,40,400,1487779199856
5,5,1487779199858
5,5,50,500,1487779199858
I originally posted the above as just a comment under #VM17's answer but (s)he suggested I make it a new answer.
The above would be more robust and efficient since it's using the default separator for sort+cut (tab), will truly only sort on the first key (his would use the whole line despite the -k1 since sorts field separator tab isn't present in the line), uses a stable sort algorithm (sort -s) to preserve input order and uses cut to strip off the added key field which is more efficient than invoking awk again since awk does field splitting etc. on each record which isn't needed to just remove the leading field(s).
Alternatvely you might find something like this more efficient:
$ cat tst.awk
{ currRec = $0; currKey = $NF }
NR>1 {
print prevRec
printf "%s", saved
while ( (getline < "fileB") > 0 ) {
if ($NF < currKey) {
print
}
else {
saved = $0 ORS
break
}
}
}
{ prevRec = currRec; prevKey = currKey }
END {
print prevRec
printf "%s", saved
while ( (getline < "fileB") > 0 ) {
print
}
}
$ awk -f tst.awk fileA
1,1,1487779199850
1,1,10,100,1487779199850
2,2,1487779199852
2,2,20,200,1487779199852
3,3,1487779199854
3,3,30,300,1487779199854
4,4,1487779199856
4,4,40,400,1487779199856
5,5,1487779199858
5,5,50,500,1487779199858
As you can see it reads from fileB between reads of lines fileA comparing timestamps so it's interleaving the 2 files and so doesn't require a subsequent pipe to sort and cut.
Just check the logic as I didn't think about it very much and be aware that this is a rare situation where getline might be appropriate for efficiency but make sure to read http://awk.freeshell.org/AllAboutGetline to understand all it's caveats if you're ever considering using it again.
Try this-
awk -F, '{print $NF, $0}' fileA fileB | sort -nk 1 | awk '{print $2}'
Output-
1,1,10,100,1487779199850
1,1,1487779199850
2,2,1487779199852
2,2,20,200,1487779199852
3,3,1487779199854
3,3,30,300,1487779199854
4,4,1487779199856
4,4,40,400,1487779199856
5,5,1487779199858
5,5,50,500,1487779199858
This concatenates the two files and then puts the timestamp at the starting of the line. It then sorts according to the timestamp and then removes that dummy column.
This will be slow for big files though.

Compare multiple Columns and Append the result into another file

I have two files file1 and file2, Both the files have 5 columns.
I want to compare first 4 columns of file1 with file2.
If they are equal, need to compare the 5th column. If 5th column values are different, need to print the file1's 5th column as file2's 6th column.
I have used below awk to compare two columns in two different files, but how to compare multiple columns and append the particular column in another file if matches found?
awk -F, 'NR==FNR{_1[$1]++;next}!_1[$1]'
file1:
111,item1,garde1,wing1,maingroup
123,item3,grade5,wing10,topcat
132,item2,grade3,wing7,middlecat
134,item2,grade3,wing7,middlecat
177,item8,gradeA,wing11,lowcat
file2:
111,item1,garde1,wing1,maingroup
123,item3,grade5,wing10,lowcat
132,item3,grade3,wing7,middlecat
126,item2,grade3,wing7,maingroup
177,item8,gradeA,wing11,lowcat
Desired output:
123,item3,grade5,wing10,lowcat,topcat
Awk can simulate multidimensional arrays by sequencing the indices. Underneath the indices are concatenated using the built-in SUBSEP variable as a separator:
$ awk -F, -v OFS=, 'NR==FNR { a[$1,$2,$3,$4]=$5; next } a[$1,$2,$3,$4] && a[$1,$2,$3,$4] != $5 { print $0,a[$1,$2,$3,$4] }' file1.txt file2.txt
123,item3,grade5,wing10,lowcat,topcat
awk -F, -v OFS=,
Set both input and output separators to ,
NR==FNR { a[$1,$2,$3,$4]=$5; next }
Create an associative array from the first file relating the first four fields of each line to the
fifth. When using a comma-separated list of values as an index, awk actually concatenates them
using the value of the built-in SUBSEP variable as a separator. This is awk's way of
simulating multidimensional arrays with a single subscript. You can set SUBSEP to any value you like
but the default, which is a non-printing character unlikely to appear in the data, is usually
fine. (You can also just do the trick yourself, something like a[$1 "|" $2 "|" $3 "|" $4],
assuming you know that your data contains no vertical bars.)
a[$1,$2,$3,$4] && a[$1,$2,$3,$4] != $5 { print $0,a[$1,$2,$3,$4] }
Arriving here, we know we are looking at the second file. If the first four fields were found in the
first file, and the $5 from the first file is different than the $5 in the second, print the line
from the second file followed by the $5 from the first. (I am assuming here that no $5 from the first file will have a value that evaluates to false, such as 0 or empty.)
$ cat tst.awk
BEGIN { FS=OFS="," }
{ key = $0; sub("(,[^,]*){"NF-4"}$","",key) }
NR==FNR { file1[key] = $5; next }
(key in file1) && ($5 != file1[key]) {
print $0, file1[key]
}
$ awk -f tst.awk file1 file2
123,item3,grade5,wing10,lowcat,topcat

How to remove several columns and the field separators at once in AWK?

I have a big file with several thousands of columns. I want to delete some specific columns and the field separators at once with AWK in Bash.
I can delete one column at a time with this oneliner (column 3 will be deleted and its corresponding field separator):
awk -vkf=3 -vFS="\t" -vOFS="\t" '{for(i=kf; i<NF;i++){ $i=$(i+1);}; NF--; print}' < Big_File
However, I want to delete several columns at once... Can someone help me figure this out?
You can pass list of columns to be deleted from shell to awk like this:
awk -vkf="3,5,11" ...
then in the awk programm parse it into array:
split(kf,kf_array,",")
and then go thru all the colums and test if each particular column is in the kf_array and possibly skip it
Other possibility is to call your oneliner several times :-)
Here is an implementation of Kamil's idea:
awk -v remove="3,8,5" '
BEGIN {
OFS=FS="\t"
split(remove,a,",")
for (i in a) b[a[i]]=1
}
{
j=1
for (i=1;i<=NF;++i) {
if (!(i in b)) {
$j=$i
++j
}
}
NF=j-1
print
}
'
If you can use cut instead of awk, this one is easier with cut:
e.g. this obtains columns 1,3, and from 50 on from file:
cut -f1,3,50- file
Something like this should work:
awk -F'\t' -v remove='3|8|5' '
{
rec=ofs=""
for (i=1;i<=NF;i++) {
if (i !~ "^(" remove ")$" ) {
rec = rec ofs $i
ofs = FS
}
}
print rec
}
' file

Resources