awk keep header when using IGNORECASE - bash

I have a csv file I want to search for a string in a specific column columnB (column 5 in my dataset) (case ignored), and apply a filter on anothercolumnC(column 10 in my dataset). Then save selected columns to a file.
sample of the dataset
columnA columnB columnC columnD
abc Apple 100 today
nbd apple 50 tomorrow
ccc apple 101 today
desired output
columnB columnC
Apple 100
apple 101
the problem when I use awk I can select columnB, but I can't output the header.
awk 'BEGIN {IGNORECASE = 1} {if($5 == "Apple") print $0 }' Data.csv> testPipe.txt
I have tried using NR==1 but for some reason it doesn't work with IGNORECASE.
I also tried the methods here and here.
I tried to use grip, I can output the header but I can't specify columnB for the string matching.And the search will be applied to all columns.
cat Data.csv |{ head -1; grep -I "Apple";} | awk -F',' '{ if ($10 >100 ) { print } }'>testPipe.txt
Is there a way to combine both methods and get the desired output?
Thanks

Use the function tolower():
awk 'NR==1{print; next} tolower($5) == "apple"' file
Explanation:
# Print the headers
NR==1 {
print
next
}
# Print the current line if $5 matches the condition
# Note that if there is no action specified, awk will
# use print $0 by default
tolower($5)
If you want to write further actions if the condition is true, put them into a block
tolower($5) {
...
}
In opposite to IGNORECASE which works only with GNU awk, the tolower() will work with any version of awk because it is defined by POSIX.

Update: apparently my answer isn't as good as I thought, see the comment from Ed Morton below. I'll keep it anyway, as a "How not to do it".
Original (bad) answer:
add the following to your BEGIN clause, either before or after setting IGNORECASE:
getline;
print;
explanation: the BEGIN clause is executed once before everything else, so you can process lines there, too, but you have to read them in manually.
full example:
awk '
BEGIN {
getline;
print;
IGNORECASE = 1;
}
$2 == "apple" && $3 <= 100 {
print $1;
}
'

Related

awk to get first column if the a specific number in the line is greater than a digit

I have a data file (file.txt) contains the below lines:
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=22:00,dom=sss.co.uk,user2=lis
I'm expecting to get the first column ($1) only if the ETA= number is greater than 15, like here I will have 2nd and 3rd line first column only is expected.
345
456
I tried like cat file.txt | awk -F [,TPF=]' '{print $1}' but its print whole line which has ETA at the end.
Using awk
$ awk -F"[=, ]" '{for (i=1;i<NF;i++) if ($i=="ETA") if ($(i+1) > 15) print $1}' input_file
345
456
With your shown samples please try following GNU awk code. Using match function of GNU awk where I am using regex (^[0-9]+).*ETA=([0-9]+):[0-9]+ which creates 2 capturing groups and saves its values into array arr. Then checking condition if 2nd element of arr is greater than 15 then print 1st value of arr array as per requirement.
awk '
match($0,/(^[0-9]+).*\<ETA=([0-9]+):[0-9]+/,arr) && arr[2]+0>15{
print arr[1]
}
' Input_file
I would harness GNU AWK for this task following way, let file.txt content be
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=02:00,dom=sss.co.uk,user2=lis
then
awk 'substr($0,index($0,"ETA=")+4,2)+0>15{print $1}' file.txt
gives output
345
Explanation: I use String functions, index to find where is ETA= then substr to get 2 characters after ETA=, 4 is used as ETA= is 4 characters long and index gives start position, I use +0 to convert to integer then compare it with 15. Disclaimer: this solution assumes every row has ETA= followed by exactly 2 digits.
(tested in GNU Awk 5.0.1)
Whenever input contains tag=value pairs as yours does, it's best to first create an array of those mappings (v[]) below and then you can just access the values by their tags (names):
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
v["ETA"]+0 > 15 {
print $1
}
$ awk -f tst.awk file
345
456
With that approach you can trivially enhance the script in future to access whatever values you like by their names, test them in whatever combinations you like, output them in whatever order you like, etc. For example:
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
(v["pro"] ~ /b/) && (v["ETA"]+0 > 15) {
print $1, v["team"], v["dom"]
}
$ awk -f tst.awk file
345,abc,sbc.int
456,efg,sss.co.uk
Think about how you'd enhance any other solution to do the above or anything remotely similar.
It's unclear why you think your attempt would do anything of the sort. Your attempt uses a completely different field separator and does not compare anything against the number 15.
You'll also want to get rid of the useless use of cat.
When you specify a column separator with -F that changes what the first column $1 actually means; it is then everything before the first occurrence of the separator. Probably separately split the line to obtain the first column, space-separated.
awk -F 'ETA=' '$2 > 15 { split($0, n, /[ \t]+/); print n[1] }' file.txt
The value in $2 will be the data after the first separator (and up until the next one) but using it in a numeric comparison simply ignores any non-numeric text after the number at the beginning of the field. So for example, on the first line, we are actually literally checking if 12:00, team=xyz,user1=tom,dom=dby.com is larger than 15 but it effectively checks if 12 is larger than 15 (which is obviously false).
When the condition is true, we split the original line $0 into the array n on sequences of whitespace, and then print the first element of this array.
Using awk you could match ETA= followed by 1 or more digits. Then get the match without the ETA= part and check if the number is greater than 15 and print the first field.
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4)+0 > 15) print $1
}' file
Output
345
456
If the first field should start with a number:
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4) > 15)+0 print $1
}' file

awk: select first column and value in column after matching word

I have a .csv where each row corresponds to a person (first column) and attributes with values that are available for that person. I want to extract the names and values a particular attribute for persons where the attribute is available. The doc is structured as follows:
name,attribute1,value1,attribute2,value2,attribute3,value3
joe,height,5.2,weight,178,hair,
james,,,,,,
jesse,weight,165,height,5.3,hair,brown
jerome,hair,black,breakfast,donuts,height,6.8
I want a file that looks like this:
name,attribute,value
joe,height,5.2
jesse,height,5.3
jerome,height,6.8
Using this earlier post, I've tried a few different awk methods but am still having trouble getting both the first column and then whatever column has the desired value for the attribute (say height). For example the following returns everything.
awk -F "height," '{print $1 "," FS$2}' file.csv
I could grep only the rows with height in them, but I'd prefer to do everything in a single line if I can.
You may use this awk:
cat attrib.awk
BEGIN {
FS=OFS=","
print "name,attribute,value"
}
NR > 1 && match($0, k "[^,]+") {
print $1, substr($0, RSTART+1, RLENGTH-1)
}
# then run it as
awk -v k=',height,' -f attrib.awk file
name,attribute,value
joe,height,5.2
jesse,height,5.3
jerome,height,6.8
# or this one
awk -v k=',weight,' -f attrib.awk file
name,attribute,value
joe,weight,178
jesse,weight,165
With your shown samples please try following awk code. Written and tested in GNU awk. Simple explanation would be, using GNU awk and setting RS(record separator) to ^[^,]*,height,[^,]* and then printing RT as per requirement to get expected output.
awk -v RS='^[^,]*,height,[^,]*' 'RT{print RT}' Input_file
I'd suggest a sed one-liner:
sed -n 's/^\([^,]*\).*\(,height,[^,]*\).*/\1\2/p' file.csv
One awk idea:
awk -v attr="height" '
BEGIN { FS=OFS="," }
FNR==1 { print "name", "attribute", "value"; next }
{ for (i=2;i<=NF;i+=2) # loop through even-numbered fields
if ($i == attr) { # if field value is an exact match to the "attr" variable then ...
print $1,$i,$(i+1) # print current name, current field and next field to stdout
next # no need to check rest of current line; skip to next input line
}
}
' file.csv
NOTE: this assumes the input value (height in this example) will match exactly (including same capitalization) with a field in the file
This generates:
name,attribute,value
joe,height,5.2
jesse,height,5.3
jerome,height,6.8
With a perl one-liner:
$ perl -lne '
print "name,attribute,value" if $.==1;
print "$1,$2" if /^(\w+).*(height,\d+\.\d+)/
' file
output
name,attribute,value
joe,height,5.2
jesse,height,5.3
jerome,height,6.8
awk accepts variable-value arguments following a -v flag before the script. Thus, the name of the required attribute can be passed into an awk script using the general pattern:
awk -v attr=attribute1 ' {} ' file.csv
Inside the script, the value of the passed variable is reference by the variable name, in this case attr.
Your criteria are to print column 1, the first column containing the name, the column corresponding to the required header value, and the column immediately after that column (holding the matched values).
Thus, the following script allows you to fish out the column headed "attribute1" and it's next neighbour:
awk -v attr=attribute1 ' BEGIN {FS=","} /attr/{for (i=1;i<=NF;i++) if($i == attr) col=i;} {print $1","$col","$(col+1)} ' data.txt
result:
name,attribute1,value1
joe,height,5.2
james,,
jesse,weight,165
jerome,hair,black
another column (attribute 3):
awk -v attr=attribute3 ' BEGIN {FS=","} /attr/{for (i=1;i<=NF;i++) if($i == attr) col=i;} {print $1","$col","$(col+1)} ' awkNames.txt
result:
name,attribute3,value3
joe,hair,
james,,
jesse,hair,brown
jerome,height,6.8
Just change the value of the -v attr= argument for the required column.

How to add an if statement before calculation in AWK

I have a series of files that I am looping through and calculating the mean on a column within each file after performing a serious of filters. Each filter is piped in to the next, BEFORE calculating the mean on the final output. All of this is done within a sub shell to assign it to a variable for later use.
for example:
variable=$(filter1 | filter 2 | filter 3 | calculate mean)
to calculate the mean I use the following code
... | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}'
So, my problem is that depending on the file, the number of rows after the final filter is reduced to 0, i.e. the pipe passes nothing to AWK and I end up with awk: fatal: division by zero attempted printed to screen, and the variable then remains empty. I later print the variable to file and in this case I end up with BLANK in a text file. Instead what I am attempting to do is state that if NR==0 then assign 0 to the variable so that my final output in the text file is 0.
To do this I have tried to add an if statement at the start of my awk command
... | awk '{if (NR==0) print 0}BEGIN{s=0;}{s=s+$5;}END{print s/NR;}'
but this doesn't change the output/ error and I am left with BLANKs
I did move the begin statement but this caused other errors (syntax and output errors)
Expected results:
given that column from a file has 5 lines and looks thus, I would filter on apple and pipe into the calculation
apple 10
apple 10
apple 10
apple 10
apple 10
code:
vairable=$(awk -F"\t" '{OFS="\t"; if($1 ~ /apple/) print $0}' file.in | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}')
then I would expect the variable to be set to 10 (10*5/5 = 10)
In the following scenario where I filter on banana
vairable=$(awk -F"\t" '{OFS="\t"; if($1 ~ /banana/) print $0}' file.in | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}')
given that the pipe passes nothing to AWK I would want the variable to be 0
is it just easier to accept the blank space and change it later when printed to file - i.e. replace BLANK with 0?
The default value of a variable which you treat as a number in AWK is 0, so you don't need BEGIN {s=0}.
You should put the condition in the END block. NR is not the number of all rows, but the index of the current row. So it will only give the number of rows there were at the end.
awk '{s += $5} END { if (NR == 0) { print 0 } else { print s/NR } }'
Or, using a ternary:
awk '{s += $5} END { print (NR == 0) ? 0 : s/NR }'
Also, a side note about your BEGIN{OFS='\t'} ($1 ~ /banana/) { print $0 } examples: most of that code is unnecessary. You can just pass the condition:
awk -F'\t' '$1 ~ /banana/'`
When an awk program is only a condition, it uses that as a condition for whether or not to print a line. So you can use conditions as a quick way to filter through the text.
The correct way to write:
awk -F"\t" '{OFS="\t"; if($1 ~ /banana/) print $0}' file.in | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}'
is (assuming a regexp comparison for $1 really is appropriate, which it probably isn't):
awk 'BEGIN{FS=OFS="\t"} $1 ~ /banana/{ s+=$5; c++ } END{print (c ? s/c : 0)}' file.in
Is that what you're looking for?
Or are you trying to get the mean per column 1 like this:
awk 'BEGIN{FS=OFS="\t"} { s[$1]+=$5; c[$1]++ } END{ for (k in s) print k, s[k]/c[k] }' file.in
or something else?

using awk to lookup and insert data

In my continuing crusade not to use MS Excel, I'd like to process some data, send it to a file, and then insert some records from a separate file into a third file using field $1 as the index. Is this possible?
I have data like this:
2600,foo,stack,1,04/02/2015,ACH Payment,ACH Settled,1500
2600,foo,stack,2,04/06/2015,Credit Card Sale,Settled,100
2600,foo,stack,3,04/07/2015,Credit Card Sale,Settled,157.13
2600,foo,stack,4,04/07/2015,ACH Credit,ACH Settled,.03
I have this to group it:
cat group.awk
#!/usr/bin/awk -f
BEGIN {
OFS = FS = ","
}
NR > 1 {
arr[$1 OFS $2 OFS $3]++
}
END {
for (key in arr)
print key, arr[key]
}
The group makes it like this:
2600,foo,stack,4
Simple multiplication is applied to fields 5, 6 and 7 where applicable--depends on fields 3.
In this example we can say the finished record looks like this:
2600,foo,stack,4,.2,19.8
Now in a separate file, I have this data:
2600,registered user,5hPASLJlHlgJR4AQc9sZQ==
basic flow is:
awk -f group.awk data.csv | awk -f math.awk > finished.csv
Then use awk (if it can do this) to look up field $1 in finished.csv and find corresponding record above in the separate file(bill.csv) and print to a third file or insert into bill.csv.
Expected output in third file(bill.csv):
x,y,,1111111,2600,,,,,,,19.8,,,registered user,,,,,,,,,,RS,,,N5hPASLJlHlgJR4AQc9sZQ==,z,a
x,y,,1111111,RS,z,a will be pre-populated to I only need to insert three new records.
Is this something awk can accomplish?
Edit
Field $3 is the accountID that sets the multiplication on 5, 6 and 7.
Here's the idea:
bill.awk:
NR>1{if($3=="stack" && $4>199) $5=$4*0.03;
if($3=="stack" && $4<200) $5=$4*0.05
if($3=="user") $5=$4*.01
}1
total.awk:
awk -F, -v OFS="," 'NR>1{if($3=="stack" && $5<20) $6=20-$5;
if($3=='stack && $5>20) $6=0;}1'
This part is working and final output is like above:
2600,foo,stack,4,.2,19.8
4*.05 = .2 & 20 - .2 = 19.8
But the minimium charge is $20
So we'll correct it:
4*.05 = .2 & 20 - .2 = 20
Extra populated fields came from a separate file (bill.csv) and I need to fill in 20 to the correct record on bill.csv
bill.csv contains everything needed except the 20
before:
x,y,,1111111,2600,,,,,,,,,,,registered user,,,,,,,,,,RS,,,N5hPASLJlHlgJR4AQc9sZQ==,z,a
after:
x,y,,1111111,2600,,,,,,,20,,,registered user,,,,,,,,,,RS,,,N5hPASLJlHlgJR4AQc9sZQ==,z,a
Is this a better explanaiton? Go on the assumption that group.awk, bill.awk and total.awk are working correctly. I just need to extract the correct total for field $1 and put it in bill.csv in the correct spot.
Is maybe this last awk what you need. I´ve tried to understand what you want and I think is just that merging awk way:
For explaininng: We first save the fileA in an array with the first key as the index. Then we search for each line o file B if the field1 is between the indexes of our array, and if it´s, we print all data from two files together
awk -F"," 'BEGIN {while (getline < "record.dat"){ a[$1]=$0; }} {if($1 in a){ print a[$1]","$0}}' file.dat
2600,foo,stack,4,10,10.4,2600,registered user,5hPASLJlHlgJR4AQc9sZQ==
This is the kind of solution you need:
$ cat fileA
2600,foo,stack,1,04/02/2015,ACH Payment,ACH Settled,1500
2600,foo,stack,2,04/06/2015,Credit Card Sale,Settled,100
2600,foo,stack,3,04/07/2015,Credit Card Sale,Settled,157.13
2600,foo,stack,4,04/07/2015,ACH Credit,ACH Settled,.03
2600,foo,stack,5,04/09/2015,ACH Payment,ACH Settled,147.10
$ cat fileB
2600,registered user,5hPASLJlHlgJR4AQc9sZQ==
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR{
cnts[$1][$2FS$3]++
next
}
{
for (val in cnts[$1]) {
cnt = cnts[$1][val]
print $1, val, cnt, cnt*2.5, $2, $3
}
}
$ awk -f tst.awk fileA fileB
2600,foo,stack,5,12.5,registered user,5hPASLJlHlgJR4AQc9sZQ==
but until you update your question we can't provide any more concrete help than that.
The above uses GNU awk 4.* for true 2D arrays.

How to remove several columns and the field separators at once in AWK?

I have a big file with several thousands of columns. I want to delete some specific columns and the field separators at once with AWK in Bash.
I can delete one column at a time with this oneliner (column 3 will be deleted and its corresponding field separator):
awk -vkf=3 -vFS="\t" -vOFS="\t" '{for(i=kf; i<NF;i++){ $i=$(i+1);}; NF--; print}' < Big_File
However, I want to delete several columns at once... Can someone help me figure this out?
You can pass list of columns to be deleted from shell to awk like this:
awk -vkf="3,5,11" ...
then in the awk programm parse it into array:
split(kf,kf_array,",")
and then go thru all the colums and test if each particular column is in the kf_array and possibly skip it
Other possibility is to call your oneliner several times :-)
Here is an implementation of Kamil's idea:
awk -v remove="3,8,5" '
BEGIN {
OFS=FS="\t"
split(remove,a,",")
for (i in a) b[a[i]]=1
}
{
j=1
for (i=1;i<=NF;++i) {
if (!(i in b)) {
$j=$i
++j
}
}
NF=j-1
print
}
'
If you can use cut instead of awk, this one is easier with cut:
e.g. this obtains columns 1,3, and from 50 on from file:
cut -f1,3,50- file
Something like this should work:
awk -F'\t' -v remove='3|8|5' '
{
rec=ofs=""
for (i=1;i<=NF;i++) {
if (i !~ "^(" remove ")$" ) {
rec = rec ofs $i
ofs = FS
}
}
print rec
}
' file

Resources