Using AWK to compare values in the same column - bash

I'm using AWK and I've been trying to compare the previous value in a column with the next one until it finds the highest but I haven't been able to.
awk '$6 >= $6 {print $6}'
With the above, it returns me every single value
For example:
money:
49
90
30
900
I would like it to return 900

$ awk '(NR>1) && ((NR==2) || ($1>max)){max=$1} END{if (max != "") print max}' file
900
The above is based on your posted example (including the money: header line) but would also work even if all input values were 0 or negative or the input file was empty. Change $1 to $6 if the real field number you're interested in is 6.
Also consider:
$ tail -n +2 file | cut -f1 | sort -rn | head -1
900
and change -f1 to -f6.
Set separator chars in awk with -F'<char>' and cut with -d'<char>' if necessary.

Like this:
awk 'int($6) && $6 > n{n=$6}END{print n}' file

Assuming the columns are separated by the default (white-space) separator, the following awk line checks the 6th value in each data row to see if it is greater than any previously seen values (current set in a BEGIN block to zero, gets updated each time a greater 6th value is encountered). The END block prints the final value held in current (which is the greatest value for the 6th field)"
awk 'BEGIN {current=0} NR>1{if ($6>current) current=$6} END {print current}' dataFile.txt
For .csv data, the file separator can be defined as a comma with FS="," in the BEGIN block:
awk 'BEGIN {FS=","; current=0} NR>1{if ($6>current) current=$6} END {print current}' dataFile.csv

$ awk 'NR==2 {max=$6}
$6>max {max=$6}
END {print max}' file

Related

awk to get first column if the a specific number in the line is greater than a digit

I have a data file (file.txt) contains the below lines:
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=22:00,dom=sss.co.uk,user2=lis
I'm expecting to get the first column ($1) only if the ETA= number is greater than 15, like here I will have 2nd and 3rd line first column only is expected.
345
456
I tried like cat file.txt | awk -F [,TPF=]' '{print $1}' but its print whole line which has ETA at the end.
Using awk
$ awk -F"[=, ]" '{for (i=1;i<NF;i++) if ($i=="ETA") if ($(i+1) > 15) print $1}' input_file
345
456
With your shown samples please try following GNU awk code. Using match function of GNU awk where I am using regex (^[0-9]+).*ETA=([0-9]+):[0-9]+ which creates 2 capturing groups and saves its values into array arr. Then checking condition if 2nd element of arr is greater than 15 then print 1st value of arr array as per requirement.
awk '
match($0,/(^[0-9]+).*\<ETA=([0-9]+):[0-9]+/,arr) && arr[2]+0>15{
print arr[1]
}
' Input_file
I would harness GNU AWK for this task following way, let file.txt content be
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=02:00,dom=sss.co.uk,user2=lis
then
awk 'substr($0,index($0,"ETA=")+4,2)+0>15{print $1}' file.txt
gives output
345
Explanation: I use String functions, index to find where is ETA= then substr to get 2 characters after ETA=, 4 is used as ETA= is 4 characters long and index gives start position, I use +0 to convert to integer then compare it with 15. Disclaimer: this solution assumes every row has ETA= followed by exactly 2 digits.
(tested in GNU Awk 5.0.1)
Whenever input contains tag=value pairs as yours does, it's best to first create an array of those mappings (v[]) below and then you can just access the values by their tags (names):
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
v["ETA"]+0 > 15 {
print $1
}
$ awk -f tst.awk file
345
456
With that approach you can trivially enhance the script in future to access whatever values you like by their names, test them in whatever combinations you like, output them in whatever order you like, etc. For example:
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
(v["pro"] ~ /b/) && (v["ETA"]+0 > 15) {
print $1, v["team"], v["dom"]
}
$ awk -f tst.awk file
345,abc,sbc.int
456,efg,sss.co.uk
Think about how you'd enhance any other solution to do the above or anything remotely similar.
It's unclear why you think your attempt would do anything of the sort. Your attempt uses a completely different field separator and does not compare anything against the number 15.
You'll also want to get rid of the useless use of cat.
When you specify a column separator with -F that changes what the first column $1 actually means; it is then everything before the first occurrence of the separator. Probably separately split the line to obtain the first column, space-separated.
awk -F 'ETA=' '$2 > 15 { split($0, n, /[ \t]+/); print n[1] }' file.txt
The value in $2 will be the data after the first separator (and up until the next one) but using it in a numeric comparison simply ignores any non-numeric text after the number at the beginning of the field. So for example, on the first line, we are actually literally checking if 12:00, team=xyz,user1=tom,dom=dby.com is larger than 15 but it effectively checks if 12 is larger than 15 (which is obviously false).
When the condition is true, we split the original line $0 into the array n on sequences of whitespace, and then print the first element of this array.
Using awk you could match ETA= followed by 1 or more digits. Then get the match without the ETA= part and check if the number is greater than 15 and print the first field.
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4)+0 > 15) print $1
}' file
Output
345
456
If the first field should start with a number:
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4) > 15)+0 print $1
}' file

How to add an if statement before calculation in AWK

I have a series of files that I am looping through and calculating the mean on a column within each file after performing a serious of filters. Each filter is piped in to the next, BEFORE calculating the mean on the final output. All of this is done within a sub shell to assign it to a variable for later use.
for example:
variable=$(filter1 | filter 2 | filter 3 | calculate mean)
to calculate the mean I use the following code
... | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}'
So, my problem is that depending on the file, the number of rows after the final filter is reduced to 0, i.e. the pipe passes nothing to AWK and I end up with awk: fatal: division by zero attempted printed to screen, and the variable then remains empty. I later print the variable to file and in this case I end up with BLANK in a text file. Instead what I am attempting to do is state that if NR==0 then assign 0 to the variable so that my final output in the text file is 0.
To do this I have tried to add an if statement at the start of my awk command
... | awk '{if (NR==0) print 0}BEGIN{s=0;}{s=s+$5;}END{print s/NR;}'
but this doesn't change the output/ error and I am left with BLANKs
I did move the begin statement but this caused other errors (syntax and output errors)
Expected results:
given that column from a file has 5 lines and looks thus, I would filter on apple and pipe into the calculation
apple 10
apple 10
apple 10
apple 10
apple 10
code:
vairable=$(awk -F"\t" '{OFS="\t"; if($1 ~ /apple/) print $0}' file.in | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}')
then I would expect the variable to be set to 10 (10*5/5 = 10)
In the following scenario where I filter on banana
vairable=$(awk -F"\t" '{OFS="\t"; if($1 ~ /banana/) print $0}' file.in | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}')
given that the pipe passes nothing to AWK I would want the variable to be 0
is it just easier to accept the blank space and change it later when printed to file - i.e. replace BLANK with 0?
The default value of a variable which you treat as a number in AWK is 0, so you don't need BEGIN {s=0}.
You should put the condition in the END block. NR is not the number of all rows, but the index of the current row. So it will only give the number of rows there were at the end.
awk '{s += $5} END { if (NR == 0) { print 0 } else { print s/NR } }'
Or, using a ternary:
awk '{s += $5} END { print (NR == 0) ? 0 : s/NR }'
Also, a side note about your BEGIN{OFS='\t'} ($1 ~ /banana/) { print $0 } examples: most of that code is unnecessary. You can just pass the condition:
awk -F'\t' '$1 ~ /banana/'`
When an awk program is only a condition, it uses that as a condition for whether or not to print a line. So you can use conditions as a quick way to filter through the text.
The correct way to write:
awk -F"\t" '{OFS="\t"; if($1 ~ /banana/) print $0}' file.in | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}'
is (assuming a regexp comparison for $1 really is appropriate, which it probably isn't):
awk 'BEGIN{FS=OFS="\t"} $1 ~ /banana/{ s+=$5; c++ } END{print (c ? s/c : 0)}' file.in
Is that what you're looking for?
Or are you trying to get the mean per column 1 like this:
awk 'BEGIN{FS=OFS="\t"} { s[$1]+=$5; c[$1]++ } END{ for (k in s) print k, s[k]/c[k] }' file.in
or something else?

awk combine 2 commands for csv file formatting

I have a CSV file which has 4 columns. I want to first:
print the first 10 items of each column
only print the items in the third column
My method is to pipe the first awk command into another but i didnt get exactly what i wanted:
awk 'NR < 10' my_file.csv | awk '{ print $3 }'
The only missing thing was the -F.
awk -F "," 'NR < 10' my_file.csv | awk -F "," '{ print $3 }'
You don't need to run awk twice.
awk -F, 'NR<=10{print $3}'
This prints the third field for every line whose record number (line) is less than or equal to 10.
Note that < is different from <=. The former matches records one through nine, the latter matches records one through ten. If you need ten records, use the latter.
Note that this will walk through your entire file, so if you want to optimize your performance:
awk -F, '{print $3} NR>10{exit}'
This will print the third column. Then if the record number is greater than 10, it will exit. This does not step through your entire file.
Note also that awk's "CSV" matching is very simple; awk does not understand quoted fields, so the record:
red,"orange,yellow",green
has four fields, two of which have double quotes in them. YMMV depending on your input.

Get the contents of one column given another column

I have a tab separated file with 3 columns. I'd like to get the contents of the first column, but only for the rows where the 3rd column is equal to 8. How do I extract these values? If I just wanted to extract the values in the first column, I would do the following:
cat file1 | tr "\t" "~" | cut -d"~" -f1 >> file_with_column_3
I'm thinking something like:
cat file1 | tr "\t" "~" | if cut -d"~" -f3==8; then cut -d"~" -f1 ; fi>> file_with_column_3
But that doesn't quite seem to work.
Given that your file is tab delimited, it seems like this problem would be well suited for awk.
Something simple like below should work for you, though without any sample data I can't say for sure (try to always include this on questions on SO)
awk -F'\t' '$3==8 {print $1}' inputfile > outputfile
The -F'\t' sets the input delimiter as tab.
$3==8 compares if the 3rd column based on that delimiter is 8.
If so, the {print $1} is executed, which prints the first column.
Otherwise, nothing is done and awk proceeds to the next line.
If your file had a header you wanted to preserve, you could just modify this like the following, which tells awk to print if the current record number is 1.
awk -F'\t' 'NR==1 {print;} $3==8 {print $1}' inputfile > outputfile
awk can handle this better:
awk -F '\t' '$3 == 8 { print $1 }' file1
You can do it with bash only too:
cat x | while read y; do split=(${y}); [ ${split[2]} == '8' ] && echo $split[0]; done
The input is read in variable y, then split into an array. The IFS (input field separator) defaults to <space><tab<>newline>, so it splits on tabs too. The third field of the array is then compared to '8'. If it equals, it prints the first field of the array. Remember that fields in arrays start counting at zero.

Shell Script add column values

I have a text file which contains like below:
{"userId":"f1fcab","count":"3","type":"Stack"}
{"userId":"fcab","count":"2","type":"Stack"}
{"userId":"abcd","count":"5","type":"Stack"}
I want to get sum of the value of count.
I am using awk to achive this like below:
$ awk -F "," '{print $4}' test.txt
How can I get only the integer type using awk and add them all.
My script should give me as
sum=10
You could try the below,
$ awk -F'"' '{sum = sum + $8;}END{print "sum="sum+0}' file
sum=10
-F'"' Sets the double quotes as FS value. Awk splits the row into colunms according to the value of FS variable.
sum = sum + $8 Calculate the sum of all the values in column no 8 and store it into a variable called sum
Finally by printing the variable sum at the end will give you the desired output.
You can get the value of count key using double quotes (") as delimiter so that the eighth column will be the value to count on:
$ awk -F"\"" 'BEGIN {sum=0} {sum+=$8} END {print sum}' fd
10
Assuming consistent use of double quote characters, you can use:
awk -F\" '{s += $8} END{print "sum=" s+0}' inputFile
This will generate:
sum=10
This works because a quote delimiter gives you the fields:
1 2 3 4 5 6 7 8 ...
{"userId":"f1fcab","count":"3","type":"Stack"}
awk -F'[:"]' '{sum+=$10} END{print "sum=" sum}' File
Setting ':' and '"' as delimiters. Then taking the 10th field, which is the count value. add then up to sum and print at the end.
Example:
sdlcb#ubuntu:~/AMD_C/SO$ cat File
{"userId":"f1fcab","count":"3","type":"Stack"}
{"userId":"fcab","count":"2","type":"Stack"}
{"userId":"abcd","count":"5","type":"Stack"}
sdlcb#ubuntu:~/AMD_C/SO$ awk -F'[:"]' '{sum+=$10} END{print "sum=" sum}' File
sum=10

Resources