awk calculate average or zero - bash

I am calculating the average for a bunch of numbers in a bunch of text files like this:
grep '^num' file.$i | awk '{ sum += $2 } END { print sum / NR }'
But some times the file doesn't contain the pattern, in which cas I want the script to return zero. Any ideas of this slightly modified one-liner?

You're adding to your load (average) by spawning an extra process to do everything the first can do. Using 'grep' and 'awk' together is a red-flag. You would be better to write:
awk '/^num/ {n++;sum+=$2} END {print n?sum/n:0}' file

Try this:
... END { print NR ? sum/NR : 0 }

Use awk's ternary operator, i.e. m ? m : n which means, if m has a value '?', use it, else ':' use this other value. Both n and m can be strings, numbers, or expressions that produce a value.
grep '^num' file.$i | awk '{ sum += $2 } END { print sum ? sum / NR : 0.0 }'

Related

How to add an if statement before calculation in AWK

I have a series of files that I am looping through and calculating the mean on a column within each file after performing a serious of filters. Each filter is piped in to the next, BEFORE calculating the mean on the final output. All of this is done within a sub shell to assign it to a variable for later use.
for example:
variable=$(filter1 | filter 2 | filter 3 | calculate mean)
to calculate the mean I use the following code
... | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}'
So, my problem is that depending on the file, the number of rows after the final filter is reduced to 0, i.e. the pipe passes nothing to AWK and I end up with awk: fatal: division by zero attempted printed to screen, and the variable then remains empty. I later print the variable to file and in this case I end up with BLANK in a text file. Instead what I am attempting to do is state that if NR==0 then assign 0 to the variable so that my final output in the text file is 0.
To do this I have tried to add an if statement at the start of my awk command
... | awk '{if (NR==0) print 0}BEGIN{s=0;}{s=s+$5;}END{print s/NR;}'
but this doesn't change the output/ error and I am left with BLANKs
I did move the begin statement but this caused other errors (syntax and output errors)
Expected results:
given that column from a file has 5 lines and looks thus, I would filter on apple and pipe into the calculation
apple 10
apple 10
apple 10
apple 10
apple 10
code:
vairable=$(awk -F"\t" '{OFS="\t"; if($1 ~ /apple/) print $0}' file.in | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}')
then I would expect the variable to be set to 10 (10*5/5 = 10)
In the following scenario where I filter on banana
vairable=$(awk -F"\t" '{OFS="\t"; if($1 ~ /banana/) print $0}' file.in | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}')
given that the pipe passes nothing to AWK I would want the variable to be 0
is it just easier to accept the blank space and change it later when printed to file - i.e. replace BLANK with 0?
The default value of a variable which you treat as a number in AWK is 0, so you don't need BEGIN {s=0}.
You should put the condition in the END block. NR is not the number of all rows, but the index of the current row. So it will only give the number of rows there were at the end.
awk '{s += $5} END { if (NR == 0) { print 0 } else { print s/NR } }'
Or, using a ternary:
awk '{s += $5} END { print (NR == 0) ? 0 : s/NR }'
Also, a side note about your BEGIN{OFS='\t'} ($1 ~ /banana/) { print $0 } examples: most of that code is unnecessary. You can just pass the condition:
awk -F'\t' '$1 ~ /banana/'`
When an awk program is only a condition, it uses that as a condition for whether or not to print a line. So you can use conditions as a quick way to filter through the text.
The correct way to write:
awk -F"\t" '{OFS="\t"; if($1 ~ /banana/) print $0}' file.in | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}'
is (assuming a regexp comparison for $1 really is appropriate, which it probably isn't):
awk 'BEGIN{FS=OFS="\t"} $1 ~ /banana/{ s+=$5; c++ } END{print (c ? s/c : 0)}' file.in
Is that what you're looking for?
Or are you trying to get the mean per column 1 like this:
awk 'BEGIN{FS=OFS="\t"} { s[$1]+=$5; c[$1]++ } END{ for (k in s) print k, s[k]/c[k] }' file.in
or something else?

Finding max valued of a feild based on the value in another row

I have a file as below (example snippet) and I want to find out the task that has taken the maximum time , the time field does not have to be immediately followed by the task field.
task: a
time:10
log: akjafgasgf
...
....
task:b
log: taskb
.....
time:30
....
....
task:c
time:20
....
....
log:hhhhs
Sample output of above input is
task:b time:30
I have tried
awk -F":" '/task/{i=$2}{if($0 ~ "time" ) arr[i]=$2}END{for(i in arr) print i,arr[i]}' filename | sort -nr -k2,2 | head -n1
and it works but I think this can be more optimized , so Please advise something better than this.
You can do it like this:
awk -F: '
$1=="task" { ct = $2 }
$1=="time" { if($2 > mti){ mti = $2; mta = ct } }
END { printf("task:%s time:%s\n",mta, mti) }
' yourfile
: is used as separator (-F:)
the current task is stored in ct, this means that we need a task before a time line
when we see the time line the value in $2 is compared with the max time value seen so far; mti and mta are updated if necessary
in the END the max values are printed
a wholesale solution, assumes tasks precede times and one to one.
awk -F' *: *' '$1=="task" {n=$2}
$1=="time" {t[n]=$2}
END {for(n in t) print n,t[n]}' file |
sort -k2nr
will give
b 30
c 20
a 10
note the handling of white space around the field delimiter, as in your first task. You can of course sort in awk as well by using an index array, but the tool already exist for this task.
Perl :) (also sums multiple times for the given task)
perl -nlE '
$t = $1 if /^task:\s*(.*)/;
$s->{$t} += $1 if /^time:\s*(.*)/;
}{
say "$_: $s->{$_}" for sort { $s->{$b} <=> $s->{$a} } keys %$s
' tasks.txt
output
b: 30
c: 20
a: 10
or
perl -nlE '
$t = $1 if /^task:\s*(.*)/;
$s->{$t} += $1 if /^time:\s*(.*)/;
}{
say "$_: $s->{$_}" for [sort { $s->{$b} <=> $s->{$a} } keys %$s]->[0]
' tasks.txt
output
b: 30
would be much better to use an nice, full-featured perl script instead of such hackish "multiliner". But YMMV - you can always squeeze it into one unreadable line, as:
perl -nlE '$t=$1if/^task:\s*(.*)/;$s->{$t}+=$1if/^time:\s*(.*)/}{say"$_: $s->{$_}"for sort{$s->{$b}<=>$s->{$a}}keys%$s' task
awk -F: '/^task/{ta=$0} /^time/&&($2>ti||ti==""){ti=$2;b=ta OFS $0} END{print b}' file
task:b time:30
Explained:
awk -F: ' # set delimiter
/^task/ { ta=$0 } # buffer task record
/^time/ && ( $2>ti || ti=="" ) { # if time value larger than previous max
ti=$2 # store new max time value
b=ta OFS $0 # construct ouput buffer while at it
}
END {
print b # output max
}' file

Hi, trying to obtain the mean from the array values using awk?

Im new to bash programming. Here im trying to obtain the mean from the array values.
Heres what im trying:
${GfieldList[#]} | awk '{ sum += $1; n++ } END { if (n > 0) print "mean: " sum / n; }';
Using $1 Im not able to get all the values? Guys pls help me out in this...
For each non-empty line of input, this will sum everything on the line and print the mean:
$ echo 21 20 22 | awk 'NF {sum=0;for (i=1;i<=NF;i++)sum+=$i; print "mean=" sum / NF; }'
mean=21
How it works
NF
This serves as a condition: the statements which follow will only be executed if the number of fields on this line, NF, evaluates to true, meaning non-zero.
sum=0
This initializes sum to zero. This is only needed if there is more than one line.
for (i=1;i<=NF;i++)sum+=$i
This sums all the fields on this line.
print "mean=" sum / NF
This prints the sum of the fields divided by the number of fields.
The bare
${GfieldList[#]}
will not print the array to the screen. You want this:
printf "%s\n" "${GfieldList[#]}"
All those quotes are definitely needed .

Getting gawk to output 0 if no arguments available

I am struggling to make this piece of code output 0 if there are no arguments in $5
ls -AFl | sed "1 d" | grep [^/]$ | gawk '{ if ($5 =="") sum = 0; else sum += $5 } END { print sum }'
When I run this line in a directory without any files in it, it outputs a newline, instead of 0.
I don't understand why? How can I make it so it outputs 0 when there are no files in the directory, any help would be appreciated, thank you.
You can change awk command to:
gawk 'BEGIN { sum = 0 } $5 { sum += $5 } END { print sum }'
i.e. initialize sum to 0 in BEGIN block and aggregate sum only when $5 is non-empty.
Here's an alternative way of achieving what you want:
stat * 2&>/dev/null | awk '/Size/ && !/directory/ { sum += $2 } END { print (sum ? sum : 0) }'
This uses awk to parse the output of stat. The shell expands the * to the names of everything in the current directory. If the directory is empty, stat produces an error, which is sent to /dev/null.
The awk script adds the value of the second column for lines which contain "Size" but not "directory", so files and symbolic links are included. If you wanted to only count files, you could change !/directory/ to /regular file/.
The ternary operator ? : means if sum is "true", print sum, otherwise print 0. If the directory is empty, sum is not defined, so 0 is printed.
As mentioned in the comments, a more concise way of coercing sum to a number is to use print sum+0, or alternatively using the unary + operator print +sum. In this case, either are perfectly fine to use, although many recommend against +sum in more complex scenarios.

median of column with awk

How can I use AWK to compute the median of a column of numerical data?
I can think of a simple algorithm but I can't seem to program it:
What I have so far is:
sort | awk 'END{print NR}'
And this gives me the number of elements in the column. I'd like to use this to print a certain row (NR/2). If NR/2 is not an integer, then I round up to the nearest integer and that is the median, otherwise I take the average of (NR/2)+1 and (NR/2)-1.
With awk you have to store the values in an array and compute the median at the end, assuming we look at the first column:
sort -n file | awk ' { a[i++]=$1; } END { print a[int(i/2)]; }'
Sure, for real median computation do the rounding as described in the question:
sort -n file | awk ' { a[i++]=$1; }
END { x=int((i+1)/2); if (x < (i+1)/2) print (a[x-1]+a[x])/2; else print a[x-1]; }'
This awk program assumes one column of numerically sorted data:
#/usr/bin/env awk
{
count[NR] = $1;
}
END {
if (NR % 2) {
print count[(NR + 1) / 2];
} else {
print (count[(NR / 2)] + count[(NR / 2) + 1]) / 2.0;
}
}
Sample usage:
sort -n data_file | awk -f median.awk
OK, just saw this topic and thought I could add my two cents, since I looked for something similar in the past. Even though the title says awk, all the answers make use of sort as well. Calculating the median for a column of data can be easily accomplished with datamash:
> seq 10 | datamash median 1
5.5
Note that sort is not needed, even if you have an unsorted column:
> seq 10 | gshuf | datamash median 1
5.5
The documentation gives all the functions it can perform, and good examples as well for files with many columns. Anyway, it has nothing to do with awk, but I think datamash is of great help in cases like this, and could also be used in conjunction with awk. Hope it helps somebody!
This AWK based answer to a similar question on unix.stackexchange.com gives the same results as Excel for calculating the median.
If you have an array to compute median from (contains one-liner of Johnsyweb solution):
array=(5 6 4 2 7 9 3 1 8) # numbers 1-9
IFS=$'\n'
median=$(awk '{arr[NR]=$1} END {if (NR%2==1) print arr[(NR+1)/2]; else print (arr[NR/2]+arr[NR/2+1])/2}' <<< sort <<< "${array[*]}")
unset IFS

Resources