Calculate the average over a number of columns - bash

I am trying to create a script which calculates the average over a number of rows.
This number would depend on the number of samples that I have, which varies.
An example of these files is here:
24 1 2.505
24 2 0.728
24 3 0.681
48 1 2.856
48 2 2.839
48 3 2.942
96 1 13.040
96 2 12.922
96 3 13.130
192 1 50.629
192 2 51.506
192 3 51.016
The average is calculated on the 3rd column and,
the second column indicates the number of samples, 3 in this particular case.
Therefore, I should obtain 4 values here.
One average value per 3 rows.
I have tried something like:
count=3;
total=0;
for i in $( awk '{ print $3; }' ${file} )
do
for j in 1 2 3
do
total=$(echo $total+$i | bc )
done
echo "scale=2; $total / $count" | bc
done
But it is not giving me the right answer, instead I think it calculates an average per each group of three rows.
The average is calculated on the 3rd column and,
the second column indicates the number of samples, 3 in this particular case.
Therefore, I should obtain 4 values here.
One average value per 3 rows.
I have tried something like:
count=3;
total=0;
for i in $( awk '{ print $3; }' ${file} )
do
for j in 1 2 3
do
total=$(echo $total+$i | bc )
done
echo "scale=2; $total / $count" | bc
done
But it is not giving me the right answer, instead I think it calculates an average per each group of three rows.
Expected output
24 1.3046
48 2.879
96 13.0306
192 51.0503

You can use the following awk script:
awk '{t[$2]+=$3;n[$2]++}END{for(i in t){print i,t[i]/n[i]}}' file
Output:
1 17.2575
2 16.9988
3 16.9423
This is better explained as a multiline script with comments in it:
# On every line of input
{
# sum up the value of the 3rd column in an array t
# which is is indexed by the 2nd column
t[$2]+=$3
# Increment the number of lines having the same value of
# the 2nd column
n[$2]++
}
# At the end of input
END {
# Iterate through the array t
for(i in t){
# Print the number of samples along with the average
print i,t[i]/n[i]
}
}

Apparently I brought a third view to the problem. In awk:
$ awk 'NR>1 && $1!=p{print p, s/c; c=s=0} {s+=$3;c++;p=$1} END {print p, s/c}' file
24 1.30467
48 2.879
96 13.0307
192 51.0503

Related

Optimally finding the index of the maximum element in BASH array

I am using bash in order to process software responses on-the-fly and I am looking for a way to find the
index of the maximum element in the array.
The data that gets fed to the bash script is like this:
25 9
72 0
3 3
0 4
0 7
And so I create two arrays. There is
arr1 = [ 25 72 3 0 0 ]
arr2 = [ 9 0 3 4 7 ]
And what I need is to find the index of the maximum number in arr1 in order to use it also for arr2.
But I would like to see if there is a quick - optimal way to do this.
Would it maybe be better to use a dictionary structure [key][value] with the data I have? Would this make the process easier?
I have also found [1] (from user jhnc) but I don't quite think it is what I want.
My brute - force approach is the following:
function MAX {
arr1=( 25 72 3 0 0 )
arr2=( 9 0 3 4 7 )
local indx=0
local max=${arr1[0]}
local flag
for ((i=1; i<${#arr1[#]};i++)); do
#To avoid invalid arithmetic operators when items are floats/doubles
flag=$( python <<< "print(${arr1$[${i}]} > ${max})")
if [ $flag == "True" ]; then
indx=${i}
max=${arr1[${i}]}
fi
done
echo "MAX:INDEX = ${max}:${indx}"
echo "${arr1[${indx}]}"
echo "${arr2[${indx}]}"
}
This approach obviously will work, BUT, is it the optimal one? Is there a faster way to perform the task?
arr1 = [ 99.97 0.01 0.01 0.01 0 ]
arr2 = [ 0 6 4 3 2 ]
In this example, if an array contains floats then I would get a
syntax error: invalid arithmetic operator (error token is ".97)
So, I am using
flag=$( python <<< "print(${arr1$[${i}]} > ${max})")
In order to overcome this issue.
Finding a maximum is inherently an O(n) operation. But there's no need to spawn a Python process on each iteration to perform the comparison. Write a single awk script instead.
awk 'BEGIN {
split(ARGV[1], a1);
split(ARGV[2], a2);
max=a1[1];
indx=1;
for (i in a1) {
if (a1[i] > max) {
indx = i;
max = a1[i];
}
}
print "MAX:INDEX = " max ":" (indx - 1)
print a1[indx]
print a2[indx]
}' "${arr1[*]}" "${arr2[*]}"
The two shell arrays are passed as space-separated strings to awk, which splits them back into awk arrays.
It's difficult to do it efficiently if you really do need to compare floats. Bash can't do floats, which means invoking an external program for every number comparison. However, comparing every number in bash, is not necessarily needed.
Here is a fast, pure bash, integer only solution, using comparison:
#!/bin/bash
arr1=( 25 72 3 0 0)
arr2=( 9 0 3 4 7)
# Get the maximum, and also save its index(es)
for i in "${!arr1[#]}"; do
if ((arr1[i]>arr1_max)); then
arr1_max=${arr1[i]}
max_indexes=($i)
elif [[ "${arr1[i]}" == "$arr1_max" ]]; then
max_indexes+=($i)
fi
done
# Print the results
printf '%s\n' \
"Array1 max is $arr1_max" \
"The index(s) of the maximum are:" \
"${max_indexes[#]}" \
"The corresponding values from array 2 are:"
for i in "${max_indexes[#]}"; do
echo "${arr2[i]}"
done
Here is another optimal method, that can handle floats. Comparison in bash is avoided altogether. Instead the much faster sort(1) is used, and is only needed once. Rather than starting a new python instance for every number.
#!/bin/bash
arr1=( 25 72 3 0 0)
arr2=( 9 0 3 4 7)
arr1_max=$(printf '%s\n' "${arr1[#]}" | sort -n | tail -1)
for i in "${!arr1[#]}"; do
[[ "${arr1[i]}" == "$arr1_max" ]] &&
max_indexes+=($i)
done
# Print the results
printf '%s\n' \
"Array 1 max is $arr1_max" \
"The index(s) of the maximum are:" \
"${max_indexes[#]}" \
"The corresponding values from array 2 are:"
for i in "${max_indexes[#]}"; do
echo "${arr2[i]}"
done
Example output:
Array 1 max is 72
The index(s) of the maximum are:
1
The corresponding values from array 2 are:
0
Unless you need those arrays, you can also feed your input script directly in to something like this:
#!/bin/bash
input-script |
sort -nr |
awk '
(NR==1) {print "Max: "$1"\nCorresponding numbers:"; max = $1}
{if (max == $1) print $2; else exit}'
Example (with some extra numbers):
$ echo \
'25 9
72 0
72 11
72 4
3 3
3 14
0 4
0 1
0 7' |
sort -nr |
awk '(NR==1) {max = $1; print "Max: "$1"\nCorresponding numbers:"}
{if (max == $1) print $2; else exit}'
Max: 72
Corresponding numbers:
4
11
0
You can also do it 100% in awk, including sorting:
$ echo \
'25 9
72 0
72 11
72 4
3 3
3 14
0 4
0 1
0 7' |
awk '
{
col1[a++] = $1
line[a-1] = $0
}
END {
asort(col1)
col1_max = col1[a-1]
print "Max is "col1_max"\nCorresponding numbers are:"
for (i in line) {
if (line[i] ~ col1_max"\\s") {
split(line[i], max_line)
print max_line[2]
}
}
}'
Max is 72
Corresponding numbers are:
0
11
4
Or, just to get the maximum of column 1, and any single number from column 2, that corresponds with it. As simply as possible:
$ echo \
'25 9
72 0
3 3
0 4
0 7' |
sort -nr |
head -1
72 0

Merging sums of numbers from different files and deleting select duplicate lines

I've checked other threads here on merging, but they seem to be mostly about merging text, and not quite what I needed, or at least I couldn't figure out a way to connect their solutions to my own problem.
Problem
I have 10+ input files, each consisting of two columns of numbers (think of them as x,y data points for a graph). Goals:
Merge these files into 1 file for plotting
For any duplicate x values in the merge, add their respective y-values together, then print one line with x in field 1 and the added y-values in field 2.
Consider this example for 3 files:
y1.dat
25 16
27 18
y2.dat
24 10
27 9
y3.dat
24 2
29 3
According to my goals above, I should be able to merge them into one file with output:
final.dat
24 12
25 16
27 27
29 3
Attempt
So far, I have the following:
#!/bin/bash
loops=3
for i in `seq $loops`; do
if [ $i == 1 ]; then
cp -f y$i.dat final.dat
else
awk 'NR==FNR { arr[NR] = $1; p[NR] = $2; next } {
for (n in arr) {
if ($1 == arr[n]) {
print $1, p[n] + $2
n++
}
}
print $1, $2
}' final.dat y$i.dat >> final.dat
fi
done
Output:
25 16
27 18
24 10
27 27
27 9
24 12
24 2
29 3
On closer inspection, it's clear I have duplicates of the original x-values.
The problem is my script needs to print all the x-values first, and then I can add them together for my output. However, I don't know how to go back and remove the lines with the old x-values that I needed to make the addition.
If I blindly use uniq, I don't know whether the old x-values or the new x-value is deleted. With awk '!duplicate[$1]++' the order of lines deleted was reversed over the loop, so it deletes on the first loop correctly but the wrong ones after that.
Been at this for a long time, would appreciate any help. Thank you!
I am assuming you already merged all the files into a single one before making the calculation. Once that's done the script is as simple as :
awk '{ if ( $1 != "" ) { coord[$1]+=$2 } } END { for ( k in coord ) { print k " " coord[k] } }' input.txt
Hope it helps!
Edit : How this works ?
if ( $1 != "" ) { coord[$1]+=$2 }
This line will get executed for each line in your input. It will first check whether there is a value for X, otherwise it simply ignores the line. This helps to ignore empty lines should your file have any. The block which gets executed : coord[$1]+=$2 is the heart of the script and creates a dictionary with X being the key of each entry and at the same time it adds each value for Y found.
END { for ( k in coord ) { print k " " coord[k] }
This block will execute after awk has iterated over all the lines in your file. It will simply grab each key from the dictionary and print it, then a space and finally the sum of all the values which were found, or in other words, the value for that specific key.
Using Perl one-liner
> cat y1.dat
25 16
27 18
> cat y2.dat
24 10
27 9
> cat y3.dat
24 2
29 3
> perl -lane ' $kv{$F[0]}+=$F[1]; END { print "$_ $kv{$_}" for(sort keys %kv) }' y*dat
24 12
25 16
27 27
29 3
>

Difference between two files after average of selected entries using shell script or awk

I have two files. Each has one column with some missing data as 9999, 9000. e.g.
ifile1.txt ifile2.txt
30 20
9999 10
10 40
40 30
10 31
29 9000
9000 9999
9999 9999
31 1250
550 29
I would like to calculate the difference between the averages of the values (which are > 10) in the above two files without considering the missing values. i.e.
average ( the entries > 10 in ifile1.txt) - average (the entries > 10 in ifile2.txt)
Kindly note: The average should be taken over the selected values only i.e. those are > 10 only e.g.
(30+40+29+31+550/5) in ifile1.txt
I asked a similar question here Difference between two files after average using shell script or awk and tried like this, but getting error.
awk '($0>10) && !/9000|9999/{a[ARGIND]+=$0;b[ARGIND]++}END{print a[1]/b[1]-a[2]/b[2]}' file1 file2
Try this awk:
awk '$1>10 && $1 !~ /^(9000|9999)$/{a[ARGIND]+=$1; b[ARGIND]++}
END{printf "%.2f\n", a[1]/b[1]-a[2]/b[2]}' ifile[12].txt
Output:
-97.33
awk '$1>10 && !/^9999$|^9000$/ {if(NR==FNR) {s1+=$1;n1++} else {s2+=$1;n2++}} END {print s1/n1 - s2/n2}' file1 file2
For the first file (NR==FNR), for values greater than 10 and values not exactly equal to 9999 or 9000, add the values to variable s1. Also increment the count variable n1. So s1/n1 gives average for the first file. Similarly for the second file (NR!=FNR), update variables s2 and n2. In the END block, print the difference of the averages.

Bash/shell script: create four random-length strings with fixed total length

I would like to create four strings, each with a random length, but their total length should be 10. So possible length combinations could be:
3 3 3 1
or
4 0 2 2
Which would then (respectively) result in strings like this:
111 222 333 4
or
1111 33 44
How could I do this?
$RANDOM will give you a random integer in range 0..32767.
Using some arithmetic expansion you can do:
remaining=10
for i in {1..3}; do
next=$((RANDOM % remaining)) # get a number in range 0..$remaining
echo -n "$next "
((remaining -= next))
done
echo $remaining
Update: to repeat the number N times, you can use a function like this:
repeat() {
for ((i=0; i<$1; i++)); do
echo -n $1
done
echo
}
repeat 3
333
Here is an algorithm:
Make first 3 strings with random length, which is not greater than sum of lenght (each time substract it). And rest of length - it's your last string.
Consider this:
sumlen=10
for i in {1..3}
do
strlen=$(($RANDOM % $sumlen)); sumlen=$(($sumlen-$strlen)); echo $strlen
done
echo $sumlen
This will output your lengths, now you can create strings, suppose you know how
alternative awk solution
awk 'function r(n) {return int(n*rand())}
BEGIN{srand(); s=10;
for(i=1;i<=3;i++) {a=r(s); s-=a; print a}
print s}'
3
5
1
1
srand() to set a randomized seed, otherwise will generate the same random numbers each time.
Here you can combine the next task of generating the strings into the same awk script
$ awk 'function r(n) {return int(n*rand())};
function rep(n,t) {c="";for(i=1;i<=n;i++) c=c t; return c}
BEGIN{srand(); s=10;
for(j=1;j<=3;j++) {a=r(s); s-=a; printf("%s ", rep(a,j))}
printf("%s\n", rep(s,j))}'
generated
1111 2 3 4444

bash: find pattern in one file and apply some code for each pattern found

I created a script that will auto-login to router and checks for current CPU load, if load exceeds a certain threshold I need it print the current CPU value to the standard output.
i would like to search in script o/p for a certain pattern (the value 80 in this case which is the threshold for high CPU load) and then for each instance of the pattern it will check if current value is greater than 80 or not, if true then it will print 5 lines before the pattern followed by then the current line with the pattern.
Question1: how to loop over each instance of the pattern and apply some code on each of them separately?
Question2: How to print n lines before the pattern followed by x lines after the pattern?
ex. i used awk to search for the pattern "health" and print 6 lines after it as below:
awk '/health/{x=NR+6}(NR<=x){print}' ./logs/CpuCheck.log
I would like to do the same for the pattern "80" and this time print 5 lines before it and one line after....only if $3 (representing current CPU load) is exceeding the value 80
below is the output of auto-login script (file name: CpuCheck.log)
ABCD-> show health xxxxxxxxxx
* - current value exceeds threshold
1 Min 1 Hr 1 Hr
Cpu Limit Curr Avg Avg Max
-----------------+-------+------+------+-----+----
01 80 39 36 36 47
WXYZ-> show health xxxxxxxxxx
* - current value exceeds threshold
1 Min 1 Hr 1 Hr
Cpu Limit Curr Avg Avg Max
-----------------+-------+------+------+-----+----
01 80 29 31 31 43
Thanks in advance for the help
Rather than use awk, you could use the -B and -A and switches to grep, which print a number of lines before and after a pattern is matched:
grep -E -B 5 -A 1 '^[0-9]+[[:space:]]+80[[:space:]]+(100|9[0-9]|8[1-9])' CpuCheck.log
The pattern matches lines which start with some numbers, followed by spaces, followed by 80, followed by a number greater between 81 and 100. The -E switch enables extended regular expressions (EREs), which are needed if you want to use the + character to mean "one or more". If your version of grep doesn't support EREs, you can instead use the slightly more verbose \{1,\} syntax:
grep -B 5 -A 1 '^[0-9]\{1,\}[[:space:]]\{1,\}80[[:space:]]\{1,\}\(100\|9[0-9]\|8[1-9]\)' CpuCheck.log
If grep isn't an option, one alternative would be to use awk. The easiest way would be to store all of the lines in a buffer:
awk 'f-->0;{a[NR]=$0}/^[0-9]+[[:space:]]+80[[:space:]]+(100|9[0-9]|8[1-9])/{for(i=NR-5;i<=NR;++i)print i, a[i];f=1}'
This stores every line in an array a. When the third column is greater than 80, it prints the previous 5 lines from the array. It also sets the flag f to 1, so that f-->0 is true for the next line, causing it to be printed.
Originally I had opted for a comparison $3>80 instead of the regular expression but this isn't a good idea due to the varying format of the lines.
If the log file is really big, meaning that reading the whole thing into memory is unfeasible, you could implement a circular buffer so that only the previous 5 lines were stored, or alternatively, read the file twice.
Unfortunately, awk is stream-oriented and doesn't have a simple way to get the lines before the current line. But that doesn't mean it isn't possible:
awk '
BEGIN {
bufferSize = 6;
}
{
buffer[NR % bufferSize] = $0;
}
$2 == 80 && $3 > 80 {
# print the five lines before the match and the line with the match
for (i = 1; i <= bufferSize; i++) {
print buffer[(NR + i) % bufferSize];
}
}
' ./logs/CpuCheck.log
I think the easiest way with awk, by reading the file.
This should use essentially 0 memory except whatever is used to store the line numbers.
If there is only one occurence
awk 'NR==FNR&&$2=="80"{to=NR+1;from=NR-5}NR!=FNR&&FNR<=to&&FNR>=from' file{,}
If there are more than one occurences
awk 'NR==FNR&&$2=="80"{to[++x]=NR+1;from[x]=NR-5}
NR!=FNR{for(i in to)if(FNR<=to[i]&&FNR>=from[i]){print;next}}' file{,}
Input/output
Input
1
2
3
4
5
6
7
8
9
10
11
12
01 80 39 36 36 47
13
14
15
16
17
01 80 39 36 36 47
18
19
20
Output
8
9
10
11
12
01 80 39 36 36 47
13
14
15
16
17
01 80 39 36 36 47
18
How it works
NR==FNR&&$2=="80"{to[++x]=NR+5;from[x]=NR-5}
In the first file if the second field is 80 set to and from to the record number + or - whatever you want.
Increment the occurrence variable x.
NR!=FNR
In the second file
for(i in to)
For each occurrence
if(FNR<=to[i]&&FNR>=from[i]){print;next}
If the current record number(in this file) is between this occurrences to and from then print the line.Next prevents the line from being printed multiple times if occurrences of the pattern are close together.
file{,}
Use the file twice as two args. the {,} expands to file file

Resources