Using sed commands on an output to find an average - bash

My goal is to compute the average ping times of a website over the first 10 counts, average over the second 10 counts and average over the last 10 counts and print them in one go using bash scripting.
So far, I have come up with the following script:
ping -c 30 google.com > pings.txt
sed 's/\=/= /g' pings.txt > formatted_op.txt
sed -n '2, 11p' formatted_op.txt > 1.txt
sed -n '12, 22p' formatted_op.txt > 2.txt
sed -n '22, 32p' formatted_op.txt > 3.txt
awk '{print $10}' 1.txt | awk '{ sum += $1 } END {print sum/10}' > 1_avg.txt
awk '{print $10}' 2.txt | awk '{ sum += $1 } END {print sum/10}' > 2_avg.txt
awk '{print $10}' 3.txt | awk '{ sum += $1 } END {print sum/10}' > 3_avg.txt
cat 1_avg.txt 2_avg.txt 3_avg.txt > final_avg.txt
cat final_avg.txt
rm 1.txt 2.txt 3.txt 1_avg.txt 2_avg.txt 3_avg.txt formatted_op.txt pings.txt
However, I want to be able to do this without creating any temporary files. How do I do so?
I have also tried doing it using pipes as follows:
ping -c 30 google.com | sed 's/\=/= /g' | sed -n '2, 11p' | awk '{print $10}'| awk '{ sum += $1 } END {print sum/10}'
but this only computes the average of the first 10 pings and using ping 3 times would give different results each time.

This can all be done with awk pretty easily. Assuming your ping output looks like this:
64 bytes from yyz08s09-in-f110.1e100.net (172.217.1.110): icmp_seq=27 ttl=59 time=0.636 ms
64 bytes from yyz08s09-in-f110.1e100.net (172.217.1.110): icmp_seq=28 ttl=59 time=0.638 ms
64 bytes from yyz08s09-in-f110.1e100.net (172.217.1.110): icmp_seq=29 ttl=59 time=0.658 ms
64 bytes from yyz08s09-in-f110.1e100.net (172.217.1.110): icmp_seq=30 ttl=59 time=0.666 ms
This will do the trick:
ping -c 30 google.com | \
awk '
{
split($8,a,"=");
if(NR > 1 && NR < 12) {
round1+=a[2]
} else if (NR < 22) {
round2+=a[2]
} else if (NR < 32) {
round3+=a[2]
}
}
END {
print round1/10" "round2/10" "round3/10
}
'
We use the NR variable to check which line of output is being processed and then increment the appropriate variable. (The value is gotten by splitting the time field on the equals sign.)

You don't have to make it that complicated. Do something like
for i in {1..10}
do
ping -c 30 google.com | tail -n1 |
awk -v count="$i" -v FS="/" '{print "Count", count,"average : ",$5}'
done
Sample Output
Count 1 average : 78.484
Count 2 average : 74.473
Count 3 average : 76.971
Count 4 average : 78.789
Count 5 average : 103.609
Count 6 average : 105.754
Count 7 average : 98.969
Count 8 average : 99.009
Count 9 average : 86.186
Count 10 average : 86.521

A pure Bash solution is not possible, because Bash has no support for floating point arithmetic. You need at least bc.
In order to calculate an average it is not necessary to collect and store all values. You can calculate the average incrementally. See here or here for the mathematics or read Volume 2 of Donald Knuth's "The Art of Computer Programming". The formula as a Bash function might look like this:
avg ()
{
local a=$1 # average of the values already seen in the past
local x=$2 # new sample
local n=$3 # sequence number
bc -l <<<"$a + ($x - $a) / $n"
}
A single ping time can be collected by the following function.
pingtime ()
{
ping -c1 ${host:-localhost} |
sed -n 's/.*time=\(.*\) .*/\1/p'
}
With the above functions the calculation of n average values with m samples for each average can be done with two for loops.
n=3
m=10
for i in $(seq $n); do
a=
for j in $(seq $m); do
t=$(pingtime)
if [ -z "$a" ]; then
a=$t
else
a=$(avg $a $t $j)
fi
done
LC_NUMERIC=C printf '%2g\n' $a
done
In the outer loop the average a is first cleared. In the inner loop the ping time is measured. If the average is empty it is initialized with the first sample. Otherwise the average is calculated. And the result is reported in the outer loop.
bc and printf behave differently depending on the localization. If necessary a specific behavior can be forced by setting LC_NUMERIC.
If you wish you can add a sleep 1 in front of the ping, to measure only one sample per second.

You can also try my solution
#!/bin/bash
pings=10
for i in {1..10};
do
ping -c ${pings} google.pl | grep -o 'time=.*' | sed 's/time=\(.*\) ms/\1/g' | awk -v iteration="$i" -v pings="$pings" '{sum+=$1+$2+$3} END {print "Iteration " iteration " Average time " sum/pings}'
done

Related

Random line using sed

I want to select a random line with sed. I know shuf -n and sort -R | head -n does the job, but for shuf you have to install coreutils, and for the sort solution, it isn't optimal on large data :
Here is what I tested :
echo "$var" | shuf -n1
Which gives the optimal solution but I'm afraid for portability
that's why I want to try it with sed.
`var="Hi
i am a student
learning scripts"`
output:
i am a student
output:
hi
It must be Random.
It depends greatly on what you want your pseudo-random probability distribution to look like. (Don't try for random, be content with pseudo-random. If you do manage to generate a truly random value, go collect your nobel prize.) If you just want a uniform distribution (eg, each line has equal probability of being selected), then you'll need to know a priori how many lines of are in the file. Getting that distribution is not quite so easy as allowing the earlier lines in the file to be slightly more likely to be selected, and since that's easy, we'll do that. Assuming that the number of lines is less than 32769, you can simply do:
N=$(wc -l < input-file)
sed -n -e $((RANDOM % N + 1))p input-file
-- edit --
After thinking about it for a bit, I realize you don't need to know the number of lines, so you don't need to read the data twice. I haven't done a rigorous analysis, but I believe that the following gives a uniform distribution:
awk 'BEGIN{srand()} rand() < 1/NR { out=$0 } END { print out }' input-file
-- edit --
Ed Morton suggests in the comments that we should be able to invoke rand() only once. That seems like it ought to work, but doesn't seem to. Curious:
$ time for i in $(seq 400); do awk -v seed=$(( $(date +%s) + i)) 'BEGIN{srand(seed); r=rand()} r < 1/NR { out=$0 } END { print out}' input; done | awk '{a[$0]++} END { for (i in a) print i, a[i]}' | sort
1 205
2 64
3 37
4 21
5 9
6 9
7 9
8 46
real 0m1.862s
user 0m0.689s
sys 0m0.907s
$ time for i in $(seq 400); do awk -v seed=$(( $(date +%s) + i)) 'BEGIN{srand(seed)} rand() < 1/NR { out=$0 } END { print out}' input; done | awk '{a[$0]++} END { for (i in a) print i, a[i]}' | sort
1 55
2 60
3 37
4 50
5 57
6 45
7 50
8 46
real 0m1.924s
user 0m0.710s
sys 0m0.932s
var="Hi
i am a student
learning scripts"
mapfile -t array <<< "$var" # create array from $var
echo "${array[$RANDOM % (${#array}+1)]}"
echo "${array[$RANDOM % (${#array}+1)]}"
Output (e.g.):
learning scripts
i am a student
See: help mapfile
This seems to be the best solution for large input files:
awk -v seed="$RANDOM" -v max="$(wc -l < file)" 'BEGIN{srand(seed); n=int(rand()*max)+1} NR==n{print; exit}' file
as it uses standard UNIX tools, it's not restricted to files that are 32,769 lines long or less, it doesn't have any bias towards either end of the input, it'll produce different output even if called twice in 1 second, and it exits immediately after the target line is printed rather than continuing to the end of the input.
Update:
Having said the above, I have no explanation for why a script that calls rand() once per line and reads every line of input is about twice as fast as a script that calls rand() once and exits at the first matching line:
$ seq 100000 > file
$ time for i in $(seq 500); do
awk -v seed="$RANDOM" -v max="$(wc -l < file)" 'BEGIN{srand(seed); n=int(rand()*max)+1} NR==n{print; exit}' file;
done > o3
real 1m0.712s
user 0m8.062s
sys 0m9.340s
$ time for i in $(seq 500); do
awk -v seed="$RANDOM" 'BEGIN{srand(seed)} rand() < 1/NR{ out=$0 } END { print out}' file;
done > o4
real 0m29.950s
user 0m9.918s
sys 0m2.501s
They both produced very similar types of output:
$ awk '{a[$0]++} END { for (i in a) print i, a[i]}' o3 | awk '{sum+=$2; max=(NR>1&&max>$2?max:$2); min=(NR>1&&min<$2?min:$2)} END{print NR, sum, min, max}'
498 500 1 2
$ awk '{a[$0]++} END { for (i in a) print i, a[i]}' o4 | awk '{sum+=$2; max=(NR>1&&max>$2?max:$2); min=(NR>1&&min<$2?min:$2)} END{print NR, sum, min, max}'
490 500 1 3
Final Update:
Turns out it was calling wc that (unexpectedly to me at least!) was taking most of the time. Here's the improvement when we take it out of the loop:
$ time { max=$(wc -l < file); for i in $(seq 500); do awk -v seed="$RANDOM" -v max="$max" 'BEGIN{srand(seed); n=int(rand()*max)+1} NR==n{print; exit}' file; done } > o3
real 0m24.556s
user 0m5.044s
sys 0m1.565s
so the solution where we call wc up front and rand() once is faster than calling rand() for every line as expected.
on bash shell, first initialize seed to # line cube or your choice
$ i=;while read a; do let i++;done<<<$var; let RANDOM=i*i*i
$ let l=$RANDOM%$i+1 ;echo -e $var |sed -En "$l p"
if move your data to varfile
$ echo -e $var >varfile
$ i=;while read a; do let i++;done<varfile; let RANDOM=i*i*i
$ let l=$RANDOM%$i+1 ;sed -En "$l p" varfile
put the last inside loop e.g. for((c=0;c<9;c++)) { ;}
Using GNU sed and bash; no wc or awk:
f=input-file
sed -n $((RANDOM%($(sed = $f | sed '2~2d' | sed -n '$p')) + 1))p $f
Note: The three seds in $(...) are an inefficient way to fake wc -l < $f. Maybe there's a better way -- using only sed of course.
Using shuf:
$ echo "$var" | shuf -n 1
Output:
Hi

Get Average of Found Numbers in Each File to Two Decimal Places

I have a script that searches through all files in the directory and pulls the number next to the word <Overall>. I want to now get the average of the numbers from each file, and output the filename next to the average to two decimal places. I've gotten most of it to work except displaying the average. I should say I think it works, I'm not sure if it's pulling all of the instances in the file, and I'm definitely not sure if it's finding the average, it's hard to tell without the precision. I'm also sorting by the average at the end. I'm trying to use awk and bc to get the average, there's probably a better method.
What I have now:
path="/home/Downloads/scores/*"
(for i in $path
do
echo `basename $i .dat` `grep '<Overall>' < $i |
head -c 10 | tail -c 1 | awk '{total += $1} END {print total/NR}' | bc`
done) | sort -g -k 2
The output i get is:
John 4
Lucy 4
Matt 5
Sara 5
But it shouldn't be an integer and it should be to two decimal places.
Additionally, the files I'm searching through look like this:
<Student>John
<Math>2
<English>3
<Overall>5
<Student>Richard
<Math>2
<English>2
<Overall>4
In general, your script does not extract all numbers from each file, but only the first digit of the first number. Consider the following file:
<Overall>123 ...
<Overall>4 <Overall>56 ...
<Overall>7.89 ...
<Overall> 0 ...
The command grep '<Overall>' | head -c 10 | tail -c 1 will only extract 1.
To extract all numbers preceded by <Overall> you can use grep -Eo '<Overall> *[0-9.]*' | grep -o '[0-9.]*' or (depending on your version) grep -Po '<Overall>\s*\K[0-9.]*'.
To compute the average of these numbers you can use your awk command or specialized tools like ... | average (from the package num-utils) or ... | datamash mean 1.
To print numbers with two decimal places (that is 1.00 instead of 1 and 2.35 instead of 2.34567) you can use printf.
#! /bin/bash
path=/home/Downloads/scores/
for i in "$path"/*; do
avg=$(grep -Eo '<Overall> *[0-9.]*' "$file" | grep -o '[0-9.]*' |
awk '{total += $1} END {print total/NR}')
printf '%s %.2f\n' "$(basename "$i" .dat)" "$avg"
done |
sort -g -k 2
Sorting works only if file names are free of whitespace (like space, tab, newline).
Note that you can swap out the two lines after avg=$( with any method mentioned above.
You can use a sed command and retrieve the values to calculate their average with bc:
# Read the stdin, store the value in an array and perform a bc call
function avg() { mapfile -t l ; IFS=+ bc <<< "scale=2; (${l[*]})/${#l[#]}" ; }
# Browse the .dat files, then display for each file the average
find . -iname "*.dat" |
while read f
do
f=${f##*/} # Remove the dirname
# Echoes the file basename and a tabulation (no newline)
echo -en "${f%.dat}\t"
# Retrieves all the "Overall" values and passes them to our avg function
sed -E -e 's/<Overall>([0-9]+)/\1/' "$f" | avg
done
Output example:
score-2 1.33
score-3 1.33
score-4 1.66
score-5 .66
The pipeline head -c 10 | tail -c 1 | awk '{total += $1} END {print total/NR}' | bc needs improvement.
head -c 10 | tail -c 1 leaves only the 10th character of the first Overall line from each file; better drop that.
Instead, use awk to "remove" the prefix <Overall> and extract the number; we can do this by using <Overall> for the input field separator.
Also use awk to format the result to two decimal places.
Since awk did the job, there's no more need for bc; drop it.
The above pipeline becomes awk -F'<Overall>' '{total += $2} END {printf "%.2f\n", total/NR}'.
Don't miss to keep the ` after it.

how to find maximum and minimum values of a particular column using AWK [duplicate]

I'm using awk to deal with a simple .dat file, which contains several lines of data and each line has 4 columns separated by a single space.
I want to find the minimum and maximum of the first column.
The data file looks like this:
9 30 8.58939 167.759
9 38 1.3709 164.318
10 30 6.69505 169.529
10 31 7.05698 169.425
11 30 6.03872 169.095
11 31 5.5398 167.902
12 30 3.66257 168.689
12 31 9.6747 167.049
4 30 10.7602 169.611
4 31 8.25869 169.637
5 30 7.08504 170.212
5 31 11.5508 168.409
6 31 5.57599 168.903
6 32 6.37579 168.283
7 30 11.8416 168.538
7 31 -2.70843 167.116
8 30 47.1137 126.085
8 31 4.73017 169.496
The commands I used are as follows.
min=`awk 'BEGIN{a=1000}{if ($1<a) a=$1 fi} END{print a}' mydata.dat`
max=`awk 'BEGIN{a= 0}{if ($1>a) a=$1 fi} END{print a}' mydata.dat`
However, the output is min=10 and max=9.
(The similar commands can return me the right minimum and maximum of the second column.)
Could someone tell me where I was wrong? Thank you!
Awk guesses the type.
String "10" is less than string "4" because character "1" comes before "4".
Force a type conversion, using addition of zero:
min=`awk 'BEGIN{a=1000}{if ($1<0+a) a=$1} END{print a}' mydata.dat`
max=`awk 'BEGIN{a= 0}{if ($1>0+a) a=$1} END{print a}' mydata.dat`
a non-awk answer:
cut -d" " -f1 file |
sort -n |
tee >(echo "min=$(head -1)") \
> >(echo "max=$(tail -1)")
That tee command is perhaps a bit much too clever. tee duplicates its stdin stream to the files names as arguments, plus it streams the same data to stdout. I'm using process substitutions to filter the streams.
The same effect can be used (with less flourish) to extract the first and last lines of a stream of data:
cut -d" " -f1 file | sort -n | sed -n '1s/^/min=/p; $s/^/max=/p'
or
cut -d" " -f1 file | sort -n | {
read line
echo "min=$line"
while read line; do max=$line; done
echo "max=$max"
}
Your problem was simply that in your script you had:
if ($1<a) a=$1 fi
and that final fi is not part of awk syntax so it is treated as a variable so a=$1 fi is string concatenation and so you are TELLING awk that a contains a string, not a number and hence the string comparison instead of numeric in the $1<a.
More importantly in general, never start with some guessed value for max/min, just use the first value read as the seed. Here's the correct way to write the script:
$ cat tst.awk
BEGIN { min = max = "NaN" }
{
min = (NR==1 || $1<min ? $1 : min)
max = (NR==1 || $1>max ? $1 : max)
}
END { print min, max }
$ awk -f tst.awk file
4 12
$ awk -f tst.awk /dev/null
NaN NaN
$ a=( $( awk -f tst.awk file ) )
$ echo "${a[0]}"
4
$ echo "${a[1]}"
12
If you don't like NaN pick whatever you'd prefer to print when the input file is empty.
late but a shorter command and with more precision without initial assumption:
awk '(NR==1){Min=$1;Max=$1};(NR>=2){if(Min>$1) Min=$1;if(Max<$1) Max=$1} END {printf "The Min is %d ,Max is %d",Min,Max}' FileName.dat
A very straightforward solution (if it's not compulsory to use awk):
Find Min --> sort -n -r numbers.txt | tail -n1
Find Max --> sort -n -r numbers.txt | head -n1
You can use a combination of sort, head, tail to get the desired output as shown above.
(PS: In case if you want to extract the first column/any desired column you can use the cut command i.e. to extract the first column cut -d " " -f 1 sample.dat)
#minimum
cat your_data_file.dat | sort -nk3,3 | head -1
#this fill find minumum of column 3
#maximun
cat your_data_file.dat | sort -nk3,3 | tail -1
#this will find maximum of column 3
#to find in column 2 , use -nk2,2
#assing to a variable and use
min_col=`cat your_data_file.dat | sort -nk3,3 | head -1 | awk '{print $3}'`

awk: find minimum and maximum in column

I'm using awk to deal with a simple .dat file, which contains several lines of data and each line has 4 columns separated by a single space.
I want to find the minimum and maximum of the first column.
The data file looks like this:
9 30 8.58939 167.759
9 38 1.3709 164.318
10 30 6.69505 169.529
10 31 7.05698 169.425
11 30 6.03872 169.095
11 31 5.5398 167.902
12 30 3.66257 168.689
12 31 9.6747 167.049
4 30 10.7602 169.611
4 31 8.25869 169.637
5 30 7.08504 170.212
5 31 11.5508 168.409
6 31 5.57599 168.903
6 32 6.37579 168.283
7 30 11.8416 168.538
7 31 -2.70843 167.116
8 30 47.1137 126.085
8 31 4.73017 169.496
The commands I used are as follows.
min=`awk 'BEGIN{a=1000}{if ($1<a) a=$1 fi} END{print a}' mydata.dat`
max=`awk 'BEGIN{a= 0}{if ($1>a) a=$1 fi} END{print a}' mydata.dat`
However, the output is min=10 and max=9.
(The similar commands can return me the right minimum and maximum of the second column.)
Could someone tell me where I was wrong? Thank you!
Awk guesses the type.
String "10" is less than string "4" because character "1" comes before "4".
Force a type conversion, using addition of zero:
min=`awk 'BEGIN{a=1000}{if ($1<0+a) a=$1} END{print a}' mydata.dat`
max=`awk 'BEGIN{a= 0}{if ($1>0+a) a=$1} END{print a}' mydata.dat`
a non-awk answer:
cut -d" " -f1 file |
sort -n |
tee >(echo "min=$(head -1)") \
> >(echo "max=$(tail -1)")
That tee command is perhaps a bit much too clever. tee duplicates its stdin stream to the files names as arguments, plus it streams the same data to stdout. I'm using process substitutions to filter the streams.
The same effect can be used (with less flourish) to extract the first and last lines of a stream of data:
cut -d" " -f1 file | sort -n | sed -n '1s/^/min=/p; $s/^/max=/p'
or
cut -d" " -f1 file | sort -n | {
read line
echo "min=$line"
while read line; do max=$line; done
echo "max=$max"
}
Your problem was simply that in your script you had:
if ($1<a) a=$1 fi
and that final fi is not part of awk syntax so it is treated as a variable so a=$1 fi is string concatenation and so you are TELLING awk that a contains a string, not a number and hence the string comparison instead of numeric in the $1<a.
More importantly in general, never start with some guessed value for max/min, just use the first value read as the seed. Here's the correct way to write the script:
$ cat tst.awk
BEGIN { min = max = "NaN" }
{
min = (NR==1 || $1<min ? $1 : min)
max = (NR==1 || $1>max ? $1 : max)
}
END { print min, max }
$ awk -f tst.awk file
4 12
$ awk -f tst.awk /dev/null
NaN NaN
$ a=( $( awk -f tst.awk file ) )
$ echo "${a[0]}"
4
$ echo "${a[1]}"
12
If you don't like NaN pick whatever you'd prefer to print when the input file is empty.
late but a shorter command and with more precision without initial assumption:
awk '(NR==1){Min=$1;Max=$1};(NR>=2){if(Min>$1) Min=$1;if(Max<$1) Max=$1} END {printf "The Min is %d ,Max is %d",Min,Max}' FileName.dat
A very straightforward solution (if it's not compulsory to use awk):
Find Min --> sort -n -r numbers.txt | tail -n1
Find Max --> sort -n -r numbers.txt | head -n1
You can use a combination of sort, head, tail to get the desired output as shown above.
(PS: In case if you want to extract the first column/any desired column you can use the cut command i.e. to extract the first column cut -d " " -f 1 sample.dat)
#minimum
cat your_data_file.dat | sort -nk3,3 | head -1
#this fill find minumum of column 3
#maximun
cat your_data_file.dat | sort -nk3,3 | tail -1
#this will find maximum of column 3
#to find in column 2 , use -nk2,2
#assing to a variable and use
min_col=`cat your_data_file.dat | sort -nk3,3 | head -1 | awk '{print $3}'`

bash checking all the elements in a list are within a range

LessThanThirty=1
GreaterThanTwenty=1
while read line
do
LessThanThirty=$(echo "$line <= 30.0" | bc)
GreaterThanTwenty=$(echo "$line >= 20.0" | bc)
done < <(grep -A 26 "some text" someFile.txt | awk '/More Text/ { gsub(/M/, " "); print $4 }' | uniq )
echo $LessThanThirty
echo $GreaterThanTwenty
I have a whole list of numbers and want to test that they are all within the range of 20 - 30. if any of them are greater than 30 then LessThanThirty should remain false. As it stands, my Echo's at the end are only reporting the status of the checks on the last element in the list. I need a way for my variables to be set false if ANY of the numbers in the list are within the range.
Just do the comparisons in awk:
read GreaterThanTwenty LessThanThirty < <(
grep -A 26 "some text" someFile.txt | \
awk '/More Text/ { gsub(/M/, " "); print $4 }' | \
awk 'BEGIN { g20=l30=1 }
$0 < 20.0 { g20=0 }
$0 > 30.0 { l30=0 }
END { print g20, l30 }'
)
You can probably eliminate the call to uniq and merge the two awk scripts into one fairly easily by using an awk array to eliminate duplicates, but this demonstrates the idea.
This works by setting two awk variables, g20 and l30 initially to 1. If an input line is ever less than 20, we set g20 to 0 (i.e., not all inputs are greater than 20). Likewise, we set l30 to 0 if any input is ever greater than 30. After consuming all the input, we print the values of g20 and l30. This becomes the output of the process substitution, which is read into the two bash variables using the read command.
(I've removed the call to uniq, since it requires its input to be sorted anyway, and you don't seem to be interested in counts, just the existence of values outside of the range. It will probably take more time to sort and remove duplicates than it will to just run them all through awk to do the range check. This would make it easier to merge the two awk programs, but for simplicity I'll leave them be.)
I guess you could just OR together your variables:
LessThanThirty=0
GreaterThanTwenty=0
while read line
do
LessThanThirty=$(echo "$line <= 30.0 || $LessThanThirty" | bc)
GreaterThanTwenty=$(echo "$line >= 20.0 || $GreaterThanTwenty" | bc)
done < <(grep -A 26 "some text" someFile.txt | awk '/More Text/ { gsub(/M/, " "); print $4 }' | uniq )
echo $LessThanThirty
echo $GreaterThanTwenty
But now the variables have to start as false.

Resources