bash: how to compute average of different columns?

bash: how to compute average of different columns? - bash

I am writing a script for automatically computing average runtime.
First I need to run $ time ./foo.py for 100 times and save output to file time.txt (working)
$ for i in `seq 100`; do { time ./foo.py; } 2>> time.txt; done
Output looks as follows
time ./foo.py
real 0m0,030s
user 0m0,030s
sys 0m0,000s
[...]
Runtimes from different scripts are in the same file. Each entry starts with time ./foo.py, followed by 100 "triplets" of real, user and sys.
Now, if possible, I would love to have the script automatically compute the average runtime for each tested file by using all 100 "triplets", and neatly returning only one "mean triplet".
I have thought about maybe using awk to calculate the mean, like this
awk '{ total += $2 } END { print total/NR }' time.txt
But the command would need to be adapted to fit my needs - after all, only the parts after the , (e.g. ,030s) may be used for computation and the s would also need to be disregarded.
Since I do not know how to achieve this objective, I thought to ask the community.
Any help is greatly appreciated.

It's easier if you tell time to output the time info in POSIX format:
awk '/^real/ { totalReal += $2 } /^user/ { totalUser += $2 } /^sys/ { totalSys += $2 } END { print "realAvg " totalReal/(NR/4) "\n" "userAvg " totalUser/(NR/4) "\n" "sysAvg " totalSys/(NR/4) }' time.txt
Prints output as follows:
realAvg 12.62
userAvg 27
sysAvg 3.8
Explanation:
Basically, tell awk to go through each line in the file, and if the line starts with real, add that to the totalReal variable, same for user and sys. So, basically, keep a running total of each of the three "types".
At the end, simply print the the three running totals, divided by the number of lines divided by 4. This is because you want each "set" of 4 lines to count as 1 instance, and awk's NR just counts the number of lines.

Related

Parsing multiline program output

I've recently been working on some lab assignments and in order to collect and analyze results well, I prepared a bash script to automate my job. It was my first attempt to create such script, thus it is not perfect and my question is strictly connected with improving it.
Exemplary output of the program is shown below, but I would like to make it more general for more purposes.
>>> VARIANT 1 <<<
Random number generator seed is 0xea3495cc76b34acc
Generate matrix 128 x 128 (16 KiB)
Performing 1024 random walks of 4096 steps.
> Total instructions: 170620482
> Instructions per cycle: 3.386
Time elapsed: 0.042127 seconds
Walks accrued elements worth: 534351478
All data I want to collect is always in different lines. My first attempt was running the same program twice (or more times depending on the amount of data) and then using grep in each run to extract the data I need by looking for the keyword. It is very inefficient, as there probably are some possibilities of parsing whole output of one run, but I could not come up with any idea. At the moment the script is:
#!/bin/bash
write() {
o1=$(./progname args | grep "Time" | grep -o -E '[0-9]+.[0-9]+')
o2=$(./progname args | grep "cycle" | grep -o -E '[0-9]+.[0-9]+')
o3=$(./progname args | grep "Total" | grep -o -E '[0-9]+.[0-9]+')
echo "$1 $o1 $o2 $o3"
}
for ((i = 1; i <= 10; i++)); do
write $i >> times.dat
done
It is worth mentioning that echoing results in one line is crucial, as I am using gnuplot later and having data in columns is perfect for that use. Sample output should be:
1 0.019306 3.369 170620476
2 0.019559 3.375 170620475
3 0.021971 3.334 170620478
4 0.020536 3.378 170620480
5 0.019692 3.390 170620475
6 0.020833 3.375 170620477
7 0.019951 3.450 170620477
8 0.019417 3.381 170620476
9 0.020105 3.374 170620476
10 0.020255 3.402 170620475
My question is: how could I improve the script to collect such data in just one program execution?

You could use awk here and could get values into an array and later access them by index 1,2 and 3 in case you want to do this in a single command.
myarr=($(your_program args | awk '/Total/{print $NF;next} /cycle/{print $NF;next} /Time/{print $(NF-1)}'))
OR use following to forcefully print all elements into a single line, which will not come in new lines if someone using " to keep new lines safe for values.
myarr=($(your_program args | awk '/Total/{val=$NF;next} /cycle/{val=(val?val OFS:"")$NF;next} /Time/{print val OFS $(NF-1)}'))
Explanation: Adding detailed explanation of awk program above.
awk ' ##Starting awk program from here.
/Total/{ ##Checking if a line has Total keyword in it then do following.
print $NF ##Printing last field of that line which has Total in it here.
next ##next keyword will skip all further statements from here.
}
/cycle/{ ##Checking if a line has cycle in it then do following.
print $NF ##Printing last field of that line which has cycle in it here.
next ##next keyword will skip all further statements from here.
}
/Time/{ ##Checking if a line has Time in it then do following.
print $(NF-1) ##Printing 2nd last field of that line which has Time in it here.
}'
To access individual items you could use like:
echo ${myarr[0]}, echo ${myarr[1]} and echo ${myarr[2]} for Total, cycle and time respectively.
Example to access all elements by loop in case you need:
for i in "${myarr[#]}"
do
echo $i
done

You can execute your program once and save the output at a variable.
o0=$(./progname args)
Then you can grep that saved string any times like this.
o1=$(echo "$o0" | grep "Time" | grep -o -E '[0-9]+.[0-9]+')

Assumptions:
each of the 3x search patterns (Time, cycle, Total) occur just once in a set of output from ./progname
format of ./progname output is always the same (ie, same number of space-separated items for each line of output)
I've created my own progname script that just does an echo of the sample output:
$ cat progname
echo ">>> VARIANT 1 <<<
Random number generator seed is 0xea3495cc76b34acc
Generate matrix 128 x 128 (16 KiB)
Performing 1024 random walks of 4096 steps.
> Total instructions: 170620482
> Instructions per cycle: 3.386
Time elapsed: 0.042127 seconds
Walks accrued elements worth: 534351478"
One awk solution to parse and print the desired values:
$ i=1
$ ./progname | awk -v i=${i} ' # assign awk variable "i" = ${i}
/Time/ { o1 = $3 } # o1 = field 3 of line that contains string "Time"
/cycle/ { o2 = $5 } # o2 = field 5 of line that contains string "cycle"
/Total/ { o3 = $4 } # o4 = field 4 of line that contains string "Total"
END { printf "%s %s %s %s\n", i, o1, o2, o3 } # print 4x variables to stdout
'
1 0.042127 3.386 170620482

Having SUM issues with a bash script

I'm trying to write a script to pull the integers out of 4 files that store temperature readings from 4 industrial freezers, this is a hobby script it generates the general readouts I wanted, however when I try to generate a SUM of the temperature readings I get the following printout into the file and my goal is to print the end SUM only not the individual numbers printed out in a vertical format
Any help would be greatly appreciated;here's my code
grep -o "[0.00-9.99]" "/location/$value-1.txt" | awk '{ SUM += $1; print $1} END { print SUM }' >> "/location/$value-1.txt"
here is what I am getting in return
Morningtemp:17.28
Noontemp:17.01
Lowtemp:17.00 Hightemp:18.72
1
7
.
2
8
1
7
.
0
1
1
7
.
0
0
1
8
.
7
2
53
It does generate the SUM I don't need the already listed numbers, just the SUM total

Why not stick with AWK completely? Code:
$ cat > summer.awk
{
while(match($0,/[0-9]+\.[0-9]+/)) # while matches on record
{
sum+=substr($0, RSTART, RLENGTH) # extract matches and sum them
$0=substr($0, RSTART + RLENGTH) # reset to start after previous match
count++ # count matches
}
}
END {
print sum"/"count"="sum/count # print stuff
Data:
$ cat > data.txt
Morningtemp:17.28
Noontemp:17.01
Lowtemp:17.00 Hightemp:18.72
Run:
$ awk -f summer.awk file
70.01/4=17.5025
It might work in the winter too.

The regex in grep -o "[0.00-9.99]" "/location/$value-1.txt" is equivalent to [0-9.], but you're probably looking for numbers in the range 0.00 to 9.99. For that, you need a different regex:
grep -o "[0-9]\.[0-9][0-9]" "/location/$value-1.txt"
That looks for a digit, a dot, and two more digits. It was almost tempting to use [.] in place of \.; it would also work. A plain . would not; that would select entries such as 0X87.
Note that the pattern shown ([0-9]\.[0-9][0-9]) will match 192.16.24.231 twice (2.16 and 4.23). If that's not what you want, you have to be a lot more precise. OTOH, it may not matter in the slightest for the actual data you have. If you'd want it to match 192.16 and 24.231 (or .24 and .231), you have to refine your regex.
Your command structure:
grep … filename | awk '…' >> filename
is living dangerously. In the example, it is 'OK' (but there's a huge grimace on my face as I type 'OK') because the awk script doesn't write anything to the file until grep has read it all. But change the >> to > and you have an empty input, or have awk write material before the grep is complete and suddenly it gets very tricky to determine what happens (it depends, in part, on what awk writes to the end of the file).

Why awk for loop require ENTER key before printing anything?

Program: for_loop.awk
{
sum = 0
i = 1
for (i=1; i<=10; i++)
{
sum += i;
}
printf "Sum for 1 to 10 numbers = %d \nGoodbuy!\n\n", sum
exit 1
}
If I execute above code using awk -f for_loop.awk it is waiting for input then if I press ENTER key then only it is showing printf statement otherwise it is waiting for my response.
This is not happening in while/do while loop. In for loop only it is requiring input from keyboard (ENTER) then only it is showing the output.
Can anybody explain why this is happening? (forgive me any spelling or grammar mistakes)
Edit:
One more question related to above problem i.e,
if I remove exit 1 in that program it is still waiting for my input (Enter Key) and again if I do that it is printing same output as many times as I press (Enter Key) or until Ctrl+D (to exit).
Without loop how it is running infinite times? means if it is executing exit 1 means it is out of loop but how can it is going back and executing same statements.
Thanks

Your awk program consists of a single action that will be triggered by each line of input. When you press ENTER you trigger that action with the first, albeit empty, line of input. You print the sum from 1 to 10 as expected, and then the exit statement quits the program without trying to read any more input.
That answers your question, but I suspect you have more in mind that you'd like to work out. Update your question (or start a new one) if you like and we'll try to help more!
EDIT
As mentioned in the comments, the fundamental aspect of awk is that the normal actions are run for every line of input, without having to explicitly do the looping yourself. Play with this example and/or take a few minutes with an online tutorial to get the idea:
BEGIN { print "do this once before reading input" }
{ print "do this for each line of input (now processing '" $0 "')" }
END { print "do this once after reading all input" }

Unless you run your code in a BEGIN block, awk will be waiting for input:
Example:
This waits for input:
awk '{print "hello world!"}'
This does not:
awk 'BEGIN{print "hello world!"}'
So to fix your code:
BEGIN{
for (i=1; i<=10; i++)
{
sum += i;
}
printf "Sum for 1 to 10 numbers = %d \nGoodbye!\n\n", sum
exit 0
}

You can also run awk without any input files.
If you type the following command line: awk ’program’
awk applies the program to the standard input, which usually means whatever you type on the terminal. This continues until you indicate end-of-file by typing Ctrl-d. (On other operating systems, the end-of-file character may be different. For example, on OS/2, it is Ctrl-z)
(Source: GAWK: Effective AWK Programming by Arnold D. Robbins)

Why does awk skip the second field in first entry?

I have a manually created log file of the format
date start duration description
2/5 10:00p 1:45 Did this and that.
2/6 2:00a 0:20 Woke up from my slumber.
==============================================
2:05 TOTAL time spent
There are many entries in the log. To avoid manually recomputing total time every time an entry is added, I wrote the following script:
#!/bin/bash
file=`ls | grep log`
head -n -1 $file | egrep -o [0-9]:[0-9]{2}[^ap] \
| awk '{ FS = ":" ; SUM += 60*$1 ; SUM += $2 } END { print SUM }'
First, the script assumes there is exactly one file with log in its name, and that's the file I'm after. Second, it takes all lines other than the line with the current total, greps the time information from the line, and feeds it to awk, which converts it to minutes.
This is where I run into problems. The final sum would always be slightly off. Through trial and error, I discovered that awk will never count the second field of the very first record, e.g. the 45 minutes in this case. It will count the hour; it won't count the minutes. It has no such problem with the other records, but it's always off by the minutes in the first record.
What could be causing this behavior? How do I debug it?

You set FS in the loop and it's already too late for the first line.
The right way to do is :
echo -e "1:45\n0:20" | awk 'BEGIN { FS=":" } { SUM += 60*$1 + $2 } END { print SUM }'

You did not show us, that how you expect output
Whether like this ?
$ cat log
date start duration description
2/5 10:00p 1:45 Did this and that.
2/6 2:00a 0:20 Woke up from my slumber.
==============================================
2:05 TOTAL time spent
Awk Code
awk '$3~/([[:digit:]]):([[:digit:]])/ && !/TOTAL/{
split($3,A,":")
sum+=A[1]*60+A[2]
}
END{
print "Total",sum,"Minutes"
}' log
Resulting
Total 125 Minutes

Using awk with Operations on Variables

I'm trying to write a Bash script that reads files with several columns of data and multiplies each value in the second column by each value in the third column, adding the results of all those multiplications together.
For example if the file looked like this:
Column 1 Column 2 Column 3 Column 4
genome 1 30 500
genome 2 27 500
genome 3 83 500
...
The script should multiply 1*30 to give 30, then 2*27 to give 54 (and add that to 30), then 3*83 to give 249 (and add that to 84) etc..
I've been trying to use awk to parse the input file but am unsure of how to get the operation to proceed line by line. Right now it stops after the first line is read and the operations on the variables are performed.
Here's what I've written so far:
for file in fileone filetwo
do
set -- $(awk '/genome/ {print $2,$3}' $file.hist)
var1=$1
var2=$2
var3=$((var1*var2))
total=$((total+var3))
echo var1 \= $var1
echo var2 \= $var2
echo var3 \= $var3
echo total \= $total
done
I tried placing a "while read" loop around everything but could not get the variables to update with each line. I think I'm going about this the wrong way!
I'm very new to Linux and Bash scripting so any help would be greatly appreciated!

That's because awk reads the entire file and runs its program on each line. So the output you get from awk '/genome/ {print $2,$3}' $file.hist will look like
1 30
2 27
3 83
and so on, which means in the bash script, the set command makes the following variable assignments:
$1 = 1
$2 = 30
$3 = 2
$4 = 27
$5 = 3
$6 = 83
etc. But you only use $1 and $2 in your script, meaning that the rest of the file's contents - everything after the first line - is discarded.
Honestly, unless you're doing this just to learn how to use bash, I'd say just do it in awk. Since awk automatically runs over every line in the file, it'll be easy to multiply columns 2 and 3 and keep a running total.
awk '{ total += $2 * $3 } ENDFILE { print total; total = 0 }' fileone filetwo
Here ENDFILE is a special address that means "run this next block at the end of each file, not at each line."
If you are doing this for educational purposes, let me say this: the only thing you need to know about doing arithmetic in bash is that you should never do arithmetic in bash :-P Seriously though, when you want to manipulate numbers, bash is one of the least well-adapted tools for that job. But if you really want to know, I can edit this to include some information on how you could do this task primarily in bash.

I agree that awk is in general better suited for this kind of work, but if you are curious what a pure bash implementation would look like:
for f in file1 file2; do
total=0
while read -r _ x y _; do
((total += x * y))
done < "$f"
echo "$total"
done

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

bash: how to compute average of different columns? - bash

Related

Parsing multiline program output

Having SUM issues with a bash script

Why awk for loop require ENTER key before printing anything?

Why does awk skip the second field in first entry?

Using awk with Operations on Variables

Categories

Resources