Why does awk skip the second field in first entry? - bash

I have a manually created log file of the format
date start duration description
2/5 10:00p 1:45 Did this and that.
2/6 2:00a 0:20 Woke up from my slumber.
==============================================
2:05 TOTAL time spent
There are many entries in the log. To avoid manually recomputing total time every time an entry is added, I wrote the following script:
#!/bin/bash
file=`ls | grep log`
head -n -1 $file | egrep -o [0-9]:[0-9]{2}[^ap] \
| awk '{ FS = ":" ; SUM += 60*$1 ; SUM += $2 } END { print SUM }'
First, the script assumes there is exactly one file with log in its name, and that's the file I'm after. Second, it takes all lines other than the line with the current total, greps the time information from the line, and feeds it to awk, which converts it to minutes.
This is where I run into problems. The final sum would always be slightly off. Through trial and error, I discovered that awk will never count the second field of the very first record, e.g. the 45 minutes in this case. It will count the hour; it won't count the minutes. It has no such problem with the other records, but it's always off by the minutes in the first record.
What could be causing this behavior? How do I debug it?

You set FS in the loop and it's already too late for the first line.
The right way to do is :
echo -e "1:45\n0:20" | awk 'BEGIN { FS=":" } { SUM += 60*$1 + $2 } END { print SUM }'

You did not show us, that how you expect output
Whether like this ?
$ cat log
date start duration description
2/5 10:00p 1:45 Did this and that.
2/6 2:00a 0:20 Woke up from my slumber.
==============================================
2:05 TOTAL time spent
Awk Code
awk '$3~/([[:digit:]]):([[:digit:]])/ && !/TOTAL/{
split($3,A,":")
sum+=A[1]*60+A[2]
}
END{
print "Total",sum,"Minutes"
}' log
Resulting
Total 125 Minutes

Related

Parsing multiline program output

I've recently been working on some lab assignments and in order to collect and analyze results well, I prepared a bash script to automate my job. It was my first attempt to create such script, thus it is not perfect and my question is strictly connected with improving it.
Exemplary output of the program is shown below, but I would like to make it more general for more purposes.
>>> VARIANT 1 <<<
Random number generator seed is 0xea3495cc76b34acc
Generate matrix 128 x 128 (16 KiB)
Performing 1024 random walks of 4096 steps.
> Total instructions: 170620482
> Instructions per cycle: 3.386
Time elapsed: 0.042127 seconds
Walks accrued elements worth: 534351478
All data I want to collect is always in different lines. My first attempt was running the same program twice (or more times depending on the amount of data) and then using grep in each run to extract the data I need by looking for the keyword. It is very inefficient, as there probably are some possibilities of parsing whole output of one run, but I could not come up with any idea. At the moment the script is:
#!/bin/bash
write() {
o1=$(./progname args | grep "Time" | grep -o -E '[0-9]+.[0-9]+')
o2=$(./progname args | grep "cycle" | grep -o -E '[0-9]+.[0-9]+')
o3=$(./progname args | grep "Total" | grep -o -E '[0-9]+.[0-9]+')
echo "$1 $o1 $o2 $o3"
}
for ((i = 1; i <= 10; i++)); do
write $i >> times.dat
done
It is worth mentioning that echoing results in one line is crucial, as I am using gnuplot later and having data in columns is perfect for that use. Sample output should be:
1 0.019306 3.369 170620476
2 0.019559 3.375 170620475
3 0.021971 3.334 170620478
4 0.020536 3.378 170620480
5 0.019692 3.390 170620475
6 0.020833 3.375 170620477
7 0.019951 3.450 170620477
8 0.019417 3.381 170620476
9 0.020105 3.374 170620476
10 0.020255 3.402 170620475
My question is: how could I improve the script to collect such data in just one program execution?
You could use awk here and could get values into an array and later access them by index 1,2 and 3 in case you want to do this in a single command.
myarr=($(your_program args | awk '/Total/{print $NF;next} /cycle/{print $NF;next} /Time/{print $(NF-1)}'))
OR use following to forcefully print all elements into a single line, which will not come in new lines if someone using " to keep new lines safe for values.
myarr=($(your_program args | awk '/Total/{val=$NF;next} /cycle/{val=(val?val OFS:"")$NF;next} /Time/{print val OFS $(NF-1)}'))
Explanation: Adding detailed explanation of awk program above.
awk ' ##Starting awk program from here.
/Total/{ ##Checking if a line has Total keyword in it then do following.
print $NF ##Printing last field of that line which has Total in it here.
next ##next keyword will skip all further statements from here.
}
/cycle/{ ##Checking if a line has cycle in it then do following.
print $NF ##Printing last field of that line which has cycle in it here.
next ##next keyword will skip all further statements from here.
}
/Time/{ ##Checking if a line has Time in it then do following.
print $(NF-1) ##Printing 2nd last field of that line which has Time in it here.
}'
To access individual items you could use like:
echo ${myarr[0]}, echo ${myarr[1]} and echo ${myarr[2]} for Total, cycle and time respectively.
Example to access all elements by loop in case you need:
for i in "${myarr[#]}"
do
echo $i
done
You can execute your program once and save the output at a variable.
o0=$(./progname args)
Then you can grep that saved string any times like this.
o1=$(echo "$o0" | grep "Time" | grep -o -E '[0-9]+.[0-9]+')
Assumptions:
each of the 3x search patterns (Time, cycle, Total) occur just once in a set of output from ./progname
format of ./progname output is always the same (ie, same number of space-separated items for each line of output)
I've created my own progname script that just does an echo of the sample output:
$ cat progname
echo ">>> VARIANT 1 <<<
Random number generator seed is 0xea3495cc76b34acc
Generate matrix 128 x 128 (16 KiB)
Performing 1024 random walks of 4096 steps.
> Total instructions: 170620482
> Instructions per cycle: 3.386
Time elapsed: 0.042127 seconds
Walks accrued elements worth: 534351478"
One awk solution to parse and print the desired values:
$ i=1
$ ./progname | awk -v i=${i} ' # assign awk variable "i" = ${i}
/Time/ { o1 = $3 } # o1 = field 3 of line that contains string "Time"
/cycle/ { o2 = $5 } # o2 = field 5 of line that contains string "cycle"
/Total/ { o3 = $4 } # o4 = field 4 of line that contains string "Total"
END { printf "%s %s %s %s\n", i, o1, o2, o3 } # print 4x variables to stdout
'
1 0.042127 3.386 170620482

bash: how to compute average of different columns?

I am writing a script for automatically computing average runtime.
First I need to run $ time ./foo.py for 100 times and save output to file time.txt (working)
$ for i in `seq 100`; do { time ./foo.py; } 2>> time.txt; done
Output looks as follows
time ./foo.py
real 0m0,030s
user 0m0,030s
sys 0m0,000s
[...]
Runtimes from different scripts are in the same file. Each entry starts with time ./foo.py, followed by 100 "triplets" of real, user and sys.
Now, if possible, I would love to have the script automatically compute the average runtime for each tested file by using all 100 "triplets", and neatly returning only one "mean triplet".
I have thought about maybe using awk to calculate the mean, like this
awk '{ total += $2 } END { print total/NR }' time.txt
But the command would need to be adapted to fit my needs - after all, only the parts after the , (e.g. ,030s) may be used for computation and the s would also need to be disregarded.
Since I do not know how to achieve this objective, I thought to ask the community.
Any help is greatly appreciated.
It's easier if you tell time to output the time info in POSIX format:
awk '/^real/ { totalReal += $2 } /^user/ { totalUser += $2 } /^sys/ { totalSys += $2 } END { print "realAvg " totalReal/(NR/4) "\n" "userAvg " totalUser/(NR/4) "\n" "sysAvg " totalSys/(NR/4) }' time.txt
Prints output as follows:
realAvg 12.62
userAvg 27
sysAvg 3.8
Explanation:
Basically, tell awk to go through each line in the file, and if the line starts with real, add that to the totalReal variable, same for user and sys. So, basically, keep a running total of each of the three "types".
At the end, simply print the the three running totals, divided by the number of lines divided by 4. This is because you want each "set" of 4 lines to count as 1 instance, and awk's NR just counts the number of lines.

Having SUM issues with a bash script

I'm trying to write a script to pull the integers out of 4 files that store temperature readings from 4 industrial freezers, this is a hobby script it generates the general readouts I wanted, however when I try to generate a SUM of the temperature readings I get the following printout into the file and my goal is to print the end SUM only not the individual numbers printed out in a vertical format
Any help would be greatly appreciated;here's my code
grep -o "[0.00-9.99]" "/location/$value-1.txt" | awk '{ SUM += $1; print $1} END { print SUM }' >> "/location/$value-1.txt"
here is what I am getting in return
Morningtemp:17.28
Noontemp:17.01
Lowtemp:17.00 Hightemp:18.72
1
7
.
2
8
1
7
.
0
1
1
7
.
0
0
1
8
.
7
2
53
It does generate the SUM I don't need the already listed numbers, just the SUM total
Why not stick with AWK completely? Code:
$ cat > summer.awk
{
while(match($0,/[0-9]+\.[0-9]+/)) # while matches on record
{
sum+=substr($0, RSTART, RLENGTH) # extract matches and sum them
$0=substr($0, RSTART + RLENGTH) # reset to start after previous match
count++ # count matches
}
}
END {
print sum"/"count"="sum/count # print stuff
Data:
$ cat > data.txt
Morningtemp:17.28
Noontemp:17.01
Lowtemp:17.00 Hightemp:18.72
Run:
$ awk -f summer.awk file
70.01/4=17.5025
It might work in the winter too.
The regex in grep -o "[0.00-9.99]" "/location/$value-1.txt" is equivalent to [0-9.], but you're probably looking for numbers in the range 0.00 to 9.99. For that, you need a different regex:
grep -o "[0-9]\.[0-9][0-9]" "/location/$value-1.txt"
That looks for a digit, a dot, and two more digits. It was almost tempting to use [.] in place of \.; it would also work. A plain . would not; that would select entries such as 0X87.
Note that the pattern shown ([0-9]\.[0-9][0-9]) will match 192.16.24.231 twice (2.16 and 4.23). If that's not what you want, you have to be a lot more precise. OTOH, it may not matter in the slightest for the actual data you have. If you'd want it to match 192.16 and 24.231 (or .24 and .231), you have to refine your regex.
Your command structure:
grep … filename | awk '…' >> filename
is living dangerously. In the example, it is 'OK' (but there's a huge grimace on my face as I type 'OK') because the awk script doesn't write anything to the file until grep has read it all. But change the >> to > and you have an empty input, or have awk write material before the grep is complete and suddenly it gets very tricky to determine what happens (it depends, in part, on what awk writes to the end of the file).

Inserting if loop within awk

I had a problem solved in a previous post using the awk, but now I want to put an if loop in it, but I am getting an error.
Here's the problem:
I had a lot of files that looked like this:
Header
175566717.000
175570730.000
175590376.000
175591966.000
175608932.000
175612924.000
175614836.000
.
.
.
175680016.000
175689679.000
175695803.000
175696330.000
And I wanted to extract the first 2000 lines (line 1 to 2000), then extract the lines 1500 to 3500, then 3000 to 5000 and so on... What I mean is: extract a window of 2000 lines with an overlap of 500 lines between contiguous windows until the end of the file.
This is the awk command used for it:
awk -v i=1 -v t=2000 -v d=501 'NR>1{a[NR-1]=$0}END{
while(i<NR-1){
++n;
for(k=i;k<i+t;k++)print a[k] > "win"n".txt";
close("_win"n".txt")
i=i+t-d
}
}' myfile.txt
done
And I get several files with names win1.txt , win2.txt , win3.txt , etc...
My problem now is that because the file was not a multiple of 2000, my last window has less than 2000 lines. How can I put an if loop that would do this: if the last window had less than 2000 digital numbers, the previous window should had all the lines until the end of the file.
EXTRA INFO
When the windows are created, there is a line break at the end.That is why I needed the if loop to take into account a window of less than 2000 digital numbers, and not just lines.
If you don't have to use awk for some other reason, try the sed approach
#!/bin/bash
file="$(sed '/^\s*$/d' myfile.txt)"
sed -n 1,2000p <<< "$file"
first=1500
last=3500
max=$(wc -l <<< "$file" | awk '{print $1}')
while [[ $max -ge 2000 && $last -lt $((max+1500)) ]]; do
sed -n "$first","$last"p <<< "$file"
((first+=1500))
((last+=1500))
done
Obviously this is going to be less fast than awk and more error prone for gigatic files, but should work in most cases.
Change the while condition to make it stop earlier:
while (i+t <= NR) {
Change the end condition of the for loop to compensate for the last output file being potentially bigger:
for (k = i; k < (i+t+t-d <= NR ? i+t : NR); k++)
The rest of your code can stay the same; although I took the liberty of removing the close statement (why was that?), and to set d=500, to make the output files really overlap by 500 lines.
awk -v i=1 -v t=2000 -v d=500 'NR>1{a[NR-1]=$0}END{
while (i+t <= NR) {
++n;
for (k=i; k < (i+t+t-d <= NR ? i+t : NR); k++) print a[k] > "win"n".txt";
i=i+t-d
}
}' myfile.txt
I tested it with small values of t and d, and it seems to work as requested.
One final remark: for big input files, I wouldn't encourage storing the whole thing in array a.

How increment hours in a datetimstamp?

I have a file which contains time in YYYYMMDDhhmmss.sss
I am fetching only hours/minutes from the file using the following command
start=grep -i "XYZ" | head -1 | awk '{print $3}' | cut -c9-12
The start variable would contain number of hours/minutes (Example: - 1041 [HHMM])
My task is to increment this time by 60 minutes.
Please help me to do so. I am not using system date.
Here's what i tried,
start=grep -i "XYZ" | head -1 | awk '{print $3}' | cut -c9-12
end=$(($start) + 3600 )
But This logic is wrong as it will add like a normal number. Also converting the time to seconds would be a tedious job. Is there any way to increment via system commands ? Please suggest.
60 minutes is 1 hour, so all you need to do is increment the hour by one. This can be done by adding 100 to the HHMM time as shown below:
start="$1"
# add 100 to the start time to increment the hour
newTime=$((10#$start+100))
# check if we have crossed midnight
if (( newTime >= 2400 )); then
newTime=$((newTime-2400))
fi
#pad with leading zero
newTime="$(printf %04d $newTime)"
echo "$newTime"

Resources