How to add two time outputs in bash - bash

I want to calculate in bash the average time spent by several commands. The output of time command is min:sec.milisec.
I don't know how to add two outputs of this kind in bash and in final to calculate the average.
I tried to convert the output with date but the output is "date: invalid date `0:01.00'".

This is a three part answer
Part one
First, use the TIMEFORMAT variable to output only seconds elapsed. Then you can add this directly
From man bash
TIMEFORMAT
The value of this parameter is used as a format string specifying how the timing information
for pipelines prefixed with the time reserved word should be displayed. The % character
introduces an escape sequence that is expanded to a time value or other information. The
escape sequences and their meanings are as follows; the braces denote optional portions.
Here is an example which outputs only seconds with a precision of 0, i.e. no decimal point. Read part three why that's important.
TIMEFORMAT='%0R'; time sleep 1
1
Part two
Second, how do we capture the output of time? It's actually a bit tricky, here's how you do capture the time from the command above
TIMEFORMAT='%0R'; time1=$( { time sleep 1; } 2>&1 )
Part three
How do we add the times together and get the average?
In bash we use the $(( )) construct to do math. Note that bash does not natively support floating point so you will be doing integer division (hence the precision 0.) Here is a script that will capture the time from two commands and output each of the individual times and their average
#!/bin/bash
TIMEFORMAT='%0R'
time1=$( { time sleep 1; } 2>&1 )
time2=$( { time sleep 4; } 2>&1 )
ave=$(( (time1 + time2) / 2))
echo "time1 is $time1 | time2 is $time2 | average is $ave"
Output
time1 is 1 | time2 is 4 | average is 2
If integer division is a non-starter for you and you want precision, as long as you don't mind calling the external binary bc, you can do this quite easily.
#!/bin/bash
TIMEFORMAT='%3R'
time1=$( { time sleep 1; } 2>&1 )
time2=$( { time sleep 4; } 2>&1 )
ave=$( bc <<<"scale=3; ($time1 + $time2)/2" )
echo "time1 is $time1 | time2 is $time2 | average is $ave"
Output
time1 is 1.003 | time2 is 4.003 | average is 2.503

For the example i'll use a variable preinitialized:
time="54:32.96";
minutes=$(echo "$time" | cut -d":" -f1)
seconds=$(echo "$time" | cut -d":" -f2 | cut -d"." -f1)
millis=$(echo "$time" | cut -d":" -f2 | cut -d"." -f2)
#Total time in millis
totalMillisOne=$(($millis+$seconds*1000+$minutes*60000))
You do this with every command and you save it in diferent vars, and then you do the average:
let avMillis=$totalMillisOne+$totalMillisTwo
let avMillis=$avMillis/2
And you output it in the same input format:
let avSeconds=$avMillis/1000
let avMillis=$avMillis-$avSeconds*1000;
let avMinutes=$avSeconds/60;
let avSeconds=$avSeconds-$avMinutes*60;
echo "${avMinutes}:${avSeconds}.${avMillis}"

Related

How to write a bash function that can detect if a given input ends in Kilobytes `K` or Megabytes `M`?

I have a bash function that is currently set up as:
MB=$(( $(echo $(FUNCTION_THAT_RETURNS_Kb_OR_Mb) | cut -d "K" -f 1 | sed 's/^.*- //') / 1000 ))
where the middle portion echo $(FUNCTION_THAT_RETURNS_Kb_OR_Mb) returns a value that ends in K or M, (for example: 515223 K or 36326 M) for Kilobytes or Megabytes. I currently have designed the function to strip the trailing units indicator for K, and then divide by 1000 to convert to megabytes. However, when the inside part of it ends in M, it fails. How can I write a function that detects if its in kilobytes or megabytes?
Don't reinvent the wheel - there is numfmt:
function_that_returns_Kb_or_Mb() { echo "515223 K"; }
mb=$(function_that_returns_Kb_or_Mb | numfmt -d '' --from=iec --to-unit=Mi)
# mb=504
function_that_returns_Kb_or_Mb() { echo "36326 M"; }
mb=$(function_that_returns_Kb_or_Mb | numfmt -d '' --from=iec --to-unit=Mi)
# mb=36326
Notes:
echo $(FUNCTION_THAT_RETURNS_Kb_OR_Mb) is a useless use of echo. It's like echo $(echo $(echo $(...)))). Just FUNCTION_THAT_RETURNS_Kb_OR_Mb | blabla.
By convention UPPERCASE VARIABLES are used for exported variables, like PATH COLUMNS UID PWD etc. - use lower case identifiers in your scripts.
I assumed input and output is using IEC scale, for SI scale use --from=si --to-unit=M.

Is it really slow to handle text file(more than 10K lines) with shell script?

I have a file with more than 10K lines of record.
Within each line, there are two date+time info. Below is an example:
"aaa bbb ccc 170915 200801 12;ddd e f; g; hh; 171020 122030 10; ii jj kk;"
I want to filter out the lines the days between these two dates is less than 30 days.
Below is my source code:
#!/bin/bash
filename="$1"
echo $filename
touch filterfile
totalline=`wc -l $filename | awk '{print $1}'`
i=0
j=0
echo $totalline lines
while read -r line
do
i=$[i+1]
if [ $i -gt $[j+9] ]; then
j=$i
echo $i
fi
shortline=`echo $line | sed 's/.*\([0-9]\{6\}\)[ ][0-9]\{6\}.*\([0-9]\{6\}\)[ ][0-9]\{6\}.*/\1 \2/'`
date1=`echo $shortline | awk '{print $1}'`
date2=`echo $shortline | awk '{print $2}'`
if [ $date1 -gt 700000 ]
then
continue
fi
d1=`date -d $date1 +%s`
d2=`date -d $date2 +%s`
diffday=$[(d2-d1)/(24*3600)]
#diffdays=`date -d $date2 +%s` - `date -d $date1 +%s`)/(24*3600)
if [ $diffday -lt 30 ]
then
echo $line >> filterfile
fi
done < "$filename"
I am running it in cywin. It took about 10 second to handle 10 lines. I use echo $i to show the progress.
Is it because i am using some wrong way in my script?
This answer does not answer your question but gives an alternative method to your shell script. The answer to your question is given by Sundeep's comment :
Why is using a shell loop to process text considered bad practice?
Furthermore, you should be aware that everytime you call sed, awk, echo, date, ... you are requesting the system to execute a binary which needs to be loaded into memory etc etc. So if you do this in a loop, it is very inefficient.
alternative solution
awk programs are commonly used to process log files containing timestamp information, indicating when a particular log record was written. gawk extended the awk standard with time-handling functions. The one you are interested in is :
mktime(datespec [, utc-flag ]) Turn datespec into a timestamp in the
same form as is returned by systime(). It is similar to the function
of the same name in ISO C. The argument, datespec, is a string of the
form "YYYY MM DD HH MM SS [DST]". The string consists of six or seven
numbers representing, respectively, the full year including century,
the month from 1 to 12, the day of the month from 1 to 31, the hour of
the day from 0 to 23, the minute from 0 to 59, the second from 0 to
60, and an optional daylight-savings flag.
The values of these numbers need not be within the ranges specified;
for example, an hour of -1 means 1 hour before midnight. The
origin-zero Gregorian calendar is assumed, with year 0 preceding year
1 and year -1 preceding year 0. If utc-flag is present and is either
nonzero or non-null, the time is assumed to be in the UTC time zone;
otherwise, the time is assumed to be in the local time zone. If the
DST daylight-savings flag is positive, the time is assumed to be
daylight savings time; if zero, the time is assumed to be standard
time; and if negative (the default), mktime() attempts to determine
whether daylight savings time is in effect for the specified time.
If datespec does not contain enough elements or if the resulting time
is out of range, mktime() returns -1.
As your date format is of the form yymmdd HHMMSS we need to write a parser function convertTime for this. Be aware in this function we will pass times of the form yymmddHHMMSS. Furthermore, using a space delimited fields, your times are located in field $4$5 and $11$12. As mktime converts the time to seconds since 1970-01-01 onwards, all we need to do is to check if the delta time is smaller than 30*24*3600 seconds.

awk 'function convertTime(t) {
s="20"substr(t,1,2)" "substr(t,3,2)" "substr(t,5,2)" "
s= s substr(t,7,2)" "substr(t,9,2)" "substr(t,11,2)"
return mktime(s)
}
{ t1=convertTime($4$5); t2=convertTime($11$12)}
(t2-t1 < 30*3600*24) { print }' <file>
If you are not interested in the real delta time (your sed line removes the actual time of the day), than you can adopt it to :
awk 'function convertTime(t) {
s="20"substr(t,1,2)" "substr(t,3,2)" "substr(t,5,2)" "
s= s "00 00 00"
return mktime(s)
}
{ t1=convertTime($4); t2=convertTime($11)}
(t2-t1 < 30*3600*24) { print }' <file>
If the dates are not in the fields, you can use match to find them :
awk 'function convertTime(t) {
s="20"substr(t,1,2)" "substr(t,3,2)" "substr(t,5,2)" "
s= s substr(t,7,2)" "substr(t,9,2)" "substr(t,11,2)"
return mktime(s)
}
{ match($0,/[0-9]{6} [0-9]{6}/);
t1=convertTime(substr($0,RSTART,RLENGTH));
a=substr($0,RSTART+RLENGTH)
match(a,/[0-9]{6} [0-9]{6}/)
t2=convertTime(substr(a,RSTART,RLENGTH))}
(t2-t1 < 30*3600*24) { print }' <file>
With some modifications, often without speed in mind, I can reduce the processing time by 50% - which is a lot:
#!/bin/bash
filename="$1"
echo "$filename"
# touch filterfile
totalline=$(wc -l < "$filename")
i=0
j=0
echo "$totalline" lines
while read -r line
do
i=$((i+1))
if (( i > ((j+9)) )); then
j=$i
echo $i
fi
shortline=($(echo "$line" | sed 's/.*\([0-9]\{6\}\)[ ][0-9]\{6\}.*\([0-9]\{6\}\)[ ][0-9]\{6\}.*/\1 \2/'))
date1=${shortline[0]}
date2=${shortline[1]}
if (( date1 > 700000 ))
then
continue
fi
d1=$(date -d "$date1" +%s)
d2=$(date -d "$date2" +%s)
diffday=$(((d2-d1)/(24*3600)))
# diffdays=$(date -d $date2 +%s) - $(date -d $date1 +%s))/(24*3600)
if (( diffday < 30 ))
then
echo "$line" >> filterfile
fi
done < "$filename"
Some remarks:
# touch filterfile
Well - the later CMD >> filterfile overwrites this file and creates one, if it doesn't exist.
totalline=$(wc -l < "$filename")
You don't need awk, here. The filename output is surpressed if wc doesn't see the filename.
Capturing the output in an array:
shortline=($(echo "$line" | sed 's/.*\([0-9]\{6\}\)[ ][0-9]\{6\}.*\([0-9]\{6\}\)[ ][0-9]\{6\}.*/\1 \2/'))
date1=${shortline[0]}
date2=${shortline[1]}
allows us array access and saves another call to awk.
On my machine, your code took about 42s for 2880 lines (on your machine 2880 s?) and about 19s for the same file with my code.
So I suspect, if you aren't running it on an i486-machine, that cygwin might be a slowdown. It's a linux environment for windows, isn't it? Well, I'm on a core Linux system. Maybe you try the gnu-utils for Windows - the last time I looked for them, they were advertised as gnu-utils x32 or something, maybe there is an a64-version available by now.
And the next thing I would have a look at, is the date calculation - that might be a slowdown too.
2880 lines isn't that much, so I don't suspect that my SDD drive plays a huge role in the game.

Creating histograms in bash

EDIT
I read the question that this is supposed to be a duplicate of (this one). I don't agree. In that question the aim is to get the frequencies of individual numbers in the column. However if I apply that solution to my problem, I'm still left with my initial problem of grouping the frequencies of the numbers in a particular range into the final histogram. i.e. if that solution tells me that the frequency of 0.45 is 2 and 0.44 is 1 (for my input data), I'm still left with the problem of grouping those two frequencies into a total of 3 for the range 0.4-0.5.
END EDIT
QUESTION-
I have a long column of data with values between 0 and 1.
This will be of the type-
0.34
0.45
0.44
0.12
0.45
0.98
.
.
.
A long column of decimal values with repetitions allowed.
I'm trying to change it into a histogram sort of output such as (for the input shown above)-
0.0-0.1 0
0.1-0.2 1
0.2-0.3 0
0.3-0.4 1
0.4-0.5 3
0.5-0.6 0
0.6-0.7 0
0.7-0.8 0
0.8-0.9 0
0.9-1.0 1
Basically the first column has the lower and upper bounds of each range and the second column has the number of entries in that range.
I wrote it (badly) as-
for i in $(seq 0 0.1 0.9)
do
awk -v var=$i '{if ($1 > var && $1 < var+0.1 ) print $1}' input | wc -l;
done
Which basically does a wc -l of the entries it finds in each range.
Output formatting is not a part of the problem. If I simply get the frequencies corresponding to the different bins , that will be good enough. Also please note that the bin size should be a variable like in my proposed solution.
I already read this answer and want to avoid the loop. I'm sure there's a much much faster way in awk that bypasses the for loop. Can you help me out here?
Following the same algorithm of my previous answer, I wrote a script in awk which is extremely fast (look at the picture).
The script is the following:
#!/usr/bin/awk -f
BEGIN{
bin_width=0.1;
}
{
bin=int(($1-0.0001)/bin_width);
if( bin in hist){
hist[bin]+=1
}else{
hist[bin]=1
}
}
END{
for (h in hist)
printf " * > %2.2f -> %i \n", h*bin_width, hist[h]
}
The bin_width is the width of each channel. To use the script just copy it in a file, make it executable (with chmod +x <namefile>) and run it with ./<namefile> <name_of_data_file>.
For this specific problem, I would drop the last digit, then count occurrences of sorted data:
cut -b1-3 | sort | uniq -c
which gives, on the specified input set:
2 0.1
1 0.3
3 0.4
1 0.9
Output formatting can be done by piping through this awk command:
| awk 'BEGIN{r=0.0}
{while($2>r){printf "%1.1f-%1.1f %3d\n",r,r+0.1,0;r=r+.1}
printf "%1.1f-%1.1f %3d\n",$2,$2+0.1,$1}
END{while(r<0.9){printf "%1.1f-%1.1f %3d\n",r,r+0.1,0;r=r+.1}}'
The only loop you will find in this algorithm is around the line of the file.
This is an example on how to realize what you asked in bash. Probably bash is not the best language to do this since it is slow with math. I use bc, you can use awk if you prefer.
How the algorithm works
Imagine you have many bins: each bin correspond to an interval. Each bin will be characterized by a width (CHANNEL_DIM) and a position. The bins, all together, must be able to cover the entire interval where your data are casted. Doing the value of your number / bin_width you get the position of the bin. So you have just to add +1 to that bin. Here a much more detailed explanation.
#!/bin/bash
# This is the input: you can use $1 and $2 to read input as cmd line argument
FILE='bash_hist_test.dat'
CHANNEL_NUMBER=9 # They are actually 10: 0 is already a channel
# check the max and the min to define the dimension of the channels:
MAX=`sort -n $FILE | tail -n 1`
MIN=`sort -rn $FILE | tail -n 1`
# Define the channel width
CHANNEL_DIM_LONG=`echo "($MAX-$MIN)/($CHANNEL_NUMBER)" | bc -l`
CHANNEL_DIM=`printf '%2.2f' $CHANNEL_DIM_LONG `
# Probably printf is not the best function in this context because
#+the result could be system dependent.
# Determine the channel for a given number
# Usage: find_channel <number_to_histogram> <width_of_histogram_channel>
function find_channel(){
NUMBER=$1
CHANNEL_DIM=$2
# The channel is found dividing the value for the channel width and
#+rounding it.
RESULT_LONG=`echo $NUMBER/$CHANNEL_DIM | bc -l`
RESULT=`printf '%.0f' $RESULT_LONG`
echo $RESULT
}
# Read the file and do the computuation
while IFS='' read -r line || [[ -n "$line" ]]; do
CHANNEL=`find_channel $line $CHANNEL_DIM`
[[ -z HIST[$CHANNEL] ]] && HIST[$CHANNEL]=0
let HIST[$CHANNEL]+=1
done < $FILE
counter=0
for i in ${HIST[*]}; do
CHANNEL_START=`echo "$CHANNEL_DIM * $counter - .04" | bc -l`
CHANNEL_END=`echo " $CHANNEL_DIM * $counter + .05" | bc`
printf '%+2.1f : %2.1f => %i\n' $CHANNEL_START $CHANNEL_END $i
let counter+=1
done
Hope this helps. Comment if you have other questions.

unix find the difference from a file row wise

I have some data like
[09359]0000.365604| =>SttSasph_Hmbm_bSPO_PhQmOm (Hmbm_PhQmOm_utWmP.asp)
[09359]0000.365687| =>Hmbm_bSPO_PhQmOm_Wd (Hmbm_PhQmOm_utWmP.asp)
[09359]0000.365879| =>SttSasph_Hmbm_quOuO_PhQmOm (Hmbm_PhQmOm_utWmP.asp)
[09359]0000.365890| =>Hmbm_quOuO_PhQmOm_Wd (Hmbm_PhQmOm_utWmP.asp)
[09359]0000.365979| WSmmOT SDDQ vSQWSbmO not POt, QOvOQtWnH to Onv mOthod
[09359]0001.625300| db_HOt_POPPWon_Wd: aspuQQOnt POPPWon WD WP 1016,59
[09359]0002.365979| WSmmOT SDDQ vSQWSbmO not POt, QOvOQtWnH to Onv mOthod
Every Line starts with a process number (Which can change) in square brackets
Then Seconds after the module (0001) in this case
Then MicroSeconds after the fullstop.
Then a Pipe to terminate.
Rest part can be ignored
What I need is to caluclate
Convert Seconds into MircoSeconds
Add the Microsconds to Converted Microseconds (From 1)
Find out the difference in microseconds. for eg. line2-line1 , line3-line2, line4- line3 and so.
Print the result in seperate file.
I tried to use this logic. But, it didnt work.
May I get suggestions with optimised way to do it or
improvement in my existing logic
sec=$(grep '^\[.\{1,\}\]' mass.May28.1 | cut -d "| " -f1 | cut -c8- | cut -d"." -f1)
msec=$(grep '^\[.\{1,\}\]' mass.May28.1 | cut -d "| " -f1 | cut -c8- | cut -d"." -f2)
$f_msec=$((sec * 1000000 + msec)) > final_difference_file
If you are comfortable with awk, then you can use this script:
script.awk
BEGIN{ FS="[\\[\\]\\|]+" }
{ printf("[%s]%011.6f|%s\n", $2,$3-prev,$4)
prev = $3 }
Use it like this: awk -f script.awk yourfile
The first line setups the fieldsplitting to use the brackets and pipe (ignore the backslashes they are need to escape the symbols that are regexp metacharacters). The second line prints the fields and calculates the timediff. The last line stores the current time for the calculation in the next line.
This can also be done with a bash script. Since bash lacks floating point arithmetic, we have to gather seconds and microseconds seperately (or call an external tool like bc for each line):
script.sh
IFS='|[].'
factor=1000000
prev=0
while read dummy pid secs msecs text;
do
msecs=$(( $secs * $factor + $msecs ))
timediff=$(( $msecs - $prev ))
prev=$msecs
secs=$(( $timediff / $factor ))
msecs=$(( $timediff - $secs * $factor ))
printf "[%s]%04d.%06d|%s\n" "$pid" "$secs" "$msecs" "$text"
done
Use it like this: bash script.sh yourfile

How to round a floating point number upto 3 digits after decimal point in bash

I am a new bash learner. I want to print the result of an expression given as input having 3 digits after decimal point with rounding if needed.
I can use the following code, but it does not round. Say if I give 5+50*3/20 + (19*2)/7 as input for the following code, the given output is 17.928. Actual result is 17.92857.... So, it is truncating instead of rounding. I want to round it, that means the output should be 17.929. My code:
read a
echo "scale = 3; $a" | bc -l
Equivalent C++ code can be(in main function):
float a = 5+50*3.0/20.0 + (19*2.0)/7.0;
cout<<setprecision(3)<<fixed<<a<<endl;
What about
a=`echo "5+50*3/20 + (19*2)/7" | bc -l`
a_rounded=`printf "%.3f" $a`
echo "a = $a"
echo "a_rounded = $a_rounded"
which outputs
a = 17.92857142857142857142
a_rounded = 17.929
?
You can use awk:
awk 'BEGIN{printf "%.3f\n", (5+50*3/20 + (19*2)/7)}'
17.929
%.3f output format will round up the number to 3 decimal points.
Try using this:
Here bc will provide the bash the functionality of caluculator and -l will read every single one in string and finally we are printing only three decimals at end
read num
echo $num | bc -l | xargs printf "%.3f"

Resources