Calculating percentile for each request from a log file based on start time and end time using bash script - bash

I have a simulation.log file where it will have below results and I want to calculate 5th, 25th, 95th, 99th percentile of each request using shell script by reading the file.
Below is a sample simulation.log file where 1649410339141 and 1649410341026 are start and end time in milliseconds.
REQUEST1 somelogprinted TTP123099SM000202 002 1649410339141 1649410341026 OK
REQUEST2 somelogprinted TTP123099SM000202 001 1649410339141 1649410341029 OK
......
I tried below code but did not give me any result and I am not a Unix developer:
FILE=filepath
sort -n $* > $FILE
N=$(wc -l $FILE | awk '{print $1}')
P50=$(dc -e "$N 2 / p")
P90=$(dc -e "$N 9 * 10 / p")
P99=$(dc -e "$N 99 * 100 / p") echo ";;
50th, 90th and 99th percentiles for
$N data points" awk "FNR==$P50 || FNR==$P90 || FNR==$P99" $FILE
Sample output:
Request | 5thpercentile | 25Percentile | 95Percentile | 99Percentile
Request1 | 657 | 786 | 821 | 981
Request2 | 453 | 654 | 795 | 854

Related

how to sum file size from ls like output log with Bytes, KiB, MiB, GiB

I have a pre-computed ls like output (it is not from actual ls command) and I cannot modify or recalcuate it. It looks like as follows:
2016-10-14 14:52:09 0 Bytes folder/
2020-04-18 05:19:04 201 Bytes folder/file1.txt
2019-10-16 00:32:44 201 Bytes folder/file2.txt
2019-08-26 06:29:46 201 Bytes folder/file3.txt
2020-07-08 16:13:56 411 Bytes folder/file4.txt
2020-04-18 03:03:34 201 Bytes folder/file5.txt
2019-10-16 08:27:11 1.1 KiB folder/file6.txt
2019-10-16 10:13:52 201 Bytes folder/file7.txt
2019-10-16 08:44:35 920 Bytes folder/file8.txt
2019-02-17 14:43:10 590 Bytes folder/file9.txt
The log may have GiB, MiB, KiB, Bytes at least. Among possibile values are zero values, or values w/wo comma for each of prefixes:
0 Bytes
3.9 KiB
201 Bytes
2.0 KiB
2.7 MiB
1.3 GiB
A similar approach is the following
awk 'BEGIN{ pref[1]="K"; pref[2]="M"; pref[3]="G";} { total = total + $1; x = $1; y = 1; while( x > 1024 ) { x = (x + 1023)/1024; y++; } printf("%g%s\t%s\n",int(x*10)/10,pref[y],$2); } END { y = 1; while( total > 1024 ) { total = (total + 1023)/1024; y++; } printf("Total: %g%s\n",int(total*10)/10,pref[y]); }'
but does not work correctly in my case:
$ head -n 10 files_sizes.log | awk '{print $3,$4}' | sort | awk 'BEGIN{ pref[1]="K"; pref[2]="M"; pref[3]="G";} { total = total + $1; x = $1; y = 1; while( x > 1024 ) { x = (x + 1023)/1024; y++; } printf("%g%s\t%s\n",int(x*10)/10,pref[y],$2); } END { y = 1; while( total > 1024 ) { total = (total + 1023)/1024; y++; } printf("Total: %g%s\n",int(total*10)/10,pref[y]); }'
0K Bytes
1.1K KiB
201K Bytes
201K Bytes
201K Bytes
201K Bytes
201K Bytes
411K Bytes
590K Bytes
920K Bytes
Total: 3.8M
This output wrongly calculate the size. My desidered output is the correct total sum of the input log file:
0 Bytes
201 Bytes
201 Bytes
201 Bytes
411 Bytes
201 Bytes
1.1 KiB
201 Bytes
920 Bytes
590 Bytes
Total: 3.95742 KiB
NOTE
The correct value as the result of the sum of the Bytes is
201 * 5 + 590 + 920 = 2926, so the total adding the KiB is
2.857422 + 1.1 = 3,95742 KiB = 4052.400 Bytes
[UPDATE]
I have update with the comparison of the results from KamilCuk and Ted Lyngmo and Walter A solutions that gives pretty much the same values:
$ head -n 10 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/ //; s/Bytes//; s/B//' | gnumfmt --from=auto | awk '{s+=$1}END{print s " Bytes"}'
117538 Bytes
$ head -n 1000 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/ //; s/Bytes//; s/B//' | gnumfmt --from=auto | awk '{s+=$1}END{print s " Bytes"}'
1225857 Bytes
$ head -n 10000 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/ //; s/Bytes//; s/B//' | gnumfmt --from=auto | awk '{s+=$1}END{print s " Bytes"}'
12087518 Bytes
$ head -n 1000000 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/ //; s/Bytes//; s/B//' | gnumfmt --from=auto | awk '{s+=$1}END{print s " Bytes"}'
77238840381 Bytes
$ head -n 100000000 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/ //; s/Bytes//; s/B//' | gnumfmt --from=auto | awk '{s+=$1}END{print s " Bytes"}'
2306569381835 Bytes
and
$ head -n 10 files_sizes.log | ./count_files.sh
3.957422 KiB
$ head -n 1000 files_sizes.log | ./count_files.sh
1.168946 MiB
$ head -n 10000 files_sizes.log | ./count_files.sh
11.526325 MiB
$ head -n 1000000 files_sizes.log | ./count_files.sh
71.934024 GiB
$ head -n 100000000 files_sizes.log | ./count_files.sh
2.097807 TiB
and
(head -n 100000000 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/Bytes//; s/KiB/* 1024/; s/MiB/* 1024 * 1024/;s/GiB/* 1024 * 1024 * 1024/; s/$/ + /; $s/+ //' | tr -d '\n' ; echo) | bc
2306563692898.8
where
2.097807 TiB = 2.3065631893 TB = 2306569381835 Bytes
Computationally, I have compared all the three solution for speed:
$ time head -n 100000000 files_sizes.log | ./count_files.sh
2.097807 TiB
real 2m7.956s
user 2m10.023s
sys 0m1.696s
$ time head -n 100000000 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/ //; s/Bytes//; s/B//' | gnumfmt --from=auto | awk '{s+=$1}END{print s " Bytes"}'
2306569381835 Bytes
real 4m12.896s
user 5m45.750s
sys 0m4.026s
$ time (head -n 100000000 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/Bytes//; s/KiB/* 1024/; s/MiB/* 1024 * 1024/;s/GiB/* 1024 * 1024 * 1024/; s/$/ + /; $s/+ //' | tr -d '\n' ; echo) | bc
2306563692898.8
real 4m31.249s
user 6m40.072s
sys 0m4.252s
Use numfmt to convert those numbers.
cat <<EOF |
2016-10-14 14:52:09 0 Bytes folder/
2020-04-18 05:19:04 201 Bytes folder/file1.txt
2019-10-16 00:32:44 201 Bytes folder/file2.txt
2019-08-26 06:29:46 201 Bytes folder/file3.txt
2020-07-08 16:13:56 411 Bytes folder/file4.txt
2020-04-18 03:03:34 201 Bytes folder/file5.txt
2019-10-16 08:27:11 1.1 KiB folder/file6.txt
2019-10-16 10:13:52 201 Bytes folder/file7.txt
2019-10-16 08:44:35 920 Bytes folder/file8.txt
2019-02-17 14:43:10 590 Bytes folder/file9.txt
2019-02-17 14:43:10 3.9 KiB folder/file9.txt
2019-02-17 14:43:10 2.7 MiB folder/file9.txt
2019-02-17 14:43:10 1.3 GiB folder/file9.txt
EOF
# extract 3rd and 4th column
tr -s ' ' | cut -d' ' -f3,4 |
# Remove space, remove "Bytes", remove "B"
sed 's/ //; s/Bytes//; s/B//' |
# convert to numbers
numfmt --from=auto |
# sum
awk '{s+=$1}END{print s}'
outputs the sum.
For input like the described:
2016-10-14 14:52:09 0 Bytes folder/
2020-04-18 05:19:04 201 Bytes folder/file1.txt
2019-10-16 00:32:44 201 Bytes folder/file2.txt
2019-08-26 06:29:46 201 Bytes folder/file3.txt
2020-07-08 16:13:56 411 Bytes folder/file4.txt
2020-04-18 03:03:34 201 Bytes folder/file5.txt
2019-10-16 08:27:11 1.1 KiB folder/file6.txt
2019-10-16 10:13:52 201 Bytes folder/file7.txt
2019-10-16 08:44:35 920 Bytes folder/file8.txt
2019-02-17 14:43:10 590 Bytes folder/file9.txt
You could use a table of units that you'd like to be able to decode:
BEGIN {
unit["Bytes"] = 1;
unit["kB"] = 10**3;
unit["MB"] = 10**6;
unit["GB"] = 10**9;
unit["TB"] = 10**12;
unit["PB"] = 10**15;
unit["EB"] = 10**18;
unit["ZB"] = 10**21;
unit["YB"] = 10**24;
unit["KB"] = 1024;
unit["KiB"] = 1024**1;
unit["MiB"] = 1024**2;
unit["GiB"] = 1024**3;
unit["TiB"] = 1024**4;
unit["PiB"] = 1024**5;
unit["EiB"] = 1024**6;
unit["ZiB"] = 1024**7;
unit["YiB"] = 1024**8;
}
Then just sum it up in the main loop:
{
if($4 in unit) total += $3 * unit[$4];
else printf("ERROR: Can't decode unit at line %d: %s\n", NR, $0);
}
And print the result at the end:
END {
binaryunits[0] = "Bytes";
binaryunits[1] = "KiB";
binaryunits[2] = "MiB";
binaryunits[3] = "GiB";
binaryunits[4] = "TiB";
binaryunits[5] = "PiB";
binaryunits[6] = "EiB";
binaryunits[7] = "ZiB";
binaryunits[8] = "YiB";
for(i = 8;; --i) {
if(total >= 1024**i || i == 0) {
printf("%.3f %s\n", total/(1024**i), binaryunits[i]);
break;
}
}
}
Output:
3.957 KiB
Note that you can add a she-bang to beginning of the awk-script to make it possible to run it on its own so that you won't have to put it in a bash script:
#!/usr/bin/awk -f
You can parse the input before sending them to bc:
echo "0 Bytes
3.9 KiB
201 Bytes
2.0 KiB
2.7 MiB
1.3 GiB" |
sed 's/Bytes//; s/KiB/* 1024/; s/MiB/* 1024 * 1024/;
s/GiB/* 1024 * 1024 * 1024/; s/$/ + /' |
tr -d '\n' |
sed 's/+ $/\n/' |
bc
When your sed doesn't support \n, you can try replacing the '\n' with a real newline like
sed 's/+ $/
/'
or add an echo after parsing (and move part of the last sed into the first command for removing the last +)
(echo "0 Bytes
3.9 KiB
201 Bytes
2.0 KiB
2.7 MiB
1.3 GiB" | sed 's/Bytes//; s/KiB/* 1024/; s/MiB/* 1024 * 1024/;
s/GiB/* 1024 * 1024 * 1024/; s/$/ + /; $s/+ //' | tr -d '\n' ; echo) | bc
Very good idea from #KamilCuk to make use of numfmt. Based on his answer, here is an alternative command which uses a single awk call wrapping numfmt with a two-way pipe. It requires a recent version of GNU awk (ok with 5.0.1, unstable with 4.1.4, not tested in between).
LC_NUMERIC=C gawk '
BEGIN {
conv = "numfmt --from=auto"
PROCINFO[conv, "pty"] = 1
}
{
sub(/B.*/, "", $4)
print $3 $4 |& conv
conv |& getline val
sum += val
}
END { print sum }
' input
Notes
LC_NUMERIC=C (bash/ksh/zsh) is for portability on systems using a non-english locale.
PROCINFO[conv, "pty"] = 1 lets have the output of numfmt flushed on each line (to avoid a dealock).
Let me give you a better way to work with ls: don't use it as a command, but as a find switch:
find . -maxdepth 1 -ls
This returns file sizes in a uniform unity, as explained in find's manpage, which makes it far easier to do calculations on.
Good luck

Print out the value with the highest number of occurrences in a file

In a bash shell script, I want to go through a list of numbers and then print out the number that occurs most often. If there are several different numbers appearing an equal amount of times, I want to print the highest number. For example, in a file like this:
10
10
10
15
15
20
20
20
20
I want to print the value 20.
How can I achieve this?
If the numbers are in a file, one per line:
sort < myfile | uniq -c | sort -r | head -1
without the count:
A=$(sort < myfile | uniq -c | sort -r | head -1)
set $A
echo $2
You can use this command -
echo 10 10 10 15 15 20 20 20 20 | sed 's/ /\n/g' | sort | uniq -c | sort -V | tail -n 1 | awk '{print $2}'
It will print the number you want.

parse command line values

I'm having a problem with my script and i can't seem to see where the problem occurs.
rules=$(echo "$result" | grep '^[[:space:]]\{2\}[0-9]\|^\*' | sed 's/^.//' | \
awk '{ x = $0 "\n" x } END { printf "%s", x }' | awk '{print $1}')
numRules=$(echo "$rules" | wc -l)
This is my script for the data below this would be the value of $result
ID Action Category From Hits
----------------------------
100 deny trial1 herb 0
200 deny trial2222 herb.patrick 0
300 deny triaaaals herb.patrick.hernandez 0
My goal is to be able to get the id which is 100, 200, 300 to be placed in $rules and for me to be able to get the total count of ids for this example: 3 would be the right return for $numRules.
$rules= 100 200 300
$numRules = 3
With GNU grep and an array:
rules=($(grep -o '^[0-9]\+' file))
numRules=${#rules[#]}
echo ${rules[#]}
echo $numRules
Output:
100 200 300
3

Split a column into separate columns based on value

I have a tab delimited file that looks as follows:
cat my file.txt
gives:
1 299
1 150
1 50
1 57
2 -45
2 62
3 515
3 215
3 -315
3 -35
3 3
3 6789
3 34
5 66
5 1334
5 123
I'd like to use Unix commands to get a tab-delimited file that based on values in column#1, each column of the output file will hold all relevant values of column#2
(I'm using here for the example the separator "|" instead of tab only to illustrate my desired output file):
299 | -45 | 515 | 66
150 | 62 | 215 | 1334
50 | | -315 |
57 | | -35 |
| | 3 |
The corresponding Headers (1,2,3,5; based on column#1 values) could be a nice addition to the code (as shown below), but the main request is to split the information of the first file into separated columns. Thanks!
1 | 2 | 3 | 5
299 | -45 | 515 | 66
150 | 62 | 215 | 1334
50 | | -315 |
57 | | -35 |
| | 3 |
Here's a one liner that matches your output. It builds a string $ARGS containing as many process substitutions as there are unique values in the first column. Then, $ARGS is used as the argument for the paste command:
HEADERS=$(cut -f 1 file.txt | sort -n | uniq); ARGS=""; for h in $HEADERS; do ARGS+=" <(grep ^"$h"$'\t' file.txt | cut -f 2)"; done; echo $HEADERS | tr ' ' '|'; eval "paste -d '|' $ARGS"
Output:
1|2|3|5
299|-45|515|66
150|62|215|1334
50||-315|
57||-35|
||3|
You can use gnu-awk
awk '
BEGIN{max=0;}
{
d[$1][length(d[$1])+1] = $2;
if(length(d[$1])>max)
max = length(d[$1]);
}
END{
PROCINFO["sorted_in"] = "#ind_num_asc";
line = "";
flag = 0;
for(j in d){
line = line (flag?"\t|\t":"") j;
flag = 1;
}
print line;
for(i=1; i<=max; ++i){
line = "";
flag = 0;
for(j in d){
line = line (flag?"\t|\t":"") d[j][i];
flag = 1;
}
print line;
}
}' file.txt
you get
1 | 2 | 3 | 5
299 | -45 | 515 | 66
150 | 62 | 215 | 1334
50 | | -315 |
57 | | -35 |
| | 3 |
Or, you can use python .... for example, in split2Columns.py
import sys
records = [line.split() for line in open(sys.argv[1])]
import collections
records_dict = collections.defaultdict(list)
for key, val in records:
records_dict[key].append(val)
from itertools import izip_longest
print "\t|\t".join(records_dict.keys())
print "\n".join(("\t|\t".join(map(str,l)) for l in izip_longest(*records_dict.values(), fillvalue="")))
python split2Columns.py file.txt
you get same result
#Jose Ricardo Bustos M. - thanks for your answer! I unfortunately couldn't install on my Mac the gnu-awk, but based on your suggestive answer I've performed something similar using awk:
HEADERS=$(cut -f 1 try.txt | awk '!x[$0]++');
H=( ${HEADERS// / });
MAXUNIQNUM=$(cut -f 1 try.txt |uniq -c|awk '{print $1}'|sort -nr|head -1);
awk -v header="${H[*]}" -v max=$MAXUNIQNUM
'BEGIN {
split(header,headerlist," ");
for (q = 1;q <= length(headerlist); q++)
{counter[q]=1;}
}
{for (z = 1; z <= length(headerlist); z++){
if (headerlist[z] == $1){
arr[counter[z],headerlist[z]] = $2;
counter[z]++
};
}
}
END {
for (x = 1; x <= max; x++){
for (y = 1; y<= length(headerlist); y++){
printf "%s\t",arr[x,headerlist[y]];
}
printf "\n"
}
}' try.txt
This is using an array to keep track of the column headings, using them to name temporary files and paste everything together in the end:
#!/bin/bash
infile=$1
filenames=()
idx=0
while read -r key value; do
if [[ "${filenames[$idx]}" != "$key" ]]; then
(( ++idx ))
filenames[$idx]="$key"
echo -e "$key\n----" > "$key"
fi
echo "$value" >> "$key"
done < "$1"
paste "${filenames[#]}"
rm "${filenames[#]}"

How to retrieve the specified parameters in a file (Shell Scripting )?

Here is my query :
/path/newdir/newtext.csv
newtext.csv looks like below :
Record 1
line 1
line 2
Sample Number: 123456789 (line no. 3)
|
|
|
|
|
Time In: 2012-05-29T10:21:06Z (line no. 21)
|
|
|
Time Out: 2012-05-29T13:07:46Z (line no. 30)
Record 2
line 1
line 2
Sample Number: 363214563 (line no. 3)
|
|
|
|
|
Time In: 2012-05-29T10:21:06Z (line no. 21)
|
|
|
Time Out: 2012-05-29T13:07:46Z (line no. 30)
Record 3
line 1
line 2
Sample Number: 987654321 (line no. 3)
|
|
|
|
|
Time In: 2012-05-29T10:21:06Z (line no. 21)
|
|
|
Time Out: 2012-05-29T13:07:46Z (line no. 30)
Assume there are such 100 records in a newtext.csv
So, now i need the parameters of the entered i/p string, which is something below
Example Input Search String :
123456789
Example Output :
Sample Number: 123456789
Time In: 2012-05-29T10:21:06Z
Time Out: 2012-05-29T13:07:46Z
This is what exactly i need. Can you please help me ?
For plain input* string,
grep -F "InputString" -A27 inputFile.csv | sed -n '1p;19p;$p'
For pattern(Extended-regex)* string,
grep -E "InputPattern" -A27 inputFile.csv | sed -n '1p;19p;$p'
Script:
user$ cat script.sh
#!/bin/bash
grep -F "$1" -A27 inputFile.csv | sed -n '1p;19p;$p'
user$ chmod +x script.sh
user$ ./script.sh "inputString"
Edit:
Non-line number based solution,
#!/bin/bash
grep -F "$1" -A27 inputFile.csv |sed -n "/$1/p;/^Time\s[^:]*:/p"
* The input must be unique to the file.
Try This
csv.txt (Input File)
Sample Number: 123456789 (line no. 3)
|
|
|
|
|
Time In: 2012-05-29T10:21:06Z (line no. 21)
|
|
|
Time Out: 2012-05-29T13:07:46Z (line no. 30)
line 1
line 2
Sample Number: 363214563 (line no. 3)
|
|
|
|
|
Time In: 2012-05-29T10:21:06Z (line no. 21)
|
|
|
Time Out: 2012-05-29T13:07:46Z (line no. 30)
line 1
line 2
Sample Number: 987654321 (line no. 3)
|
|
|
|
|
Time In: 2012-05-29T10:21:06Z (line no. 21)
|
|
|
Time Out: 2012-05-29T13:07:46Z (line no. 30)
csv.sh (Code)
echo "Enter your search string:"
read name
grep -A 10 "$name" csv.txt | grep -v "|" | awk -F "(" '{print $1}'
Output
Enter your search string: 123456789
Sample Number: 123456789
Time In: 2012-05-29T10:21:06Z
Time Out: 2012-05-29T13:07:46Z

Resources