Log analysis script in shell - shell

I'm a newbie in scripting and I need your help. I have a log file, that I'm cleaned out. Looks like this (Time, duration(in millisec), action):
2012-04-28 00:00:00;277.406;
2012-04-28 00:00:00;299.680;
2012-04-28 00:00:00;282.338;
2012-02-28 00:00:00;272.241;
I need to make a script that use the duration data and count the action.

First - you need to make it easier to parse the different fields. A simple way is to change a semicolon to a space using
tr ";" " " <logfile|awkscript
Second, you need to create a table of low and high values. I'm using an associative array whose index is the name of the column. I do this in the BEGIN section.
You need to count when a value is within the low and high values. I do this in the middle section.
In the END section, I print out the values. I use 2 similar printf format strings to make sure the headers and values line up nicely:
#!/usr/bin/awk -f
BEGIN {
low["<1ms"]=0;high["<1ms"]=1
low["1-10ms"]=1;high["1-10ms"]=10
low["10-100ms"]=10;high["10-100ms"]=100
low["100-500ms"]=100;high["100-500ms"]=500
low[">500ms"]=500;high[">500ms"]=1000000000
}
{
# Middle section - for each line
duration=$3
for (i in high) {
if ((duration > low[i]) && (duration <= high[i]) ) {
# printf("duration: %d, low: %s,high: %s\n", duration, low[i], high[i]);
total+=duration # total duration
bin[i]++ # store a count into different bins
count++ # total number of measurements
}
}
}
END {
average=total/count
FMT="%-10s %10s %10s %10s %10s %10s\n"
NFMT="%-10.3f %10s %10s %10s %10s %10s\n"
printf(FMT,"AVG", "<1ms", "1-10ms", "10-100ms", "100-500ms", "500+ms")
printf(NFMT,average, bin["<1ms"], bin["1-10ms"], bin["10-100ms"], bin["100-500m\
s"], bin["500+ms"])
}
When I run this with your data, I get
AVG <1ms 1-10ms 10-100ms 100-500ms 500+ms
282.916 4

Related

Aggregating data in a csv

I have to generate a HTML file to show how I have aggregated data in a csv file.
The structure of this file is as follows:
num_expediente;fecha;hora;localizacion;numero;cod_distrito;distrito;tipo_accidente;estado_meteorológico;tipo_vehiculo;tipo_persona;rango_edad;sexo;cod_lesividad;lesividad;coordenada_x_utm;coordenada_y_utm;positiva_alcohol;coste;positiva_droga
2022S000001;Enero;Noche;AVENIDA ALBUFERA;19;13;13_PUENTE DE VALLECAS;Choque;Despejado;Vehículo ligero;Conductor;<30;Mujer;0;Sin asistencia;443359,226;4472082,272;0;0;0
2022S000002;Enero;Noche;PLAZA CANOVAS DEL CASTILLO;2;3;3_RETIRO;Choque;Desconocido;Motocicleta;Conductor;31_60;Hombre;0;Sin asistencia;441155,351;4474129,588;1;0;0
2022S000003;Enero;Noche;CALLE SAN BERNARDO;53;1;1_CENTRO;Atropello;Despejado;Motocicleta;Conductor;Desconocido;Desconocido;0;Sin asistencia;439995,351;4475212,523;0;0;0
2022S000004;Enero;Noche;CALLE ALCALA;728;20;20_SAN BLAS-CANILLEJAS;Choque;Despejado;Vehículo ligero;Conductor;31_60;Hombre;2;Leve;449693,925;4477837,552;0;200;0
2022S000004;Enero;Noche;CALLE ALCALA;728;20;20_SAN BLAS-CANILLEJAS;Choque;Despejado;Vehículo ligero;Pasajero;31_60;Mujer;3;Grave;449693,925;4477837,552;0;3000;0
num_expediente is the id of the accident
fecha is the month of the accident
sexo is the gender of the person implied in the accident
coste is the cost of the accident for the person implied
I would like to create a table showing the accumulated cost per month and gender. I use this script:
#! /usr/bin/awk -f
BEGIN {FS=OFS=";"}
function loop(array, name, i) {
for (i in array) {
if (isarray(array[i]))
loop(array[i], (name "[" i "]"))
else
printf("%s[%s] = %s\n",name, i, arr[i])
}
}
NR!=1{
array[$2][$13]+=$19
}
END {loop(array, "")
}
But the output is not aggregating the cost:
[Enero][Hombre] =
[Enero][Desconocido] =
[Enero][Mujer] =
[Febrero][Hombre] =
[Febrero][Mujer] =
[Febrero][Desconocido] =
[Marzo][Hombre] =
[Marzo][Desconocido] =
[Marzo][Mujer] =
I dont know why this is not working.
I dont have idea how to generate the html out of this output. Could you help with that too?
As mentioned in the comments OP has a typo in the printf where arr[i] should be array[i]; while this should address OP's current issue I'm not sure I understand the use of a recursive function call unless OP's real world problem is dealing with arrays of varying dimensions.
Since we're dealing with an array of known dimension (ie, 2) one simpliified awk idea:
awk -F';' '
NR>1 { array[$2][$13]+=$19 }
END { for (month in array)
for (gender in array[month])
printf "[%s][%s] = %s\n", month, gender, array[month][gender]
}
' raw.csv
For the provided input this generates:
[Enero][Hombre] = 200
[Enero][Desconocido] = 0
[Enero][Mujer] = 3000
NOTES:
this solution does not address any sorting requirements OP may have for the output
for an additional sorting requirement I'd suggest OP first address the current issue and once solved then attempt to apply the additional sorting requirement and ...
if having problems with sorting then ask a new question (making sure to include a complete list of months and genders and the desired sort order for both components)

How to avoid line insert to the file if the line is already present in the file?

How should the check be made so that there are no line duplicates in the file
open ( FILE, ">newfile");
for( $a = 1; $a < 20; $a = $a + 1 ) {
my $random_number = 1+ int rand(10);;
# check to avoid inserting the line if the line is already present in the file
print FILE "Random number is $random_number \n";
}
close(FILE);
!$seen{$_}++ is a common idiom for identifying duplicates.
my %seen;
for (1..19) {
my $random_number = 1+ int rand(10);
say "Random number is $random_number" if !$seen{$random_number}++;
}
But that doesn't guarantee that you will get all numbers from 1 to 10 in random order. If that's what you are trying to achieve, the following is a far better solution:
use List::Util qw( shuffle );
say "Random number is $_" for shuffle 1..10;
It seems like what you are asking is how to randomize the order of the numbers 1 to 20. I.e. no duplicates, random order. That can be easily done with a Schwartzian transform. For example:
perl -le'print for map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [$_, rand()] } 1..20'
6
7
16
14
5
20
3
13
19
17
4
8
15
10
9
11
18
1
2
12
In this case, reading from the end and backwards, we create a list of numbers 1 .. 20, we feed that into a map statement which turns each number into an array-ref, containing the number, and a random number. Then we feed that list of array refs to a sort, where we sort numerically on the second argument in the array ref: the random number (hence creating a random order). Then we transform the array ref back into a simple number with another map statement. Finally we print the list using a for loop.
So in your case, the code would look something like:
print "Random number is: $_\n" for # print each number
map { $_>[0] } # restore to a number
sort { $a->[1] <=> $b->[1] } # sort the list on the random number
map { [ $_, rand() ] } # create array ref with random number as index
1 .. 20; # create list of numbers to randomize order of
Then you can use the program like below to redirect output to a file:
$ perl numbers.pl > newfile.txt
Enter each line into a hash as well, what makes it easy and efficient to later check for it
use warnings;
use strict;
use feature 'say';
my $filename = shift or die "Usage: $0 filename\n";
open my $fh, '>', $filename or die "Can't open $filename: $!";
my %existing_lines;
for my $i (1..19)
{
my $random_number = 1 + int rand(10);
# Check to avoid inserting the line if it is already in the file
if (not exists $existing_lines{$random_number}) {
say $fh "Random number is $random_number";
$existing_lines{$random_number} = 1;
}
}
close $fh;
This assumes that the intent in the question is to not repeat that number (symbolizing content to be stored without repetition).
But if it is indeed the whole line (sentence) to be avoided, where that random number is used merely to make each line different, then use the whole line for the key
for my $i (1..19)
{
my $random_number = 1 + int rand(10);
my $line = "Random number is $random_number";
# Check to avoid inserting the line if it is already in the file
if (not exists $existing_lines{$line}) {
say $fh $line;
$existing_lines{$line} = 1;
}
}
Notes and literature
Lexical filehandles (my $fh) are much better than globs (FILE), and the three-argument open is better. See the quide perlopentut and reference open
Always check the open call (or die... above). It can and does fail -- quietly. In that check always print the error for which it failed, $!
The C-style for loop is very rarely needed while the usual foreach (with synonym for) is much nicer to use; see it in perlsyn. The .. is the range operator
Always declare variables with my, and enforce that with strict pragma; always use warnings
If the filehandle refers to pipe-open (not the case here) always check its close
See perlintro for a general overview and for hashes; for more about Perl's data types see perldata. Keep in mind for later the notion of complex data structures, perldsc
return false will do the trick.
Because you cannot generate 20 distinct numbers in the range [1, 10].

moving sql logic to backend - bash

One of the sql logic is moving to backend and I need to generate a report using shell scripting.
For understanding, I'm making it simple as follows.
My input file - sales.txt (id, price, month)
101,50,2019-10
101,80,2020-08
101,80,2020-10
201,100,2020-09
201,350,2020-10
The output should be for 6 months window for each id e.g t1=2020-07 and t2=2020-12
101,50,2020-07
101,80,2020-08
101,80,2020-09
101,80,2020-10
101,80,2020-11
101,80,2020-12
201,100,2020-09
201,350,2020-10
201,350,2020-11
201,350,2020-12
For id 101, though there is no entry for 2020-07, it should take from the immediate previous month value that is available in the sales file.
So the price=50 from 2019-10 is used for 2020-07.
For 201, the first entry itself is from 2020-09, so 2020-08 and 2020-07 are not applicable.
Wherever there are gaps the immediate previous month value should be propagated.
I'm trying to use awk to solve this problem, I'm creating a reusable script util.awk like below
to generate the missing values, pipe it to sort command and then again use the util.awk for final output.
util.awk
function get_month(a,b,t1) { return strftime("%Y%m",mktime(a " " b t1)) }
BEGIN { ss=" 0 0 0 "; ts1=" 1 " ss; ts2=" 35 " ss ; OFS="," ; x=1 }
{
tsc=get_month($3,$4,ts1);
if ( NR>1 && $1==idp )
{
if( tsc == tsp) { print $1,$2,get_month($3,$4,ts1); x=0 }
else { for(i=tsp; i < tsc; i=get_month(j1,j2,i) )
{
j1=substr(i,1,4); j2=substr(i,5,2);
print $1,tpr,i;
}
}
}
tsp=get_month($3,$4,ts2);
idp=$1;
tpr=$2;
if(x!=0) print $1,$2,tsc
x=1;
}
But it is running infinitely awk -F"[,-]" -f utils.awk sales.txt
Though I tried in awk, I welcome other answers as well that would work in bash environment.
General plan:
assumption: sales.txt is already sorted (numerically) by the first column
user provides the min->max date range to be displayed (awk variables mindt and maxdt)
for a distinct id value we'll load all prices and dates into an array (prices[])
dates will be used as the indices of an associative array to store prices (prices[YYYY-MM])
once we've read all records for a given id ...
sort the prices[] array by the indices (ie, sort by YYYY-MM)
find the price for the max date less than mindt (save as prevprice)
for each date between mindt and maxdt (inclusive), if we have a price then display it (and save as prevprice) else ...
if we don't have a price but we do have a prevprice then use this prevprice as the current date's price (ie, fill the gap with the previous price)
One (GNU) awk idea:
mindate='2020-07'
maxdate='2020-12'
awk -v mindt="${mindate}" -v maxdt="${maxdate}" -v OFS=',' -F',' '
# function to add "months" (number) to "indate" (YYYY-MM)
function add_month(indate,months) {
dhms="1 0 0 0" # default day/hr/min/secs
split(indate,arr,"-")
yr=arr[1]
mn=arr[2]
return strftime("%Y-%m", mktime(arr[1]" "(arr[2]+months)" "dhms))
}
# function to print the list of prices for a given "id"
function print_id(id) {
if ( length(prices) == 0 ) # if prices array is empty then do nothing (ie, return)
return
PROCINFO["sorted_in"]="#ind_str_asc" # sort prices[] array by index in ascending order
for ( i in prices ) # loop through indices (YYYY-MM)
{ if ( i < mindt ) # as long as less than mindt
prevprice=prices[i] # save the price
else
break # no more pre-mindt indices to process
}
for ( i=mindt ; i<=maxdt ; i=add_month(i,1) ) # for our mindt - maxdt range
{ if ( !(i in prices) && prevprice ) # if no entry in prices[], but we have a prevprice, then ...
prices[i]=prevprice # set prices[] to prevprice (ie, fill the gap)
if ( i in prices ) # if we have an entry in prices[] then ...
{ prevprice=prices[i] # update prevprice (for filling future gap) and ...
print id,prices[i],i # print our data to stdout
}
}
}
BEGIN { split("",prices) } # pre-declare prices as an array
previd != $1 { print_id(previd) # when id changes print the prices[] array, then ...
previd=$1 # reset some variables for processing of the next id and ...
prevprice=""
delete prices # delete the prices[] array
}
{ prices[$3]=$2 } # for the current record create an entry in prices[]
END { print_id(previd) } # flush the last set of prices[] to stdout
' sales.txt
NOTE: This assumes sales.txt is sorted (numerically) by the first field; if this is not true then the last line should be changed to ' <(sort -n sales.txt)
This generates:
101,50,2020-07
101,80,2020-08
101,80,2020-09
101,80,2020-10
101,80,2020-11
101,80,2020-12
201,100,2020-09
201,350,2020-10
201,350,2020-11
201,350,2020-12
I hope I understood your question a bit. The following awk should do the trick
$ awk -v t1="2020-07" -v d="6" '
function next_month(d,a) {
split(d,a,"-"); a[2]==12?a[1]++ && a[2]=1 : a[2]++
return sprintf("%0.4d-%0.2d",a[1],a[2])
}
BEGIN{FS=OFS=",";t2=t1; for(i=1;i<=d;++i) t2=next_month(t2)}
{k[$1]}
($3<t1){a[$1,t1]=$2}
(t1 <= $3 && $3 < t2) { a[$1,$3]=$2 }
END{ for (key in k) {
p=""; t=t1;
for(i=1;i<=d;++i) {
if(p!="" || (key,t) in a) print key, ((key,t) in a ? p=a[key,t] : p), t
t=next_month(t)
}
}
}' input.txt
We implemented a straightforward function next_month that computes the next month based on a format YYYY-MM. Based on the duration of d months, we compute the time-period that should be shown in the BEGIN block. The time-period of interest is t1 <= t < t2.
Every time we read a record/line, we keep track of the key that he's been processed and store it in the array k. This way we know which key has been seen up to this point.
for all the times before the time-period of interest, we store the value in an array a with index (key,t1), while for all other times, we store the value in the array a with key (key,$3).
When the file is fully processed, we just cycle over all keys and print the output. We used a bit of logic, to check whether or not the month was listed in the original file.
Note: the output will be per key sorted in time, but the key will not appear in the same order as in the original file.

Needing Help on a Loop for averaging grades using awk. Output prints the same grade for every student

BEGIN {
FS=","
OFS = "\t"
OFMT = "%.2f"
}
$4~/[0-9]/ {
EARN[$1$2]+=$4
POS[$1$2]+=$5
CLASS[$1]++
TYPE[$2]++
}
END{
TOTAL=0
for (STUDENT in CLASS){
HW=(EARN[$1"Homework"]/POS[$1"Homework"])*0.30
LAB=(EARN[$1"Lab"]/POS[$1"Lab"])*0.50
QUIZ=(EARN[$1"Quiz"]/POS[$1"Quiz"])*0.10
FINAL=(EARN[$1"Final"]/POS[$1"Final"])*0.10
WS=(EARN[$1"Survey"]/POS[$1"Survey"])*0.10
TOTAL=(HW+LAB+QUIZ+FINAL+WS)*100
GRADE= "A"
if (TOTAL < 90) {
GRADE="B"
}
if ( TOTAL < 80){
GRADE="C"
}
if (TOTAL < 70){
GRADE="D"
}
if( TOTAL < 60) {
GRADE="E"
}
}
print "Student\t Total \t Letter Grade"
print STUDENT, TOTAL, "\t" GRADE
}
The code /should/ give a unique grade for every student, but with my sample file every student receives the same grade (which i assume is the first student's grade), The code is going through column 4 EARN (earned points) and comparing it to column $5 POS (possible points)
You have 2 primary problems with your script (but good job and good effort -- you were close). The first is you need to move your print statements. The first to the beginning of END, e.g.
END{
print "Student\t Total \t Letter Grade"
And the next within the for (STUDENT in CLASS){ loop, e.g.
if( TOTAL < 60) {
GRADE="E"
}
print STUDENT, TOTAL, "\t" GRADE
}
}
The second and most problematic error is the use of $1 in, e,g, HW=(EARN[$1"Homework"]/... instead of using STUDENT (which is your loop variable), e.g. HW=(EARN[STUDENT"Homework"]/...
Along with the second, in your calculations, if a student does not have a specific grade (like "Survey" or "Final" as in the data you supplied in the comment), the you will receive a divide by zero as there will be no corresponding, e.g. POS[STUDENT"Survey"] and it will be taken as zero.
You can avoid that with a ternary to make sure the elements is nonzero or use 1 for the value, e.g. Instead of:
FINAL=(EARN[STUDENT"Final"]/POS[STUDENT"Final"])*0.10
you can use:
FINAL=(EARN[STUDENT"Final"]/(POS[STUDENT"Final"] ? POS[STUDENT"Final"] : 1))*0.10
(this just ensures the denominator is 1 if the index isn't in the POS array)
With that change you can use:
#!/bin/awk -f
BEGIN {
FS=","
OFS = "\t"
OFMT = "%.2f"
}
$4~/[0-9]/ {
EARN[$1$2]+=$4
POS[$1$2]+=$5
CLASS[$1]++
TYPE[$2]++
}
END{
print "Student\t Total \t Letter Grade"
TOTAL=0
for (STUDENT in CLASS){
HW=(EARN[STUDENT"Homework"]/(POS[STUDENT"Homework"] ? POS[STUDENT"Homework"] : 1))*0.30
LAB=(EARN[STUDENT"Lab"]/(POS[STUDENT"Lab"] ? POS[STUDENT"Lab"] : 1))*0.50
QUIZ=(EARN[STUDENT"Quiz"]/(POS[STUDENT"Quiz"] ? POS[STUDENT"Quiz"] : 1))*0.10
FINAL=(EARN[STUDENT"Final"]/(POS[STUDENT"Final"] ? POS[STUDENT"Final"] : 1))*0.10
WS=(EARN[STUDENT"Survey"]/(POS[STUDENT"Survey"] ? POS[STUDENT"Survey"] : 1))*0.10
TOTAL=(HW+LAB+QUIZ+FINAL+WS)*100
GRADE= "A"
if (TOTAL < 90) {
GRADE="B"
}
if ( TOTAL < 80){
GRADE="C"
}
if (TOTAL < 70){
GRADE="D"
}
if( TOTAL < 60) {
GRADE="E"
}
print STUDENT, TOTAL, "\t" GRADE
}
}
(note: I would use "F" as the final grade instead of "E" -- never heard of that one..., and also use lower-case names for your user variables)
Example Use/Output
Based on the data you provided you would receive:
$ ./calcgrades.awk grades.txt
Student Total Letter Grade
Chelsey 86.89 B
Sam 40.77 E
Let me know if you have questions.

bash awk moving average with skipping

I am trying to calculate a moving average with a data set. But in addition, I want it to skip a few number of data each time the average 'window' moves. For example, if my data set is a column from 1 to 20 and my average window is 5, then the current calculation is the average of (1-5), (2-6), (3-7), (4-8).....
But I want to skip a few data each time the window moves, say I want to skip 2. then the new average will be (1-5), (4-8), (6-10), (8-12)......
Here is the current awk file I am using, can anyone help me edit it so that I can skip a few data each time the window moves? I want to change the skip size and window size as well. Thank you very much!
#!/bin/awk
BEGIN {
N=5 # the window size
}
{
n[NR]=$1 # store the value in an array
}
NR>=N { # for records where NR >= N
x=0 # reset the sum variable
delete n[NR-N] # delete the one out the window of N
for(i in n) # all array elements
x+=n[i] # ... must be summed
print x/N # print the row from the beginning of window
}
I think your ranges are not well specified, but you wanted to achieve can be done by parallel windowing as below
awk '{sum[1]+=$1}
!(NR%5){print NR-4"-"NR, sum[1]/5; sum[1]=0}
NR>3{sum[4]+=$1}
NR>3 && !((NR-3)%5){print NR-4"-"NR, sum[4]/5; sum[4]=0}' <(seq 15)
will give, you can remove printing ranges which it there for debugging.
1-5 3
4-8 6
6-10 8
9-13 11
11-15 13
for making window size and skip count variable
awk -v w=5 -v s=3 'function pr(x) {print (NR-s-1)"-"NR, sum[x]/w; sum[x]=0}
{sum[1]+=$1}
NR>s {sum[s+1]+=$1}
!(NR%w) {pr(1)}
NR>s && !((NR-s)%w){pr(s+1)}' file
first window always start at 1, second window starts at s+1. This can be generalize for more than 2 windows as well, perhaps you can find someone to do it...
I see that you want to print MA every K ticks instead of printing for every tick (K=1). So you could add a condition NR%K==0 before printing in your existing code.
But it would be better to keep an array of N elements and overwrite them instead of deleting. Using NR%N as array index. This way, when K is not 1 and want not to calculate the MA, you will avoid checking how many elements to delete etc.
awk -v n=5 -v k=2 '{ a[NR%n]=$0 }
NR>=n && (NR-n)%k==0 { s=0; for (i in a) s+=a[i]; print NR ":\t" s/n }' file
update condition to (NR-n)%k==0 for always starting from first tick where MA is calculated (that is for NR=n).

Resources