moving sql logic to backend - bash - bash

One of the sql logic is moving to backend and I need to generate a report using shell scripting.
For understanding, I'm making it simple as follows.
My input file - sales.txt (id, price, month)
101,50,2019-10
101,80,2020-08
101,80,2020-10
201,100,2020-09
201,350,2020-10
The output should be for 6 months window for each id e.g t1=2020-07 and t2=2020-12
101,50,2020-07
101,80,2020-08
101,80,2020-09
101,80,2020-10
101,80,2020-11
101,80,2020-12
201,100,2020-09
201,350,2020-10
201,350,2020-11
201,350,2020-12
For id 101, though there is no entry for 2020-07, it should take from the immediate previous month value that is available in the sales file.
So the price=50 from 2019-10 is used for 2020-07.
For 201, the first entry itself is from 2020-09, so 2020-08 and 2020-07 are not applicable.
Wherever there are gaps the immediate previous month value should be propagated.
I'm trying to use awk to solve this problem, I'm creating a reusable script util.awk like below
to generate the missing values, pipe it to sort command and then again use the util.awk for final output.
util.awk
function get_month(a,b,t1) { return strftime("%Y%m",mktime(a " " b t1)) }
BEGIN { ss=" 0 0 0 "; ts1=" 1 " ss; ts2=" 35 " ss ; OFS="," ; x=1 }
{
tsc=get_month($3,$4,ts1);
if ( NR>1 && $1==idp )
{
if( tsc == tsp) { print $1,$2,get_month($3,$4,ts1); x=0 }
else { for(i=tsp; i < tsc; i=get_month(j1,j2,i) )
{
j1=substr(i,1,4); j2=substr(i,5,2);
print $1,tpr,i;
}
}
}
tsp=get_month($3,$4,ts2);
idp=$1;
tpr=$2;
if(x!=0) print $1,$2,tsc
x=1;
}
But it is running infinitely awk -F"[,-]" -f utils.awk sales.txt
Though I tried in awk, I welcome other answers as well that would work in bash environment.

General plan:
assumption: sales.txt is already sorted (numerically) by the first column
user provides the min->max date range to be displayed (awk variables mindt and maxdt)
for a distinct id value we'll load all prices and dates into an array (prices[])
dates will be used as the indices of an associative array to store prices (prices[YYYY-MM])
once we've read all records for a given id ...
sort the prices[] array by the indices (ie, sort by YYYY-MM)
find the price for the max date less than mindt (save as prevprice)
for each date between mindt and maxdt (inclusive), if we have a price then display it (and save as prevprice) else ...
if we don't have a price but we do have a prevprice then use this prevprice as the current date's price (ie, fill the gap with the previous price)
One (GNU) awk idea:
mindate='2020-07'
maxdate='2020-12'
awk -v mindt="${mindate}" -v maxdt="${maxdate}" -v OFS=',' -F',' '
# function to add "months" (number) to "indate" (YYYY-MM)
function add_month(indate,months) {
dhms="1 0 0 0" # default day/hr/min/secs
split(indate,arr,"-")
yr=arr[1]
mn=arr[2]
return strftime("%Y-%m", mktime(arr[1]" "(arr[2]+months)" "dhms))
}
# function to print the list of prices for a given "id"
function print_id(id) {
if ( length(prices) == 0 ) # if prices array is empty then do nothing (ie, return)
return
PROCINFO["sorted_in"]="#ind_str_asc" # sort prices[] array by index in ascending order
for ( i in prices ) # loop through indices (YYYY-MM)
{ if ( i < mindt ) # as long as less than mindt
prevprice=prices[i] # save the price
else
break # no more pre-mindt indices to process
}
for ( i=mindt ; i<=maxdt ; i=add_month(i,1) ) # for our mindt - maxdt range
{ if ( !(i in prices) && prevprice ) # if no entry in prices[], but we have a prevprice, then ...
prices[i]=prevprice # set prices[] to prevprice (ie, fill the gap)
if ( i in prices ) # if we have an entry in prices[] then ...
{ prevprice=prices[i] # update prevprice (for filling future gap) and ...
print id,prices[i],i # print our data to stdout
}
}
}
BEGIN { split("",prices) } # pre-declare prices as an array
previd != $1 { print_id(previd) # when id changes print the prices[] array, then ...
previd=$1 # reset some variables for processing of the next id and ...
prevprice=""
delete prices # delete the prices[] array
}
{ prices[$3]=$2 } # for the current record create an entry in prices[]
END { print_id(previd) } # flush the last set of prices[] to stdout
' sales.txt
NOTE: This assumes sales.txt is sorted (numerically) by the first field; if this is not true then the last line should be changed to ' <(sort -n sales.txt)
This generates:
101,50,2020-07
101,80,2020-08
101,80,2020-09
101,80,2020-10
101,80,2020-11
101,80,2020-12
201,100,2020-09
201,350,2020-10
201,350,2020-11
201,350,2020-12

I hope I understood your question a bit. The following awk should do the trick
$ awk -v t1="2020-07" -v d="6" '
function next_month(d,a) {
split(d,a,"-"); a[2]==12?a[1]++ && a[2]=1 : a[2]++
return sprintf("%0.4d-%0.2d",a[1],a[2])
}
BEGIN{FS=OFS=",";t2=t1; for(i=1;i<=d;++i) t2=next_month(t2)}
{k[$1]}
($3<t1){a[$1,t1]=$2}
(t1 <= $3 && $3 < t2) { a[$1,$3]=$2 }
END{ for (key in k) {
p=""; t=t1;
for(i=1;i<=d;++i) {
if(p!="" || (key,t) in a) print key, ((key,t) in a ? p=a[key,t] : p), t
t=next_month(t)
}
}
}' input.txt
We implemented a straightforward function next_month that computes the next month based on a format YYYY-MM. Based on the duration of d months, we compute the time-period that should be shown in the BEGIN block. The time-period of interest is t1 <= t < t2.
Every time we read a record/line, we keep track of the key that he's been processed and store it in the array k. This way we know which key has been seen up to this point.
for all the times before the time-period of interest, we store the value in an array a with index (key,t1), while for all other times, we store the value in the array a with key (key,$3).
When the file is fully processed, we just cycle over all keys and print the output. We used a bit of logic, to check whether or not the month was listed in the original file.
Note: the output will be per key sorted in time, but the key will not appear in the same order as in the original file.

Related

Aggregating data in a csv

I have to generate a HTML file to show how I have aggregated data in a csv file.
The structure of this file is as follows:
num_expediente;fecha;hora;localizacion;numero;cod_distrito;distrito;tipo_accidente;estado_meteorológico;tipo_vehiculo;tipo_persona;rango_edad;sexo;cod_lesividad;lesividad;coordenada_x_utm;coordenada_y_utm;positiva_alcohol;coste;positiva_droga
2022S000001;Enero;Noche;AVENIDA ALBUFERA;19;13;13_PUENTE DE VALLECAS;Choque;Despejado;Vehículo ligero;Conductor;<30;Mujer;0;Sin asistencia;443359,226;4472082,272;0;0;0
2022S000002;Enero;Noche;PLAZA CANOVAS DEL CASTILLO;2;3;3_RETIRO;Choque;Desconocido;Motocicleta;Conductor;31_60;Hombre;0;Sin asistencia;441155,351;4474129,588;1;0;0
2022S000003;Enero;Noche;CALLE SAN BERNARDO;53;1;1_CENTRO;Atropello;Despejado;Motocicleta;Conductor;Desconocido;Desconocido;0;Sin asistencia;439995,351;4475212,523;0;0;0
2022S000004;Enero;Noche;CALLE ALCALA;728;20;20_SAN BLAS-CANILLEJAS;Choque;Despejado;Vehículo ligero;Conductor;31_60;Hombre;2;Leve;449693,925;4477837,552;0;200;0
2022S000004;Enero;Noche;CALLE ALCALA;728;20;20_SAN BLAS-CANILLEJAS;Choque;Despejado;Vehículo ligero;Pasajero;31_60;Mujer;3;Grave;449693,925;4477837,552;0;3000;0
num_expediente is the id of the accident
fecha is the month of the accident
sexo is the gender of the person implied in the accident
coste is the cost of the accident for the person implied
I would like to create a table showing the accumulated cost per month and gender. I use this script:
#! /usr/bin/awk -f
BEGIN {FS=OFS=";"}
function loop(array, name, i) {
for (i in array) {
if (isarray(array[i]))
loop(array[i], (name "[" i "]"))
else
printf("%s[%s] = %s\n",name, i, arr[i])
}
}
NR!=1{
array[$2][$13]+=$19
}
END {loop(array, "")
}
But the output is not aggregating the cost:
[Enero][Hombre] =
[Enero][Desconocido] =
[Enero][Mujer] =
[Febrero][Hombre] =
[Febrero][Mujer] =
[Febrero][Desconocido] =
[Marzo][Hombre] =
[Marzo][Desconocido] =
[Marzo][Mujer] =
I dont know why this is not working.
I dont have idea how to generate the html out of this output. Could you help with that too?
As mentioned in the comments OP has a typo in the printf where arr[i] should be array[i]; while this should address OP's current issue I'm not sure I understand the use of a recursive function call unless OP's real world problem is dealing with arrays of varying dimensions.
Since we're dealing with an array of known dimension (ie, 2) one simpliified awk idea:
awk -F';' '
NR>1 { array[$2][$13]+=$19 }
END { for (month in array)
for (gender in array[month])
printf "[%s][%s] = %s\n", month, gender, array[month][gender]
}
' raw.csv
For the provided input this generates:
[Enero][Hombre] = 200
[Enero][Desconocido] = 0
[Enero][Mujer] = 3000
NOTES:
this solution does not address any sorting requirements OP may have for the output
for an additional sorting requirement I'd suggest OP first address the current issue and once solved then attempt to apply the additional sorting requirement and ...
if having problems with sorting then ask a new question (making sure to include a complete list of months and genders and the desired sort order for both components)

Needing Help on a Loop for averaging grades using awk. Output prints the same grade for every student

BEGIN {
FS=","
OFS = "\t"
OFMT = "%.2f"
}
$4~/[0-9]/ {
EARN[$1$2]+=$4
POS[$1$2]+=$5
CLASS[$1]++
TYPE[$2]++
}
END{
TOTAL=0
for (STUDENT in CLASS){
HW=(EARN[$1"Homework"]/POS[$1"Homework"])*0.30
LAB=(EARN[$1"Lab"]/POS[$1"Lab"])*0.50
QUIZ=(EARN[$1"Quiz"]/POS[$1"Quiz"])*0.10
FINAL=(EARN[$1"Final"]/POS[$1"Final"])*0.10
WS=(EARN[$1"Survey"]/POS[$1"Survey"])*0.10
TOTAL=(HW+LAB+QUIZ+FINAL+WS)*100
GRADE= "A"
if (TOTAL < 90) {
GRADE="B"
}
if ( TOTAL < 80){
GRADE="C"
}
if (TOTAL < 70){
GRADE="D"
}
if( TOTAL < 60) {
GRADE="E"
}
}
print "Student\t Total \t Letter Grade"
print STUDENT, TOTAL, "\t" GRADE
}
The code /should/ give a unique grade for every student, but with my sample file every student receives the same grade (which i assume is the first student's grade), The code is going through column 4 EARN (earned points) and comparing it to column $5 POS (possible points)
You have 2 primary problems with your script (but good job and good effort -- you were close). The first is you need to move your print statements. The first to the beginning of END, e.g.
END{
print "Student\t Total \t Letter Grade"
And the next within the for (STUDENT in CLASS){ loop, e.g.
if( TOTAL < 60) {
GRADE="E"
}
print STUDENT, TOTAL, "\t" GRADE
}
}
The second and most problematic error is the use of $1 in, e,g, HW=(EARN[$1"Homework"]/... instead of using STUDENT (which is your loop variable), e.g. HW=(EARN[STUDENT"Homework"]/...
Along with the second, in your calculations, if a student does not have a specific grade (like "Survey" or "Final" as in the data you supplied in the comment), the you will receive a divide by zero as there will be no corresponding, e.g. POS[STUDENT"Survey"] and it will be taken as zero.
You can avoid that with a ternary to make sure the elements is nonzero or use 1 for the value, e.g. Instead of:
FINAL=(EARN[STUDENT"Final"]/POS[STUDENT"Final"])*0.10
you can use:
FINAL=(EARN[STUDENT"Final"]/(POS[STUDENT"Final"] ? POS[STUDENT"Final"] : 1))*0.10
(this just ensures the denominator is 1 if the index isn't in the POS array)
With that change you can use:
#!/bin/awk -f
BEGIN {
FS=","
OFS = "\t"
OFMT = "%.2f"
}
$4~/[0-9]/ {
EARN[$1$2]+=$4
POS[$1$2]+=$5
CLASS[$1]++
TYPE[$2]++
}
END{
print "Student\t Total \t Letter Grade"
TOTAL=0
for (STUDENT in CLASS){
HW=(EARN[STUDENT"Homework"]/(POS[STUDENT"Homework"] ? POS[STUDENT"Homework"] : 1))*0.30
LAB=(EARN[STUDENT"Lab"]/(POS[STUDENT"Lab"] ? POS[STUDENT"Lab"] : 1))*0.50
QUIZ=(EARN[STUDENT"Quiz"]/(POS[STUDENT"Quiz"] ? POS[STUDENT"Quiz"] : 1))*0.10
FINAL=(EARN[STUDENT"Final"]/(POS[STUDENT"Final"] ? POS[STUDENT"Final"] : 1))*0.10
WS=(EARN[STUDENT"Survey"]/(POS[STUDENT"Survey"] ? POS[STUDENT"Survey"] : 1))*0.10
TOTAL=(HW+LAB+QUIZ+FINAL+WS)*100
GRADE= "A"
if (TOTAL < 90) {
GRADE="B"
}
if ( TOTAL < 80){
GRADE="C"
}
if (TOTAL < 70){
GRADE="D"
}
if( TOTAL < 60) {
GRADE="E"
}
print STUDENT, TOTAL, "\t" GRADE
}
}
(note: I would use "F" as the final grade instead of "E" -- never heard of that one..., and also use lower-case names for your user variables)
Example Use/Output
Based on the data you provided you would receive:
$ ./calcgrades.awk grades.txt
Student Total Letter Grade
Chelsey 86.89 B
Sam 40.77 E
Let me know if you have questions.

Log analysis script in shell

I'm a newbie in scripting and I need your help. I have a log file, that I'm cleaned out. Looks like this (Time, duration(in millisec), action):
2012-04-28 00:00:00;277.406;
2012-04-28 00:00:00;299.680;
2012-04-28 00:00:00;282.338;
2012-02-28 00:00:00;272.241;
I need to make a script that use the duration data and count the action.
First - you need to make it easier to parse the different fields. A simple way is to change a semicolon to a space using
tr ";" " " <logfile|awkscript
Second, you need to create a table of low and high values. I'm using an associative array whose index is the name of the column. I do this in the BEGIN section.
You need to count when a value is within the low and high values. I do this in the middle section.
In the END section, I print out the values. I use 2 similar printf format strings to make sure the headers and values line up nicely:
#!/usr/bin/awk -f
BEGIN {
low["<1ms"]=0;high["<1ms"]=1
low["1-10ms"]=1;high["1-10ms"]=10
low["10-100ms"]=10;high["10-100ms"]=100
low["100-500ms"]=100;high["100-500ms"]=500
low[">500ms"]=500;high[">500ms"]=1000000000
}
{
# Middle section - for each line
duration=$3
for (i in high) {
if ((duration > low[i]) && (duration <= high[i]) ) {
# printf("duration: %d, low: %s,high: %s\n", duration, low[i], high[i]);
total+=duration # total duration
bin[i]++ # store a count into different bins
count++ # total number of measurements
}
}
}
END {
average=total/count
FMT="%-10s %10s %10s %10s %10s %10s\n"
NFMT="%-10.3f %10s %10s %10s %10s %10s\n"
printf(FMT,"AVG", "<1ms", "1-10ms", "10-100ms", "100-500ms", "500+ms")
printf(NFMT,average, bin["<1ms"], bin["1-10ms"], bin["10-100ms"], bin["100-500m\
s"], bin["500+ms"])
}
When I run this with your data, I get
AVG <1ms 1-10ms 10-100ms 100-500ms 500+ms
282.916 4

How to deal with this situation when picking a single owner from a list of owners using perl hashes?

I run perforce command on a list of files and after some parsing and stuff i generate a file that contains owners like this(call it owner.log):
ownerA
ownerB
ownerC
ownerA
ownerA
then i go throug the owner.log file and pick an owner like this:
while(<OWNER>) {
$vote->{$_} += 1;
}
and then the owner with the highest vote gets selected for email notification. But the problem is when i have an owner log like this:
ownerA
ownerB
ownerC
ownerD
each one gets the same vote? How should i pick one?
Thank you.
Is there a quick way of finding if all hashes have same value? that way i can pick one at random.
One way to determine if all hash keys have the same value is to use uniq. If there is only one common value, use the keys of your hash as an array and use rand to find a random index within the array bounds:
use More::ListUtils qw(uniq);
my #keys = keys %hash;
my #vals = values %hash;
if (scalar uniq(#vals) == 1) {
print "all of equal weight\n";
print $keys[ int(rand(#keys)) ], "\n";
}
Assuming the array #winners:
print "The winner is: ", $winners[rand #winners];
The whole process:
my $last = 0;
my #winners;
for my $name (sort { $vote->{$b} <=> $vote->{$a} } keys %$vote) {
last if ($vote->{$name} < $last);
push #winners, $name;
$last = $vote->{$name};
}
my $winner = $winners[rand #winners];
print "The winner is, by ",
#winners == 1 ? "unanimous vote: " : "luck of the draw: ", $winner;

Sort an associative array in awk

I have an associative array in awk that gets populated like this:
chr_count[$3]++
When I try to print my chr_counts, I use this:
for (i in chr_count) {
print i,":",chr_count[i];
}
But not surprisingly, the order of i is not sorted in any way.
Is there an easy way to iterate over the sorted keys of chr_count?
Instead of asort, use asorti(source, destination) which sorts the indices into a new array and you won't have to copy the array.
Then you can use the destination array as pointers into the source array.
For your example, you would use it like this:
n=asorti(chr_count, sorted)
for (i=1; i<=n; i++) {
print sorted[i] " : " chr_count[sorted[i]]
}
you can use the sort command. e.g.
for ( i in data )
print i ":", data[i] | "sort"
I recently came across this issue and found that with gawk I could set the value of PROCINFO["sorted_in"] to control iteration order. I found a list of valid values for this by searching for PROCINFO online and landed on this GNU Awk User's Guide page: https://www.gnu.org/software/gawk/manual/html_node/Controlling-Scanning.html
This lists options of the form #{ind|val}_{num|type|str}_{asc|desc} with:
ind sorting by key (index) and val sorting by value.
num sorting numerically, str by string and type by assigned type.
asc for ascending order and desc for descending order.
I simply used:
PROCINFO["sorted_in"] = "#val_num_desc"
for (i in map) print i, map[i]
And the output was sorted in descending order of values.
Note that asort() and asorti() are specific to gawk, and are unknown to awk. For plain awk, you can roll your own sort() or get one from elsewhere.
This is taken directly from the documentation:
populate the array data
# copy indices
j = 1
for (i in data) {
ind[j] = i # index value becomes element value
j++
}
n = asort(ind) # index values are now sorted
for (i = 1; i <= n; i++) {
do something with ind[i] Work with sorted indices directly
...
do something with data[ind[i]] Access original array via sorted indices
}

Resources