Aggregating data in a csv

Aggregating data in a csv - bash

I have to generate a HTML file to show how I have aggregated data in a csv file.
The structure of this file is as follows:
num_expediente;fecha;hora;localizacion;numero;cod_distrito;distrito;tipo_accidente;estado_meteorológico;tipo_vehiculo;tipo_persona;rango_edad;sexo;cod_lesividad;lesividad;coordenada_x_utm;coordenada_y_utm;positiva_alcohol;coste;positiva_droga
2022S000001;Enero;Noche;AVENIDA ALBUFERA;19;13;13_PUENTE DE VALLECAS;Choque;Despejado;Vehículo ligero;Conductor;<30;Mujer;0;Sin asistencia;443359,226;4472082,272;0;0;0
2022S000002;Enero;Noche;PLAZA CANOVAS DEL CASTILLO;2;3;3_RETIRO;Choque;Desconocido;Motocicleta;Conductor;31_60;Hombre;0;Sin asistencia;441155,351;4474129,588;1;0;0
2022S000003;Enero;Noche;CALLE SAN BERNARDO;53;1;1_CENTRO;Atropello;Despejado;Motocicleta;Conductor;Desconocido;Desconocido;0;Sin asistencia;439995,351;4475212,523;0;0;0
2022S000004;Enero;Noche;CALLE ALCALA;728;20;20_SAN BLAS-CANILLEJAS;Choque;Despejado;Vehículo ligero;Conductor;31_60;Hombre;2;Leve;449693,925;4477837,552;0;200;0
2022S000004;Enero;Noche;CALLE ALCALA;728;20;20_SAN BLAS-CANILLEJAS;Choque;Despejado;Vehículo ligero;Pasajero;31_60;Mujer;3;Grave;449693,925;4477837,552;0;3000;0
num_expediente is the id of the accident
fecha is the month of the accident
sexo is the gender of the person implied in the accident
coste is the cost of the accident for the person implied
I would like to create a table showing the accumulated cost per month and gender. I use this script:
#! /usr/bin/awk -f
BEGIN {FS=OFS=";"}
function loop(array, name, i) {
for (i in array) {
if (isarray(array[i]))
loop(array[i], (name "[" i "]"))
else
printf("%s[%s] = %s\n",name, i, arr[i])
}
}
NR!=1{
array[$2][$13]+=$19
}
END {loop(array, "")
}
But the output is not aggregating the cost:
[Enero][Hombre] =
[Enero][Desconocido] =
[Enero][Mujer] =
[Febrero][Hombre] =
[Febrero][Mujer] =
[Febrero][Desconocido] =
[Marzo][Hombre] =
[Marzo][Desconocido] =
[Marzo][Mujer] =
I dont know why this is not working.
I dont have idea how to generate the html out of this output. Could you help with that too?

As mentioned in the comments OP has a typo in the printf where arr[i] should be array[i]; while this should address OP's current issue I'm not sure I understand the use of a recursive function call unless OP's real world problem is dealing with arrays of varying dimensions.
Since we're dealing with an array of known dimension (ie, 2) one simpliified awk idea:
awk -F';' '
NR>1 { array[$2][$13]+=$19 }
END { for (month in array)
for (gender in array[month])
printf "[%s][%s] = %s\n", month, gender, array[month][gender]
}
' raw.csv
For the provided input this generates:
[Enero][Hombre] = 200
[Enero][Desconocido] = 0
[Enero][Mujer] = 3000
NOTES:
this solution does not address any sorting requirements OP may have for the output
for an additional sorting requirement I'd suggest OP first address the current issue and once solved then attempt to apply the additional sorting requirement and ...
if having problems with sorting then ask a new question (making sure to include a complete list of months and genders and the desired sort order for both components)

Related

moving sql logic to backend - bash

One of the sql logic is moving to backend and I need to generate a report using shell scripting.
For understanding, I'm making it simple as follows.
My input file - sales.txt (id, price, month)
101,50,2019-10
101,80,2020-08
101,80,2020-10
201,100,2020-09
201,350,2020-10
The output should be for 6 months window for each id e.g t1=2020-07 and t2=2020-12
101,50,2020-07
101,80,2020-08
101,80,2020-09
101,80,2020-10
101,80,2020-11
101,80,2020-12
201,100,2020-09
201,350,2020-10
201,350,2020-11
201,350,2020-12
For id 101, though there is no entry for 2020-07, it should take from the immediate previous month value that is available in the sales file.
So the price=50 from 2019-10 is used for 2020-07.
For 201, the first entry itself is from 2020-09, so 2020-08 and 2020-07 are not applicable.
Wherever there are gaps the immediate previous month value should be propagated.
I'm trying to use awk to solve this problem, I'm creating a reusable script util.awk like below
to generate the missing values, pipe it to sort command and then again use the util.awk for final output.
util.awk
function get_month(a,b,t1) { return strftime("%Y%m",mktime(a " " b t1)) }
BEGIN { ss=" 0 0 0 "; ts1=" 1 " ss; ts2=" 35 " ss ; OFS="," ; x=1 }
{
tsc=get_month($3,$4,ts1);
if ( NR>1 && $1==idp )
{
if( tsc == tsp) { print $1,$2,get_month($3,$4,ts1); x=0 }
else { for(i=tsp; i < tsc; i=get_month(j1,j2,i) )
{
j1=substr(i,1,4); j2=substr(i,5,2);
print $1,tpr,i;
}
}
}
tsp=get_month($3,$4,ts2);
idp=$1;
tpr=$2;
if(x!=0) print $1,$2,tsc
x=1;
}
But it is running infinitely awk -F"[,-]" -f utils.awk sales.txt
Though I tried in awk, I welcome other answers as well that would work in bash environment.

General plan:
assumption: sales.txt is already sorted (numerically) by the first column
user provides the min->max date range to be displayed (awk variables mindt and maxdt)
for a distinct id value we'll load all prices and dates into an array (prices[])
dates will be used as the indices of an associative array to store prices (prices[YYYY-MM])
once we've read all records for a given id ...
sort the prices[] array by the indices (ie, sort by YYYY-MM)
find the price for the max date less than mindt (save as prevprice)
for each date between mindt and maxdt (inclusive), if we have a price then display it (and save as prevprice) else ...
if we don't have a price but we do have a prevprice then use this prevprice as the current date's price (ie, fill the gap with the previous price)
One (GNU) awk idea:
mindate='2020-07'
maxdate='2020-12'
awk -v mindt="${mindate}" -v maxdt="${maxdate}" -v OFS=',' -F',' '
# function to add "months" (number) to "indate" (YYYY-MM)
function add_month(indate,months) {
dhms="1 0 0 0" # default day/hr/min/secs
split(indate,arr,"-")
yr=arr[1]
mn=arr[2]
return strftime("%Y-%m", mktime(arr[1]" "(arr[2]+months)" "dhms))
}
# function to print the list of prices for a given "id"
function print_id(id) {
if ( length(prices) == 0 ) # if prices array is empty then do nothing (ie, return)
return
PROCINFO["sorted_in"]="#ind_str_asc" # sort prices[] array by index in ascending order
for ( i in prices ) # loop through indices (YYYY-MM)
{ if ( i < mindt ) # as long as less than mindt
prevprice=prices[i] # save the price
else
break # no more pre-mindt indices to process
}
for ( i=mindt ; i<=maxdt ; i=add_month(i,1) ) # for our mindt - maxdt range
{ if ( !(i in prices) && prevprice ) # if no entry in prices[], but we have a prevprice, then ...
prices[i]=prevprice # set prices[] to prevprice (ie, fill the gap)
if ( i in prices ) # if we have an entry in prices[] then ...
{ prevprice=prices[i] # update prevprice (for filling future gap) and ...
print id,prices[i],i # print our data to stdout
}
}
}
BEGIN { split("",prices) } # pre-declare prices as an array
previd != $1 { print_id(previd) # when id changes print the prices[] array, then ...
previd=$1 # reset some variables for processing of the next id and ...
prevprice=""
delete prices # delete the prices[] array
}
{ prices[$3]=$2 } # for the current record create an entry in prices[]
END { print_id(previd) } # flush the last set of prices[] to stdout
' sales.txt
NOTE: This assumes sales.txt is sorted (numerically) by the first field; if this is not true then the last line should be changed to ' <(sort -n sales.txt)
This generates:
101,50,2020-07
101,80,2020-08
101,80,2020-09
101,80,2020-10
101,80,2020-11
101,80,2020-12
201,100,2020-09
201,350,2020-10
201,350,2020-11
201,350,2020-12

I hope I understood your question a bit. The following awk should do the trick
$ awk -v t1="2020-07" -v d="6" '
function next_month(d,a) {
split(d,a,"-"); a[2]==12?a[1]++ && a[2]=1 : a[2]++
return sprintf("%0.4d-%0.2d",a[1],a[2])
}
BEGIN{FS=OFS=",";t2=t1; for(i=1;i<=d;++i) t2=next_month(t2)}
{k[$1]}
($3<t1){a[$1,t1]=$2}
(t1 <= $3 && $3 < t2) { a[$1,$3]=$2 }
END{ for (key in k) {
p=""; t=t1;
for(i=1;i<=d;++i) {
if(p!="" || (key,t) in a) print key, ((key,t) in a ? p=a[key,t] : p), t
t=next_month(t)
}
}
}' input.txt
We implemented a straightforward function next_month that computes the next month based on a format YYYY-MM. Based on the duration of d months, we compute the time-period that should be shown in the BEGIN block. The time-period of interest is t1 <= t < t2.
Every time we read a record/line, we keep track of the key that he's been processed and store it in the array k. This way we know which key has been seen up to this point.
for all the times before the time-period of interest, we store the value in an array a with index (key,t1), while for all other times, we store the value in the array a with key (key,$3).
When the file is fully processed, we just cycle over all keys and print the output. We used a bit of logic, to check whether or not the month was listed in the original file.
Note: the output will be per key sorted in time, but the key will not appear in the same order as in the original file.

How can I find both identical and similar strings in a particular field in a text file in Linux?

My apologies ahead of time - I'm not sure that there is an answer for this one using only Linux command-line fu. Please note I am not a programmer, but I have been playing around with bash and python a bit over the last few years.
I have a large text file with rows and columns that resemble the following (note - fields are separated with tabs):
1074 Beetle OOB11061MNH 12/22/16 Confirmed
3430 Hightop 0817BESTYET 08/07/17 Queued
3431 Hightop 0817BESTYET 08/07/17 Queued
3078 Copland 2017GENERAL 07/07/17 Confirmed
3890 Bartok FOODS 09/11/17 Confirmed
5440 Alphapha 00B1106IMNH 01/09/18 Queued
What I want to do is find and output only those rows where the third field is either identical OR similar to another in the list. I don't really care whether the other fields are similar or not, but they should all be included in the output. By similar, I mean no more than [n] characters are different in that particular field (for example, no more than 3 characters are different). So the output I would want would be:
1074 Beetle OOB11061MNH 12/22/16 Confirmed
3430 Hightop 0817BESTYET 08/07/17 Queued
3431 Hightop 0817BESTYET 08/07/17 Queued
5440 Alphapha 00B1106IMNH 01/09/18 Queued
The line beginning 1074 has a third field that differs by 3 characters with 5440, so both of them are included. 3430 and 3431 are included because they are exactly identical. 3078 and 3890 are eliminated because they are not similar.
Through googling the forums I've managed to piece together this rather longish pipeline to be able to find all of the instances where field 3 is exactly identical:
cat inputfile.txt | awk 'BEGIN { OFS=FS="\t" } {if (count[$3] > 1) print $0; else if (count[$3] == 1) { print save[$3]; print $0; } else save[$3] = $0; count[$3]++; }' > outputfile.txt
I must confess I don't really understand awk all that well; I'm just copying and adapting from the web. But that seemed to work great at finding exact duplicates (i.e., it would output only 3430 and 3431 above). But I have no idea how to approach trying to find strings that are not identical but that differ in no more than 3 places.
For instance, in my example above, it should match 1074 and 5440 because they would both fit the pattern:
??B1106?MNH
But I would want it to be able to match also any other random pattern of matches, as long as there are no more than three differences, like this:
20?7G?N?RAL
These differences could be arbitrarily in any position.
The reason for needing this is we are trying to find a way to automatically find typographical errors in a serial-number-like field. There might be a mis-key, or perhaps a letter "O" replaced with a number "0", or the like.
So... any ideas? Thanks for the help!

you can use this script
$ more hamming.awk
function hamming(x,y,xs,ys,min,max,h) {
if(x==y) return 0;
else {
nx=split(x,xs,"");
mx=split(y,ys,"");
min=nx<mx?nx:mx;
max=nx<mx?mx:nx;
for(i=1;i<=min;i++) if(xs[i]!=ys[i]) h++;
return h+(max-min);
}
}
BEGIN {FS=OFS="\t"}
NR==FNR {
if($3 in a) nrs[NR];
for(k in a)
if(hamming(k,$3)<4) {
nrs[NR];
nrs[a[k]];
}
a[$3]=NR;
next
}
FNR in nrs
usage
$ awk -f hamming.awk file{,}
it's a double scan algorithm, finds the hamming distance (the one you described) between keys. Notice the it's O(n^2) algorithm, so may not suitable for very large data sets. However, not sure any other algorithm can do better.
NB Additional note based on the comment which I missed from the post. This algorithm compares the keys character by character, so displacements won't be identified. For example 123 and 23 will give a distance of 3.

Levenshtein distance aka "edit distance" suits your task best. Perl script below requires installing a module Text::Levenshtein (for debian/ubuntu do: sudo apt install libtext-levenshtein-perl).
use Text::Levenshtein qw(distance);
$maxdist = shift;
#ll = (<>);
#k = map {
$k = (split /\t/, $_)[2];
# $k =~ s/O/0/g;
} #ll;
for ($i = 0; $i < #ll; ++$i) {
for ($j = 0; $j < #ll; ++$j) {
if ($i != $j and distance($k[$i], $k[$j]) < $maxdist) {
print $ll[$i];
last;
}
}
}
Usage:
perl lev.pl 3 inputfile.txt > outputfile.txt
The algorithm is the same O(n^2) as in #karakfa's post, but matching is more flexible.
Also note the commented line # $k =~ s/O/0/g;. If you uncomment it, then all O's in key will become 0's, which will fix keys damaged by O->0 transformation. When working with damaged data I always use small rules like this to fix data gradually, refining rules from run to run, to the point where data is almost perfect and fuzzy match is no longer needed.

How to create missing records within date-time range in pig latin

I have input records of the form
2013-07-09T19:17Z,f1,f2
2013-07-09T03:17Z,f1,f2
2013-07-09T21:17Z,f1,f2
2013-07-09T16:17Z,f1,f2
2013-07-09T16:14Z,f1,f2
2013-07-09T16:16Z,f1,f2
2013-07-09T01:17Z,f1,f2
2013-07-09T16:18Z,f1,f2
These represent timestamps and events. I have written these by hand, but actual data should be sorted based on time.
I would like to generate a set of records which would be input to graph plotting function which needs continuous time series. I would like to fill in missing values, i.e. if there are entries for "2013-07-09T19:17Z" and "2013-07-09T19:19Z", I would like to generate entry for "2013-07-09T19:18Z" with predefined value.
My thoughts on doing this:
Use MIN and MAX to find the start and end date in the series
Write UDF which takes min and max and returns relation with missing
timestamps
Join above 2 relations
I cannot get my head around on how to implement this in PIG though. Would appreciate any help.
Thanks!

Generate another file using a script (outside pig)with all time stamps between MIN and MAX , including MIN and MAX. Load this as a second data set. Here is a sample that I used from your data set. Please note I filled in only few gaps not all.
2013-07-09T01:17Z,d1,d2
2013-07-09T01:18Z,d1,d2
2013-07-09T03:17Z,d1,d2
2013-07-09T16:14Z,d1,d2
2013-07-09T16:15Z,d1,d2
2013-07-09T16:16Z,d1,d2
2013-07-09T16:17Z,d1,d2
2013-07-09T16:18Z,d1,d2
2013-07-09T19:17Z,d1,d2
2013-07-09T21:17Z,d1,d2
Do a COGROUP on the original dataset and the generated dataset above. Use a nested FOREACH GENERATE to write output dataset. If first dataset is empty, use the values from second set to generate output dataset else the first dataset. Here is the piece of code I used on these two datasets.
Org_Set = LOAD 'pigMissingData/timeSeries' USING PigStorage(',') AS (timeStamp, fl1, fl2);
Default_set = LOAD 'pigMissingData/timeSeriesFull' USING PigStorage(',') AS (timeStamp, fl1, fl2);
coGrouped = COGROUP Org_Set BY timeStamp, Default_set BY timeStamp;
Filled_Data_set = FOREACH coGrouped {
x = COUNT(times);
y = (x == 0? (Default_set.fl1, Default_set.fl2): (Org_Set.fl1, Org_Set.fl2));
GENERATE FLATTEN(group), FLATTEN(y.$0), FLATTEN(y.$1);
};
if you need further clarification or help let me know

In addition to #Rags answer, you could use the STREAM x THROUGH command and a simple awk script (similar to this one) to generate the date range once you have the min and max dates. Something similar to (untested! - you might need to single line the awk script with semi-colon command delimitation, or better to ship it as a script file)
grunt> describe bounds;
(min:chararray, max:chararray)
grunt> dump bounds;
(2013/01/01,2013/01/04)
grunt> fullDateBounds = STREAM bounds THROUGH `gawk '{
split($1,s,"/")
split($2,e,"/")
st=mktime(s[1] " " s[2] " " s[3] " 0 0 0")
et=mktime(e[1] " " e[2] " " e[3] " 0 0 0")
for (i=st;i<=et;i+=60*24) print strftime("%Y/%m/%d",i)
}'`;

Array List looping for a duplicate value

I am looking if there is an "easy" or simple way to make an array of something, Lets say Icecreams.. this would be a class of icecream with various Attributes (ID, flavour, Size, scoops), i would like to run an array that gathers every ice cream ordered and then searches through this list for any duplicate values (2+ same size)
First idea i had was a for loop that creates the array than grabs the ice cream ID for the first instance, and checks its "flavour" against the array, if no duplicate is found the ID is increased by 1 (ID++) and then that Ice creams flavour is ran in the array, if a match is found i would set a Boolean to true.
Every approach i seem to take appears to be rather long winded and i haven't got one working as of yet. hoping some fresh/more experienced eyes would help on this.
In answer to below;
The XML would hold something like below
<iceCream id=1>
<flavour>chocolate</flavour>
<scoops>5</scoops>
</iceCream>
<iceCream id=2>
<flavour>banana</flavour>
<scoops>2</scoops>
</iceCream>
I would want to use drools (probably an array list?) to gather each icecream tag and allow me to check if any of the icecreams have the same flavour and output something (set a boolean to true) if a match is found, My understand was to make an array then run each icecream though the array by using its ID to identify it and inside each loop do ID +1 (int ID = 1) then in the lopp ID++. Aswell as search through the flavour childtag.
int ID = 0;
boolean match = false;
ArrayList iceCreams = new ArrayList($cont.getIceCreams());
for(iceCream $Flavour: (ArrayList<iceCream>)iceCreams)
{
ID++
if($Flavour.getFlavour().equals(icecream with id of (ID variable).getFlavour)
{
match = true;
}
}
if(match)
{etc etc etc}
Something along these lines if this helps?

1) If you have control over the first array creation, why dont you make sure that while insertion, you insert only the icecreams that are unique. So, while you are inserting into the array say ID=1, first iterate through the array and check if there is an icecream in the array with ID as 1, if not you put this into the array and do other stuff.
2) Searching part: now while inserting, make sure that you are doing so based on the ascending oder of IDs, so you can perform binary search for the same.
Note: I dont know drools, i have just posted a logic as per my understanding of the problem.

I don't know drools either, but I'll post the some pseudo code for what I think you are trying to accomplish:
for(i = 0; i < len(ice_cream_array); i++)
{
for(j = (i + 1); j < len(ice_cream_array); j++)
{
if (ice_cream_array[i] == ice_cream_array[j])
break from inner loop
else
there is no match
}
}
You may also want to look up bubble sorts and binary searches.

Sort an associative array in awk

I have an associative array in awk that gets populated like this:
chr_count[$3]++
When I try to print my chr_counts, I use this:
for (i in chr_count) {
print i,":",chr_count[i];
}
But not surprisingly, the order of i is not sorted in any way.
Is there an easy way to iterate over the sorted keys of chr_count?

Instead of asort, use asorti(source, destination) which sorts the indices into a new array and you won't have to copy the array.
Then you can use the destination array as pointers into the source array.
For your example, you would use it like this:
n=asorti(chr_count, sorted)
for (i=1; i<=n; i++) {
print sorted[i] " : " chr_count[sorted[i]]
}

you can use the sort command. e.g.
for ( i in data )
print i ":", data[i] | "sort"

I recently came across this issue and found that with gawk I could set the value of PROCINFO["sorted_in"] to control iteration order. I found a list of valid values for this by searching for PROCINFO online and landed on this GNU Awk User's Guide page: https://www.gnu.org/software/gawk/manual/html_node/Controlling-Scanning.html
This lists options of the form #{ind|val}_{num|type|str}_{asc|desc} with:
ind sorting by key (index) and val sorting by value.
num sorting numerically, str by string and type by assigned type.
asc for ascending order and desc for descending order.
I simply used:
PROCINFO["sorted_in"] = "#val_num_desc"
for (i in map) print i, map[i]
And the output was sorted in descending order of values.

Note that asort() and asorti() are specific to gawk, and are unknown to awk. For plain awk, you can roll your own sort() or get one from elsewhere.

This is taken directly from the documentation:
populate the array data
# copy indices
j = 1
for (i in data) {
ind[j] = i # index value becomes element value
j++
}
n = asort(ind) # index values are now sorted
for (i = 1; i <= n; i++) {
do something with ind[i] Work with sorted indices directly
...
do something with data[ind[i]] Access original array via sorted indices
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Aggregating data in a csv - bash

Related

moving sql logic to backend - bash

How can I find both identical and similar strings in a particular field in a text file in Linux?

How to create missing records within date-time range in pig latin

Array List looping for a duplicate value

Sort an associative array in awk

Categories

Resources