Bash - How to combine duplicate rows and sum the values of a csv file? [duplicate] - bash

I have a CSV file with a heading row and multiple data rows each with 11 data columns like this:
Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,Mildred#email.com,205583995140400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,Mildred#email.com,205583995140400,,1,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,Martha#email.com,205486016644400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,Martha#email.com,205486016644400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,Misty#email.com,205588935534900,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
05 Jun 2018,Misty#email.com,205588935534900,,1,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
I want to remove the duplicates in that file and sum the values in the Quantity data column. I want the result to be like this:
Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,Mildred#email.com,205583995140400,,3,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,Martha#email.com,205486016644400,,4,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,Misty#email.com,205588935534900,,3,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
I want to sum only the values in the fifth data column Quantity while leaving the rest as it is. I have tried the solution in Sum duplicate row values with awk, but the answer there works only if the file has only two data columns. My CSV file has 11 data columns and so it doesn't work.
How to do it with awk?

awk to the rescue!
$ awk 'BEGIN{FS=OFS=","}
NR==1{print; next}
{q=$5; $5="~"; a[$0]+=q}
END {for(k in a) {sub("~",a[k],k); print k}}' file
Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,Misty#email.com,205588935534900,,3,Gold,05 Jun 2018 - 10:01,In Process,Rp3.000.000,Done,Rutwan Address
05 Jun 2018,Martha#email.com,205486016644400,,4,Gold,05 Jun 2018 - 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,Mildred#email.com,205583995140400,,3,Gold,05 Jun 2018 - 10:01,In Process,Rp3.000.000,Done,Syahrul Address
note that the order of records are not guaranteed, but also doesn't require them to be sorted initially. To preserve the order there are multiple solutions...
Also, I use ~ as a placeholder. If your data includes this char you can replace with an unused one.
UPDATE
To preserve the order (based on first appearance of a row)
$ awk 'BEGIN{FS=OFS=","}
NR==1{print; next}
{q=$5;$5="~"; if(!($0 in a)) b[++c]=$0; a[$0]+=q}
END {for(k=1;k<=c;k++) {sub("~",a[b[k]],b[k]); print b[k]}}' file
keep a separate structure to mark the order of the rows and iterate over that data structure...

Taking direct adaption from Karafka's solution and adding some code in it a bit to get the lines in proper order(in which they are present in Input_file) as per OP's request.
awk -F, '
FNR==1{
print;
next}
{
val=$5;
$5="~";
a[$0]+=val
}
!b[$0]++{
c[++count]=$0}
END{
for(i=1;i<=count;i++){
sub("~",a[c[i]],c[i]);
print c[i]}
}' OFS=, Input_file
Explanation: Adding explanation to above code too now.
awk -F, ' ##Setting field separator as comma here.
FNR==1{ ##Checking condition if line number is 1 then do following.
print; ##Print the current line.
next} ##next will skip all further statements from here.
{
val=$5; ##Creating a variable named val whose value is 5th field of current line.
$5="~"; ##Setting value of 5th field as ~ here to keep all lines same(to create index for array a).
a[$0]+=val ##Creating an array named a whose index is current line and its value is variable val value.
}
!b[$0]++{ ##Checking if array b whose index is current line its value is NULL then do following.
c[++count]=$0} ##Creating an array named c whose index is variable count increasing value with 1 and value is current line.
END{ ##Starting END block of awk code here.
for(i=1;i<=count;i++){ ##Starting a for loop whose value starts from 1 to till value of count variable.
sub("~",a[c[i]],c[i]); ##Substituting ~ in value of array c(which is actually lines value) with value of SUMMED $5.
print c[i]} ##Printing newly value of array c where $5 is now replaced with its actual value.
}' OFS=, Input_file ##Setting OFS as comma here and mentioning Input_file name here too.

This solution uses an extra array to guarantee output is deduped and pre-sorted in original input order :
1st array tracks input row order,
2nd one both de-dupes and sums up $5
% echo
% cat testfile.txt
% < testfile.txt mawk 'BEGIN {
print $( (FS=OFS=",")*(getline))\
($(!(__=(_+=(_=_~_)+_)+--_-(_="")))=_)
} (______=($+__)($__=_=""))==(___[_=$_]+=______) {
____[++_____]=_
} END {
for(_^=!__;_<=_____;_++) {
print $(($__=___[$!_=____[_]])<-_) } }'
Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,Mildred#email.com,205583995140400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,Mildred#email.com,205583995140400,,1,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,Martha#email.com,205486016644400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,Martha#email.com,205486016644400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,Misty#email.com,205588935534900,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
05 Jun 2018,Misty#email.com,205588935534900,,1,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,Mildred#email.com,205583995140400,,3,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,Martha#email.com,205486016644400,,4,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,Misty#email.com,205588935534900,,3,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address

Related

Number of logins on Linux using Shell script and AWK

How can I get the number of logins of each day from the beginning of the wtmp file using AWK?
I thought about using an associative array but I don't know how to implement it in AWK..
myscript.sh
#!/bin/bash
awk 'BEGIN{numberoflogins=0}
#code goes here'
The output of the last command:
[fnorbert#localhost Documents]$ last
fnorbert tty2 /dev/tty2 Mon Apr 24 13:25 still logged in
reboot system boot 4.8.6-300.fc25.x Mon Apr 24 16:25 still running
reboot system boot 4.8.6-300.fc25.x Mon Apr 24 13:42 still running
fnorbert tty2 /dev/tty2 Fri Apr 21 16:14 - 21:56 (05:42)
reboot system boot 4.8.6-300.fc25.x Fri Apr 21 19:13 - 21:56 (02:43)
fnorbert tty2 /dev/tty2 Tue Apr 4 08:31 - 10:02 (01:30)
reboot system boot 4.8.6-300.fc25.x Tue Apr 4 10:30 - 10:02 (00:-27)
fnorbert tty2 /dev/tty2 Tue Apr 4 08:14 - 08:26 (00:11)
reboot system boot 4.8.6-300.fc25.x Tue Apr 4 10:13 - 08:26 (-1:-47)
wtmp begins Mon Mar 6 09:39:43 2017
The shell script's output should be:
Apr 4: 4
Apr 21: 2
Apr 24: 3
, using associative array if it's possible
In awk, arrays can be indexed by strings or numbers, so you can use it like an associative array.
However, what you're asking will be hard to do with awk reliably because the delimiters are whitespace, therefore empty fields will throw off the columns, and if you use FIELDWIDTHS you'll also get thrown off by columns longer than their assigned width.
If all you're looking for is just the number of logins per day you might want to use a combination of sed and awk (and sort):
last | \
sed -E 's/^.*(Mon|Tue|Wed|Thu|Fri|Sat|Sun) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) ([ 0-9]{2}).*$/\2 \3/p;d' | \
awk '{arr[$0]++} END { for (a in arr) print a": " arr[a]}' | \
sort -M
The sed -E uses extended regular expressions, and the pattern just prints the date of each line that is emitted by last (This matches on the day of week, but only prints the Month and Date)
We could have used uniq -c to get the counts, but using awk we can do an associative array as you hinted.
Finally using sort -M we're sorting on the abbreviated date formats like Apr 24, Mar 16, etc.
Try the following awk script(assuming that the month is the same, points to current month):
myscript.awk:
#!/bin/awk -f
{
a[NR]=$0; # saving each line into an array indexed by line number
}
END {
for (i=NR-1;i>1;i--) { # iterating lines in reverse order(except the first/last line)
if (match(a[i],/[A-Z][a-z]{2} ([A-Z][a-z]{2}) *([0-9]{1,2}) [0-9]{2}:[0-9]{2}/, b))
m=b[1]; # saving month name
c[b[2]]++; # accumulating the number of occurrences
}
for (i in c) print m,i": "c[i]
}
Usage:
last | awk -f myscript.awk
The output:
Apr 4: 4
Apr 21: 2
Apr 24: 3

How to get the data from log file with different months in unix?

I want to get the data between two times in a log file of different months and date.Suppose if my startime is not present in the logfile, then I want to extract the data from the nearest next time in the logfile. And also it has to end before the endtime, if the entered endtime is not present in the log file.
My log file data,
Apr 10 16 02:07:20 Data 1
Apr 11 16 02:07:20 Data 1
May 10 16 04:11:09 Data 2
May 12 16 04:11:09 Data 2
Jun 11 16 06:22:35 Data 3
Jun 12 16 06:22:35 Data 3
The solution I am using is,
awk -v start="$StartTime" -v stop="$EndTime" 'start <= $StartTime && $EndTime <= stop' $file
where, I am storing my starttime in $StartTime and endtime in $EndTimeBut Iam not getting the exact output. Please help.
Something like this maybe:
$ BashVarStart="16 05 10 00 00 00" # the same format that awk function will reformat to
$ BashVarStop="16 06 11 00 00 00"
$ awk -v start="$BashVarStart" -v stop="$BashVarStop" -F"[ :]" -v OFS=\ '
function reformatdate(m,d,y,h,mm,s) { # basically throw year to the beginning
monstr="Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec"; # numerize the months
split(monstr,monarr," "); # split monstr to an array to enumerate the months
# monarr[1]="Jan", monarr[2]="Feb" etc
for(i in monarr) { # iterate over all month numbers in monarr index
if(monarr[i]==m) # when month number matches
m=sprintf("%02d",i) # zeropad if month number below 10: 9 -> 09
};
return y" "m" "d" "h" "mm" "s # return in different order
}
start < reformatdate($1,$2,$3,$4,$5,$6) && stop > reformatdate($1,$2,$3,$4,$5,$6)
' test.in
May 10 16 04:11:09 Data 2
May 12 16 04:11:09 Data 2

Awk sum rows in csv file based on value of three columns

Iam using this awk to process csv files:
awk 'BEGIN {FS=OFS=";"} (NR==1) {$9="TpmC"; print $0} (NR>1 && NF) {a=$2$5; sum6[a]+=$6; sum7[a]+=$7; sum8[a]+=$8; other[a]=$0} END
{for(i in sum7) {$0=other[i]; $6=sum6[i]; $7=sum7[i]; $8=sum8[i];
$9=(sum8[i]?sum8[i]/sum6[i]:"NaN"); print}}' input.csv > output.csv
it is doing sum of rows in columns 6,7,8 and then division of sum8/sum6 everything for rows with the same value in column 2 and 5.
I have two questions about it
1) I need the same functionality but all calculations must be done for rows with the same value in columns 2,3 and 5. i have tried to replace
a=$2$5;
with
b=$2$3; a=$b$5;
but its giving me wrong numbers.
2) how can i delete all rows with value:
Date;DBMS;Mode;Test type;W;time;TotalTPCC;NewOrder Tpm
except first row?
here is some example of csv.input:
Date;DBMS;Mode;Test type;W;time;TotalTPCC;NewOrder Tpm
Tue Jun 16 21:08:33 CEST 2015;sqlite;in-memory;TPC-C test;1;10;83970;35975
Tue Jun 16 21:18:43 CEST 2015;sqlite;in-memory;TPC-C test;1;10;83470;35790
Date;DBMS;Mode;Test type;W;time;TotalTPCC;NewOrder Tpm
Tue Jun 16 23:35:35 CEST 2015;hsql;in-memory;TPC-C test;1;10;337120;144526
Tue Jun 16 23:45:44 CEST 2015;hsql;in-memory;TPC-C test;1;10;310230;133271
Thu Jun 18 00:10:45 CEST 2015;derby;on-disk;TPC-C test;5;120;64720;27964
Thu Jun 18 02:41:27 CEST 2015;sqlite;on-disk;TPC-C test;1;120;60030;25705
Thu Jun 18 04:42:14 CEST 2015;hsql;on-disk;TPC-C test;1;120;360900;154828
output.csv should be
Date;DBMS;Mode;Test type;W;time;TotalTPCC;NewOrder Tpm;TpmC
Tue Jun 16 21:08:33 CEST 2015;sqlite;in-memory;TPC-C test;1;20;167440;71765;3588.25
Tue Jun 16 23:35:35 CEST 2015;hsql;in-memory;TPC-C test;1;20;647350;277797;13889.85
Thu Jun 18 00:10:45 CEST 2015;derby;on-disk;TPC-C test;5;120;64720;27964;233.03
Thu Jun 18 02:41:27 CEST 2015;sqlite;on-disk;TPC-C test;1;120;60030;25705;214.20
Thu Jun 18 04:42:14 CEST 2015;hsql;on-disk;TPC-C test;1;120;360900;154828;1290.23
To group by columns 2,3, and 5 use a=$2$3$5. To delete the extra header rows, add match statement ($1 !~ /^Date/)
So the whole awk script becomes:
BEGIN {
FS=OFS=";"
}
(NR==1) {$9="TpmC"; print $0}
(NR>1 && NF && ($1 !~ /^Date/)) {
a=$2$3$5; sum6[a]+=$6; sum7[a]+=$7; sum8[a]+=$8; other[a]=$0
}
END {
for(i in sum7) {
$0=other[i]; $6=sum6[i]; $7=sum7[i]; $8=sum8[i]; $9=(sum8[i]?sum8[i]/sum6[i]:"NaN"); print
}
}

Search a string between two timestamps starting from bottom of log file

I was trying to search for string 'Cannot proceed: the database is empty' in file out.log from bottom to top only (as log file is quite huge and everyday it appends the log at last only) during time-stamps yesterday 10:30 pm to today 00:30 am only.
Extract from out.log is as below:
[Thu Jun 5 07:56:17 2014]Local/data///47480280486528/Info(1019022)
Writing Database Mapping For [data]
[Thu Jun 5 07:56:18 2014]Local/data///47480280486528/Info(1250008)
Setting Outline Paging Cachesize To [8192KB]
[Thu Jun 5 07:56:18 2014]Local/data///47480280486528/Info(1013202)
Cannot proceed: the database is empty
[Thu Jun 5 07:56:20 2014]Local/data///47480280486528/Info(1013205)
Received Command [Load Database]
[Thu Jun 5 07:56:21 2014]Local/data///47480280486528/Info(1019018)
Writing Parameters For Database
I searched on google and SO and explored commands like sed and grep but unfortunately it seems like grep doesn't parse timestamps and sed prints all lines between two patterns.
Can anybody please let me know how I can achieve this ?
You can make use of date comparison in awk:
tac file | awk '/Cannot proceed: the database is empty/ {f=$0; next} f{if (($3==5 && $4>"22:30:00") || ($4==6 && $4<="00:30:00")) {print; print f} f=""}'
Test
For this given file:
$ cat a
[Thu Jun 5 07:56:17 2014]Local/data///47480280486528/Info(1019022)
Writing Database Mapping For [data]
[Thu Jun 5 07:56:18 2014]Local/data///47480280486528/Info(1250008)
Setting Outline Paging Cachesize To [8192KB]
[Thu Jun 5 07:56:18 2014]Local/data///47480280486528/Info(1013202)
Cannot proceed: the database is empty
[Thu Jun 5 07:56:20 2014]Local/data///47480280486528/Info(1013205)
Received Command [Load Database]
[Thu Jun 5 07:56:21 2014]Local/data///47480280486528/Info(1019018)
Writing Parameters For Database
[Thu Jun 5 23:56:20 2014]Local/data///47480280486528/Info(1013205)
Writing Parameters For Database
[Thu Jun 5 23:56:20 2014]Local/data///47480280486528/Info(1013205)
Cannot proceed: the database is empty
[Thu Jun 5 22:56:21 2014]Local/data///47480280486528/Info(1019018)
Cannot proceed: the database is empty
It returns:
$ tac a | awk '/Cannot proceed: the database is empty/ {f=$0; next} f{if (($3==5 && $4>"22:30:00") || ($4==6 && $4<="00:30:00")) {print; print f} f=""}'
[Thu Jun 5 22:56:21 2014]Local/data///47480280486528/Info(1019018)
Cannot proceed: the database is empty
[Thu Jun 5 23:56:20 2014]Local/data///47480280486528/Info(1013205)
Cannot proceed: the database is empty
You can get the line with this:
awk '/Cannot proceed: the database is empty/{ts = last; msg = $0; next}; {last = $0}; END{if (ts) printf "%s\n%s\n", ts, msg}' log
Output:
[Thu Jun 5 07:56:18 2014]Local/data///47480280486528/Info(1013202)
Cannot proceed: the database is empty
It should be easy to refine the code depending on which part is really needed.

using awk to do exact match in a file

i'm just wondering how can we use awk to do exact matches.
for eg
$ cal 09 09 2009
September 2009
Su Mo Tu We Th Fr Sa
1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30
$ cal 09 09 2009 | awk '{day="9"; col=index($0,day); print col }'
17
0
0
11
20
0
8
0
As you can see the above command outputs the index number of all the lines that contain the string/number "9", is there a way to make awk output index number in only the 4th line of cal output above.??? may be an even more elegant solution?
I'm using awk to get the day name using the cal command. here's the whole line of code:
$ dayOfWeek=$(cal $day $month $year | awk '{day='$day'; split("Sunday Monday Tuesday Wednesday Thursday Friday Saturday", array); column=index($o,day); dow=int((column+2)/3); print array[dow]}')
The problem with the above code is that if multiple matches are found then i get multiple results, whereas i want it to output only one result.
Thanks!
Limit the call to index() to only those lines which have your "day" surrounded by spaces:
awk -v day=$day 'BEGIN{split("Sunday Monday Tuesday Wednesday Thursday Friday Saturday", array)} $0 ~ "\\<"day"\\>"{for(i=1;i<=NF;i++)if($i == day){print array[i]}}'
Proof of Concept
$ cal 02 1956
February 1956
Su Mo Tu We Th Fr Sa
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29
$ day=18; cal 02 1956 | awk -v day=$day 'BEGIN{split("Sunday Monday Tuesday Wednesday Thursday Friday Saturday", array)} $0 ~ "\\<"day"\\>"{for(i=1;i<=NF;i++)if($i == day){print array[i]}}'
Saturday
Update
If all you are looking for is to get the day of the week from a certain date, you should really be using the date command like so:
$ day=9;month=9;year=2009;
$ dayOfWeek=$(date +%A -d "$day/$month/$year")
$ echo $dayOfWeek
Wednesday
you wrote
cal 09 09 2009
I'm not aware of a version of cal that accepts day of month as an input,
only
cal ${mon} (optional) ${year} (optional)
But, that doesn't affect your main issue.
you wrote
is there a way to make awk output index number in only the 4th line of cal output above.?
NR (Num Rec) is your friend
and there are numerous ways to use it.
cal 09 09 2009 | awk 'NR==4{day="9"; col=index($0,day); print col }'
OR
cal 09 09 2009 | awk '{day="9"; if (NR==4) {col=index($0,day); print col } }'
ALSO
In awk, if you have variable assignments that should be used throughout your whole program, then it is better to use the BEGIN section so that the assignment is only performed once. Not a big deal in you example, but why set bad habits ;-)?
HENCE
cal 09 2009 | awk 'BEGIN{day="9"}; NR==4 {col=index($0,day); print col }'
FINALLY
It is not completely clear what problem you are trying to solve. Are you sure you always want to grab line 4? If not, then how do you propose to solve that?
Problems stated as " 1. I am trying to do X. 2. Here is my input. 3. Here is my output. 4. Here is the code that generated that output" are much easier to respond to.
It looks like you're trying to do date calculations. You can be much more robust and general solutions by using the gnu date command. I have seen numerous useful discussions of this tagged as bash, shell, (date?).
I hope this helps.
This is so much easier to do in a language that has time functionality built-in. Tcl is great for that, but many other languages are too:
$ echo 'puts [clock format [clock scan 9/9/2009] -format %a]' | tclsh
Wed
If you want awk to only output for line 4, restrict the rule to line 4:
$ awk 'NR == 4 { ... }'

Resources