how to group records into bucket based on the timestamp? - shell

i have a list of entries from the logs:
15:38:52.363 1031
15:41:06.347 1259
15:41:06.597 1171
15:48:44.115 1588
15:48:44.125 1366
15:48:44.125 1132
15:53:14.525 1348
15:53:15.121 1553
15:53:15.181 1286
15:53:15.187 1293
the first one is the timestamp, the second one is the value.
now i'm trying to group them up by an interval of, say, 20 sec. i want to either sum the values, or get their average. i wonder what's the easiest way to do this? preferrably i can do this thru some simple shell script, so i can pipe my grep statement into and get a divided list. thanks!

This gawk script completely ignores fractional seconds. It also knows nothing about spanning from one day to the next (crossing 00:00:00):
grep ... | awk -v interval=20 'function groupout() {print "----", "Timespan ending:", strftime("%T", prevtime), "Sum:", sum, "Avg:", sum/count, "----"} BEGIN {prevtime = 0} {split($1, a, "[:.]"); time = mktime(strftime("%Y %m %d") " " a[1] " " a[2] " " a[3]); if (time > prevtime + interval) {if (NR != 1) {groupout(); sum=0; count=0}}; print; sum+=$2; count++; prevtime = time} END {groupout()}'
Output:
15:38:52.363 1031
---- Timespan ending: 15:38:52 Sum: 1031 Avg: 1031 ----
15:41:06.347 1259
15:41:06.597 1171
---- Timespan ending: 15:41:06 Sum: 2430 Avg: 1215 ----
15:48:44.115 1588
15:48:44.125 1366
15:48:44.125 1132
---- Timespan ending: 15:48:44 Sum: 4086 Avg: 1362 ----
15:53:14.525 1348
15:53:15.121 1553
15:53:15.181 1286
15:53:15.187 1293
---- Timespan ending: 15:53:15 Sum: 5480 Avg: 1370 ----
Here it is again more readably:
awk -v interval=20 '
function groupout() {
print "----", "Timespan ending:", strftime("%T", prevtime), "Sum:", sum, "Avg:", sum/count, "----"
}
BEGIN {
prevtime = 0
}
{
split($1, a, "[:.]");
time = mktime(strftime("%Y %m %d") " " a[1] " " a[2] " " a[3]);
if (time > prevtime + interval) {
if (NR != 1) {groupout(); sum=0; count=0}
};
print;
sum+=$2;
count++;
prevtime = time
}
END {groupout()}'

Related

Unix converting time format to integer value

I have the following text file.
Account1,2h 01m 00s
Account2,4h 25m 23s
Account3,5h 43m 59s
I wish to add the values of hours, minutes and seconds in order to total them to their respective minute totals.
Account1 minute total = 121
Account2 minute total = 265
Account3 minute total = 343
I have the following bash file
cat data.txt | cut -f2 -d','
This isolates the time values; however, from here I don't know what steps I would take to isolate the time, convert it to integers and then convert it to minutes. I have tried using a PARAM but to no avail.
If awk is an option, you can try this
awk -F"[, ]" '{h=60; m=1; s=0.01666667}{split($2,a,/h/); split($3,b,/m/); split($4,c,/s/); print$1, "minute total = " int(a[1] * h + b[1] * m + c[1] * s)}' input_file
$ cat awk.script
BEGIN {
FS=",| "
} {
h=60
m=1
s=0.01666667
}{
split($2,a,/h/)
split($3,b,/m/)
split($4,c,/s/)
print $1, "minute total = " int(a[1] * h + b[1] * m + c[1] * s)
}
Output
awk -f awk.script input_file
Account1 minute total = 121
Account2 minute total = 265
Account3 minute total = 343

sum by year and insert missing entries with 0

I have a report for year-month entries like below
201703 5
201708 10
201709 20
201710 40
201711 80
201712 100
201802 0
201803 25
201804 50
201805 50
201806 150
201807 300
201808 200
201902 10
I need to sum the year-month entries by year and print after all the months for that particular year. The year-month can have missing entries for any month(s).
For those months the a dummy value (0) should be inserted.
Required output:
201703 5
201704 0
201705 0
201706 0
201707 0
201708 10
201709 20
201710 40
201711 80
201712 100
2017 255
201801 0
201802 0
201803 25
201804 50
201805 50
201806 150
201807 300
201808 200
201809 0
201810 0
201811 0
201812 0
2018 775
201901 0
201902 10
201903 0
2019 10
I can get the summary of year by using below command.
awk ' { c=substr($1,0,4); if(c!=p) { print p,s ;s=0} s=s+$2 ; p=c ; print } ' ym.dat
But, how to insert entries for the missing ones?.
Also the last entry should not exceed current (system time) year-month. i.e for this specific example, dummy values should not be inserted for 201904..201905.. etc. It should just stop with 201903
You may use this awk script mmyy.awk:
{
rec[$1] = $2;
yy=substr($1, 1, 4)
mm=substr($1, 5, 2) + 0
ys[yy] += $2
}
NR == 1 {
fm = mm
fy = yy
}
END {
for (y=fy; y<=cy; y++)
for (m=1; m<=12; m++) {
# print previous years sums
if (m == 1 && y-1 in ys)
print y-1, ys[y-1]
if (y == fy && m < fm)
continue;
else if (y == cy && m > cm)
break;
# print year month with values or 0 if entry is missing
k = sprintf("%d%02d", y, m)
printf "%d%02d %d\n", y, m, (k in rec ? rec[k] : 0)
}
print y-1, ys[y-1]
}
Then call it as:
awk -v cy=$(date '+%Y') -v cm=$(date '+%m') -f mmyy.awk file
201703 5
201704 0
201705 0
201706 0
201707 0
201708 10
201709 20
201710 40
201711 80
201712 100
2017 255
201801 0
201802 0
201803 25
201804 50
201805 50
201806 150
201807 300
201808 200
201809 0
201810 0
201811 0
201812 0
2018 775
201901 0
201902 10
201903 0
2019 10
With GNU awk for strftime():
$ cat tst.awk
NR==1 {
begDate = $1
endDate = strftime("%Y%m")
}
{
val[$1] = $NF
year = substr($1,1,4)
}
year != prevYear { prt(); prevYear=year }
END { prt() }
function prt( mth, sum, date) {
if (prevYear != "") {
for (mth=1; mth<=12; mth++) {
date = sprintf("%04d%02d", prevYear, mth)
if ( (date >= begDate) && (date <=endDate) ) {
print date, val[date]+0
sum += val[date]
delete val[date]
}
}
print prevYear, sum+0
}
}
.
$ awk -f tst.awk file
201703 5
201704 0
201705 0
201706 0
201707 0
201708 10
201709 20
201710 40
201711 80
201712 100
2017 255
201801 0
201802 0
201803 25
201804 50
201805 50
201806 150
201807 300
201808 200
201809 0
201810 0
201811 0
201812 0
2018 775
201901 0
201902 10
201903 0
2019 10
With other awks you'd just pass in endDate using awk -v endDate=$(date +'%Y%m') '...'
Perl to the rescue!
perl -lane '$start ||= $F[0];
$Y{substr $F[0], 0, 4} += $F[1];
$YM{$F[0]} = $F[1];
END { for $y (sort keys %Y) {
for $m (1 .. 12) {
$m = sprintf "%02d", $m;
next if "$y$m" lt $start;
print "$y$m ", $YM{$y . $m} || 0;
last if $y == 1900 + (localtime)[5]
&& (localtime)[4] < $m;
}
print "$y ", $Y{$y} || 0;
}
}' -- file
-n reads the input line by line
-l removes newlines from input and adds them to output
-a splits each line on whitespace into the #F array
substr extracts the year from the YYYYMM date. Hashes %Y and %YM use dates and keys and the counts as values. That's why the year hash uses += which adds the value to the already accumulated one.
The END block is evaluated after the input has been exhausted.
It just iterates over the years stored in the hash, the range 1 .. 12 is used for month to insert the zeroes (the || operator prints it).
next and $start skips the months before the start of the report.
last is responsible for skipping the rest of the current year.
The following awk script will do what you expect. The idea is:
store data in an array
print and sum only when the year changes
This gives:
# function that prints the year starting
# at month m1 and ending at m2
function print_year(m1,m2, s,str) {
s=0
for(i=(m1+0); i<=(m2+0); ++i) {
str=y sprintf("%0.2d",i);
print str, a[str]+0; s+=a[str]
}
print y,s
}
# This works for GNU awk, replace for posix with a call as
# awk -v stime=$(date "+%Y%m") -f script.awk file
BEGIN{ stime=strftime("%Y%m") }
# initializer on first record
(NR==1){ y=substr($1,1,4); m1=substr($1,5) }
# print intermediate year
(substr($1,1,4) != y) {
print_year(m1,12)
y=substr($1,1,4); m1="01";
delete a
}
# set array value and keep track of last month
{a[$1]=$2; m2=substr($1,5)}
# check if entry is still valid (past stime or not)
($1 > stime) { exit }
# print all missing years full
# print last year upto system time month
END {
for (;y<substr(stime,1,4)+0;y++) { print_year(m1,12); m1=1; m2=12; }
print_year(m1,substr(stime,5))
}
Nice question, btw. Friday afternoon brain frier. Time to head home.
In awk. The optional endtime and its value are brought in as arguments:
$ awk -v arg1=201904 -v arg2=100 ' # optional parameters
function foo(ym,v) {
while(p<ym){
y=substr(p,1,4) # get year from previous round
m=substr(p,5,2)+0 # get month
p=y+(m==12) sprintf("%02d",m%12+1) # December magic
if(m==12)
print y,s[y] # print the sums (delete maybe?)
print p, (p==ym?v:0) # print yyyymm and 0/$2
}
}
{
s[substr($1,1,4)]+=$2 # sums in array, year index
}
NR==1 { # handle first record
print
p=$1
}
NR>1 {
foo($1,$2)
}
END {
if(arg1)
foo(arg1,arg2)
print y=substr($1,1,4),s[y]+arg2
}' file
Tail from output:
2018 775
201901 0
201902 10
201903 0
201904 100
2019 110

bash group times and average + sum columns

I have a daily file output on a linux system like the below and was wondering is there a way to group the data in 30min increments based on $1 and avg $3 and sum $4 $5 $6 $7 $8 via a shell script using awk/gawk or something similar?
04:04:13 04:10:13 2.13 36 27 18 18 0
04:09:13 04:15:13 2.37 47 38 13 34 0
04:14:13 04:20:13 2.19 57 37 23 33 1
04:19:13 04:25:13 2.43 43 35 13 30 0
04:24:13 04:30:13 2.29 48 40 19 28 1
04:29:13 04:35:13 2.33 56 42 16 40 0
04:34:13 04:40:13 2.21 62 47 30 32 0
04:39:13 04:45:13 2.25 44 41 19 25 0
04:44:13 04:50:13 2.20 65 50 32 33 0
04:49:13 04:55:13 2.47 52 38 16 36 0
04:54:13 05:00:13 2.07 72 54 40 32 0
04:59:13 05:05:13 2.35 53 41 19 34 0
so basically this hour of data would result in something like this:
04:04:13-04:29:13 2.29 287 219 102 183 2
04:34:13-04:59:13 2.25 348 271 156 192 0
this is what I have gotten so far using awk to search between the time frames but I think there is an easier way to get the grouping done without awking each 30min interval
awk '$1>=from&&$1<=to' from="04:00:00" to="04:30:00" | awk '{ total += $3; count++ } END { print total/count }'|awk '{printf "%0.2f\n", $1'}
awk '$1>=from&&$1<=to' from="04:00:00" to="04:30:00" | awk '{ sum+=$4} END {print sum}'
This should do what you want:
{
split($1, times, ":");
i = (2 * times[1]);
if (times[2] >= 30) i++;
if (!start[i] || $1 < start[i]) start[i] = $1;
if (!end[i] || $1 > end[i]) end[i] = $1;
count[i]++;
for (col = 3; col <= 8; col++) {
data[i, col] += $col;
}
}
END {
for (i = 1; i <= 48; i++) {
if (start[i]) {
data[i, 3] = data[i, 3] / count[i];
printf("%s-%s %.2f", start[i], end[i], data[i, 3]);
for (col = 4; col <= 8; col++) {
printf(" " data[i, col]);
}
print "";
}
}
}
As you can see, I divide the day into 48 half-hour intervals and place the data into one of these bins depending on the time in the first column. After the input has been exhausted, I print out all bins that are not empty.
Personally, I would do this in Python or Perl. In awk, the arrays are not ordered (well, in gawk you could use assorti to sort the array...) which makes printing ordered buckets more work.
Here is the outline:
Read input
Convert the time stamp to seconds
Add to an ordered (or sortable) associative array of the data elements in buckets of the desired time frame (or, just keep running totals).
After the data is read, process as you wish.
Here is a Python version of that:
#!/usr/bin/python
from collections import OrderedDict
import fileinput
times=[]
interval=30*60
od=OrderedDict()
for line in fileinput.input():
li=line.split()
secs=sum(x*y for x,y in zip([3600,60,1], map(int, li[0].split(":"))))
times.append([secs, [li[0], float(li[2])]+map(int, li[3:])])
current=times[0][0]
for t, li in times:
if t-current<interval:
od.setdefault(current, []).append(li)
else:
current=t
od.setdefault(current, []).append(li)
for s, LoL in od.items():
avg=sum(e[1] for e in LoL)/len(LoL)
sums=[sum(e[i] for e in LoL) for i in range(2,7)]
print "{}-{} {:.3} {}".format(LoL[0][0], LoL[-1][0], avg, ' '.join(map(str, sums)))
Running that on your example data:
$ ./ts.py ts.txt
04:04:13-04:29:13 2.29 287 219 102 183 2
04:34:13-04:59:13 2.26 348 271 156 192 0
The advantage is you can easily change the interval and a similar technic can use timestamps that are longer than days.
If you really want awk you could do:
awk 'BEGIN{ interval=30*60 }
function fmt(){
line=sprintf("%s-%s %.2f %i %i %i %i %i", ls, $1, sums[3]/count,
sums[4], sums[5], sums[6], sums[7], sums[8])
}
{
split($1,a,":")
secs=a[1]*3600+a[2]*60+a[3]
if (NR==1) {
low=secs
ls=$1
count=0
for (i=3; i<=8; i++)
sums[i]=0
}
for (i=3; i<=8; i++){
sums[i]+=$i
}
count++
if (secs-low<interval) {
fmt()
}
else {
print line
low=secs
ls=$1
count=1
for (i=3; i<=8; i++)
sums[i]=$i
}
}
END{
fmt()
print line
}' file
04:04:13-04:29:13 2.29 287 219 102 183 2
04:34:13-04:59:13 2.26 348 271 156 192 0

Calculating sum of gradients with awk

I have a file that contains 4 columns such as:
A B C D
1 2 3 4
10 20 30 40
100 200 300 400
.
.
.
I can calculate gradient of columns B to D versus A such as following commands:
NR>1{print $0,($2-b)/($1-a)}{a=$1;b=$2}' file
How can I print sum of gradients as the 5th column in the file? The results should be:
A B C D sum
1 2 3 4 1+2+3+4=10
10 20 30 40 (20-2)/(10-1)+(30-3)/(10-1)+(40-4)/(10-1)=9
100 200 300 400 (200-20)/(100-10)+(300-30)/(100-10)+(400-40)/(100-10)=9
.
.
.
awk 'NR == 1 { print $0, "sum"; next } { if (NR == 2) { sum = $1 + $2 + $3 + $4 } else { t = $1 - a; sum = ($2 - b) / t + ($3 - c) / t + ($4 - d) / t } print $0, sum; a = $1; b = $2; c = $3; d = $4 }' file
Output:
A B C D sum
1 2 3 4 10
10 20 30 40 9
100 200 300 400 9
With ... | column -t:
A B C D sum
1 2 3 4 10
10 20 30 40 9
100 200 300 400 9
Update:
#!/usr/bin/awk -f
NR == 1 {
print $0, "sum"
next
}
{
sum = 0
if (NR == 2) {
for (i = 1; i <= NF; ++i)
sum += $i
} else {
t = $1 - a[1]
for (i = 2; i <= NF; ++i)
sum += ($i - a[i]) / t
}
print $0, sum
for (i = 1; i <= NF; ++i)
a[i] = $i
}
Usage:
awk -f script.awk file
If you apply the same logic to the first line of numbers as you do to the rest, taking the initial value of each column as 0, you get 9 as the result of the sum (as it was in your question originally). This approach uses a loop to accumulate the sum of the gradient from the second field up to the last one. It uses the fact that on the first time round, the uninitialised values in the array a evaluate to 0:
awk 'NR==1 { print $0, "sum"; next }
{
s = 0
for(i=2;i<=NF;++i) s += ($i-a[i])/($1-a[1]) # accumulate sum
for(i=1;i<=NF;++i) a[i] = $i # fill array to be used for next iteration
print $0, s
}' file
You can pack it all onto one line if you want but remember to separate the statements with semicolons. It's also slightly shorter to only use a single for loop with an if:
awk 'NR==1{print$0,"sum";next}{s=0;for(i=1;i<=NF;++i)if(i>1)s+=($i-a[i])/($1-a[1]);a[i]=$i;print$0,s}' file
Output:
A B C D sum
1 2 3 4 9
10 20 30 40 9
100 200 300 400 9

Using awk create two arrays from two column values, find difference and sum differences, and output data

I have a file with the following fields (and an example value to the right):
hg18.ensGene.bin 0
hg18.ensGene.name ENST00000371026
hg18.ensGene.chrom chr1
hg18.ensGene.strand -
hg18.ensGene.txStart 67051161
hg18.ensGene.txEnd 67163158
hg18.ensGene.exonStarts 67051161,67060631,67065090,67066082,67071855,67072261,67073896,67075980,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932,
hg18.ensGene.exonEnds 67052451,67060788,67065317,67066181,67071977,67072419,67074048,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158,
hg18.ensGene.name2 ENSG00000152763
hg18.ensGene.exonFrames 0,2,0,0,1,2,0,0,1,1,1,2,1,2,0,2,0,
This is a shortened version of the file:
0 ENST00000371026 chr1 - 67051161 67163158 67051161,67060631,67065090,67066082,67071855,67072261,67073896,67075980,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932, 67052451,67060788,67065317,67066181,67071977,67072419,67074048,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158, ENSG00000152763 0,2,0,0,1,2,0,0,1,1,1,2,1,2,0,2,0, uc009waw.1,uc009wax.1,uc001dcx.1,
0 ENST00000371023 chr1 - 67075869 67163055 67075869,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932, 67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163055, ENSG00000152763 0,1,1,1,2,1,2,0,2,0, uc001dcy.1
0 ENST00000395250 chr1 - 67075991 67163158 67075991,67076022,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932, 67076018,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158, ENSG00000152763 0,0,1,1,1,2,0,-1,-1,-1,-1, n/a
I need to sum the difference of the exon starts and ends for example:
hg18.ensGene.exonStarts 67051161,67060631,67065090,67066082,67071855,67072261,67073896,67075980,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932,
hg18.ensGene.exonEnds 67052451,67060788,67065317,67066181,67071977,67072419,67074048,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158,
difference:
1290,157,227,99,122,158,152,87,203,195,156,140,157,113,185,175,226
sum (hg18.ensGene.exonLenSum):
3842
And I would like the output to have the following fields:
hg18.ensGene.name
hg18.ensGene.name2
hg18.ensGene.exonLenSum
such as this:
ENST00000371026 ENST00000371023 3842
I would like to do this with one awk script for all lines in the input file. How can I do this? This is useful for calculating exon lengths, say for a RPMK (Reads Per Kilobase exon Model per million mapped reads) calculation.
so ross$ awk -f gene.awk gene.dat
ENST00000371026 ENSG00000152763 3842
ENST00000371023 ENSG00000152763 1645
ENST00000395250 ENSG00000152763 1622
so ross$ cat gene.awk
/./ {
name = $2
name2 = $9
s = $7
e = $8
sc = split(s, sa, ",")
ec = split(e, ea, ",")
if (sc != ec) {
print "starts != ends ", name, name2, sc, ec
}
diffsum = 0
for(i = 1; i <= sc; ++i) {
diffsum += ea[i] - sa[i]
}
print name, name2, diffsum
}
using the UCSC mysql anonymous server:
mysql -N -h genome-mysql.cse.ucsc.edu -A -u genome -D hg18 -e 'select name,name2,exonStarts,exonEnds from ensGene' |\
awk -F ' ' '{n=split($3,a1,"[,]"); split($4,a2,"[,]"); size=0; for(i=1;i<=n;++i) {size+=int(a2[i]-a1[i]);} printf("%s\t%s\t%d\n",$1,$2,size); }'
result:
ENST00000404059 ENSG00000219789 632
ENST00000326632 ENSG00000146556 1583
ENST00000408384 ENSG00000221311 138
ENST00000409575 ENSG00000222003 1187
ENST00000409981 ENSG00000222027 1187
ENST00000359752 ENSG00000197490 126
ENST00000379479 ENSG00000205292 873
ENST00000326183 ENSG00000177693 918
ENST00000407826 ENSG00000219467 2820
ENST00000405199 ENSG00000220902 1231
(...)

Resources