This is a follow up of question:
Calculating the total time in an online meeting
Suppose there was a meeting and the meeting record is saved in a CSV file. How to write a bash script/awk script to find out the total amount of time for which an employee stayed online. One employee may leave and rejoin the meeting, all his/her online time should be calculated.
What I did is as follows, but got stuck on how to compare one record with all other record, and add the total time of each joined and left pairs of a person.
cat tst.awk
BEGIN { FS=" *, *"; OFS=", " }
NR==1 { next }
$1 in joined {
jt = time2secs(joined[$1])
lt = time2secs($3)
totSecs[$1] += (lt - jt)
delete joined[$1]
next
}
{ joined[$1] = $3 }
END {
for (name in totSecs) {
print name, secs2time(totSecs[name])
}
}
function time2secs(time, t) {
split(time,t,/:/)
return (t[1]*60 + t[2])*60 + t[3]
}
function secs2time(secs, h,m,s) {
h = int(secs / (60*60))
m = int((secs - (h*60*60)) / 60)
s = int(secs % 60)
return sprintf("%02d:%02d:%02d", h, m, s)
}
The start_time and end_time of the meeting are given at command line such as:
$ ./script.sh input.csv 10:00:00 13:00:00
Only the time between startTime (10:00:00) and EndTime (13:00:00) should be considered. The persons who have joined but not left, must be considered as left at Endtime and their online time should be added also. I tried but no desired result.
The output should look like this: (Can be stored in an output file)
Bob, 02:44:00
John, 00:41:00
David, 02:50:00
James, 01:39:30
The contents of the CSV file is as follows:
Employee_name, Joined/Left, Time
David, joined, 09:40:00
David, left, 10:20:00
David, joined, 10:30:00
John, joined, 10:00:00
Bob, joined, 10:01:00
James, joined, 10:00:30
Bob, left, 10:20:00
James, left, 11:40:00
John, left, 10:41:00
Bob, joined, 10:35:00
$ cat tst.awk
BEGIN { FS=" *, *"; OFS=", " }
NR==1 { next }
{ names[$1] }
$3 < beg { next }
$3 >= end { exit }
$2 == "joined" {
joined[$1] = $3
}
($2 == "left") && ($1 in joined) {
jt = time2secs(joined[$1])
lt = time2secs($3)
totSecs[$1] += (lt - jt)
delete joined[$1]
next
}
END {
for (name in names) {
if (name in joined) {
jt = time2secs(joined[name])
lt = time2secs(end)
totSecs[name] += (lt - jt)
}
print name, secs2time(totSecs[name])
}
}
function time2secs(time, t) {
split(time,t,/:/)
return (t[1]*60 + t[2])*60 + t[3]
}
function secs2time(secs, h,m,s) {
h = int(secs / (60*60))
m = int((secs - (h*60*60)) / 60)
s = int(secs % 60)
return sprintf("%02d:%02d:%02d", h, m, s)
}
.
$ awk -v beg='10:00:00' -v end='13:00:00' -f tst.awk file
James, 01:39:30
David, 02:30:00
Bob, 02:44:00
John, 00:41:00
Related
I'm trying to create a nice little table of values and I'm doing it using bash, but not all the values are in order. Not only that the values also happen to be in their own file. My first few thoughts are to use cat and grep to grab the values, but from there I'm not sure what is appropriate. I feel like awk would do wonders in this situation, but I do not know awk very well.
file1 might look like this
V 0.001
A 98.6
N Measurement1
T 14:15:01
S 20.2
F 212.86
G 28.19
file2 might look like this
V 0.008
A 103.4
N Measurement2
T 16:20:31
S 21.2
F 215.86
G 28.19
The final file would look like this
N Measurement1 Measurement2
T 14:15:01 16:20:31
V 0.001 0.008
G 28.19 28.19
A 98.6 103.4
S 20.2 21.2
F 212.86 215.86
self commented, code is provided for understanding awk
awk '
# cretate new reference (per file)
FNR==1{Ref++}
# each line
{ # add label to memory
N[$1]
# add value in 2 dimension array
V[Ref ":" $1] = $2
# remember maximum length of this serie
if( length( $2 ) > M[Ref] ) M[Ref] = length( $2 )
}
# after last file
END{
# print header (name of the serie)
printf( "N ")
for( i=1;i<=Ref;i++) printf( "%" M[i] "s ", V[ i ":N" ] )
printf( "\n")
# print each data for this label (format suite the size to be aligned)
# don t print a second time the name of the serie
for ( n in N ){
if( n != "N" ){
printf( "%s ", n)
for( i=1;i<=Ref;i++) printf( "%" M[i] "s ", V[ i ":" n ] )
printf( "\n")
}
}
}
' file*
I have a file test.txt with multiple records as below.
100,200,300,08-May-2012 11:24:25
100,400,300,25-May-2012 09:24:25
Now I want to output the data using the following "format":
$1,$2,$3,$4,$4+30days,$4+60days,$4+90days
How can I do that using awk?
script.awk file
function map_month(m)
{
if (m ~ /[jJ][aA][nN]/) return 1;
if (m ~ /[fF][eE][bB]/) return 2;
if (m ~ /[mM][aA][rR]/) return 3;
if (m ~ /[aA][pP][rR]/) return 4;
if (m ~ /[mM][aA][yY]/) return 5;
if (m ~ /[jJ][uU][nN]/) return 6;
if (m ~ /[jJ][uU][lL]/) return 7;
if (m ~ /[aA][uU][gG]/) return 8;
if (m ~ /[sS][eE][pP]/) return 9;
if (m ~ /[oO][cC][tT]/) return 10;
if (m ~ /[nN][oO][vV]/) return 11;
if (m ~ /[dD][eE][cC]/) return 12;
return 0;
}
function cvt_timestamp(str, t, a)
{
split(str, a, /[- :]/)
a[2] = map_month(a[2])
#print a[3] " " a[2] " " a[1] " " a[4] " " a[5] " " a[6]
t = mktime(a[3] " " a[2] " " a[1] " " a[4] " " a[5] " " a[6])
#print t
return t
}
function fmt_timestamp(t)
{
return strftime("%d-%b-%Y %H:%M:%S", t)
}
BEGIN { FS = OFS = "," }
{
tm = cvt_timestamp($4);
t0 = fmt_timestamp(tm + 0 * 86400)
t1 = fmt_timestamp(tm + 30 * 86400)
t2 = fmt_timestamp(tm + 60 * 86400)
t3 = fmt_timestamp(tm + 90 * 86400)
print $1, $2, $3, t0, t1, t2, t3
}
There's probably a better way to do the month abbreviation to number mapping; have at it. Using functions in awk is valuable just the same as it is in any other language with functions.
Example data file
100,200,300,08-May-2012 11:24:25
100,400,300,25-May-2012 09:24:25
100,400,300,15-Sep-2012 09:24:25
100,400,300,29-Feb-2012 09:24:25
Example output
$ gawk -f script.awk data
100,200,300,08-May-2012 11:24:25,07-Jun-2012 11:24:25,07-Jul-2012 11:24:25,06-Aug-2012 11:24:25
100,400,300,25-May-2012 09:24:25,24-Jun-2012 09:24:25,24-Jul-2012 09:24:25,23-Aug-2012 09:24:25
100,400,300,15-Sep-2012 09:24:25,15-Oct-2012 09:24:25,14-Nov-2012 08:24:25,14-Dec-2012 08:24:25
100,400,300,29-Feb-2012 09:24:25,30-Mar-2012 10:24:25,29-Apr-2012 10:24:25,29-May-2012 10:24:25
$
If you decide you want to handle switches in time zone differently, that's your prerogative; you can fix the code.
I have an input file with fields in several lines. In this file, the field pattern is repeated according to query size.
ZZZZ
21293
YYYYY XXX WWWW VV
13242 MUTUAL BOTH NO
UUUUU TTTTTTTT SSSSSSSS RRRRR QQQQQQQQ PPPPPPPP
3 0 3 0
NNNNNN MMMMMMMMM LLLLLLLLL KKKKKKKK JJJJJJJJ
2 0 5 3
IIIIII HHHHHH GGGGGGG FFFFFFF EEEEEEEEEEE DDDDDDDDDDD
5 3 0 3
My desired output is one line per total group of fields. Empty
fields should be marked. Example:"x"
21293 13242 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
12345 67890 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
I have been thinking about how can I get the desired output with awk/unix scripts but can't figure it out. Any ideas? Thank you very much!!!
This isn't really a great fit for awk's style of programming, which is based on fields that are delimited by a pattern, not fields with variable positions on the line. But it can be done.
When you process the first line in each pair, scan through it finding the positions of the beginning of each field name.
awk 'NR%3 == 1 {
delete fieldpos;
delete fieldlen;
lastspace = 1;
fieldindex = 0;
for (i = 1; i <= length(); i++) {
if (substr($0, i, 1) != " ") {
if (lastspace) {
fieldpos[fieldindex] = i;
if (fieldindex > 0) {
fieldlen[fieldindex-1] = i - fieldpos[fieldindex-1];
}
fieldindex++;
}
lastspace = 0;
} else {
lastspace = 1;
}
}
}
NR%3 == 2 {
for (i = 0; i < fieldindex; i++) {
if (i in fieldlen) {
f = substr($0, fieldpos[i], fieldlen[i]);
} else { # last field, go to end of line
f = substr($0, fieldpos[i]);
}
gsub(/^ +| +$/, "", f); # trim surrounding spaces
if (f == "") { f = "X" }
printf("%s ", f);
}
}
NR%15 == 14 { print "" } # print newline after 5 data blocks
'
Assuming your fields are separated by blank chars and not tabs, GNU awk's FIELDWITDHS is designed to handle this sort of situation:
/^ZZZZ/ { if (rec!="") print rec; rec="" }
/^[[:upper:]]/ {
FIELDWIDTHS = ""
while ( match($0,/\S+\s*/) ) {
FIELDWIDTHS = (FIELDWIDTHS ? FIELDWIDTHS " " : "") RLENGTH
$0 = substr($0,RLENGTH+1)
}
next
}
NF {
for (i=1;i<=NF;i++) {
gsub(/^\s+|\s+$/,"",$i)
$i = ($i=="" ? "X" : $i)
}
rec = (rec=="" ? "" : rec " ") $0
}
END { print rec }
$ awk -f tst.awk file
2129 13242 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
In other awks you'd use match()/substr(). Note that the above isn't perfect in that it truncates a char off 21293 - that's because I'm not convinced your input file is accurate and if it is you haven't told us why that number is longer than the string on the preceding line or how to deal with that.
I have lots of files like this:
3
10
23
.
.
.
720
810
980
And a much bigger file like this:
2 0.004
4 0.003
6 0.034
.
.
.
996 0.01
998 0.02
1000 0.23
What I want to do is find in which range of the second file my first file falls and then estimate the mean of the values in the 2nd column of that range.
Thanks in advance.
NOTE
The numbers in the files do not necessarily follow an easy pattern like 2,4,6...
Since your smaller files are sorted you can pull out the first row and the last row to get the min and max. Then you just need go through the bigfile with an awk script to compute the mean.
So for each smallfile small you would run the script
awk -v start=$(head -n 1 small) -v end=$(tail -n 1 small) -f script bigfile
Where script can be something simple like
BEGIN {
sum = 0;
count = 0;
range_start = -1;
range_end = -1;
}
{
irow = int($1)
ival = $2 + 0.0
if (irow >= start && end >= irow) {
if (range_start == -1) {
range_start = NR;
}
sum = sum + ival;
count++;
}
else if (irow > end) {
if (range_end == -1) {
range_end = NR - 1;
}
}
}
END {
print "start =", range_start, "end =", range_end, "mean =", sum / count
}
You can try below:
for r in *; do
awk -v r=$r -F' ' \
'NR==1{b=$2;v=$4;next}{if(r >= b && r <= $2){m=(v+$4)/2; print m; exit}; b=$2;v=$4}' bigfile.txt
done
Explanation:
First pass it saves column 2 & 4 into temp variables. For all other passes it checks if filename r is between the begin range (previous coluimn 2) and end range (current column 2).
It then works out the mean and prints the result.
I am trying to parse some csv files using awk. I am new to shell scripting and awk.
The csv file i am working on looks something like this :
fnName,minAccessTime,maxAccessTime
getInfo,300,600
getStage,600,800
getStage,600,800
getInfo,250,620
getInfo,200,700
getStage,700,1000
getInfo,280,600
I need to find the average AccessTimes of the different functions.
I have been working with awk and have been able to get the average times provided the exact column numbers are specified like $2, $3 etc.
However I need to have a general script in which if i input "minAccessTime" in the command argument, I need the script to print the average AccessTime (instead of explicitly specifying $2 or $3 while using awk).
I have been googling about this and saw in various forums but none of them seems to work.
Can someone tell me how to do this ? It would be of great help !
Thanks in advance!!
This awk script should give you all that you want.
It first evaluates which column you're interested in by using the name passed in as the COLM variable and checking against the first line. It converts this into an index (it's left as the default 0 if it couldn't find the column).
It then basically runs through all other lines in your input file. On all these other lines (assuming you've specified a valid column), it updates the count, sum, minimum and maximum for both the overall data plus each individual function name.
The former is stored in count, sum, min and max. The latter are stored in associative arrays with similar names (with _arr appended).
Then, once all records are read, the END section outputs the information.
NR == 1 {
for (i = 1; i <= NF; i++) {
if ($i == COLM) {
cidx = i;
}
}
}
NR > 1 {
if (cidx > 0) {
count++;
sum += $cidx;
if (count == 1) {
min = $cidx;
max = $cidx;
} else {
if ($cidx < min) { min = $cidx; }
if ($cidx > max) { max = $cidx; }
}
count_arr[$1]++;
sum_arr[$1] += $cidx;
if (count_arr[$1] == 1) {
min_arr[$1] = $cidx;
max_arr[$1] = $cidx;
} else {
if ($cidx < min_arr[$1]) { min_arr[$1] = $cidx; }
if ($cidx > max_arr[$1]) { max_arr[$1] = $cidx; }
}
}
}
END {
if (cidx == 0) {
print "Column '" COLM "' does not exist"
} else {
print "Overall:"
print " Total records = " count
print " Sum of column = " sum
if (count > 0) {
print " Min of column = " min
print " Max of column = " max
print " Avg of column = " sum / count
}
for (task in count_arr) {
print "Function " task ":"
print " Total records = " count_arr[task]
print " Sum of column = " sum_arr[task]
print " Min of column = " min_arr[task]
print " Max of column = " max_arr[task]
print " Avg of column = " sum_arr[task] / count_arr[task]
}
}
}
Storing that script into qq.awk and placing your sample data into qq.in, then running:
awk -F, -vCOLM=minAccessTime -f qq.awk qq.in
generates the following output, which I'm relatively certain will give you every possible piece of information you need:
Overall:
Total records = 7
Sum of column = 2930
Min of column = 200
Max of column = 700
Avg of column = 418.571
Function getStage:
Total records = 3
Sum of column = 1900
Min of column = 600
Max of column = 700
Avg of column = 633.333
Function getInfo:
Total records = 4
Sum of column = 1030
Min of column = 200
Max of column = 300
Avg of column = 257.5
For `maxAccessTime, you get:
Overall:
Total records = 7
Sum of column = 5120
Min of column = 600
Max of column = 1000
Avg of column = 731.429
Function getStage:
Total records = 3
Sum of column = 2600
Min of column = 800
Max of column = 1000
Avg of column = 866.667
Function getInfo:
Total records = 4
Sum of column = 2520
Min of column = 600
Max of column = 700
Avg of column = 630
And, for xyzzy (a non-existent column), you'll see:
Column 'xyzzy' does not exist
If I understand the requirements correctly, you want the average of a column, and you'd like to specify the column by name.
Try the following script (avg.awk):
BEGIN {
FS=",";
}
NR == 1 {
for (i=1; i <= NF; ++i) {
if ($i == SELECTED_FIELD) {
SELECTED_COL=i;
}
}
}
NR > 1 && $1 ~ SELECTED_FNAME {
sum[$1] = sum[$1] + $SELECTED_COL;
count[$1] = count[$1] + 1;
}
END {
for (f in sum) {
printf("Average %s for %s: %d\n", SELECTED_FIELD, f, sum[f] / count[f]);
}
}
and invoke your script like this
awk -v SELECTED_FIELD=minAccessTime -f avg.awk < data.csv
or
awk -v SELECTED_FIELD=maxAccessTime -f avg.awk < data.csv
or
awk -v SELECTED_FIELD=maxAccessTime -v SELECTED_FNAME=getInfo -f avg.awk < data.csv
EDIT:
Rewritten to group by function name (assumed to be first field)
EDIT2:
Rewritten to allow additional parameter to filter by function name (assumed to be first field)