find time difference in following file - shell

Firstly thank you for the forum members
I need to find time difference b'w two rows of timestamp using awk/shell.
Here is the logfile:
cat business_file
start:skdjh:22:06:2010:10:30:22
sdfnskjoeirg
wregn'wergnoeirnfqoeitgherg
end:siifneworigo:22:06:2010:10:45:34
start:srsdneriieroi:24:06:2010:11:00:45
dfglkndfogn
sdgsdfgdfhdfg
end:erfqefegoieg:24:06:2010:11:46:34
oeirgoeirg\
start:sdjfhsldf:25:07:2010:12:55:43
wrgnweoigwerg
ewgjw]egpojwepr
etwasdf
gwdsdfsdf
fgpwj]pgojwth
wtr
wtwrt
end:dgnoeingwoit:25:07:2010:01:42:12
===========
The above logfile is kind of api file, and there are some rows start with "start" and "end",and the corresponding row's column 3rd to end of the row is timestamp (take delimiter as ":" )
we have to find the time difference between start and end time consecutive rows
Hope I am clear with the question,please let me know if you need more explanation.
Thx
Srinivas

Since the timestamp is separated by the same field separator as the rest of the line, this does not even require manual splitting. Simply
awk -F : 'function timestamp() { return mktime($5 " " $4 " " $3 " " $6 " " $7 " " $8) } $1 == "start" { t_start = timestamp() } $1 == "end" { print(timestamp() - t_start) }' filename
works and prints the time difference in seconds. The code is
# return the timestamp for the current row, using the pre-split fields
function timestamp() {
return mktime($5 " " $4 " " $3 " " $6 " " $7 " " $8)
}
# in start lines, remember the timestamp
$1 == "start" {
t_start = timestamp()
}
# in end lines, print the difference.
$1 == "end" {
print(timestamp() - t_start)
}
If you want to format the time difference in another manner, see this handy reference of relevant functions. By the way, the last block in your example has several hours negative length. You may want to look into that.
Addendum: In case that's because of the am/pm thing some countries have, this opens up a can of worms in that all timestamps have two possible meanings (because the log file does not seem to include the information whether it's an am or pm timestamp), so you have an unsolvable problem with durations of more than half a day. If you know that durations are never longer than half a day and the end time is always after the start time, you might be able to hack your way around it with something like
$1 == "end" {
t_end = timestamp();
if(t_end < t_start) {
t_end += 43200 # add 12 hours
}
print(t_end - t_start)
}
...but in that case the log file format is broken and should be fixed. This sort of hackery is not something you want to rely on in the long term.

Related

Unix script, awk and csv handling

Unix noob here again. I'm writing a script for a Unix class I'm taking online, and it is supposed to handle a csv file while excluding some info and tallying other info, then print a report when finished. I've managed to write something (with help) but am having to debug it on my own, and it's not going well.
After sorting out my syntax errors, the script runs, however all it does is print the header before the loop at the end, and one single value of zero under "Gross Sales", where it doesn't belong.
My script thus far:
I am open to any and all suggestions. However, I should mention again, I don't know what I'm doing, so you may have to explain a bit.
#!/bin/bash
grep -v '^C' online_retail.csv \
| awk -F\\t '!($2=="POST" || $8=="Australia")' \
| awk -F\\t '!($2=="D" || $3=="Discount")' \
| awk -F\\t '{(country[$8] += $4*$6) && (($6 < 2) ? quantity[$8] += $4 : quantity[$8] += 0)} \
END{print "County\t\tGross Sales\t\tItems Under $2.00\n \
-----------------------------------------------------------"; \
for (i in country) print i,"\t\t",country[i],"\t\t", quantity[i]}'
THE ASSIGNMENT Summary:
Using only awk and sed (or grep for the clean-up portion) write a script that prepares the following reports on the above retail dataset.
First, do some data "clean up" -- you can use awk, sed and/or grep for these:
Invoice numbers with a 'C' at the beginning are cancellations. These are just noise -- all lines like this should be deleted.
Any items with a StockCode of "POST" should be deleted.
Your Australian site has been producing some bad data. Delete lines where the "Country" is set to "Australia".
Delete any rows labeled 'Discount' in the Description (or have a 'D' in the StockCode field). Note: if you already completed steps 1-3 above, you've probably already deleted these lines, but double-check here just in case.
Then, print a summary report for each region in a formatted table. Use awk for this part. The table should include:
Gross sales for each region (printed in any order). The regions in the file, less Australia, are below. To calculate gross sales, multiply the UnitPrice times the Quantity for each row, and keep a running total.
France
United Kingdom
Netherlands
Germany
Norway
Items under $2.00 are expected to be a big thing this holiday season, so include a total count of those items per region.
Use field widths so the columns are aligned in the output.
The output table should look like this, although the data will produce different results. You can format the table any way you choose but it should be in a readable table with aligned columns. Like so:
Country Gross Sales Items Under $2.00
---------------------------------------------------------
France 801.86 12
Netherlands 177.60 1
United Kingdom 23144.4 488
Germany 243.48 11
Norway 1919.14 56
A small sample of the csv file:
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536388,22469,HEART OF WICKER SMALL,12,12/1/2010 9:59,1.65,16250,United Kingdom
536388,22242,5 HOOK HANGER MAGIC TOADSTOOL,12,12/1/2010 9:59,1.65,16250,United Kingdom
C536379,D,Discount,-1,12/1/2010 9:41,27.5,14527,United Kingdom
536389,22941,CHRISTMAS LIGHTS 10 REINDEER,6,12/1/2010 10:03,8.5,12431,Australia
536527,22809,SET OF 6 T-LIGHTS SANTA,6,12/1/2010 13:04,2.95,12662,Germany
536527,84347,ROTATING SILVER ANGELS T-LIGHT HLDR,6,12/1/2010 13:04,2.55,12662,Germany
536532,22962,JAM JAR WITH PINK LID,12,12/1/2010 13:24,0.85,12433,Norway
536532,22961,JAM MAKING SET PRINTED,24,12/1/2010 13:24,1.45,12433,Norway
536532,84375,SET OF 20 KIDS COOKIE CUTTERS,24,12/1/2010 13:24,2.1,12433,Norway
536403,POST,POSTAGE,1,12/1/2010 11:27,15,12791,Netherlands
536378,84997C,BLUE 3 PIECE POLKADOT CUTLERY SET,6,12/1/2010 9:37,3.75,14688,United Kingdom
536378,21094,SET/6 RED SPOTTY PAPER PLATES,12,12/1/2010 9:37,0.85,14688,United Kingdom
Seriously, Thank you to whoever can help. You all are amazing!
edit: I think I used the wrong field separator... but not sure how it is supposed to look. still tinkering...
edit2: okay, I "fixed?" the delimiter and changed it from awk -F\\t to awk -F\\',' and now it runs, however, the report data is all incorrect. sigh... I will trudge on.
You professor is hinting at what you should use to do this. awk and awk alone in a single call to awk processing all records in your file. You can do it with three rules.
the first rule simply sets the conditions, which if found in the record (line), causes awk to skip to the next line ignoring the record. According to the description, that would be:
FNR==1 || /^C/ || $2=="POST" || $NF=="Australia" || $2=="D" { next }
your second rule simply sums the Quantity * Unit Price for each Country and also keeps track of the sales of low-price goods if the Unit Price < 2. The a[] array tracks the total sales per Country, and the lpc[] array tracks the low-price cost goods sold if the Unit Price is less than $2.00. That can simply be:
{
a[$NF] += $4 * $6
if ($6 < 2)
lpc[$NF] += $4
}
The final END rule just outputs the heading and outputs the table formatted in column form. That could be:
END {
printf "%-20s%-20s%s", "Country", "Group Sales", "Items Under $2.00\n"
for (i = 0; i < 57; i++)
printf "-"
print ""
for (i in a)
printf "%-20s%10.2f%10s%s\n", i, a[i], " ", lpc[i]
}
That's it, if you put it altogether and providing your input in file, you would have:
awk -F, '
FNR==1 || /^C/ || $2=="POST" || $NF=="Australia" || $2=="D" { next }
{
a[$NF] += $4 * $6
if ($6 < 2)
lpc[$NF] += $4
}
END {
printf "%-20s%-20s%s", "Country", "Group Sales", "Items Under $2.00\n"
for (i = 0; i < 57; i++)
printf "-"
print ""
for (i in a)
printf "%-20s%10.2f%10s%s\n", i, a[i], " ", lpc[i]
}
' file
Example Use/Output
You can just select-copy and middle-mouse-paste the command above into an xterm with the current working directory containing file. Doing so, your results would be:
Country Group Sales Items Under $2.00
---------------------------------------------------------
United Kingdom 72.30 36
Germany 33.00
Norway 95.40 36
Which is similar to the format specified -- though I took the time to decimal align the Group Sales for easy reading.
Look things over an let me know if you have further questions.
note: processing CSV files with awk (or sed or grep) is generally not a good idea IF the values can contain embedded commas within double-quoted fields, e.g.
field1,"field2,part_a,part_b",fields3,...
This prevents problems with choosing a field-separator that will correctly parse the file.
If your input does not have embedded commas (or separators) in the fields, awk is perfectly fine. Just be aware of the potential gotcha depending on the data.

replace terms with associated abbreviations from other file, in case of matching

I have two files:
1. Pattern file = pattern.txt
2. File containing different terms = terms.txt
pattern.txt contain two columns, separated by ;
In the first column I have several terms and in the second column abbreviations,
associated to the first column, same line.
terms.txt contain single words and terms defined by single words but also
by a combination of words.
pattern.txt
Berlin;Brln
Barcelona;Barcln
Checkpoint Charly;ChckpntChrl
Friedrichstrasse;Fridrchstr
Hall of Barcelona;HllOfBarcln
Paris;Prs
Yesterday;Ystrdy
terms.txt
Berlin
The Berlinale ended yesterday
Checkpoint Charly is still in Friedrichstrasse
There will be a fiesta in the Hall of Barcelona
Paris is a very nice city
The target is to replace terms with standardised abbreviations and to find out which terms
have no abbreviation.
As result I would like to have two files.
The first file is a new terms file, with terms replaced by abbreviations where it could be replaced.
The second file containing a list with all terms that doesn't have an abbreviation.
The output is case insensitive, I don't make difference between "The" and "the".
new_terms.txt
Brln
The Berlinale ended Ystrdy
ChckpntChrl is still in Fridrchstr
There will be a fiesta in the HllOfBarcln
Prs is a very nice city
terms_without_abbreviations.txt
a
be
Berlinale
city
ended
fiesta
in
is
nice
of
still
The
There
very
will
I will appreciate your help and thanks in advance for your time and hints!
This is mostly what you need:
BEGIN { FS=";"; }
FNR==NR { dict[tolower($1)] = $2; next }
{
line = "";
count = split($0, words, / +/);
for (i = 1; i <= count; i++) {
key = tolower(words[i]);
if (key in dict) {
words[i] = dict[key];
} else {
result[key] = words[i];
}
line = line " " words[i];
}
print substr(line, 2);
}
END {
count = asorti(result, sorted);
for (i = 1; i <= count; i++) {
print result[sorted[i]];
}
}
Ok, so I had a bit of a crack, but will explain issues:
If you have multiple changes in pattern.txt that can pertain to a single line, the first change will make its change and the second will not (eg. Barcelona;Barcln and Hall of Barcelona;HllOfBarcln, obviously if Barcln has already been done when you get to the longer version it will not longer exist and so no change made)
Similar to above, there is no abbreviation for the word 'Hall' so again if we assume above is true and only the first change was made, your new file for changes will include hall as not having an abbreviation
#!/usr/bin/awk -f
BEGIN{
FS = ";"
IGNORECASE = 1
}
FNR == NR{
abbr[tolower($1)] = $2
next
}
FNR == 1{ FS = " " }
{
for(i = 1; i <= NF; i++){
item = tolower($i)
if(!(item in abbr) && !(item in twa)){
twa[item]
print item > "terms_without_abbreviations.txt"
}
}
for(i in abbr)
gsub("\\<"i"\\>", abbr[i])
print > "new_terms.txt"
}
There are probably other gotchas to look for but it is a vague direction. Not sure how you would get around my points above??

What is the best way to process times in awk and sponge the output to a new column in a csv file?

Pretend there is a simple csv file:
date,miles,time,min
2016-01-01,5.15,0:21:10,0:03:30
2016-01-03,15.30,1:10:00,0:03:45
2016-02-02,08.37,0:31:24,0:03:22
Say I want to add two more columns, in which the H:M:S times are converted to decimal numbers, where 1.0 equals one hour. How can I efficiently achieve this with awk? Currently I pipe a field of this file to another awk command where I use : as a field separator, use some arithmetic (e.g. field 2 divided by 60) to get the decimal numbers, save the results to a file, then use paste to combine the original and derived file. There is an easier way, no?
Since you didn't show us your expected output, this may or may not be what you want:
$ cat tst.awk
BEGIN { FS=OFS="," }
{
if (NR==1) {
tdec = "time_dec"
mdec = "min_dec"
}
else {
split($3,a,/:/); tdec = a[1] + a[2]/60 + a[3]/3600
split($4,a,/:/); mdec = a[1] + a[2]/60 + a[3]/3600
}
print $0, tdec, mdec
}
$ awk -f tst.awk file
date,miles,time,min,time_dec,min_dec
2016-01-01,5.15,0:21:10,0:03:30,0.352778,0.0583333
2016-01-03,15.30,1:10:00,0:03:45,1.16667,0.0625
2016-02-02,08.37,0:31:24,0:03:22,0.523333,0.0561111
but hopefully you get the idea if it's not exactly what you want.

Troubling figuring out one part of this script

I'm completely new to this, so I think this is a fairly easy question.
I have the following script which was given to me to go through our logs and pull out information:
awk '
match($0, /"username":"[^"]*"/) {
split($3, d, "#")
user = substr($0, RSTART + 12, RLENGTH - 17)
split(user, e, "#")
c[e[2] "," d[1] "," e[1]]++
}
END { for(i in c)
printf("%d,%s\n", c[i], i)
}' mycompany.log | sort -t, -k2,2 -k3,3 -k4,4
What this script does is goes through the log entries, and any entry that corresponds to a username it grabs the date, username, organization and the number of unique entries for that user on that date. I pretty much understand how it works to get all of these values except the number of entries per user (can't figure out where in the script it does this).
Basically right now the output is sorted in columns:
number of entries, organization, date, username
like this:
609,organization,05-22,someuserfromthatorganization
and I want this:
organization,05-22,someuserfromthatorganization,609
But as I mentioned I'm unsure how/where in the script this number is calculated so I can't figure out how to do that.
The associative array c contains the count of entries for each user on a date. e[2] "," d[1] "," e[1] concatenates the organization, date, and username. This is then used as the key in the c array with c[e[2] "," d[1] "," e[1]]. Finally, the ++ increment operator makes it count the repetitions of this.
At the end it prints the contents of this array.

How to merge rows from the same column using unix tools

I have a text file that looks like the following:
1000000 45 M This is a line This is another line Another line
that breaks into that also breaks that has a blank
multiple rows into multiple rows - row below.
How annoying!
1000001 50 F I am another I am well behaved.
column that has
text spanning
multiple rows
I would like to convert this into a csv file that looks like:
1000000, 45, M, This is a line that breaks into multiple rows, This is another line that also breaks into multiple rows - How annoying!
1000001, 50, F, I am another column that has text spanning multiple rows, I am well behaved.
The text file output comes from a program that was written in 1984, and I have no way to modify the output. I want it in csv format so that I can convert it to Excel as painlessly as possible. I am not sure where to start, and rather than reinvent the wheel, was hoping someone could point me in the right direction. Thanks!
== EDIT ==
I've modified the text file to have \n between rows - maybe this will be helpful?
== EDIT 2 ==
I've modified the text file to have a blank row.
Using GNU awk
gawk '
BEGIN { FIELDWIDTHS="11 6 5 22 22" }
length($1) == 11 {
if ($1 ~ /[^[:blank:]]/) {
if (f1) print_line()
f1=$1; f2=$2; f3=$3; f4=$4; f5=$5
}
else {
f4 = f4" "$4; f5 = f5" "$5
}
}
function rtrim(str) {
sub(/[[:blank:]]+$/, "", str)
return str
}
function print_line() {
gsub(/[[:blank:]]{2,}/, " ", f4); gsub(/"/, "&&", f4)
gsub(/[[:blank:]]{2,}/, " ", f5); gsub(/"/, "&&", f5)
printf "%s,%s,%s,\"%s\",\"%s\"\n", rtrim(f1), rtrim(f2), rtrim(f3),f4,f5
}
END {if (f1) print_line()}
' file
1000000,45,M,"This is a line that breaks into multiple rows ","This is another line that also breaks into multiple rows - How annoying!"
1000001,50,F,"I am another column that has text spanning multiple rows","I am well behaved. "
I've quoted the last 2 columns in case they contain commas, and doubled any potential inner double quotes.
Here's a Perl script that does what you want. It uses unpack to split the fixed width columns into fields, adding to the previous fields if there is no data in the first column.
As you've mentioned that the widths vary between files, the script works out the widths for itself, based on the content of the first line. The assumption is that there are at least two space characters between each field. It creates a format string like A11 A6 A5 A22 A21, where "A" means any character and the numbers specify the width of each field.
Inspired by glenn's version, I have wrapped any field containing spaces in double quotes. Whether that's useful or not depends on how you're going to end up using the data. For example, if you want to parse it using another tool and there are commas within the input, it may be helpful. If you don't want it to happen, you can change the grep block in both places to simply grep { $_ ne "" }:
use strict;
use warnings;
chomp (my $first_line = <>);
my #fields = split /(?<=\s{2})(?=\S)/, $first_line;
my $format = join " ", map { "A" . length } #fields;
my #cols = unpack $format, $first_line;
while(<>) {
chomp( my $line = $_ );
my #tmp = unpack $format, $line;
if ($tmp[0] ne '') {
print join(", ", grep { $_ ne "" && /\s/ ? qq/"$_"/ : $_ } #cols), "\n";
#cols = #tmp;
}
else {
for (1..$#tmp) {
$cols[$_] .= " $tmp[$_]" if $tmp[$_] ne "";
}
}
}
print join(", ", grep { $_ ne "" && /\s/ ? qq/"$_"/ : $_ } #cols), "\n";
Output:
1000000, 45, M, "This is a line that breaks into multiple rows", "This is another line that also breaks into multiple rows - How annoying!"
1000001, 50, F, "I am another column that has text spanning multiple rows", "I am well behaved."
Using this awk:
awk -F ' {2,}' -v OFS=', ' 'NF==5{if (p) print a[1], a[2], a[3], a[4], a[5];
for (i=1; i<=NF; i++) a[i]=$i; p=index($0,$4)}
NF<4 {for(i=2; i<=NF; i++) index($0,$i) == p ? a[4]=a[4] " " $i : a[5]=a[5] $i}
END { print a[1], a[2], a[3], a[4], a[5] }' file
1000000, 45, M, This is a line that breaks into multiple rows, This is another line that also breaks into multiple rows - How annoying!
1000001, 50, F, I am another column that has text spanning multiple rows, I am well behaved.
You can write a script in python that does that. Read each line, call split on it, if the line is not empty append to the previous line. If it is, then add the next line to the result set. Finally use the csv write to write the result set to file.
Something along the lines of :
#import csv
inputFile = open(filename, 'r')
isNewItem = True
results = []
for line in inputFile:
if len(results) == 0:
isNewItem = True
else if line == '':
isNewItem = True
continue
else:
inNewItem = False
temp = line.split()
if isNewItem:
results.append(temp)
else
lastRow = results[-1]
combinedRow = []
for leftColumn, rigtColumn in lastRow, temp:
combinedRow.append(leftColumn + rightColumn)
with open(csvOutputFileName, 'w') as outFile:
csv.write(results)

Resources