Unix noob here again. I'm writing a script for a Unix class I'm taking online, and it is supposed to handle a csv file while excluding some info and tallying other info, then print a report when finished. I've managed to write something (with help) but am having to debug it on my own, and it's not going well.
After sorting out my syntax errors, the script runs, however all it does is print the header before the loop at the end, and one single value of zero under "Gross Sales", where it doesn't belong.
My script thus far:
I am open to any and all suggestions. However, I should mention again, I don't know what I'm doing, so you may have to explain a bit.
#!/bin/bash
grep -v '^C' online_retail.csv \
| awk -F\\t '!($2=="POST" || $8=="Australia")' \
| awk -F\\t '!($2=="D" || $3=="Discount")' \
| awk -F\\t '{(country[$8] += $4*$6) && (($6 < 2) ? quantity[$8] += $4 : quantity[$8] += 0)} \
END{print "County\t\tGross Sales\t\tItems Under $2.00\n \
-----------------------------------------------------------"; \
for (i in country) print i,"\t\t",country[i],"\t\t", quantity[i]}'
THE ASSIGNMENT Summary:
Using only awk and sed (or grep for the clean-up portion) write a script that prepares the following reports on the above retail dataset.
First, do some data "clean up" -- you can use awk, sed and/or grep for these:
Invoice numbers with a 'C' at the beginning are cancellations. These are just noise -- all lines like this should be deleted.
Any items with a StockCode of "POST" should be deleted.
Your Australian site has been producing some bad data. Delete lines where the "Country" is set to "Australia".
Delete any rows labeled 'Discount' in the Description (or have a 'D' in the StockCode field). Note: if you already completed steps 1-3 above, you've probably already deleted these lines, but double-check here just in case.
Then, print a summary report for each region in a formatted table. Use awk for this part. The table should include:
Gross sales for each region (printed in any order). The regions in the file, less Australia, are below. To calculate gross sales, multiply the UnitPrice times the Quantity for each row, and keep a running total.
France
United Kingdom
Netherlands
Germany
Norway
Items under $2.00 are expected to be a big thing this holiday season, so include a total count of those items per region.
Use field widths so the columns are aligned in the output.
The output table should look like this, although the data will produce different results. You can format the table any way you choose but it should be in a readable table with aligned columns. Like so:
Country Gross Sales Items Under $2.00
---------------------------------------------------------
France 801.86 12
Netherlands 177.60 1
United Kingdom 23144.4 488
Germany 243.48 11
Norway 1919.14 56
A small sample of the csv file:
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536388,22469,HEART OF WICKER SMALL,12,12/1/2010 9:59,1.65,16250,United Kingdom
536388,22242,5 HOOK HANGER MAGIC TOADSTOOL,12,12/1/2010 9:59,1.65,16250,United Kingdom
C536379,D,Discount,-1,12/1/2010 9:41,27.5,14527,United Kingdom
536389,22941,CHRISTMAS LIGHTS 10 REINDEER,6,12/1/2010 10:03,8.5,12431,Australia
536527,22809,SET OF 6 T-LIGHTS SANTA,6,12/1/2010 13:04,2.95,12662,Germany
536527,84347,ROTATING SILVER ANGELS T-LIGHT HLDR,6,12/1/2010 13:04,2.55,12662,Germany
536532,22962,JAM JAR WITH PINK LID,12,12/1/2010 13:24,0.85,12433,Norway
536532,22961,JAM MAKING SET PRINTED,24,12/1/2010 13:24,1.45,12433,Norway
536532,84375,SET OF 20 KIDS COOKIE CUTTERS,24,12/1/2010 13:24,2.1,12433,Norway
536403,POST,POSTAGE,1,12/1/2010 11:27,15,12791,Netherlands
536378,84997C,BLUE 3 PIECE POLKADOT CUTLERY SET,6,12/1/2010 9:37,3.75,14688,United Kingdom
536378,21094,SET/6 RED SPOTTY PAPER PLATES,12,12/1/2010 9:37,0.85,14688,United Kingdom
Seriously, Thank you to whoever can help. You all are amazing!
edit: I think I used the wrong field separator... but not sure how it is supposed to look. still tinkering...
edit2: okay, I "fixed?" the delimiter and changed it from awk -F\\t to awk -F\\',' and now it runs, however, the report data is all incorrect. sigh... I will trudge on.
You professor is hinting at what you should use to do this. awk and awk alone in a single call to awk processing all records in your file. You can do it with three rules.
the first rule simply sets the conditions, which if found in the record (line), causes awk to skip to the next line ignoring the record. According to the description, that would be:
FNR==1 || /^C/ || $2=="POST" || $NF=="Australia" || $2=="D" { next }
your second rule simply sums the Quantity * Unit Price for each Country and also keeps track of the sales of low-price goods if the Unit Price < 2. The a[] array tracks the total sales per Country, and the lpc[] array tracks the low-price cost goods sold if the Unit Price is less than $2.00. That can simply be:
{
a[$NF] += $4 * $6
if ($6 < 2)
lpc[$NF] += $4
}
The final END rule just outputs the heading and outputs the table formatted in column form. That could be:
END {
printf "%-20s%-20s%s", "Country", "Group Sales", "Items Under $2.00\n"
for (i = 0; i < 57; i++)
printf "-"
print ""
for (i in a)
printf "%-20s%10.2f%10s%s\n", i, a[i], " ", lpc[i]
}
That's it, if you put it altogether and providing your input in file, you would have:
awk -F, '
FNR==1 || /^C/ || $2=="POST" || $NF=="Australia" || $2=="D" { next }
{
a[$NF] += $4 * $6
if ($6 < 2)
lpc[$NF] += $4
}
END {
printf "%-20s%-20s%s", "Country", "Group Sales", "Items Under $2.00\n"
for (i = 0; i < 57; i++)
printf "-"
print ""
for (i in a)
printf "%-20s%10.2f%10s%s\n", i, a[i], " ", lpc[i]
}
' file
Example Use/Output
You can just select-copy and middle-mouse-paste the command above into an xterm with the current working directory containing file. Doing so, your results would be:
Country Group Sales Items Under $2.00
---------------------------------------------------------
United Kingdom 72.30 36
Germany 33.00
Norway 95.40 36
Which is similar to the format specified -- though I took the time to decimal align the Group Sales for easy reading.
Look things over an let me know if you have further questions.
note: processing CSV files with awk (or sed or grep) is generally not a good idea IF the values can contain embedded commas within double-quoted fields, e.g.
field1,"field2,part_a,part_b",fields3,...
This prevents problems with choosing a field-separator that will correctly parse the file.
If your input does not have embedded commas (or separators) in the fields, awk is perfectly fine. Just be aware of the potential gotcha depending on the data.
I have a file suppose xyz.dat which has data like below -
a1|b1|c1|d1|e1|f1|g1
a2|b2|c2|d2|e2|f2|g2
a3|b3|c3|d3|e3|f3|g3
Due to some requirement, I am making two new files(aka m.dat and o.dat) from original xyz.dat.
M.dat contains columns 2|4|6 like below after running some logic on it -
b11|d11|f11
b22|d22|f22
b33|d33|f33
O.dat contains all the columns except 2|4|6 like below without any change in it -
a1|c1|e1|g1
a2|c2|e2|g2
a3|c3|e3|g3
Now I want to merge both M and O file to create back the original format xyz.dat file.
a1|b11|c1|d11|e1|f11|g1
a2|b22|c2|d22|e2|f22|g2
a3|b33|c3|d33|e3|f33|g3
Please note column positions can change for another file. I will get the columns positions like in above example it is 2,4,6 so need some generic command to run in loop to merge the new M and O file or one command in which I can pass the columns positions and it will copy the columns form M.dat file and past it in O.dat file.
I tried paste, sed, cut but not able to make any perfect command.
Please help.
To perform column-wise merge of two files, better to use a scripting engine (Python, Awk or Perl or even bash). Tools like paste, sed and cut do not have enough flexibility for those tasks (join may come close, but require extra work).
Consider the following awk based script
awk -vOFS='|' '-F|' '
{
getline s < "o.dat"
n = split(s. a)
# Print output, Add a[n], or $n, ... as needed based on actual number of fields.
print $1, a[1], $2, a[2], $3, a[3], a[4]
}
' m.dat
The print line can be customized to generate whatever column order
Based on clarification from OP, looks like the goal is: Given an input of two files, and list of columns where data should be merged from the 2nd file, produce an output file that contain the merge data.
For example:
awk -f mergeCols COLS=2,4,6 M=b.dat a.dat
# If file is marked executable (chmod +x mergeCols)
mergeCols COLS=2,4,6 M=b.dat a.dat
Will insert the columns from b.dat into columns 2, 4 and 6, whereas other column will include data from a.dat
Implementation, using awk: (create a file mergeCols).
#! /usr/bin/awk -f
BEGIN {
FS=OFS="|"
}
NR==1 {
# Set the column map
nc=split(COLS, c, ",")
for (i=1 ; i<=nc ; i++ ) {
cmap[c[i]] = i
}
}
{
# Read one line from merged file, split into tokens in 'a'
getline s < M
n = split(s, a)
# Merge columns using pre-set 'cmap'
k=0
for (i=1 ; i<=NF+nc ; i++ ) {
# Pick up a column
v = cmap[i] ? a[cmap[i]] : $(++k)
sep = (i<NF+nc) ? "|" : "\n"
printf "%s%s", v, sep
}
}
I'm fairly new to linux/bash shell and I'm really having trouble printing two values (the highest and lowest) from a particular column in a text file. The file is formatted like this:
Geoff Audi 2:22:35.227
Bob Mercedes 1:24:22.338
Derek Jaguar 1:19:77.693
Dave Ferrari 1:08:22.921
As you can see the final column is a timing, I'm trying to use awk to print out the highest and lowest timing in the column. I'm really stumped, I've tried:
awk '{print sort -n < $NF}' timings.txt
However that didn't even seem to sort anything, I just received an output of:
1
0
1
0
...
Repeating over and over, it went on for longer but I didn't want a massive line of it when you get the point after the first couple iterations.
My desired output would be:
Min: 1:08:22.921
Max: 2:22:35.227
After question clarifications: if the time field always has a same number of digits in the same place, e.g. h:mm:ss.ss, the solution can be drastically simplified. Namely, we don't need to convert time to seconds to compare it anymore, we can do a simple string/lexicographical comparison:
$ awk 'NR==1 {m=M=$3} {$3<m&&m=$3; $3>M&&M=$3} END {printf("min: %s\nmax: %s",m,M)}' file
min: 1:08:22.921
max: 2:22:35.227
The logic is the same as in the (previous) script below, just using a simpler string-only based comparison for ordering values (determining min/max). We can do that since we know all timings will conform to the same format, and if a < b (for example "1:22:33" < "1:23:00") we know a is "smaller" than b. (If values are not consistently formatted, then by using the lexicographical comparison alone, we can't order them, e.g. "12:00:00" < "3:00:00".)
So, on first value read (first record, NR==1), we set the initial min/max value to the timing read (in the 3rd field). For each record we test if the current value is smaller than the current min, and if it is, we set the new min. Similarly for the max. We use short circuiting instead if to make expressions shorter ($3<m && m=$3 is equivalent to if ($3<m) m=$3). In the END we simply print the result.
Here's a general awk solution that accepts time strings with variable number of digits for hours/minutes/seconds per record:
$ awk '{split($3,t,":"); s=t[3]+60*(t[2]+60*t[1]); if (s<min||NR==1) {min=s;min_t=$3}; if (s>max||NR==1) {max=s;max_t=$3}} END{print "min:",min_t; print "max:",max_t}' file
min: 1:22:35.227
max: 10:22:35.228
Or, in a more readable form:
#!/usr/bin/awk -f
{
split($3, t, ":")
s = t[3] + 60 * (t[2] + 60 * t[1])
if (s < min || NR == 1) {
min = s
min_t = $3
}
if (s > max || NR == 1) {
max = s
max_t = $3
}
}
END {
print "min:", min_t
print "max:", max_t
}
For each line, we convert the time components (hours, minutes, seconds) from the third field to seconds which we can later simply compare as numbers. As we iterate, we track the current min val and max val, printing them in the END. Initial values for min and max are taken from the first line (NR==1).
Given your statements that the time field is actually a duration and the hours component is always a single digit, this is all you need:
$ awk 'NR==1{min=max=$3} {min=(min<$3?min:$3); max=(max>$3?max:$3)} END{print "Min:", min ORS "Max:", max}' file
Min: 1:08:22.921
Max: 2:22:35.227
You don't want to run sort inside of awk (even with the proper syntax).
Try this:
sed 1d timings.txt | sort -k3,3n | sed -n '1p; $p'
where
the first sed will remove the header
sort on the 3rd column numerically
the second sed will print the first and last line
Firstly thank you for the forum members
I need to find time difference b'w two rows of timestamp using awk/shell.
Here is the logfile:
cat business_file
start:skdjh:22:06:2010:10:30:22
sdfnskjoeirg
wregn'wergnoeirnfqoeitgherg
end:siifneworigo:22:06:2010:10:45:34
start:srsdneriieroi:24:06:2010:11:00:45
dfglkndfogn
sdgsdfgdfhdfg
end:erfqefegoieg:24:06:2010:11:46:34
oeirgoeirg\
start:sdjfhsldf:25:07:2010:12:55:43
wrgnweoigwerg
ewgjw]egpojwepr
etwasdf
gwdsdfsdf
fgpwj]pgojwth
wtr
wtwrt
end:dgnoeingwoit:25:07:2010:01:42:12
===========
The above logfile is kind of api file, and there are some rows start with "start" and "end",and the corresponding row's column 3rd to end of the row is timestamp (take delimiter as ":" )
we have to find the time difference between start and end time consecutive rows
Hope I am clear with the question,please let me know if you need more explanation.
Thx
Srinivas
Since the timestamp is separated by the same field separator as the rest of the line, this does not even require manual splitting. Simply
awk -F : 'function timestamp() { return mktime($5 " " $4 " " $3 " " $6 " " $7 " " $8) } $1 == "start" { t_start = timestamp() } $1 == "end" { print(timestamp() - t_start) }' filename
works and prints the time difference in seconds. The code is
# return the timestamp for the current row, using the pre-split fields
function timestamp() {
return mktime($5 " " $4 " " $3 " " $6 " " $7 " " $8)
}
# in start lines, remember the timestamp
$1 == "start" {
t_start = timestamp()
}
# in end lines, print the difference.
$1 == "end" {
print(timestamp() - t_start)
}
If you want to format the time difference in another manner, see this handy reference of relevant functions. By the way, the last block in your example has several hours negative length. You may want to look into that.
Addendum: In case that's because of the am/pm thing some countries have, this opens up a can of worms in that all timestamps have two possible meanings (because the log file does not seem to include the information whether it's an am or pm timestamp), so you have an unsolvable problem with durations of more than half a day. If you know that durations are never longer than half a day and the end time is always after the start time, you might be able to hack your way around it with something like
$1 == "end" {
t_end = timestamp();
if(t_end < t_start) {
t_end += 43200 # add 12 hours
}
print(t_end - t_start)
}
...but in that case the log file format is broken and should be fixed. This sort of hackery is not something you want to rely on in the long term.
I want to figure out how to sort a list of names (FS=" ") into two piles: one with 2 fields, and the other with 3 + fields, but only show the total number of records in both lists in one sentence. Is this even possible? If so, how?
(I am new to awk, scripting. There are loads of info regarding how to show sums of fields, but not how to split up a list into two files by NF and then show the sum.)
Here goes
awk 'NF >= 2{x[NF==2?2:3]++};
END{for (i in x) printf "%d records with %s fields\n", x[i], i==3?"3+":"2"}' file
Assuming numbers are in the last field
awk '{ sum += $NF } END { print sum }' file.txt