Unix script, awk and csv handling

Unix script, awk and csv handling - bash

Unix noob here again. I'm writing a script for a Unix class I'm taking online, and it is supposed to handle a csv file while excluding some info and tallying other info, then print a report when finished. I've managed to write something (with help) but am having to debug it on my own, and it's not going well.
After sorting out my syntax errors, the script runs, however all it does is print the header before the loop at the end, and one single value of zero under "Gross Sales", where it doesn't belong.
My script thus far:
I am open to any and all suggestions. However, I should mention again, I don't know what I'm doing, so you may have to explain a bit.
#!/bin/bash
grep -v '^C' online_retail.csv \
| awk -F\\t '!($2=="POST" || $8=="Australia")' \
| awk -F\\t '!($2=="D" || $3=="Discount")' \
| awk -F\\t '{(country[$8] += $4*$6) && (($6 < 2) ? quantity[$8] += $4 : quantity[$8] += 0)} \
END{print "County\t\tGross Sales\t\tItems Under $2.00\n \
-----------------------------------------------------------"; \
for (i in country) print i,"\t\t",country[i],"\t\t", quantity[i]}'
THE ASSIGNMENT Summary:
Using only awk and sed (or grep for the clean-up portion) write a script that prepares the following reports on the above retail dataset.
First, do some data "clean up" -- you can use awk, sed and/or grep for these:
Invoice numbers with a 'C' at the beginning are cancellations. These are just noise -- all lines like this should be deleted.
Any items with a StockCode of "POST" should be deleted.
Your Australian site has been producing some bad data. Delete lines where the "Country" is set to "Australia".
Delete any rows labeled 'Discount' in the Description (or have a 'D' in the StockCode field). Note: if you already completed steps 1-3 above, you've probably already deleted these lines, but double-check here just in case.
Then, print a summary report for each region in a formatted table. Use awk for this part. The table should include:
Gross sales for each region (printed in any order). The regions in the file, less Australia, are below. To calculate gross sales, multiply the UnitPrice times the Quantity for each row, and keep a running total.
France
United Kingdom
Netherlands
Germany
Norway
Items under $2.00 are expected to be a big thing this holiday season, so include a total count of those items per region.
Use field widths so the columns are aligned in the output.
The output table should look like this, although the data will produce different results. You can format the table any way you choose but it should be in a readable table with aligned columns. Like so:
Country Gross Sales Items Under $2.00
---------------------------------------------------------
France 801.86 12
Netherlands 177.60 1
United Kingdom 23144.4 488
Germany 243.48 11
Norway 1919.14 56
A small sample of the csv file:
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536388,22469,HEART OF WICKER SMALL,12,12/1/2010 9:59,1.65,16250,United Kingdom
536388,22242,5 HOOK HANGER MAGIC TOADSTOOL,12,12/1/2010 9:59,1.65,16250,United Kingdom
C536379,D,Discount,-1,12/1/2010 9:41,27.5,14527,United Kingdom
536389,22941,CHRISTMAS LIGHTS 10 REINDEER,6,12/1/2010 10:03,8.5,12431,Australia
536527,22809,SET OF 6 T-LIGHTS SANTA,6,12/1/2010 13:04,2.95,12662,Germany
536527,84347,ROTATING SILVER ANGELS T-LIGHT HLDR,6,12/1/2010 13:04,2.55,12662,Germany
536532,22962,JAM JAR WITH PINK LID,12,12/1/2010 13:24,0.85,12433,Norway
536532,22961,JAM MAKING SET PRINTED,24,12/1/2010 13:24,1.45,12433,Norway
536532,84375,SET OF 20 KIDS COOKIE CUTTERS,24,12/1/2010 13:24,2.1,12433,Norway
536403,POST,POSTAGE,1,12/1/2010 11:27,15,12791,Netherlands
536378,84997C,BLUE 3 PIECE POLKADOT CUTLERY SET,6,12/1/2010 9:37,3.75,14688,United Kingdom
536378,21094,SET/6 RED SPOTTY PAPER PLATES,12,12/1/2010 9:37,0.85,14688,United Kingdom
Seriously, Thank you to whoever can help. You all are amazing!
edit: I think I used the wrong field separator... but not sure how it is supposed to look. still tinkering...
edit2: okay, I "fixed?" the delimiter and changed it from awk -F\\t to awk -F\\',' and now it runs, however, the report data is all incorrect. sigh... I will trudge on.

You professor is hinting at what you should use to do this. awk and awk alone in a single call to awk processing all records in your file. You can do it with three rules.
the first rule simply sets the conditions, which if found in the record (line), causes awk to skip to the next line ignoring the record. According to the description, that would be:
FNR==1 || /^C/ || $2=="POST" || $NF=="Australia" || $2=="D" { next }
your second rule simply sums the Quantity * Unit Price for each Country and also keeps track of the sales of low-price goods if the Unit Price < 2. The a[] array tracks the total sales per Country, and the lpc[] array tracks the low-price cost goods sold if the Unit Price is less than $2.00. That can simply be:
{
a[$NF] += $4 * $6
if ($6 < 2)
lpc[$NF] += $4
}
The final END rule just outputs the heading and outputs the table formatted in column form. That could be:
END {
printf "%-20s%-20s%s", "Country", "Group Sales", "Items Under $2.00\n"
for (i = 0; i < 57; i++)
printf "-"
print ""
for (i in a)
printf "%-20s%10.2f%10s%s\n", i, a[i], " ", lpc[i]
}
That's it, if you put it altogether and providing your input in file, you would have:
awk -F, '
FNR==1 || /^C/ || $2=="POST" || $NF=="Australia" || $2=="D" { next }
{
a[$NF] += $4 * $6
if ($6 < 2)
lpc[$NF] += $4
}
END {
printf "%-20s%-20s%s", "Country", "Group Sales", "Items Under $2.00\n"
for (i = 0; i < 57; i++)
printf "-"
print ""
for (i in a)
printf "%-20s%10.2f%10s%s\n", i, a[i], " ", lpc[i]
}
' file
Example Use/Output
You can just select-copy and middle-mouse-paste the command above into an xterm with the current working directory containing file. Doing so, your results would be:
Country Group Sales Items Under $2.00
---------------------------------------------------------
United Kingdom 72.30 36
Germany 33.00
Norway 95.40 36
Which is similar to the format specified -- though I took the time to decimal align the Group Sales for easy reading.
Look things over an let me know if you have further questions.
note: processing CSV files with awk (or sed or grep) is generally not a good idea IF the values can contain embedded commas within double-quoted fields, e.g.
field1,"field2,part_a,part_b",fields3,...
This prevents problems with choosing a field-separator that will correctly parse the file.
If your input does not have embedded commas (or separators) in the fields, awk is perfectly fine. Just be aware of the potential gotcha depending on the data.

Related

How to count number of hotels in every county using awk?

I have a dataset hotels.csv with columns: doc_id, hotel_name, hotel_url, street, city, state, country, zip, class, price, num_reviews, CLEANLINESS, ROOM, SERVICE, LOCATION, VALUE, COMFORT, overall_ratingsource
And I want to count amount of hotels in every country. How can I do it using awk?
I can count amount of hotels for China or USA:
cat /home/data/hotels.csv | awk -F, '$7=="China"{n+=1} END {print n}'
But how to do it for every country?

Parsing CSV with awk is usually not a good idea. If some of your fields contain commas, for instance, it will not work as expected. Anyway, associative arrays are usually convenient for this kind of tasks:
awk -F, '{num[$7]++} END{for(country in num) print country, num[country]}' /home/data/hotels.csv
Note: cat file | awk ... is useless. Simply pass the file to awk.

If you have the columns as the first row, you can start processing the data starting from the second row, use the name of the country as the array key and increment the value when encountering the same key.
awk -F, 'NR > 1 {
ary[$7]++
}
END {
for(item in ary) print item, ary[item]
}
' /home/data/hotels.csv

Awk printing out smallest and highest number, in a time format

I'm fairly new to linux/bash shell and I'm really having trouble printing two values (the highest and lowest) from a particular column in a text file. The file is formatted like this:
Geoff Audi 2:22:35.227
Bob Mercedes 1:24:22.338
Derek Jaguar 1:19:77.693
Dave Ferrari 1:08:22.921
As you can see the final column is a timing, I'm trying to use awk to print out the highest and lowest timing in the column. I'm really stumped, I've tried:
awk '{print sort -n < $NF}' timings.txt
However that didn't even seem to sort anything, I just received an output of:
1
0
1
0
...
Repeating over and over, it went on for longer but I didn't want a massive line of it when you get the point after the first couple iterations.
My desired output would be:
Min: 1:08:22.921
Max: 2:22:35.227

After question clarifications: if the time field always has a same number of digits in the same place, e.g. h:mm:ss.ss, the solution can be drastically simplified. Namely, we don't need to convert time to seconds to compare it anymore, we can do a simple string/lexicographical comparison:
$ awk 'NR==1 {m=M=$3} {$3<m&&m=$3; $3>M&&M=$3} END {printf("min: %s\nmax: %s",m,M)}' file
min: 1:08:22.921
max: 2:22:35.227
The logic is the same as in the (previous) script below, just using a simpler string-only based comparison for ordering values (determining min/max). We can do that since we know all timings will conform to the same format, and if a < b (for example "1:22:33" < "1:23:00") we know a is "smaller" than b. (If values are not consistently formatted, then by using the lexicographical comparison alone, we can't order them, e.g. "12:00:00" < "3:00:00".)
So, on first value read (first record, NR==1), we set the initial min/max value to the timing read (in the 3rd field). For each record we test if the current value is smaller than the current min, and if it is, we set the new min. Similarly for the max. We use short circuiting instead if to make expressions shorter ($3<m && m=$3 is equivalent to if ($3<m) m=$3). In the END we simply print the result.
Here's a general awk solution that accepts time strings with variable number of digits for hours/minutes/seconds per record:
$ awk '{split($3,t,":"); s=t[3]+60*(t[2]+60*t[1]); if (s<min||NR==1) {min=s;min_t=$3}; if (s>max||NR==1) {max=s;max_t=$3}} END{print "min:",min_t; print "max:",max_t}' file
min: 1:22:35.227
max: 10:22:35.228
Or, in a more readable form:
#!/usr/bin/awk -f
{
split($3, t, ":")
s = t[3] + 60 * (t[2] + 60 * t[1])
if (s < min || NR == 1) {
min = s
min_t = $3
}
if (s > max || NR == 1) {
max = s
max_t = $3
}
}
END {
print "min:", min_t
print "max:", max_t
}
For each line, we convert the time components (hours, minutes, seconds) from the third field to seconds which we can later simply compare as numbers. As we iterate, we track the current min val and max val, printing them in the END. Initial values for min and max are taken from the first line (NR==1).

Given your statements that the time field is actually a duration and the hours component is always a single digit, this is all you need:
$ awk 'NR==1{min=max=$3} {min=(min<$3?min:$3); max=(max>$3?max:$3)} END{print "Min:", min ORS "Max:", max}' file
Min: 1:08:22.921
Max: 2:22:35.227

You don't want to run sort inside of awk (even with the proper syntax).
Try this:
sed 1d timings.txt | sort -k3,3n | sed -n '1p; $p'
where
the first sed will remove the header
sort on the 3rd column numerically
the second sed will print the first and last line

Change date and data cells in .csv file progressively

I have a file that I'm trying to get ready for my boss in time for his manager's meeting tomorrow morning at 8:00AM -8GMT. I want to retroactively change the dates in non consecutive rows in this .csv file: (truncated)
,,,,,
,,,,,sideshow
,,,
date_bob,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
bob_available,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383
bob_used,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312
,,,
date_mel,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
mel_available,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537
mel_used,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159
,,,
date_sideshow-ws2,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
sideshow-ws2_available,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239
sideshow-ws2_used,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441
,,,
,,,,,simpsons
,,,
date_bart,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
bart_available,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559
bart_used,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117
,,,
date_homer,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
homer_available,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799
homer_used,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877
,,,
date_lisa,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
lisa_available,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899
lisa_used,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777
In other words a row that now reads:
date_lisa,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
would desirably read:
date_lisa,09-04-14,09-05-14,09-06-14,09-07-14,09-08-14,09-09-14,09-10-14,09-11-14,09-12-14,09-13-14,09-14-14,09-15-14,09-16-14,09-17-14
I'd like to make the daily available numbers less at the beginning and then get progressively bigger day by day. This will mean that the used rows will have to be proportionately smaller at the beginning and then get progressively bigger in lock step with the available rows as they shrink.
Not by a large amount don't make it look obvious just a few GB here and there. I plan to make pivot tables and graphs out of this and so it has to vary a little. BTW the numbers are all in MB as I generated them using df -m.
Thanks in advance if anyone can help me.

The following awk does what you need:
awk -F, -v OFS=, '
/^date/ {
split ($2, date, /-/);
for (i=2; i<=NF; i++) {
$i = date[1] "-" sprintf ("%02d", date[2] - NF + i) "-" date[3]
}
}
/available|used/ {
for (i=2; i<=NF; i++) {
$i = int (($i*i)/NF)
}
}1' csv
Set the Input and Output Field Separator to ,
All the lines that start with date, we split the second column to find the date part.
We iterate from second column to the end of the line and set the column to new calculated start date which basically uses the current date and the total number of fields.
All other lines remain as is and gets printed along with modified lines.
This has a caveat of not rolling over different months correctly.
For data fields we iterate from second column to the end of line and do a calculation to make them progressively greater than the previous one to match the original value for last field.

How to Write a Unix Shell to Sum the Values in a Row Against Each Unique Column (e.g., how to calculate total votes for each distinct candidate)

In its basic form, I am given a text file with state vote results from the 2012 Presidential Election and I need to write a one line shell script in Unix to determine which candidate won. The file has various fields, one of which is CandidateName and the other is TotalVotes. Each record in the file is the results from one precinct within the state, thus there are many records for any given CandidateName, so what I'd like to be able to do is sort the data according to CandidateName and then ultimately sum the TotalVotes for each unique CandidateName (so the sum starts at a unique CandidateName and ends before the next unique CandidateName).

No need for sorting with awk and its associative arrays. For convenience, the data file format can be:
precinct1:candidate name1:732
precinct1:candidate2 name:1435
precinct2:candidate name1:9920
precinct2:candidate2 name:1238
Thus you need to create totals of field 3 based on field 2 with : as the delimiter.
awk -F: '{sum[$2] += $3} END { for (name in sum) { print name " = " sum[name] } }' data.file
Some versions of awk can sort internally; others can't. I'd use the sort program to process the results:
sort -t= -k2nb
(field separator is the = sign; the sort is on field 2, which is a numeric field, possibly with leading blanks).

Not quite one line, but will work
$ cat votes.txt
Colorado Obama 50
Colorado Romney 20
Colorado Gingrich 30
Florida Obama 60
Florida Romney 20
Florida Gingrich 30
script
while read loc can num
do
if ! [ ${!can} ]
then
cans+=($can)
fi
(( $can += num ))
done < votes.txt
for can in ${cans[*]}
do
echo $can ${!can}
done
output
Obama 110
Romney 40
Gingrich 60

SPLIT file by Script (bash, cpp) - numbers in columns

I have files with some columns filled by numbers (float). I would need to split these files according to the value in one of the columns (can set as the first one). This means, when
a b c
in my file the value c fullfils 0.05<=c<=0.1 then create the file named c and copy the whole columns there which fullfils the c-condition...
is this possible? I can something small with bash, awk, something also with c++.
I have searched for some solutions but - I can the data sort of course and only read the first number of the line..
I don't know.
Please, very please.
Thank you
Jane

As you mentioned awk, the basic rule in awk is 'match a line (either by default or with a regexp, condition or line number)' AND 'do something because you found a match'.
awk uses values like $1, $2, $3 to indicate which column in the current line of data it is looking at. $0 refers to the whole line. So ...
awk '
BEGIN{
afile="afile.txt"
bfile="bfile.txt"
cfile="cfile.txt"
}
{
# test c value between .05 and .1
if ($3 >= 0.05 && $3 <= 0.1) print $0 > cfile
} inputData
Note that I am testing the value of the third column (c in your example). You can use $2 to test b column, etc.
If you don't know about the sort of condition test I have included >= 0.5 && $3 <= 0.1 you'll have some learning ahead of you.
Questions in the form of 1. I have this input, 2. I want this output. 3. (but) I'm getting this output, 4. with this code .... {code here} .... have a much better chance of getting a reasonable response in a reasonable amount of time ;-)
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, and/or give it a + (or -) as a useful answer.

If I understand your requirements correctly:
awk '{print > $3}' file ...

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio