How to count number of hotels in every county using awk? - bash

I have a dataset hotels.csv with columns: doc_id, hotel_name, hotel_url, street, city, state, country, zip, class, price, num_reviews, CLEANLINESS, ROOM, SERVICE, LOCATION, VALUE, COMFORT, overall_ratingsource
And I want to count amount of hotels in every country. How can I do it using awk?
I can count amount of hotels for China or USA:
cat /home/data/hotels.csv | awk -F, '$7=="China"{n+=1} END {print n}'
But how to do it for every country?

Parsing CSV with awk is usually not a good idea. If some of your fields contain commas, for instance, it will not work as expected. Anyway, associative arrays are usually convenient for this kind of tasks:
awk -F, '{num[$7]++} END{for(country in num) print country, num[country]}' /home/data/hotels.csv
Note: cat file | awk ... is useless. Simply pass the file to awk.

If you have the columns as the first row, you can start processing the data starting from the second row, use the name of the country as the array key and increment the value when encountering the same key.
awk -F, 'NR > 1 {
ary[$7]++
}
END {
for(item in ary) print item, ary[item]
}
' /home/data/hotels.csv

Related

Unix script, awk and csv handling

Unix noob here again. I'm writing a script for a Unix class I'm taking online, and it is supposed to handle a csv file while excluding some info and tallying other info, then print a report when finished. I've managed to write something (with help) but am having to debug it on my own, and it's not going well.
After sorting out my syntax errors, the script runs, however all it does is print the header before the loop at the end, and one single value of zero under "Gross Sales", where it doesn't belong.
My script thus far:
I am open to any and all suggestions. However, I should mention again, I don't know what I'm doing, so you may have to explain a bit.
#!/bin/bash
grep -v '^C' online_retail.csv \
| awk -F\\t '!($2=="POST" || $8=="Australia")' \
| awk -F\\t '!($2=="D" || $3=="Discount")' \
| awk -F\\t '{(country[$8] += $4*$6) && (($6 < 2) ? quantity[$8] += $4 : quantity[$8] += 0)} \
END{print "County\t\tGross Sales\t\tItems Under $2.00\n \
-----------------------------------------------------------"; \
for (i in country) print i,"\t\t",country[i],"\t\t", quantity[i]}'
THE ASSIGNMENT Summary:
Using only awk and sed (or grep for the clean-up portion) write a script that prepares the following reports on the above retail dataset.
First, do some data "clean up" -- you can use awk, sed and/or grep for these:
Invoice numbers with a 'C' at the beginning are cancellations. These are just noise -- all lines like this should be deleted.
Any items with a StockCode of "POST" should be deleted.
Your Australian site has been producing some bad data. Delete lines where the "Country" is set to "Australia".
Delete any rows labeled 'Discount' in the Description (or have a 'D' in the StockCode field). Note: if you already completed steps 1-3 above, you've probably already deleted these lines, but double-check here just in case.
Then, print a summary report for each region in a formatted table. Use awk for this part. The table should include:
Gross sales for each region (printed in any order). The regions in the file, less Australia, are below. To calculate gross sales, multiply the UnitPrice times the Quantity for each row, and keep a running total.
France
United Kingdom
Netherlands
Germany
Norway
Items under $2.00 are expected to be a big thing this holiday season, so include a total count of those items per region.
Use field widths so the columns are aligned in the output.
The output table should look like this, although the data will produce different results. You can format the table any way you choose but it should be in a readable table with aligned columns. Like so:
Country Gross Sales Items Under $2.00
---------------------------------------------------------
France 801.86 12
Netherlands 177.60 1
United Kingdom 23144.4 488
Germany 243.48 11
Norway 1919.14 56
A small sample of the csv file:
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536388,22469,HEART OF WICKER SMALL,12,12/1/2010 9:59,1.65,16250,United Kingdom
536388,22242,5 HOOK HANGER MAGIC TOADSTOOL,12,12/1/2010 9:59,1.65,16250,United Kingdom
C536379,D,Discount,-1,12/1/2010 9:41,27.5,14527,United Kingdom
536389,22941,CHRISTMAS LIGHTS 10 REINDEER,6,12/1/2010 10:03,8.5,12431,Australia
536527,22809,SET OF 6 T-LIGHTS SANTA,6,12/1/2010 13:04,2.95,12662,Germany
536527,84347,ROTATING SILVER ANGELS T-LIGHT HLDR,6,12/1/2010 13:04,2.55,12662,Germany
536532,22962,JAM JAR WITH PINK LID,12,12/1/2010 13:24,0.85,12433,Norway
536532,22961,JAM MAKING SET PRINTED,24,12/1/2010 13:24,1.45,12433,Norway
536532,84375,SET OF 20 KIDS COOKIE CUTTERS,24,12/1/2010 13:24,2.1,12433,Norway
536403,POST,POSTAGE,1,12/1/2010 11:27,15,12791,Netherlands
536378,84997C,BLUE 3 PIECE POLKADOT CUTLERY SET,6,12/1/2010 9:37,3.75,14688,United Kingdom
536378,21094,SET/6 RED SPOTTY PAPER PLATES,12,12/1/2010 9:37,0.85,14688,United Kingdom
Seriously, Thank you to whoever can help. You all are amazing!
edit: I think I used the wrong field separator... but not sure how it is supposed to look. still tinkering...
edit2: okay, I "fixed?" the delimiter and changed it from awk -F\\t to awk -F\\',' and now it runs, however, the report data is all incorrect. sigh... I will trudge on.
You professor is hinting at what you should use to do this. awk and awk alone in a single call to awk processing all records in your file. You can do it with three rules.
the first rule simply sets the conditions, which if found in the record (line), causes awk to skip to the next line ignoring the record. According to the description, that would be:
FNR==1 || /^C/ || $2=="POST" || $NF=="Australia" || $2=="D" { next }
your second rule simply sums the Quantity * Unit Price for each Country and also keeps track of the sales of low-price goods if the Unit Price < 2. The a[] array tracks the total sales per Country, and the lpc[] array tracks the low-price cost goods sold if the Unit Price is less than $2.00. That can simply be:
{
a[$NF] += $4 * $6
if ($6 < 2)
lpc[$NF] += $4
}
The final END rule just outputs the heading and outputs the table formatted in column form. That could be:
END {
printf "%-20s%-20s%s", "Country", "Group Sales", "Items Under $2.00\n"
for (i = 0; i < 57; i++)
printf "-"
print ""
for (i in a)
printf "%-20s%10.2f%10s%s\n", i, a[i], " ", lpc[i]
}
That's it, if you put it altogether and providing your input in file, you would have:
awk -F, '
FNR==1 || /^C/ || $2=="POST" || $NF=="Australia" || $2=="D" { next }
{
a[$NF] += $4 * $6
if ($6 < 2)
lpc[$NF] += $4
}
END {
printf "%-20s%-20s%s", "Country", "Group Sales", "Items Under $2.00\n"
for (i = 0; i < 57; i++)
printf "-"
print ""
for (i in a)
printf "%-20s%10.2f%10s%s\n", i, a[i], " ", lpc[i]
}
' file
Example Use/Output
You can just select-copy and middle-mouse-paste the command above into an xterm with the current working directory containing file. Doing so, your results would be:
Country Group Sales Items Under $2.00
---------------------------------------------------------
United Kingdom 72.30 36
Germany 33.00
Norway 95.40 36
Which is similar to the format specified -- though I took the time to decimal align the Group Sales for easy reading.
Look things over an let me know if you have further questions.
note: processing CSV files with awk (or sed or grep) is generally not a good idea IF the values can contain embedded commas within double-quoted fields, e.g.
field1,"field2,part_a,part_b",fields3,...
This prevents problems with choosing a field-separator that will correctly parse the file.
If your input does not have embedded commas (or separators) in the fields, awk is perfectly fine. Just be aware of the potential gotcha depending on the data.

Search duplicates in a column, add value

Convert file input.csv.
id,location_id,organization_id,service_id,name,title,email,department
36,,,22,Joe Smith,third-party,john.smith#example.org,third-party Applications
18,11,,,Dave Genesy,Head of office,,
14,9,,,David Genesy,Library Director,,
22,14,,,Andres Espinoza, Manager Commanding Officer,,
(Done!) Need to update column name. Name format: first letter of name/surname uppercase and all other letters lowercase.
(Done!) Need to update column email with domain #abc.Email format: first letter from name and full surname, lowercase
(Not done) Emails with the same ID should contain numbers. Example: Name Max Houston, email mhouston1#examples.com etc.
#!/bin/bash
inputfile="accounts.csv"
echo "id,location_id,organization_id,service_id,name,title,email,department" > accounts_new.csv
while IFS="," read -r rec_column1 rec_column2 rec_column3 rec_column4 rec_column5 rec_column6 rec_column7 rec_column8
do
surnametemp="${rec_column5:0:1}$(echo $rec_column5 | awk '{print $2}')"
namesurname=$(echo $rec_column5 | sed 's! .!\U&!g')
echo $rec_column1","$rec_column2","$rec_column3","$rec_column4","$namesurname","$rec_column6",""${surnametemp,,}#abc.com"","$rec_column8 >>accounts_new.csv
done < <(tail -n +2 $inputfile)
How can do that?
Outputfile
id,location_id,organization_id,service_id,name,title,email,department
14,9,,,Dave Genesy,Library Director,dgenesy#abc.com,
14,9,,,David Genesy,Library Director,dgenesy2#abc.com,
15,9,,,maria Kramer,Library Divisions Manager,mkramer#abc.com,
26,18,,,Sharon Petersen,Administrator,spetersen#abc.com,
27,19,,,Shen Petersen,Administrator,spetersen2#abc.com,
Task specification
This task would be much easier if specified otherwise:
add email iterator to every email
or
add email iterator to second,third... occurrence
But it was specified:
add email iterator to every email if email is used multiple times.
This specification requires double iteration through lines, thus making this task more difficult.
The right tool
My rule of thumb is: use pure bash tools (grep, sed, etc) for simple tasks, use awk for moderate tasks and python for complicated tasks. In this case (double iteration over lines) I would use python. However, there was not python tag in problem specification, so I used awk.
Solution
<accounts.csv \
gawk -vFPAT='[^,]*|"[^"]*"' \
'
BEGIN {
OFS = ","
};
{
if ($7 == "") {
split($5,name," ");
firstname = substr(tolower(name[1]),1,1);
lastname = tolower(name[2]);
domain="#abc.com";
$7=firstname "." lastname domain;
};
emailcounts[$7]++;
immutables[++iter]=$1","$2","$3","$4","$5","$6","$8;
emails[iter]=$7;
}
END {
for (iter in immutables) {
if (emailcounts[emails[iter]] > 1) {
emailiter[emails[iter]]++;
email=gensub(/#/, emailiter[emails[iter]]"#", "g", emails[iter]);
} else {
email=emails[iter]
};
print immutables[iter], email
}
}'
Results
id,location_id,organization_id,service_id,name,title,department,email
36,,,22,Joe Smith,third-party,third-party Applications,john.smith#example.org
18,11,,,Dave Genesy,Head of office,,d.genesy1#abc.com
14,9,,,David Genesy,Library Director,,d.genesy2#abc.com
22,14,,,Andres Espinoza,"Manager, Commanding Officer",,a.espinoza#abc.com
Explanation
-vFPAT='[^,]*|"[^"]*"' read csv
$7=firstname "." lastname domain;} substitute email field
emailcounts[$7]++ count email occurences
iter iterator to preserve order
immutables[++iter]=$1","$2","$3","$4","$5","$6","$8 save non email fields for second loop
emails[iter]=$7 save email for second loop
for (iter in immutables) iterate over keys in immutables dictionary
{if (emailcounts[emails[iter]] > 1) change email if more than 1 occurence
emailiter[emails[iter]]++ increase email iterator
email=gensub(/#/, emailiter[emails[iter]]"#", "g", emails[iter]) add iterator to email
print immutables[iter], email print
With the input (mailcsv) file as:
id,location_id,organization_id,service_id,name,title,email,department
14,9,,,Dave Genesy,Library Director,dgenesy#abc.com,
14,9,,,David Genesy,Library Director,dgenesy#abc.com,
15,9,,,maria Kramer,Library Divisions Manager,mkramer#abc.com,
26,18,,,Sharon Petersen,Administrator,spetersen#abc.com,
27,19,,,Shen Petersen,Administrator,spetersen2#abc.com,
You can use awk and so:
awk -F, ' NR>1 { mails[$7]+=1;if ( mails[$7] > 1 ) { OFS=",";split($7,mail1,"#");$7=mail1[1]mails[$7]"#"mail1[2] } else { $0=$0 } }1' mailscsv
Set the field delimiter to , and then create an array keyed by email address. Increment the index every time the email address is encountered. If there is more than one occurrence of the address, split the address into another array mail1 based on "#". Set $7 to the first index of the array mail1 (email address before #) followed by the value of mails index for the email address, then "#" and the second index of mail1 (the section after #) If there is only one occurrence of the email address simple set the whole line as is. Use 1 to print the line.

Comma delimited but need to exclude an enclosed field filled with comma's

I have a CSV data file say with 5 columns separated by comma.
c1, c2, col3, c4, c5
stack, over, upon, true, yes
ab, zy, pq,rs,tu,vw,ef, four, ivef
Viewing the csv file in Excel will clearly show those 5 columns, and the 3rd field having the following values: pq,rs,tu,vw,ef.
However, how do I get awk to print out col3 ($3) with "pq,rs,tu,vw,ef" as output. Right now, it sees it as pq. And the rest have shifted out of place.
Updated csv sample:
Movie ID,Remit ID,Property ID,Movie Uploader,Channel ID,Channel Display Name,Video Title,Views Count,Status,Claim Origin,Claim Type,Is Affiliate Uploaded,Is Premium,Reference Movie ID,Policy,Applied Policy,Claim Date,Movie Upload Date,Custom ID,EWRC,Title,Authors,Notes,Asset Labels
G4pelo5M9XI,ka-9foAPFkg,N103145385208693,originalkaraoke,UCnF6KQeanPgBRyEMeFmrNnA,Karaoke,Motel Fornia - Karaoke,6702511,Active,Descriptive Search,AudioVisual,No,No,,,Block the following countries: US; Track in all countries except: US,2017/01/25,2011/12/30,fW1aUnBbwL8,,MOTEL FORNIA - BLOCK,,,
uZ94drkfB5c,WIMPvt22JY8,B103945385208693,,UCBa3saYRQTO8WzsKacgaJNQ,Best Songs Backing Tracks,"Motel Fornia - Bass Backing Track with scale, chords and lyrics",1913,Active,Descriptive Search,AudioVisual,No,No,,,Track in all countries except: US; Block the following countries: US,2017/01/25,2016/01/19,fW1aUqBzwL2,,MOTEL FORNIA - BLOCK,,,
2p1te0kAE2A,HMR7M2SjJJw,N103945385208693,,UCLAvPQhYyx8yUNMG0AkPYuw,Jordy Nalgas,HOSTEL NARNIA,751,Active,Descriptive Search,AudioVisual,No,No,,,Block the following countries: US; Track in all countries except: US,2017/01/25,2016/09/11,fW1dUnBhwL8,,HOSTEL NARNIA - BLOCK,,,
and we need to extract value of Views Count column.
In gnu awk you can use FPAT to tell awk what is a valid column expression.
You can use:
awk -v col='Views Count' -v FPAT='"[^"]*"|[^,]*' '
NR==1{for (h=1; h<=NF; h++) if ($h == col) break; next} {print $h}' file.csv
6702511
1913
751
If you want column heading also then remove next from awk script above.

How to get two sums of number of fields from a list of names: one with 2 fields, the other with 3 or more?

I want to figure out how to sort a list of names (FS=" ") into two piles: one with 2 fields, and the other with 3 + fields, but only show the total number of records in both lists in one sentence. Is this even possible? If so, how?
(I am new to awk, scripting. There are loads of info regarding how to show sums of fields, but not how to split up a list into two files by NF and then show the sum.)
Here goes
awk 'NF >= 2{x[NF==2?2:3]++};
END{for (i in x) printf "%d records with %s fields\n", x[i], i==3?"3+":"2"}' file
Assuming numbers are in the last field
awk '{ sum += $NF } END { print sum }' file.txt

Totalling a column based on another column

Say I have a text file items.txt
Another Purple Item:1:3:01APR13
Another Green Item:1:8:02APR13
Another Yellow Item:1:3:01APR13
Another Orange Item:5:3:04APR13
Where the 2nd column is price and the 3rd is quantity. How could I loop through this in bash such that each unique date had a total of price * quantity?
try this awk one-liner:
awk -F: '{v[$NF]+=$2*$3}END{for(x in v)print x, v[x]}' fileĀ 
result:
01APR13 6
04APR13 15
02APR13 8
EDIT sorting
as I commented, there are two approaches to sort the output by date, I just take the simpler one: ^_^:
kent$ awk -F: '{ v[$NF]+=$2*$3}END{for(x in v){"date -d\""x"\" +%F"|getline d;print d,x,v[x]}}' file|sort|awk '$0=$2" "$3'
01APR13 6
02APR13 8
04APR13 15
Take a look at bash's associative arrays.
You can create a map of dates to values, then go over the lines, calculate price*quantity and add it to the value currently in the map or insert it if non-existent.

Resources