Related
Unix noob here again. I'm writing a script for a Unix class I'm taking online, and it is supposed to handle a csv file while excluding some info and tallying other info, then print a report when finished. I've managed to write something (with help) but am having to debug it on my own, and it's not going well.
After sorting out my syntax errors, the script runs, however all it does is print the header before the loop at the end, and one single value of zero under "Gross Sales", where it doesn't belong.
My script thus far:
I am open to any and all suggestions. However, I should mention again, I don't know what I'm doing, so you may have to explain a bit.
#!/bin/bash
grep -v '^C' online_retail.csv \
| awk -F\\t '!($2=="POST" || $8=="Australia")' \
| awk -F\\t '!($2=="D" || $3=="Discount")' \
| awk -F\\t '{(country[$8] += $4*$6) && (($6 < 2) ? quantity[$8] += $4 : quantity[$8] += 0)} \
END{print "County\t\tGross Sales\t\tItems Under $2.00\n \
-----------------------------------------------------------"; \
for (i in country) print i,"\t\t",country[i],"\t\t", quantity[i]}'
THE ASSIGNMENT Summary:
Using only awk and sed (or grep for the clean-up portion) write a script that prepares the following reports on the above retail dataset.
First, do some data "clean up" -- you can use awk, sed and/or grep for these:
Invoice numbers with a 'C' at the beginning are cancellations. These are just noise -- all lines like this should be deleted.
Any items with a StockCode of "POST" should be deleted.
Your Australian site has been producing some bad data. Delete lines where the "Country" is set to "Australia".
Delete any rows labeled 'Discount' in the Description (or have a 'D' in the StockCode field). Note: if you already completed steps 1-3 above, you've probably already deleted these lines, but double-check here just in case.
Then, print a summary report for each region in a formatted table. Use awk for this part. The table should include:
Gross sales for each region (printed in any order). The regions in the file, less Australia, are below. To calculate gross sales, multiply the UnitPrice times the Quantity for each row, and keep a running total.
France
United Kingdom
Netherlands
Germany
Norway
Items under $2.00 are expected to be a big thing this holiday season, so include a total count of those items per region.
Use field widths so the columns are aligned in the output.
The output table should look like this, although the data will produce different results. You can format the table any way you choose but it should be in a readable table with aligned columns. Like so:
Country Gross Sales Items Under $2.00
---------------------------------------------------------
France 801.86 12
Netherlands 177.60 1
United Kingdom 23144.4 488
Germany 243.48 11
Norway 1919.14 56
A small sample of the csv file:
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536388,22469,HEART OF WICKER SMALL,12,12/1/2010 9:59,1.65,16250,United Kingdom
536388,22242,5 HOOK HANGER MAGIC TOADSTOOL,12,12/1/2010 9:59,1.65,16250,United Kingdom
C536379,D,Discount,-1,12/1/2010 9:41,27.5,14527,United Kingdom
536389,22941,CHRISTMAS LIGHTS 10 REINDEER,6,12/1/2010 10:03,8.5,12431,Australia
536527,22809,SET OF 6 T-LIGHTS SANTA,6,12/1/2010 13:04,2.95,12662,Germany
536527,84347,ROTATING SILVER ANGELS T-LIGHT HLDR,6,12/1/2010 13:04,2.55,12662,Germany
536532,22962,JAM JAR WITH PINK LID,12,12/1/2010 13:24,0.85,12433,Norway
536532,22961,JAM MAKING SET PRINTED,24,12/1/2010 13:24,1.45,12433,Norway
536532,84375,SET OF 20 KIDS COOKIE CUTTERS,24,12/1/2010 13:24,2.1,12433,Norway
536403,POST,POSTAGE,1,12/1/2010 11:27,15,12791,Netherlands
536378,84997C,BLUE 3 PIECE POLKADOT CUTLERY SET,6,12/1/2010 9:37,3.75,14688,United Kingdom
536378,21094,SET/6 RED SPOTTY PAPER PLATES,12,12/1/2010 9:37,0.85,14688,United Kingdom
Seriously, Thank you to whoever can help. You all are amazing!
edit: I think I used the wrong field separator... but not sure how it is supposed to look. still tinkering...
edit2: okay, I "fixed?" the delimiter and changed it from awk -F\\t to awk -F\\',' and now it runs, however, the report data is all incorrect. sigh... I will trudge on.
You professor is hinting at what you should use to do this. awk and awk alone in a single call to awk processing all records in your file. You can do it with three rules.
the first rule simply sets the conditions, which if found in the record (line), causes awk to skip to the next line ignoring the record. According to the description, that would be:
FNR==1 || /^C/ || $2=="POST" || $NF=="Australia" || $2=="D" { next }
your second rule simply sums the Quantity * Unit Price for each Country and also keeps track of the sales of low-price goods if the Unit Price < 2. The a[] array tracks the total sales per Country, and the lpc[] array tracks the low-price cost goods sold if the Unit Price is less than $2.00. That can simply be:
{
a[$NF] += $4 * $6
if ($6 < 2)
lpc[$NF] += $4
}
The final END rule just outputs the heading and outputs the table formatted in column form. That could be:
END {
printf "%-20s%-20s%s", "Country", "Group Sales", "Items Under $2.00\n"
for (i = 0; i < 57; i++)
printf "-"
print ""
for (i in a)
printf "%-20s%10.2f%10s%s\n", i, a[i], " ", lpc[i]
}
That's it, if you put it altogether and providing your input in file, you would have:
awk -F, '
FNR==1 || /^C/ || $2=="POST" || $NF=="Australia" || $2=="D" { next }
{
a[$NF] += $4 * $6
if ($6 < 2)
lpc[$NF] += $4
}
END {
printf "%-20s%-20s%s", "Country", "Group Sales", "Items Under $2.00\n"
for (i = 0; i < 57; i++)
printf "-"
print ""
for (i in a)
printf "%-20s%10.2f%10s%s\n", i, a[i], " ", lpc[i]
}
' file
Example Use/Output
You can just select-copy and middle-mouse-paste the command above into an xterm with the current working directory containing file. Doing so, your results would be:
Country Group Sales Items Under $2.00
---------------------------------------------------------
United Kingdom 72.30 36
Germany 33.00
Norway 95.40 36
Which is similar to the format specified -- though I took the time to decimal align the Group Sales for easy reading.
Look things over an let me know if you have further questions.
note: processing CSV files with awk (or sed or grep) is generally not a good idea IF the values can contain embedded commas within double-quoted fields, e.g.
field1,"field2,part_a,part_b",fields3,...
This prevents problems with choosing a field-separator that will correctly parse the file.
If your input does not have embedded commas (or separators) in the fields, awk is perfectly fine. Just be aware of the potential gotcha depending on the data.
In its basic form, I am given a text file with state vote results from the 2012 Presidential Election and I need to write a one line shell script in Unix to determine which candidate won. The file has various fields, one of which is CandidateName and the other is TotalVotes. Each record in the file is the results from one precinct within the state, thus there are many records for any given CandidateName, so what I'd like to be able to do is sort the data according to CandidateName and then ultimately sum the TotalVotes for each unique CandidateName (so the sum starts at a unique CandidateName and ends before the next unique CandidateName).
No need for sorting with awk and its associative arrays. For convenience, the data file format can be:
precinct1:candidate name1:732
precinct1:candidate2 name:1435
precinct2:candidate name1:9920
precinct2:candidate2 name:1238
Thus you need to create totals of field 3 based on field 2 with : as the delimiter.
awk -F: '{sum[$2] += $3} END { for (name in sum) { print name " = " sum[name] } }' data.file
Some versions of awk can sort internally; others can't. I'd use the sort program to process the results:
sort -t= -k2nb
(field separator is the = sign; the sort is on field 2, which is a numeric field, possibly with leading blanks).
Not quite one line, but will work
$ cat votes.txt
Colorado Obama 50
Colorado Romney 20
Colorado Gingrich 30
Florida Obama 60
Florida Romney 20
Florida Gingrich 30
script
while read loc can num
do
if ! [ ${!can} ]
then
cans+=($can)
fi
(( $can += num ))
done < votes.txt
for can in ${cans[*]}
do
echo $can ${!can}
done
output
Obama 110
Romney 40
Gingrich 60
I am trying to resolve locations in lat and long in one file to a couple of named fields in another file.
I have one file that is like this..
f1--f2--f3--------f4-------- f5---
R 20175155 41273951N078593973W 18012
R 20175156 41274168N078593975W 18000
R 20175157 41274387N078593976W 17999
R 20175158 41274603N078593977W 18024
R 20175159 41274823N078593978W 18087
Each character is in a specific place so I need to define fields based on characters.
f1 char 18-21; f2 char 22 - 25; f3 char 26-35; f4 char 36-45; f5 char 62-66.
I have another much larger csv file that has fields 11, 12, and 13 to correspond to f3, f4, f5.
awk -F',' '{print $11, $12, $13}'
41.46703821 -078.98476926 519.21
41.46763555 -078.98477791 524.13
41.46824123 -078.98479015 526.67
41.46884129 -078.98480615 528.66
41.46943371 -078.98478482 530.50
I need to find the closest match to file 1 field 1 && 2 in file 2 field 11 && 12;
When the closest match is found I need to insert field 1, 2, 3, 4, 5 from file 1 into file 2 field 16, 17, 18, 19, 20.
As you can see the format is slightly different. File 1 breaks down like this..
File 1
f3-------f4--------
DDMMSSdd DDDMMSSdd
41273951N078593973W
File 2
f11-------- f12---------
DD dddddddd DDD dddddddd
41.46703821 -078.98476926
N means f3 is a positive number, W means f4 is a negative number.
I changed file 1 with sed, ridiculous one liner that works great.. (better way???)
cat $file1 |sed 's/.\{17\}//' |sed 's/\(.\{4\}\)\(.\{4\}\)\(.\{9\}\)\(.\)\(.\{9\}\)\(.\)\(.\{16\}\)\(.\{5\}\)/\1,\2,\3,\4,\5,\6,\8/'|sed 's/\(.\{10\}\)\(.\{3\}\)\(.\{2\}\)\(.\{2\}\)\(.\{2\}\)\(.\{3\}\)\(.\{3\}\)\(.\{2\}\)\(.*\)/\1\2,\3,\4.\5\6\7,\8\9/'|sed 's/\(.\{31\}\)\(.\{2\}\)\(.*\)/\1,\2.\3/'
2017,5155, 41,27,39.51,N,078,59,39.73,W,18012
2017,5156, 41,27,41.68,N,078,59,39.75,W,18000
2017,5157, 41,27,43.87,N,078,59,39.76,W,17999
2017,5158, 41,27,46.03,N,078,59,39.77,W,18024
2017,5159, 41,27,48.23,N,078,59,39.78,W,18087
Now I have to convert the formats.. (RESOLVED this (see below)--problem -- The numbers are rounded off too far. I need to have at least six decimal places.)
awk -F',' '{for (i=1;i<=NF;i++) {if (i <= 2) printf ($i","); else if (i == 3&&$6 == "S") printf("-"$3+($4/60)+($5/3600)","); else if (i == 3&&$6 == "N") printf($3+($4/60)+($5/3600)","); else if (i == 7&&$10 == "W") printf("-"$7+($8/60)+($9/3600)","); else if (i == 7&&$10 == "E") printf($7+($8/60)+($9/3600)","); if (i == 11) printf ($i"\n")}}'
2017,5155,41.461,-78.9944,18012
2017,5156,41.4616,-78.9944,18000
2017,5157,41.4622,-78.9944,17999
2017,5158,41.4628,-78.9944,18024
2017,5159,41.4634,-78.9944,18087
That's where I'm at.
RESOLVED THIS
*I need to get the number format to have at least 6 decimal places from this formula.*
printf($3+($4/60)+($5/3600))
Added "%.8f"
printf("%.8f", $3+($4/60)+($5/3600))
Next issue will be to match the fields file 1 f3 and f4 to the closest match in file 2 f11 and f12.
Any ideas?
Then I will need to calculate the distance between the fields.
In Excel the formuls would be like this..
=ATAN2(COS(lat1)*SIN(lat2)-SIN(lat1)*COS(lat2)*COS(lon2-lon1), SIN(lon2-lon1)*COS(lat2))
What could I use for that calculation?
*UPDATE---
I am looking at a short distance for the matching locations. I was thinking about applying something simple like Pythagoras’ theorem for the nearest match. Maybe even use less decimal places. It's got to be many times faster.
maybe something like this..*
x = (lon2-lon1) * Math.cos((lat1+lat2)/2);
y = (lat2-lat1);
d = Math.sqrt(x*x + y*y) * R;
Then I could do the heavy calculations required for greater accuracy after the final file is updated.
Thanks
You can't do the distance calculation after you perform the closest match: closest is defined by comparison of the distance values. Awk can evaluate the formula that you want (looks like great-circle distance?). Take a look at this chapter to see what you need.
The big problem is finding the nearest match. Write an awk script that takes a single line of file 1 and outputs the lines in file 2 with an extra column. That column is the calculation of the distance between the pair of points according to your distance formula. If you sort that file numerically (sort -n) then your closest match is at the top. Then you need a script that loops over each line in file 1, calls your awk script, uses head -n1 to pull out the closest match and then output it in the format that you want.
This is all possible in bash and awk, but it would be a much simpler script in Python. Depends on which you prefer.
Problem
I need to insert text of arbitrary length ( # of lines ) into a template while maintaining an exact number of total lines.
Sample source data file:
You have a hold available for pickup as of 2012-01-13:
Title: Really Long Test Title Regarding Random Gibberish. Volume 1, A-B, United States
and affiliated territories, United Nations, countries of the world
Author: Barrel Roll Morton
Title: How to Compromise Free Speech Using Everyday Tools. Volume XXVI
Author: Lamar Smith
#end-of-record
You have a hold available for pickup as of 2012-01-13:
Title: Selling Out Democracy For Fun and Profit. Volume 1, A-B, United States
Author: Lamar Smith
Copy: 12
#end-of-record
Sample Template ( simplified for brevity ):
<%CUST-NAME%>
<%CUST-ADDR%>
<%CUST-CTY-ZIP%>
<%TITLES GO HERE%>
<%STORE-NAME%>
<%STORE-ADDR%>
<%STORE-CTY-ZIP%>
At this point I use bash's 'mapfile' to load the source file
record by record using the /^#end-of-file/ regex ...so far so good.
Then I pull predictable aspects of each record according to the line
on which they occur, then process the info using a series of sed
search replace statements.
The Hang-Up
So the problem is the unknown number of 'title' records that could occur.
How can I accommodate an unknown number of titles and always have output
of precisely 65 lines?
Given that title records always occur starting on line 8, I can pull the
titles easily with:
sed -n '8,$p' test-match.txt
However, how can I insert this within an allotted space, ex, between <%CUST-CTY-ZIP%> and <%STORE-NAME%> without pushing the store info out of place in the template?
My idea so far:
-first send the customer info through:
Ex.
sed 's/<%CUST-NAME%>/Benedict Arnold/' template.txt
-Append title records
???
-Then the store/location info
sed 's/<%STORE-NAME%>/Smith's House of Greasy Palms/' template.txt
I have code and functions for this stuff if interested but this post is 'windy' as it is.
Just need help with inserting the title records while maintaining position of following text and maintaining total line number of 65.*
UPDATE
I've decided to change tactics. I'm going to create place holders in the template for all available lines between customer and store info --- then:
Test if line is null in source
if yes -- replace placeholder with null leaving the line ending. Line number maintained.
if not null -- again, replace with text, maintaining line number and line endings in template.
Eventually, I plan to invest some time looking closer at Triplee's suggestion regarding Perl. The Perl way really does look simpler and easier to maintain if I'm going to be stuck with this project long term.
This might work for you:
cat <<! >titles.txt
> 1
> 2
> 3
> 4
> 5
> 6
> 7
> Title 1
> Title 2
> Title 3
> Title 4
> Title 5
> Title 6
> !
cat <<! >template.txt
> <%CUST-NAME%>
> <%CUST-ADDR%>
> <%CUST-CTY-ZIP%>
>
> <%TITLES GO HERE%>
>
> <%STORE-NAME%>
> <%STORE-ADDR%>
> <%STORE-CTY-ZIP%>
> !
sed '1,7d;:a;$!{N;ba};:b;G;s/\n[^\n]*//5g;tc;bb;:c;s/\n/\\n/g;s|.*|/<%TITLES GO HERE%>/c\\&|' titles.txt |
sed -f - template.txt
<%CUST-NAME%>
<%CUST-ADDR%>
<%CUST-CTY-ZIP%>
Title 1
Title 2
Title 3
Title 4
Title 5
<%STORE-NAME%>
<%STORE-ADDR%>
<%STORE-CTY-ZIP%>
This pads/squeezes the titles to 5 lines (s/\n[^\n]*//5g) if you want fewer or more change the 5 to the number desired.
This will give you five lines of output regardless of the number of lines in titles.txt:
sed -n '$s/$/\n\n\n\n\n/;8,$p' test-match.txt | head -n 5
Another version:
sed -n '8,$N; ${s/$/\n\n\n\n\n/;s/\(\([^\n]*\n\)\{4\}\).*/\1/p}' test-match.txt
Use one less than the number of lines you want (4 in this example will cause 5 lines of output).
Here's a quick proof of concept using Perl formats. If you are unfamiliar with Perl, I guess you will need some additional help with how to get the values from two different files, but it's quite doable, of course. Here, the data is simply embedded into the script itself.
I set the $titles format to 5 lines instead of the proper value (58 or something?) in order to make this easier to try out in a terminal window, and to demonstrate that the output is indeed truncated when it is longer than the allocated space.
#!/usr/bin/perl
use strict;
use warnings;
use vars (qw($cust_name $cust_addr $cust_cty_zip $titles
$store_name $store_addr $store_cty_zip));
my $fmtline = '#' . '<' x 78;
my $titlefmtline = '^' . '<' x 78;
my $empty = '';
my $fmt = join ("\n$fmtline\n", 'format STDOUT = ',
'$cust_name', '$cust_addr', '$cust_cty_zip', '$empty') .
("\n$titlefmtline\n" . '$titles') x 5 . #58
join ("\n$fmtline\n", '', '$empty',
'$store_name', '$store_addr', '$store_cty_zip');
#print $fmt;
eval "$fmt\n.\n";
titles = <<____HERE;
Title: Really Long Test Title Regarding Random Gibberish. Volume 1, A-B, United States
and affiliated territories, United Nations, countries of the world
Author: Barrel Roll Morton
Title: How to Compromise Free Speech Using Everyday Tools. Volume XXVI
Author: Lamar Smith
____HERE
# Preserve line breaks -- ^<< will fill lines, but preserves line breaks on \r
$titles =~ s/\n/\r\n/g;
while (<DATA>) {
chomp;
($cust_name, $cust_addr, $cust_cty_zip, $store_name, $store_addr, $store_cty_zip)
= split (",");
write STDOUT;
}
__END__
Charlie Bravo,23 Alpa St,Delta ND 12345,Spamazon,98 Spamway,Atlanta GA 98765
The use of $empty to get an empty line is pretty ugly, but I wanted to keep the format as regular as possible. I'm sure it could be avoided, but at the cost of additional code complexity IMHO.
If you are unfamiliar with Perl, the use strict is a complication, but a practical necessity; it requires you to declare your variables either with use vars or my. It is a best practice which helps immensely if you try to make changes to the script.
Here documents with <<HERE work like in shell scripts; it allows you to create a multi-line string easily.
The x operator is for repetition; 'string' x 3 is 'stringstringstring' and ("list") x 3 is ("list" "list" "list"). The dot operator is string concatenation; that is, "foo" . "bar" is "foobar".
Finally, the DATA filehandle allows you to put arbitrary data in the script file itself after the __END__ token which signals the end of the program code. For reading from standard input, use <> instead of <DATA>.
I am finding the difference between two columns in a file like
cat "trace-0-dir2.txt" | awk '{print expr $2-$1}' | sort
this gives me values like :
-1.28339e+09
-1.28339e+09
-1.28339e+09
-1.28339e+09
I want to avoid the rounding off and want the exact value.How can this be achieved?
FYI ,trace-0-dir2.txt contains:
1283453524.342134 65337.141749 10 2
1283453524.556784 65337.388047 11 2
1283453524.556794 65337.411165 12 2
1283453524.556806 65337.435947 13 2
1283453524.556811 65337.435989 14 2
1283453524.556816 65337.453931 15 2
1283453524.771522 65337.484866 16 2
printf function can help get you the formatting you need. You don't need expr and you don't need cat. awk can do any calculation and you can invoke awk directly on the file.
You can alter the 20.20 to any number based on the format you are looking for.
[jaypal:~/Temp] cat file0
1283453524.342134 65337.141749 10 2
1283453524.556784 65337.388047 11 2
1283453524.556794 65337.411165 12 2
1283453524.556806 65337.435947 13 2
1283453524.556811 65337.435989 14 2
1283453524.556816 65337.453931 15 2
1283453524.771522 65337.484866 16 2
[jaypal:~/Temp] awk '{ printf("%20.20f\n", $2-$1)}' file0
-1283388187.20038509368896484375
-1283388187.16873693466186523438
-1283388187.14562892913818359375
-1283388187.12085914611816406250
-1283388187.12082219123840332031
-1283388187.10288500785827636719
-1283388187.28665614128112792969
From the man page:
Field Width:
An optional digit string specifying a field width; if the output string has fewer characters than the field width it will be blank-padded on the left (or right, if the left-adjustment indicator has been given) to make up the field width (note that a leading zero is a flag, but an embedded zero is part of a field width);
Precision:
An optional period, `.', followed by an optional digit string giving a precision which specifies the number of digits to appear after the decimal point, for e and f formats, or the maximum number of characters to be printed from a string; if the digit string is missing, the precision is treated as zero;