Change date and data cells in .csv file progressively - bash

I have a file that I'm trying to get ready for my boss in time for his manager's meeting tomorrow morning at 8:00AM -8GMT. I want to retroactively change the dates in non consecutive rows in this .csv file: (truncated)
,,,,,
,,,,,sideshow
,,,
date_bob,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
bob_available,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383
bob_used,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312
,,,
date_mel,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
mel_available,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537
mel_used,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159
,,,
date_sideshow-ws2,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
sideshow-ws2_available,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239
sideshow-ws2_used,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441
,,,
,,,,,simpsons
,,,
date_bart,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
bart_available,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559
bart_used,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117
,,,
date_homer,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
homer_available,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799
homer_used,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877
,,,
date_lisa,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
lisa_available,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899
lisa_used,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777
In other words a row that now reads:
date_lisa,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
would desirably read:
date_lisa,09-04-14,09-05-14,09-06-14,09-07-14,09-08-14,09-09-14,09-10-14,09-11-14,09-12-14,09-13-14,09-14-14,09-15-14,09-16-14,09-17-14
I'd like to make the daily available numbers less at the beginning and then get progressively bigger day by day. This will mean that the used rows will have to be proportionately smaller at the beginning and then get progressively bigger in lock step with the available rows as they shrink.
Not by a large amount don't make it look obvious just a few GB here and there. I plan to make pivot tables and graphs out of this and so it has to vary a little. BTW the numbers are all in MB as I generated them using df -m.
Thanks in advance if anyone can help me.

The following awk does what you need:
awk -F, -v OFS=, '
/^date/ {
split ($2, date, /-/);
for (i=2; i<=NF; i++) {
$i = date[1] "-" sprintf ("%02d", date[2] - NF + i) "-" date[3]
}
}
/available|used/ {
for (i=2; i<=NF; i++) {
$i = int (($i*i)/NF)
}
}1' csv
Set the Input and Output Field Separator to ,
All the lines that start with date, we split the second column to find the date part.
We iterate from second column to the end of the line and set the column to new calculated start date which basically uses the current date and the total number of fields.
All other lines remain as is and gets printed along with modified lines.
This has a caveat of not rolling over different months correctly.
For data fields we iterate from second column to the end of line and do a calculation to make them progressively greater than the previous one to match the original value for last field.

Related

Unix script, awk and csv handling

Unix noob here again. I'm writing a script for a Unix class I'm taking online, and it is supposed to handle a csv file while excluding some info and tallying other info, then print a report when finished. I've managed to write something (with help) but am having to debug it on my own, and it's not going well.
After sorting out my syntax errors, the script runs, however all it does is print the header before the loop at the end, and one single value of zero under "Gross Sales", where it doesn't belong.
My script thus far:
I am open to any and all suggestions. However, I should mention again, I don't know what I'm doing, so you may have to explain a bit.
#!/bin/bash
grep -v '^C' online_retail.csv \
| awk -F\\t '!($2=="POST" || $8=="Australia")' \
| awk -F\\t '!($2=="D" || $3=="Discount")' \
| awk -F\\t '{(country[$8] += $4*$6) && (($6 < 2) ? quantity[$8] += $4 : quantity[$8] += 0)} \
END{print "County\t\tGross Sales\t\tItems Under $2.00\n \
-----------------------------------------------------------"; \
for (i in country) print i,"\t\t",country[i],"\t\t", quantity[i]}'
THE ASSIGNMENT Summary:
Using only awk and sed (or grep for the clean-up portion) write a script that prepares the following reports on the above retail dataset.
First, do some data "clean up" -- you can use awk, sed and/or grep for these:
Invoice numbers with a 'C' at the beginning are cancellations. These are just noise -- all lines like this should be deleted.
Any items with a StockCode of "POST" should be deleted.
Your Australian site has been producing some bad data. Delete lines where the "Country" is set to "Australia".
Delete any rows labeled 'Discount' in the Description (or have a 'D' in the StockCode field). Note: if you already completed steps 1-3 above, you've probably already deleted these lines, but double-check here just in case.
Then, print a summary report for each region in a formatted table. Use awk for this part. The table should include:
Gross sales for each region (printed in any order). The regions in the file, less Australia, are below. To calculate gross sales, multiply the UnitPrice times the Quantity for each row, and keep a running total.
France
United Kingdom
Netherlands
Germany
Norway
Items under $2.00 are expected to be a big thing this holiday season, so include a total count of those items per region.
Use field widths so the columns are aligned in the output.
The output table should look like this, although the data will produce different results. You can format the table any way you choose but it should be in a readable table with aligned columns. Like so:
Country Gross Sales Items Under $2.00
---------------------------------------------------------
France 801.86 12
Netherlands 177.60 1
United Kingdom 23144.4 488
Germany 243.48 11
Norway 1919.14 56
A small sample of the csv file:
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536388,22469,HEART OF WICKER SMALL,12,12/1/2010 9:59,1.65,16250,United Kingdom
536388,22242,5 HOOK HANGER MAGIC TOADSTOOL,12,12/1/2010 9:59,1.65,16250,United Kingdom
C536379,D,Discount,-1,12/1/2010 9:41,27.5,14527,United Kingdom
536389,22941,CHRISTMAS LIGHTS 10 REINDEER,6,12/1/2010 10:03,8.5,12431,Australia
536527,22809,SET OF 6 T-LIGHTS SANTA,6,12/1/2010 13:04,2.95,12662,Germany
536527,84347,ROTATING SILVER ANGELS T-LIGHT HLDR,6,12/1/2010 13:04,2.55,12662,Germany
536532,22962,JAM JAR WITH PINK LID,12,12/1/2010 13:24,0.85,12433,Norway
536532,22961,JAM MAKING SET PRINTED,24,12/1/2010 13:24,1.45,12433,Norway
536532,84375,SET OF 20 KIDS COOKIE CUTTERS,24,12/1/2010 13:24,2.1,12433,Norway
536403,POST,POSTAGE,1,12/1/2010 11:27,15,12791,Netherlands
536378,84997C,BLUE 3 PIECE POLKADOT CUTLERY SET,6,12/1/2010 9:37,3.75,14688,United Kingdom
536378,21094,SET/6 RED SPOTTY PAPER PLATES,12,12/1/2010 9:37,0.85,14688,United Kingdom
Seriously, Thank you to whoever can help. You all are amazing!
edit: I think I used the wrong field separator... but not sure how it is supposed to look. still tinkering...
edit2: okay, I "fixed?" the delimiter and changed it from awk -F\\t to awk -F\\',' and now it runs, however, the report data is all incorrect. sigh... I will trudge on.
You professor is hinting at what you should use to do this. awk and awk alone in a single call to awk processing all records in your file. You can do it with three rules.
the first rule simply sets the conditions, which if found in the record (line), causes awk to skip to the next line ignoring the record. According to the description, that would be:
FNR==1 || /^C/ || $2=="POST" || $NF=="Australia" || $2=="D" { next }
your second rule simply sums the Quantity * Unit Price for each Country and also keeps track of the sales of low-price goods if the Unit Price < 2. The a[] array tracks the total sales per Country, and the lpc[] array tracks the low-price cost goods sold if the Unit Price is less than $2.00. That can simply be:
{
a[$NF] += $4 * $6
if ($6 < 2)
lpc[$NF] += $4
}
The final END rule just outputs the heading and outputs the table formatted in column form. That could be:
END {
printf "%-20s%-20s%s", "Country", "Group Sales", "Items Under $2.00\n"
for (i = 0; i < 57; i++)
printf "-"
print ""
for (i in a)
printf "%-20s%10.2f%10s%s\n", i, a[i], " ", lpc[i]
}
That's it, if you put it altogether and providing your input in file, you would have:
awk -F, '
FNR==1 || /^C/ || $2=="POST" || $NF=="Australia" || $2=="D" { next }
{
a[$NF] += $4 * $6
if ($6 < 2)
lpc[$NF] += $4
}
END {
printf "%-20s%-20s%s", "Country", "Group Sales", "Items Under $2.00\n"
for (i = 0; i < 57; i++)
printf "-"
print ""
for (i in a)
printf "%-20s%10.2f%10s%s\n", i, a[i], " ", lpc[i]
}
' file
Example Use/Output
You can just select-copy and middle-mouse-paste the command above into an xterm with the current working directory containing file. Doing so, your results would be:
Country Group Sales Items Under $2.00
---------------------------------------------------------
United Kingdom 72.30 36
Germany 33.00
Norway 95.40 36
Which is similar to the format specified -- though I took the time to decimal align the Group Sales for easy reading.
Look things over an let me know if you have further questions.
note: processing CSV files with awk (or sed or grep) is generally not a good idea IF the values can contain embedded commas within double-quoted fields, e.g.
field1,"field2,part_a,part_b",fields3,...
This prevents problems with choosing a field-separator that will correctly parse the file.
If your input does not have embedded commas (or separators) in the fields, awk is perfectly fine. Just be aware of the potential gotcha depending on the data.

Subtracting Start TIME Values from Finish TIME Values and adding the values as a new column in CSV File in Unix

I have a CSV file (output from a SQL Query). It has Start Time & Finish Time values given in different columns. I need to get the difference of Start Time and Finish Time and generate an HTML Report based on the difference value. For this, I wanted to include a new column, which will hold the output of "Finish Time" - "Start Time". columns are as below.
Time format is in the below format
START TIME: 2018-11-08 01:45:39.0
FINISH TIME:2018-11-06 06:48:20.0
I used below code, but I am not sure, whether its returning correct values. Any help on this will be appreciated.
Below are the 1st 3 lines of my CSV file
DESCRIPTION,SCHEDULE,JOBID,CLASSIFICATION,STARTTIME,FINISHTIME,NEXTRUNSTART,SYSTEM,CREATIONDATETIME,
DailyClearance,Everyday,XXXXXX, Standard,2018-11-08 01:59:59.0,2018-11-08 02:00:52.0,CAK-456,018-11-08 04:28:18,
Miscellinious,Everyday,XXXXXX, standart,2018-11-08 02:59:59.0,2018-11-08 03:29:39.0,2018-11-09 03:00:00.0,CAT-251,2018-11-08 04:28:18,
And this is my attempt
awk 'NR==1 {$7 = "DIFFMIN"} NR > 1 { $7 = $5 - $6} 1' <inputfile.csv
This might be of assistance to you. The idea is to use GNU awk which has time functions.
awk 'BEGIN{FS=OFS=","}
(NR==1){print $0 OFS "DURATION"; next}
{ tstart = $5; tend = $6
gsub(/[-:]/," ",tstart); tstart=mktime(tstart)
gsub(/[-:]/," ",tend); tend =mktime(tend)
$(NF+1)=tend-tstart;
print
}'
This should add the extra column. The time will be expressed in seconds.
The idea is to select the two columns and convert them into seconds since epoch (1970-01-01T00:00:00). This is done using the mktime function which expects a string of the form YYYY MM DD hh mm ss. That is why we first perform a substitution. Once we have the seconds since epoch for the start and end-time, we can just subtract them to get the duration in seconds.
note: there might be some problems during daylight saving time. This depends on the settings of your system.
note: subsecond accuracy is ignored.

Timestamp Subtraction Using Bash-Utils

Hey I am trying to calculate the difference between time stamp of a file in which I always want to subtract the last the fields using bash/bash-utils. This is how the file looks
14:11:56.953700000,172.20.10.1
14:25:49.233263000,172.20.10.1
Now the issue is that I want to lose that huge number and IP from the calculation.
I can put them in csv or any data file needed.
Could you please try following and let me know if this helps you.
awk -F'[.,]' '
FNR==1{
split($1,time,":");
sec=time[1] * 3600+time[2]*60+time[3]}
FNR==2{
split($1,time1,":");
sec1=time1[1] * 3600+time1[2]*60+time1[3];
seconds=(sec1-sec)%60;
min=sprintf("%d",(sec1-sec)/60);
printf("%s %s\n",min" min",seconds" sec")
}' Input_file
Output will be as follows.
13 min 53 sec

Edit fields in csv files using bash

I have a bunch of csv files that need "cleaning".
Specifically, there is a column that contains timestamp values, however some lines have a value of '1' instead.
What I wish to do, is replace those 1's with the last valid (timestamp) value, i.e. replace the value of i-th line with that of that of line i-1.
I provide a sample of the file
URL192.168.2.2,420042,20/07/2015 09:40:00,168430081,168430109
URL192.168.2.2,420042,20/07/2015 09:40:00,3232236038,3232236034
URL192.168.2.2,420042, 1,168430081,168430109
URL192.168.2.2,420042,20/07/2015 09:40:01,3232236038,3232236034
So in this example, the 1 must be replaced with 20/07/2015 09:40:00. I tried it using awk but couldn't nail it.
Assuming no commas in the other fields, an awk program like this should work:
BEGIN { FS = OFS = "," }
$3!=1 { prev = $3 }
$3==1 { $3 = prev }
{ print }
Warning: this is untested code.
The first line sets the field separator to a comma, for both input and output. The second line saves the timestamp of every row that has a timestamp in the third field. The third line writes the most recently saved timestamp to every row that doesn't have a timestamp in the third field. And the fourth line writes every input line, whether modified or not, to the output.
Let me know how you get on.

How to get two sums of number of fields from a list of names: one with 2 fields, the other with 3 or more?

I want to figure out how to sort a list of names (FS=" ") into two piles: one with 2 fields, and the other with 3 + fields, but only show the total number of records in both lists in one sentence. Is this even possible? If so, how?
(I am new to awk, scripting. There are loads of info regarding how to show sums of fields, but not how to split up a list into two files by NF and then show the sum.)
Here goes
awk 'NF >= 2{x[NF==2?2:3]++};
END{for (i in x) printf "%d records with %s fields\n", x[i], i==3?"3+":"2"}' file
Assuming numbers are in the last field
awk '{ sum += $NF } END { print sum }' file.txt

Resources