Given a CSV file with contents similar to this:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:01,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
What is the best way using bash or awk scripting to tidy it up and remove all useless zeros. By useless I mean: this data will be used for line charts in web pages. However reading the entire CSV file in the web browser via JavaScript/jQuery etc is very slow. It would be more efficient to eliminate the useless zeros prior to uploading the file. If I remove all the zeros, the lines all more or less show peak to peak to peak instead of real lines from zero to some larger value back to zero, followed by a space until the next value greater than zero.
As you see there are 3 groups in the list of data. Any time there are 3 in a row for example for GRP1, I'd like to remove the middle or 2nd 0 in that list. In reality, this could work for values greater than zero also...if the same values were found every 10 seconds for say 10 in a row... it would be good to leave both ends in place and remove items 2 through 9.
The line chart would look the same, but the data would be much smaller to deal with. Ideally I could do this with a shell script on disk prior to reading the input file.
So (just looking at GRP1) instead of:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:31,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:41,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2
The script would eliminate all useless 3 values...and leave only:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2
Or... Another Expected Result using 0 this time...instead of 3 as the common consecutive value for GRP2...
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:21,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:31,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:41,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:51,GRP2,GRP2,TOTAL,0
2017-05-01,00:01:01,GRP2,GRP2,TOTAL,2
The script would eliminate all useless 0 values...and leave only:
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:51,GRP2,GRP2,TOTAL,0
2017-05-01,00:01:01,GRP2,GRP2,TOTAL,2
#karakfa answer gets me close but still end up with portions similar to this after applying awk to one unique group and then eliminating some duplicates that also showed up for some reason:
I like it but it still ends up with this:
2017-05-02,00:05:51,DTE,DTE,TOTAL,2
2017-05-02,00:06:01,DTE,DTE,TOTAL,1
2017-05-02,00:06:51,DTE,DTE,TOTAL,1
2017-05-02,00:07:01,DTE,DTE,TOTAL,1
2017-05-02,00:07:51,DTE,DTE,TOTAL,1
2017-05-02,00:08:01,DTE,DTE,TOTAL,1
2017-05-02,00:08:51,DTE,DTE,TOTAL,1
2017-05-02,00:09:01,DTE,DTE,TOTAL,1
2017-05-02,00:09:51,DTE,DTE,TOTAL,1
2017-05-02,00:10:01,DTE,DTE,TOTAL,1
2017-05-02,00:10:51,DTE,DTE,TOTAL,1
2017-05-02,00:11:01,DTE,DTE,TOTAL,1
2017-05-02,00:11:51,DTE,DTE,TOTAL,1
2017-05-02,00:12:01,DTE,DTE,TOTAL,1
2017-05-02,00:12:51,DTE,DTE,TOTAL,1
2017-05-02,00:13:01,DTE,DTE,TOTAL,1
2017-05-02,00:13:51,DTE,DTE,TOTAL,1
2017-05-02,00:14:01,DTE,DTE,TOTAL,1
2017-05-02,00:14:51,DTE,DTE,TOTAL,1
2017-05-02,00:15:01,DTE,DTE,TOTAL,1
2017-05-02,00:15:11,DTE,DTE,TOTAL,1
2017-05-02,00:15:21,DTE,DTE,TOTAL,9
Would be wonderful to get to this instead:
2017-05-02,00:05:51,DTE,DTE,TOTAL,2
2017-05-02,00:06:01,DTE,DTE,TOTAL,1
2017-05-02,00:15:11,DTE,DTE,TOTAL,1
2017-05-02,00:15:21,DTE,DTE,TOTAL,9
That's one ill-placed question but I'll take a crack at the title, if you don't mind:
$ awk -F, ' {
if($3 OFS $4 OFS $6 in first)
last[$3 OFS $4 OFS $6]=$0
else
first[$3 OFS $4 OFS $6]=$0 }
END {
for(i in first) {
print first[i]
if(i in last)
print last[i] }
}' file
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:01,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:11,GRP3,GRP3,TOTAL,0
Basically it keeps the first and last (if exists) occurrence of each unique combination of 3rd, 4th and 6th field.
Edit: In the new light of the word consecutive, how about this awful hack:
$ awk -F, '
(p!=$3 OFS $4 OFS $6) {
if(NR>1 && lp<(NR-1))
print q
print $0
lp=NR }
{
p=$3 OFS $4 OFS $6
q=$0 }
' file
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:01,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
and output for the second data:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2
and the third:
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:51,GRP2,GRP2,TOTAL,0
2017-05-01,00:01:01,GRP2,GRP2,TOTAL,2
Simple awk approach:
awk -F, '$NF!=0' inputfile
The output:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
$NF!=0 - takes into account only those lines which don't have 0 as their last field value
awk to the rescue!
$ awk -F'[,:]' '$4==pt+10 && $NF==p {pt=$4; pl=$0; next}
pl {print pl}
{pt=$4;p=$NF}1' file
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2
I have a .csv file that is formatted thus;
myfile.csv
**Date,Timestamp,Data1,Data2,Data3,Data4,Data5,Data6**
20130730,22:08:51.244,APPLES,Spain,67p,blah,blah
20130730,22:08:51.244,PEARS,Spain,32p,blah,blah
20130730,22:08:51.708,APPLES,France,102p,blah,blah
20130730,22:10:62.108,APPLES,Spain,67p,blah,blah
20130730,22:10:68.244,APPLES,Spain,67p,blah,blah
I wish to feed in a timestamp which most likely will NOT match up perfectly to the millisecond with those in the file, and find the preceding line that matches a particular grep search.
so e.g. something like;
cat myfile.csv | grep 'Spain' | grep 'APPLES' | grep -B1 "22:09"
should return
20130730,22:08:51.244,APPLES,Spain,67p,blah,blah
But thus far I can only get it to work with exact timestamps in the grep. Is there a way to get it to treat these as a time series? (I am guessing that's what the issue is here - it's trying pure pattern matching and not unreasonably failing to find one)
I have also a fancy solution using awk:
awk -F ',' -v mytime="2013 07 30 22 09 00" '
BEGIN {tlimit=mktime(mytime); lastline=""}
{
l_y=substr($1,0,4); l_m=substr($1,4,2); l_d=substr($1,6,2);
split($2,l_hms,":"); l_hms[3]=int(l_hms[3]);
line_time=mktime(sprintf("%d %d %d %d %d %d", l_y, l_m, l_d, l_hms[1], l_hms[2], l_hms[3]));
if (line_time>tlimit) exit; lastline=$0;
}
END{if lastline=="" print $0; else print lastline;}' myfile.csv
It is working based on making the timestamps from each line with awk's time function mktime. I also make the assumption that $1 is the date.
On the first line, you have to provide the timestamp of the time limit you want (here I choose 2013 07 30 22 09 00). You have to write it according to the format used by mktime : YYYY MM DD hh mm ss. You begin the awk statement with making up the timestamp of your time limit. Then, for each line, you catch up year, month and day from $1 (line 4), then the exact hour from $2 (line 5). As mktime takes only entire seconds, I truncate the seconds (you can round it up with int(l_hms[3]+0.5)). Here you can do evereything you want to approximate the timestamp, like discarding the seconds. On line 6, I make the time stamp from the six date fields I have extracted. Finally, on line 7, I compare timestamps and goto end in case of reaching your time limit. As you want the preceding line, I store the line into the variable lastline. On exit, I print lastline; in case of reaching the time limit on the first line, I print the first line.
This solution works well on your sample file, and works for any date you supply. You only have to supply the date limit in the correct format!
EDIT
I realize that mktime is not necessary. If the assumption that $1 is the date written as YYYYMMDD, you can compare the date as a number then the time (extracted with split, rebuilt as a number as in other answers). In that case, you can supply the time limit in the format you want, and recover proper date and time limits in the BEGIN block.
you could have a awk that keep in memory the last line it saw which have a timestamp lower than the one you feed it, and prints the last match at the end (considering they are in ascending order)
ex:
awk -v FS=',' -v thetime="22:09" '($2 < thetime) { before=$0 ; } END { print before ; }' myfile.csv
This happen to work as you feed it a string that, lexigographically, doesn't need to have the complete size (ie 22:09:00.000) to be compared.
The same, but on several lines for readability:
awk -v FS=',' -v thetime="22:09" '
($2 < thetime) { before=$0 ; }
END { print before ; }' myfile.csv
Now if I understand your complete requirements: you need to find, among lines mactching a country and a type of product, the last line before a timestamp? then:
awk -v FS=',' -v thetime="${timestamp}" -v country="${thecountry}" -v product="${theproduct}" '
( $4 == country ) && ( $3 == product ) && ( $2 < thetime ) { before=$0 ; }
END { print before ; }' myfile.csv
should work for you... (feed it with 10:07, Spain and APPLES, and it returns the expected "20130730,22:08:51.244,APPLES,Spain,67p,blah,blah" line)
And if your file spans several days (to adress Bentoy13's concern),
awk -v FS=',' -v theday="${theday}" -v thetime="${timestamp}" -v thecountry="${thecountry}" -v theproduct="${theproduct}" '
( $4 == thecountry ) && ( $3 == theproduct ) && (($1<theday)||(($1==theday)&&($2<thetime))) { before=$0 ; }
END { print before ; }' myfile.csv
That last one also works if the first column changes (ie, if it spans several days), but you need to feed it also theday
You could use awk instead of your grep like this:
awk -v FS=',' -v Hour=22 -v Min=9 '{split($2, a, "[:]"); if ((3600*a[1] + 60*a[2] + a[3] - 3600*Hour - 60*Min)^2 < 100) print $0}' file
and basically change the 100 to what ever tolerance you want.