Check number in another file if in range using shell script - shell

I have two files (fileA and fileB). FileA contains the list of numbers and fileB contains the number range.
fileA
446646452
000000001
63495980020
fileB (range_from and range_to)
22400208, 22400208
446646450, 446646450
63495980000, 63495989999
OUTPUT MUST BE
63495980020
In sql script its just like
select *
from fileB
where 446646452 between Range_from and Range_To
How can I do it using shell script?

Per clarification from the OP, each value in fileA should be checked against all ranges in fileB to see if it falls into at least one range.
>= and <= logic for range checking is assumed (i.e., values that coincide with the range endpoints are included).
awk -F', +' '
# 1st pass (fileB): read the lower and upper range bounds
FNR==NR { lbs[++count] = $1+0; ubs[count] = $2+0; next }
# 2nd pass (fileA): check each line against all ranges.
{
for(i=1;i<=count;++i) {
if ($1+0 >= lbs[i] && $1+0 <= ubs[i]) { print; next }
}
}
' fileB fileA
awk is used to read both files, using separate passes:
FNR==NR is true for all lines from fileB; parallel arrays for the lower bounds (lbs) and upper bounds (ubs) of the ranges are built up; thanks to next, no further processing is applied to fileB lines.
The subsequent {...} block is then only applied to lines from fileA.
Each value from fileA is checked against all ranges, and as soon as a match is found, the input line is printed and processing proceeds to the next line.
To ensure that all tokens involved are treated as numbers, +0 is added to them.

Printing Numbers That Match Any of the Ranges
$ awk 'FNR==NR{low[NR]=$1+0; hi[NR]=$2+0;next} {for (i in low)if ($1>low[i] && $1<hi[i]){print $1;next}}' fileB fileA
63495980020
How it works
FNR==NR{low[NR]=$1+0; hi[NR]=$2+0;next}
When reading in the first file, fileB, save low end of the range in the array low and the high end in the array hi.
for (i in low)if ($1>low[i] && $1<hi[i]){print $1;next}
When reading in the second file, fileA, check the number against each range. If it satisfies any of the ranges, then print it and skip to the next line.
Printing Numbers That Match Their Respective Range
$ paste fileA fileB | awk '$1>$2+0 && $1<$3+0{print $1}'
63495980020
Note that only 63495980020 is printed. 446646452 is not between 22400208 and 22400208, so it is omitted.
How it works
The utility paste combines the files like this:
$ paste fileA fileB
446646452 22400208, 22400208
000000001 446646450, 446646450
63495980020 63495980000, 63495989999
The first column is the number we are interested in while the second column is the low value of the range and the third the high value. We want to print the first value, $1, if it is between the second and third. To test if it is bigger than the second, we might try:
$1>$2
However, to assure that awk is treating the fields as numbers, not strings, we perform addition on one of the numbers like this:
$1>$2+0
Similarly, to test if the first number is smaller than the third:
$1<$3+0
Putting those two tests together with a print command yields:
$1>$2+0 && $1<$3+0 {print $1}
This test does strictly between. Depending on your requirements, you may prefer:
$1>=$2+0 && $1<=$3+0 {print $1}

Old fashion script
sed 's/,[[:space:]]*/ /' fileB \
| while read LowVal HighVal
do
while read ThisLine
do
[ ${ThisLine} -ge ${LowVal} ] && [ ${ThisLine} -le ${HighVal} ] && echo "${ThisLine}"
done < fileA
done

Related

awk to get first column if the a specific number in the line is greater than a digit

I have a data file (file.txt) contains the below lines:
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=22:00,dom=sss.co.uk,user2=lis
I'm expecting to get the first column ($1) only if the ETA= number is greater than 15, like here I will have 2nd and 3rd line first column only is expected.
345
456
I tried like cat file.txt | awk -F [,TPF=]' '{print $1}' but its print whole line which has ETA at the end.
Using awk
$ awk -F"[=, ]" '{for (i=1;i<NF;i++) if ($i=="ETA") if ($(i+1) > 15) print $1}' input_file
345
456
With your shown samples please try following GNU awk code. Using match function of GNU awk where I am using regex (^[0-9]+).*ETA=([0-9]+):[0-9]+ which creates 2 capturing groups and saves its values into array arr. Then checking condition if 2nd element of arr is greater than 15 then print 1st value of arr array as per requirement.
awk '
match($0,/(^[0-9]+).*\<ETA=([0-9]+):[0-9]+/,arr) && arr[2]+0>15{
print arr[1]
}
' Input_file
I would harness GNU AWK for this task following way, let file.txt content be
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=02:00,dom=sss.co.uk,user2=lis
then
awk 'substr($0,index($0,"ETA=")+4,2)+0>15{print $1}' file.txt
gives output
345
Explanation: I use String functions, index to find where is ETA= then substr to get 2 characters after ETA=, 4 is used as ETA= is 4 characters long and index gives start position, I use +0 to convert to integer then compare it with 15. Disclaimer: this solution assumes every row has ETA= followed by exactly 2 digits.
(tested in GNU Awk 5.0.1)
Whenever input contains tag=value pairs as yours does, it's best to first create an array of those mappings (v[]) below and then you can just access the values by their tags (names):
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
v["ETA"]+0 > 15 {
print $1
}
$ awk -f tst.awk file
345
456
With that approach you can trivially enhance the script in future to access whatever values you like by their names, test them in whatever combinations you like, output them in whatever order you like, etc. For example:
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
(v["pro"] ~ /b/) && (v["ETA"]+0 > 15) {
print $1, v["team"], v["dom"]
}
$ awk -f tst.awk file
345,abc,sbc.int
456,efg,sss.co.uk
Think about how you'd enhance any other solution to do the above or anything remotely similar.
It's unclear why you think your attempt would do anything of the sort. Your attempt uses a completely different field separator and does not compare anything against the number 15.
You'll also want to get rid of the useless use of cat.
When you specify a column separator with -F that changes what the first column $1 actually means; it is then everything before the first occurrence of the separator. Probably separately split the line to obtain the first column, space-separated.
awk -F 'ETA=' '$2 > 15 { split($0, n, /[ \t]+/); print n[1] }' file.txt
The value in $2 will be the data after the first separator (and up until the next one) but using it in a numeric comparison simply ignores any non-numeric text after the number at the beginning of the field. So for example, on the first line, we are actually literally checking if 12:00, team=xyz,user1=tom,dom=dby.com is larger than 15 but it effectively checks if 12 is larger than 15 (which is obviously false).
When the condition is true, we split the original line $0 into the array n on sequences of whitespace, and then print the first element of this array.
Using awk you could match ETA= followed by 1 or more digits. Then get the match without the ETA= part and check if the number is greater than 15 and print the first field.
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4)+0 > 15) print $1
}' file
Output
345
456
If the first field should start with a number:
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4) > 15)+0 print $1
}' file

Subsetting a CSV based on a percentage of unique values

I've been reading through other similar questions. I have this working, but it is very slow due to the size of the CSV I'm working with. Are there ways to make this more efficient?
My goal:
I have an incredibly large CSV (>100 GB). I would like to take all of the unique values in a column, extract 10% of these, and then use that 10% to subsample the original CSV.
What I'm doing:
1 - I'm pulling all unique values from column 11 and writing those to a text file:
cat File1.csv | cut -f11 -d , | sort | uniq > uniqueValues.txt
2 - Next, I'm sampling a random 10% of the values in uniqueValues.txt:
cat uniqueValues.txt | awk 'BEGIN {srand()} !/^$/ { if (rand() <= .10) print $0'} > uniqueValues.10pct.txt
3 - Next, I'm pulling the rows in File1.csv which have column 11 matching values from uniqueValues.10pct.txt:
awk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv
As far as I can tell, this seems to be working. Does this seem reasonable? Any suggestions on how to improve the efficiency?
Any suggestions on how to improve the efficiency?
Avoid sort in 1st step as 2nd and 3rd do not care about order, you might do your whole 1st step using single awk command as follows:
awk 'BEGIN{FS=","}!arr[$11]++{print $11}' File1.csv > uniqueValues.txt
Explanation: I inform GNU AWK that field separator (FS) is comma, then for each line I do arr[$11]++ to get number of occurence of value in 11th column and use ! to negate it, so 0 becomes true, whilst 1 and greater becomes false. If this hold true I print 11th column.
Please test above against your 1st step for you data and then select one which is faster.
As for 3th step you might attemp using not-GNU AWK if you are allowed to install tools at your machine. For example author of article¹ Don’t MAWK AWK – the fastest and most elegant big data munging language! found nawk faster than GNU AWK and mawk faster than nawk. After installing prepare test data and measure times for
gawk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv
nawk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv
mawk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv
then use one which proved by fastest.
¹be warned that values shown pertains to versions available at September 2009, you might get different times with version available at June 2022.
You might find this to be faster (untested since no sample input/output provided):
cut -f11 -d',' File1.csv |
sort -u > uniqueValues.txt
numUnq=$(wc -l < uniqueValues.txt)
shuf -n "$(( numUnq / 10 ))" uniqueValues.txt |
awk -F',' 'NR==FNR{a[$1]; next} $11 in vals' - File1.csv
You could try replacing that first cut | sort; numUnq=$(wc...) with
numUnq=$(awk -F',' '!seen[$11]++{print $11 > "uniqueValues.txt"; cnt++} END{print cnt+0}' File1.csv)
to see if that's any faster but I doubt it since cut, sort, and wc are all very fast while awk has to do regexp-based field splitting and store all $11 values in memory (which can get slow as the array size increases due to how dynamic array allocation works).
Create a sample *.csv file:
for ((i=1;i<=100;i++))
do
for ((j=1;j<=100;j++))
do
echo "a,b,c,d,e,f,g,h,i,j,${j},k,l,m"
done
done > large.csv
NOTES:
1,000 total lines
100 unique values in the 11th field
each unique value shows up 10 times in the file
We'll look at a couple awk ideas that:
keep track of unique values as we find them
apply the random percentage check as we encounter a new (unique) value
require just a single pass through the source file
NOTE: both of these awk scripts (below) replace all of OP's current code (cat/cut/sort/uniq/cat/awk/awk)
First idea applies our random percentage check each time we find a new unique value:
awk -F',' '
BEGIN { srand() }
!seen[$11]++ { if (rand() <= 0.10) # if this is the 1st time we have seen this value and rand() is <= 10% then ...
keep[$11] # add the value to our keep[] array
}
$11 in keep # print current line if $11 is an index in the keep[] array
' large.csv > small.csv
NOTES:
one drawback to this approach is that the total number of unique values is not guaranteed to always be exactly 10% since we're at the mercy of the rand() function, for example ...
a half dozen sample runs generated 70, 110, 100, 140, 110 lines (ie, 7, 11, 10, 14 and 11 unique values) in small.csv
A different approach where we pre-generate a random set of modulo 100 values (ie, 0 to 99); as we find a new uniq value we check the count (of uniq values) modulo 100 and if we find a match to our pre-generated set then we print the row:
awk -F',' -v pct=10 '
BEGIN { srand()
delete mods # force awk to treat all "mods" references as an array and not a scalar
while (length(mods) < pct) # repeat loop until we have "pct" unique indices in the mods[] array
mods[int(rand() * 100)] # generate random integers betwen 0 and 99
}
!seen[$11]++ { if ((++uniqcnt % 100) in mods) # if this is the 1st time we have seen this value then increment our unique value counter and if "modulo 100" is an index in the mods[] array then ...
keep[$11] # add the value to our keep[] array
}
$11 in keep # print current line if $11 is an index in the keep[] array
' large.csv > small.csv
NOTES:
for a large pct this assumes the rand() results are evenly distributed between 0 and 1 so that the mods[] array is populated in a timely manner
this has the benefit of printing lines that represent exactly 10% of the possible unique values (depending on number of unique values the percentage will actually be 10% +/- 1%)
a half dozen sample runs all generated exactly 100 lines (ie, 10 unique values) in small.csv
If OP still needs to generate the two intermediate (sorted) files (uniqueValues.txt and uniqueValues.10pct.txt) then this could be done in the same awk script via an END {...} block, eg:
END { PROCINFO["sorted_in"]="#ind_num_asc" # this line of code requires GNU awk otherwise OP can sort the files at the OS/bash level
for (i in seen)
print i > "uniqueValues.txt"
for (i in keep)
print i > "uniqueValues.10pct.txt" # use with 1st awk script
# print i > "uniqueValues." pct "pct.txt" # use with 2nd awk script
}

Formatting output using awk

I've a file with following content:
A 28713.64 27736.1000
B 9835.32
C 38548.96
Now, i need to check if the last row in the first column is 'C', then the value of first row in third column should be printed in the third column against 'C'.
Expected Output:
A 28713.64 27736.1000
B 9835.32
C 38548.96 27736.1000
I tried below, but it's not working:
awk '{if ($1 == "C") ; print $1,$2,$3}' file_name
Any help is most welcome!!!
This works for the given example:
awk 'NR==1{v=$3}$1=="C"{$0=$0 FS v}7' file|column -t
If you want to append the 3rd column value from A row to C row, change NR==1 into $1=="A"
The column -t part is just for making output pretty. :-)
EDIT: As per OP's comment OP is looking for very first line and looking to match C string at very last line of Input_file, if this is the case then one should try following.
awk '
FNR==1{
value=$NF
print
next
}
prev{
print prev
}
{
prev=$0
prev_first=$1
}
END{
if(prev_first=="C"){
print prev,value
}
else{
print
}
}' file | column -t
Assuming that your actual Input_file is same as shown samples and you want to pick value from 1st column whose value is A.
awk '$1=="A" && FNR==1{value=$NF} $1=="C"{print $0,value;next} 1' Input_file| column -t
Output will be as follows.
A 28713.64 27736.1000
B 9835.32
C 38548.96 27736.1000
POSIX dictates that "assigning to a nonexistent field (for example, $(NF+2)=5) shall increase the value of NF; create any intervening fields with the uninitialized value; and cause the value of $0 to be recomputed, with the fields being separated by the value of OFS."
So...
awk 'NR==1{x=$3} $1=="C"{$3=x} 1' input.txt
Note that the output is not formatted well, but that's likely the case with most of the solutions here. You could pipe the output through column, as Ravinder suggested. Or you could control things precisely by printing your data with printf.
awk 'NR==1{x=$3} $1=="C"{$3=x} {printf "%-2s%-26s%s\n",$1,$2,$3}' input.txt
If your lines can be expressed in a printf format, you'll be able to avoid the unpredictability of column -t and save the overhead of a pipe.

Average of first ten numbers of text file using bash

I have a file of two columns. The first column is dates and the second contains a corresponding number. The two commas are separated by a column. I want to take the average of the first three numbers and print it to a new file. Then do the same for the 2nd-4th number. Then 3rd-5th and so on. For example:
File1
date1,1
date2,1
date3,4
date4,1
date5,7
Output file
2
2
4
Is there any way to do this using awk or some other tool?
Input
akshay#db-3325:/tmp$ cat file.txt
date1,1
date2,1
date3,4
date4,1
date5,7
akshay#db-3325:/tmp$ awk -v n=3 -v FS=, '{
x = $2;
i = NR % n;
ma += (x - q[i]) / n;
q[i] = x;
if(NR>=n)print ma;
}' file.txt
2
2
4
OR below one useful for plotting and keeping reference axis (in your case date) at center of average point
Script
akshay#db-3325:/tmp$ cat avg.awk
BEGIN {
m=int((n+1)/2)
}
{L[NR]=$2; sum+=$2}
NR>=m {d[++i]=$1}
NR>n {sum-=L[NR-n]}
NR>=n{
a[++k]=sum/n
}
END {
for (j=1; j<=k; j++)
print d[j],a[j] # remove d[j], if you just want values only
}
Output
akshay#db-3325:/tmp$ awk -v n=3 -v FS=, -v OFS=, -f avg.awk file.txt
date2,2
date3,2
date4,4
$ awk -F, '{a[NR%3]=$2} (NR>=3){print (a[0]+a[1]+a[2])/3}' file
2
2
4
Add a little bit math tricks here, set $2 to a[NR%3] for each record. So the value in each element would be updated cyclically. And the sum of a[0], a[1], a[2] would be the sum of past 3 numbers.
updated based on the changes made due to the helpful feedback from Ed Morton
here's a quick and dirty script to do what you've asked for. It doesn't have much flexibility in it but you can easily figure out how to extend it.
To run save it into a file and execute it as an awk script either with a shebang line or by calling awk -f
// {
Numbers[NR]=$2;
if ( NR >= 3 ) {
printf("%i\n", (Numbers[NR] + Numbers[NR-1] + Numbers[NR-2])/3)
}
}
BEGIN {
FS=","
}
Explanation:
Line 1: Match all lines, "/" is the match operator and in this case we have an empty match which means "do this thing on every line". Line 3: Use the Record Number (NR) as the key and store the value from column 2 Line 4: If we have 3 or more values read from the file Line 5: Do the maths and print as an integer BEGIN block: Change the Field Separator to a comma ",".

Eliminate useless repeats of values from CSV for line charting

Given a CSV file with contents similar to this:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:01,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
What is the best way using bash or awk scripting to tidy it up and remove all useless zeros. By useless I mean: this data will be used for line charts in web pages. However reading the entire CSV file in the web browser via JavaScript/jQuery etc is very slow. It would be more efficient to eliminate the useless zeros prior to uploading the file. If I remove all the zeros, the lines all more or less show peak to peak to peak instead of real lines from zero to some larger value back to zero, followed by a space until the next value greater than zero.
As you see there are 3 groups in the list of data. Any time there are 3 in a row for example for GRP1, I'd like to remove the middle or 2nd 0 in that list. In reality, this could work for values greater than zero also...if the same values were found every 10 seconds for say 10 in a row... it would be good to leave both ends in place and remove items 2 through 9.
The line chart would look the same, but the data would be much smaller to deal with. Ideally I could do this with a shell script on disk prior to reading the input file.
So (just looking at GRP1) instead of:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:31,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:41,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2
The script would eliminate all useless 3 values...and leave only:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2
Or... Another Expected Result using 0 this time...instead of 3 as the common consecutive value for GRP2...
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:21,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:31,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:41,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:51,GRP2,GRP2,TOTAL,0
2017-05-01,00:01:01,GRP2,GRP2,TOTAL,2
The script would eliminate all useless 0 values...and leave only:
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:51,GRP2,GRP2,TOTAL,0
2017-05-01,00:01:01,GRP2,GRP2,TOTAL,2
#karakfa answer gets me close but still end up with portions similar to this after applying awk to one unique group and then eliminating some duplicates that also showed up for some reason:
I like it but it still ends up with this:
2017-05-02,00:05:51,DTE,DTE,TOTAL,2
2017-05-02,00:06:01,DTE,DTE,TOTAL,1
2017-05-02,00:06:51,DTE,DTE,TOTAL,1
2017-05-02,00:07:01,DTE,DTE,TOTAL,1
2017-05-02,00:07:51,DTE,DTE,TOTAL,1
2017-05-02,00:08:01,DTE,DTE,TOTAL,1
2017-05-02,00:08:51,DTE,DTE,TOTAL,1
2017-05-02,00:09:01,DTE,DTE,TOTAL,1
2017-05-02,00:09:51,DTE,DTE,TOTAL,1
2017-05-02,00:10:01,DTE,DTE,TOTAL,1
2017-05-02,00:10:51,DTE,DTE,TOTAL,1
2017-05-02,00:11:01,DTE,DTE,TOTAL,1
2017-05-02,00:11:51,DTE,DTE,TOTAL,1
2017-05-02,00:12:01,DTE,DTE,TOTAL,1
2017-05-02,00:12:51,DTE,DTE,TOTAL,1
2017-05-02,00:13:01,DTE,DTE,TOTAL,1
2017-05-02,00:13:51,DTE,DTE,TOTAL,1
2017-05-02,00:14:01,DTE,DTE,TOTAL,1
2017-05-02,00:14:51,DTE,DTE,TOTAL,1
2017-05-02,00:15:01,DTE,DTE,TOTAL,1
2017-05-02,00:15:11,DTE,DTE,TOTAL,1
2017-05-02,00:15:21,DTE,DTE,TOTAL,9
Would be wonderful to get to this instead:
2017-05-02,00:05:51,DTE,DTE,TOTAL,2
2017-05-02,00:06:01,DTE,DTE,TOTAL,1
2017-05-02,00:15:11,DTE,DTE,TOTAL,1
2017-05-02,00:15:21,DTE,DTE,TOTAL,9
That's one ill-placed question but I'll take a crack at the title, if you don't mind:
$ awk -F, ' {
if($3 OFS $4 OFS $6 in first)
last[$3 OFS $4 OFS $6]=$0
else
first[$3 OFS $4 OFS $6]=$0 }
END {
for(i in first) {
print first[i]
if(i in last)
print last[i] }
}' file
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:01,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:11,GRP3,GRP3,TOTAL,0
Basically it keeps the first and last (if exists) occurrence of each unique combination of 3rd, 4th and 6th field.
Edit: In the new light of the word consecutive, how about this awful hack:
$ awk -F, '
(p!=$3 OFS $4 OFS $6) {
if(NR>1 && lp<(NR-1))
print q
print $0
lp=NR }
{
p=$3 OFS $4 OFS $6
q=$0 }
' file
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:01,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
and output for the second data:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2
and the third:
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:51,GRP2,GRP2,TOTAL,0
2017-05-01,00:01:01,GRP2,GRP2,TOTAL,2
Simple awk approach:
awk -F, '$NF!=0' inputfile
The output:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
$NF!=0 - takes into account only those lines which don't have 0 as their last field value
awk to the rescue!
$ awk -F'[,:]' '$4==pt+10 && $NF==p {pt=$4; pl=$0; next}
pl {print pl}
{pt=$4;p=$NF}1' file
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2

Resources