Eliminate useless repeats of values from CSV for line charting - bash

Given a CSV file with contents similar to this:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:01,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
What is the best way using bash or awk scripting to tidy it up and remove all useless zeros. By useless I mean: this data will be used for line charts in web pages. However reading the entire CSV file in the web browser via JavaScript/jQuery etc is very slow. It would be more efficient to eliminate the useless zeros prior to uploading the file. If I remove all the zeros, the lines all more or less show peak to peak to peak instead of real lines from zero to some larger value back to zero, followed by a space until the next value greater than zero.
As you see there are 3 groups in the list of data. Any time there are 3 in a row for example for GRP1, I'd like to remove the middle or 2nd 0 in that list. In reality, this could work for values greater than zero also...if the same values were found every 10 seconds for say 10 in a row... it would be good to leave both ends in place and remove items 2 through 9.
The line chart would look the same, but the data would be much smaller to deal with. Ideally I could do this with a shell script on disk prior to reading the input file.
So (just looking at GRP1) instead of:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:31,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:41,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2
The script would eliminate all useless 3 values...and leave only:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2
Or... Another Expected Result using 0 this time...instead of 3 as the common consecutive value for GRP2...
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:21,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:31,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:41,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:51,GRP2,GRP2,TOTAL,0
2017-05-01,00:01:01,GRP2,GRP2,TOTAL,2
The script would eliminate all useless 0 values...and leave only:
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:51,GRP2,GRP2,TOTAL,0
2017-05-01,00:01:01,GRP2,GRP2,TOTAL,2
#karakfa answer gets me close but still end up with portions similar to this after applying awk to one unique group and then eliminating some duplicates that also showed up for some reason:
I like it but it still ends up with this:
2017-05-02,00:05:51,DTE,DTE,TOTAL,2
2017-05-02,00:06:01,DTE,DTE,TOTAL,1
2017-05-02,00:06:51,DTE,DTE,TOTAL,1
2017-05-02,00:07:01,DTE,DTE,TOTAL,1
2017-05-02,00:07:51,DTE,DTE,TOTAL,1
2017-05-02,00:08:01,DTE,DTE,TOTAL,1
2017-05-02,00:08:51,DTE,DTE,TOTAL,1
2017-05-02,00:09:01,DTE,DTE,TOTAL,1
2017-05-02,00:09:51,DTE,DTE,TOTAL,1
2017-05-02,00:10:01,DTE,DTE,TOTAL,1
2017-05-02,00:10:51,DTE,DTE,TOTAL,1
2017-05-02,00:11:01,DTE,DTE,TOTAL,1
2017-05-02,00:11:51,DTE,DTE,TOTAL,1
2017-05-02,00:12:01,DTE,DTE,TOTAL,1
2017-05-02,00:12:51,DTE,DTE,TOTAL,1
2017-05-02,00:13:01,DTE,DTE,TOTAL,1
2017-05-02,00:13:51,DTE,DTE,TOTAL,1
2017-05-02,00:14:01,DTE,DTE,TOTAL,1
2017-05-02,00:14:51,DTE,DTE,TOTAL,1
2017-05-02,00:15:01,DTE,DTE,TOTAL,1
2017-05-02,00:15:11,DTE,DTE,TOTAL,1
2017-05-02,00:15:21,DTE,DTE,TOTAL,9
Would be wonderful to get to this instead:
2017-05-02,00:05:51,DTE,DTE,TOTAL,2
2017-05-02,00:06:01,DTE,DTE,TOTAL,1
2017-05-02,00:15:11,DTE,DTE,TOTAL,1
2017-05-02,00:15:21,DTE,DTE,TOTAL,9

That's one ill-placed question but I'll take a crack at the title, if you don't mind:
$ awk -F, ' {
if($3 OFS $4 OFS $6 in first)
last[$3 OFS $4 OFS $6]=$0
else
first[$3 OFS $4 OFS $6]=$0 }
END {
for(i in first) {
print first[i]
if(i in last)
print last[i] }
}' file
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:01,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:11,GRP3,GRP3,TOTAL,0
Basically it keeps the first and last (if exists) occurrence of each unique combination of 3rd, 4th and 6th field.
Edit: In the new light of the word consecutive, how about this awful hack:
$ awk -F, '
(p!=$3 OFS $4 OFS $6) {
if(NR>1 && lp<(NR-1))
print q
print $0
lp=NR }
{
p=$3 OFS $4 OFS $6
q=$0 }
' file
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:01,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
and output for the second data:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2
and the third:
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:51,GRP2,GRP2,TOTAL,0
2017-05-01,00:01:01,GRP2,GRP2,TOTAL,2

Simple awk approach:
awk -F, '$NF!=0' inputfile
The output:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
$NF!=0 - takes into account only those lines which don't have 0 as their last field value

awk to the rescue!
$ awk -F'[,:]' '$4==pt+10 && $NF==p {pt=$4; pl=$0; next}
pl {print pl}
{pt=$4;p=$NF}1' file
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2

Related

How $0 is used in awk, how it works?

read n
awk '
BEGIN {sum=0;}{if( $0%2==0 ){sum+=$0;
}
}
END { print sum}'
Here i add, sum of even numbers and what i want is, initially i give input as how many(count) and then the numbers i wanted to check as even and add it.
eg)
3
6
7
8
output is : 14
here 3 is count and followed by numbers i want to check, the code is executed correctly and output is correct, but i wanted to know how $0 left the count value i.e) 3 and calculates the remaining numbers.
Please update your question to be meaningful: There is no relationship between $0 and the Unix operating system, as choroba already pointed out in his comment. You obviously want to know the meaning of $0 in the awk programming language. From the awk man-page in the section about Fields:
$0 is the whole record, including leading and trailing whitespace.
you're reading the count but not using it in the script,
a rewrite can be
$ awk 'NR==1 {n=$1; next} // read the first value and skip the rest
!($1%2) {sum+=$1} // add up even numbers
NR>n {print sum; exit}' file // done when the # linespass the counter.
in awk, $0 corresponds to the record (here the line), and $i for the fields i=1,2,3...
even number is the one with remainder 0 divided by 2. NR is the line number.

Detecting semi-duplicate records in Bash/AWK

Right now I have a script that rifles through tabulated data for cross-referencing record by record (using AWK). But I've run into a problem. AWK is great for line-by-line comparisons to run through formatted data, but I also want to detect semi-duplicate records. Unfortunately, uniq will not work by itself as the record is not 100% carbon-copy.
This is an orderly list, sorted by second and third columns. What I want to detect is the same values in Column 3, 6 and 7
Here's an example:
JJ 0072 0128 V7589 N 22.35 22.35 0.00 Auth
JJ 0073 0128 V7589 N 22.35 22.35 0.00 Auth
The second number is different while the other information is exactly the same, so uniq will not find it solo.
Is there something in AWK that lets me reference the previous line? I already have this code block from AWK going line-by-line. (EDIT awk statement was an older version that was terrible)
awk '{printf "%s", $0; if($6 != $7 && $9 != "Void" && $5 == "N") {printf "****\n"} else {printf "\n"}}' /tmp/verbout.txt
Is there something in AWK that lets me reference the previous line?
No, but there's nothing stopping you from explicitly saving certain info from the last line and using that later:
{
if (last3 != $3 || last6 != $6 || last7 != $7) {
print
} else
handle duplicate here
}
last3=$3
last6=$6
last7=$7
}
The lastN variables all (effectively) default to an empty string at the start then we just compare each line with those and print that line if any are different.
Then we store the fields from that line to use for the next.
That is, of course, assuming duplicates should only be detected if they're consecutive. If you want to remove duplicates when order doesn't matter, you can sort on those fields first.
If order needs to be maintained, you can use an associative array to store the fact that the key has been seen before, something like:
{
seenkey = $3" "$6" "$7
if (seen[seenkey] == 0) {
print
seen[seenkey] = 1
} else {
handle duplicate here
}
}
one way of doing this with awk is
$ awk '{print $0, (a[$3,$6,$7]++?"duplicate":"")' file
this will mark the duplicate records, note that you don't need to sort the file.
if you want to just print the uniq records, the idiomatic way is
$ awk '!a[$3,$6,$7]++' file
again, sorting is not required.

How to Compare two files line by line and output the whole line if different

I have two sorted files in question
1)one is a control file(ctrl.txt) which is external process generated
2)and other is line count file(count.txt) that I generate using `wc -l`
$more ctrl.txt
Thunderbird|1000
Mustang|2000
Hurricane|3000
$more count.txt
Thunder_bird|1000
MUSTANG|2000
Hurricane|3001
I want to compare these two files ignoring wrinkles in column1(filenames) such as "_" (for Thunder_bird) or "upper case" (for MUSTANG) so that my output only shows below file as the only real different file for which counts dont match.
Hurricane|3000
I have this idea to only compare second column from both the files and output whole line if they are different
I have seen other examples in AWK but I could not get anything to work.
Could you please try following awk and let me know if this helps you.
awk -F"|" 'FNR==NR{gsub(/_/,"");a[tolower($1)]=$2;next} {gsub(/_/,"")} ((tolower($1) in a) && $2!=a[tolower($1)])' cntrl.txt count.txt
Adding a non-one liner form of solution too now.
awk -F"|" '
FNR==NR{
gsub(/_/,"");
a[tolower($1)]=$2;
next}
{ gsub(/_/,"") }
((tolower($1) in a) && $2!=a[tolower($1)])
' cntrl.txt count.txt
Explanation: Adding explanation too here for above code.
awk -F"|" ' ##Setting field seprator as |(pipe) here for all lines in Input_file(s).
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file(cntrl.txt) in this case is being read. Following instructions will be executed once this condition is TRUE.
gsub(/_/,""); ##Using gsub utility of awk to globally subtitute _ with NULL in current line.
a[tolower($1)]=$2; ##Creating an array named a whose index is first field in LOWER CASE to avoid confusions and value is $2 of current line.
next} ##next is awk out of the box keyword which will skip all further instructions now.(to make sure they are read when 2nd Input-file named count.txt is being read).
{ gsub(/_/,"") } ##Statements from here will be executed when 2nd Input_file is being read, using gsub to remove _ all occurrences from line.
((tolower($1) in a) && $2!=a[tolower($1)]) ##Checking condition here if lower form of $1 is present in array a and value of current line $2 is NOT equal to array a value. If this condition is TRUE then print the current line, since I have NOT given any action so by default printing of current line will happen from count.txt file.
' cntrl.txt count.txt ##Mentioning the Input_file names here which we have to pass to awk.

How to select two specific lines with awk?

/!\ The question is basically solved, see my own answer below for more details and a subsidiary question /!\
I'm trying to add two lines based on specific word, but all I could find is adding everything after some pattern: How to select lines between two marker patterns which may occur multiple times with awk/sed
Which is not what I'm looking after.
Consider the following output:
aji 1
bsk 2
cmq 3
doh 4
enr 5
fwp 6
gzx 7
What I'm trying to get is something like cmq + fwp, which output should be:
9
I do know how to add values, but I'm missing the select line containing cmq, then select line containing fwp part.
So, is there a way awk could strictly select two specific lines independently (then add them) ?
Edit:
As far as I know, matching words is awk '/cmq/', but I need to do that for let's say "fwp" too so I can add them.
$ awk '$1 ~ /^(cmq|fwp)$/{sum+=$2} END { print sum}' infile
Explanation:
awk '$1 ~ /^(cmq|fwp)$/{ # look for the match in first field
sum+=$2 # sum up 2nd field ($2) value,where sum is variable
}
END{ # at the end
print sum # print variable sum
}' infile
Test Results:
$ cat infile
aji 1
bsk 2
cmq 3
doh 4
enr 5
fwp 6
gzx 7
$ awk '$1 ~ /^(cmq|fwp)$/{sum+=$2} END { print sum}' infile
9
Now, for a more generic way this time -which even works for subtracting-:
awk '/cmq/{x=$2} /fwp/{y=$2} END {print x+y}'
Where:
awk ' # Invoking awk and its instructions
/cmq/{x=$2} # Select line with "cmq", then set its value to x. Both must be tied
/fwp/{y=$2} # Select line with "fwp", then set its value to y. Both must be tied
END # Ends pattern matching/creation
{print x+y} # Print the calculated result
' # Ending awk's instructions
Unfortuanately, two variables are used (x and y).
So, I'm still interested on finding how to do it without any variable, or only one at the very most.
I do have a single-variable way for summing:
awk '/cmq|fwp/ {x+=$2} END {print x}'
But doing this for subtracting:
awk '/cmq|fwp/ {x-=$2} END {print x}'
doesn't work.
As an subsidiary question, anyone knows to achieve such subtracting without or with only one variable ?

Check number in another file if in range using shell script

I have two files (fileA and fileB). FileA contains the list of numbers and fileB contains the number range.
fileA
446646452
000000001
63495980020
fileB (range_from and range_to)
22400208, 22400208
446646450, 446646450
63495980000, 63495989999
OUTPUT MUST BE
63495980020
In sql script its just like
select *
from fileB
where 446646452 between Range_from and Range_To
How can I do it using shell script?
Per clarification from the OP, each value in fileA should be checked against all ranges in fileB to see if it falls into at least one range.
>= and <= logic for range checking is assumed (i.e., values that coincide with the range endpoints are included).
awk -F', +' '
# 1st pass (fileB): read the lower and upper range bounds
FNR==NR { lbs[++count] = $1+0; ubs[count] = $2+0; next }
# 2nd pass (fileA): check each line against all ranges.
{
for(i=1;i<=count;++i) {
if ($1+0 >= lbs[i] && $1+0 <= ubs[i]) { print; next }
}
}
' fileB fileA
awk is used to read both files, using separate passes:
FNR==NR is true for all lines from fileB; parallel arrays for the lower bounds (lbs) and upper bounds (ubs) of the ranges are built up; thanks to next, no further processing is applied to fileB lines.
The subsequent {...} block is then only applied to lines from fileA.
Each value from fileA is checked against all ranges, and as soon as a match is found, the input line is printed and processing proceeds to the next line.
To ensure that all tokens involved are treated as numbers, +0 is added to them.
Printing Numbers That Match Any of the Ranges
$ awk 'FNR==NR{low[NR]=$1+0; hi[NR]=$2+0;next} {for (i in low)if ($1>low[i] && $1<hi[i]){print $1;next}}' fileB fileA
63495980020
How it works
FNR==NR{low[NR]=$1+0; hi[NR]=$2+0;next}
When reading in the first file, fileB, save low end of the range in the array low and the high end in the array hi.
for (i in low)if ($1>low[i] && $1<hi[i]){print $1;next}
When reading in the second file, fileA, check the number against each range. If it satisfies any of the ranges, then print it and skip to the next line.
Printing Numbers That Match Their Respective Range
$ paste fileA fileB | awk '$1>$2+0 && $1<$3+0{print $1}'
63495980020
Note that only 63495980020 is printed. 446646452 is not between 22400208 and 22400208, so it is omitted.
How it works
The utility paste combines the files like this:
$ paste fileA fileB
446646452 22400208, 22400208
000000001 446646450, 446646450
63495980020 63495980000, 63495989999
The first column is the number we are interested in while the second column is the low value of the range and the third the high value. We want to print the first value, $1, if it is between the second and third. To test if it is bigger than the second, we might try:
$1>$2
However, to assure that awk is treating the fields as numbers, not strings, we perform addition on one of the numbers like this:
$1>$2+0
Similarly, to test if the first number is smaller than the third:
$1<$3+0
Putting those two tests together with a print command yields:
$1>$2+0 && $1<$3+0 {print $1}
This test does strictly between. Depending on your requirements, you may prefer:
$1>=$2+0 && $1<=$3+0 {print $1}
Old fashion script
sed 's/,[[:space:]]*/ /' fileB \
| while read LowVal HighVal
do
while read ThisLine
do
[ ${ThisLine} -ge ${LowVal} ] && [ ${ThisLine} -le ${HighVal} ] && echo "${ThisLine}"
done < fileA
done

Resources