How $0 is used in awk, how it works? - bash

read n
awk '
BEGIN {sum=0;}{if( $0%2==0 ){sum+=$0;
}
}
END { print sum}'
Here i add, sum of even numbers and what i want is, initially i give input as how many(count) and then the numbers i wanted to check as even and add it.
eg)
3
6
7
8
output is : 14
here 3 is count and followed by numbers i want to check, the code is executed correctly and output is correct, but i wanted to know how $0 left the count value i.e) 3 and calculates the remaining numbers.

Please update your question to be meaningful: There is no relationship between $0 and the Unix operating system, as choroba already pointed out in his comment. You obviously want to know the meaning of $0 in the awk programming language. From the awk man-page in the section about Fields:
$0 is the whole record, including leading and trailing whitespace.

you're reading the count but not using it in the script,
a rewrite can be
$ awk 'NR==1 {n=$1; next} // read the first value and skip the rest
!($1%2) {sum+=$1} // add up even numbers
NR>n {print sum; exit}' file // done when the # linespass the counter.
in awk, $0 corresponds to the record (here the line), and $i for the fields i=1,2,3...
even number is the one with remainder 0 divided by 2. NR is the line number.

Related

Detecting semi-duplicate records in Bash/AWK

Right now I have a script that rifles through tabulated data for cross-referencing record by record (using AWK). But I've run into a problem. AWK is great for line-by-line comparisons to run through formatted data, but I also want to detect semi-duplicate records. Unfortunately, uniq will not work by itself as the record is not 100% carbon-copy.
This is an orderly list, sorted by second and third columns. What I want to detect is the same values in Column 3, 6 and 7
Here's an example:
JJ 0072 0128 V7589 N 22.35 22.35 0.00 Auth
JJ 0073 0128 V7589 N 22.35 22.35 0.00 Auth
The second number is different while the other information is exactly the same, so uniq will not find it solo.
Is there something in AWK that lets me reference the previous line? I already have this code block from AWK going line-by-line. (EDIT awk statement was an older version that was terrible)
awk '{printf "%s", $0; if($6 != $7 && $9 != "Void" && $5 == "N") {printf "****\n"} else {printf "\n"}}' /tmp/verbout.txt
Is there something in AWK that lets me reference the previous line?
No, but there's nothing stopping you from explicitly saving certain info from the last line and using that later:
{
if (last3 != $3 || last6 != $6 || last7 != $7) {
print
} else
handle duplicate here
}
last3=$3
last6=$6
last7=$7
}
The lastN variables all (effectively) default to an empty string at the start then we just compare each line with those and print that line if any are different.
Then we store the fields from that line to use for the next.
That is, of course, assuming duplicates should only be detected if they're consecutive. If you want to remove duplicates when order doesn't matter, you can sort on those fields first.
If order needs to be maintained, you can use an associative array to store the fact that the key has been seen before, something like:
{
seenkey = $3" "$6" "$7
if (seen[seenkey] == 0) {
print
seen[seenkey] = 1
} else {
handle duplicate here
}
}
one way of doing this with awk is
$ awk '{print $0, (a[$3,$6,$7]++?"duplicate":"")' file
this will mark the duplicate records, note that you don't need to sort the file.
if you want to just print the uniq records, the idiomatic way is
$ awk '!a[$3,$6,$7]++' file
again, sorting is not required.

Need first two characters from a file and make sure that they both aren't below 6 then return line

I have a text file like this
17 Blue
45 Purple
And I need to make sure the first two digits aren't both less than the number 6 and then print them. So, for example, the first line would print since the first digit 1 is lower than 6 and the second digit 7 is higher than 6, so they aren't both lower than 6. The next line would not print because 4 is lower than 6 and 5 is also lower than 6. I'm trying to use awk and not having any success this is what I have so far. It's just crashing in terminal/bash.
awk 'BEGIN { FS = "";} {if ($1 < 6 && $2 < 6) else print}' file.txt
I'm using FS = "" to separate the first two digits by columns ($1 and $2) not sure if there's an easier way to do this.
awk '!/^[0-5][0-5]/' file.txt
1 more approach could be.
awk -v val="6" 'substr($1,1,1)>val || substr($1,2,1)>val' Input_file
Where I am specifically checking either 1st character of 1st field or2nd character of 1st field are more than 6, where I created a variable named val whose value I have set to 6 one could change it as per need too.
About OP's approach: Yes, one could set FS="" but that will be more specific to GNU awk IMHO I don't think so all awks support it, so it may fail if FS="" is NOT supported. So it is better to either use substr or use regex for this problem(to make solution global supportive).

How to select two specific lines with awk?

/!\ The question is basically solved, see my own answer below for more details and a subsidiary question /!\
I'm trying to add two lines based on specific word, but all I could find is adding everything after some pattern: How to select lines between two marker patterns which may occur multiple times with awk/sed
Which is not what I'm looking after.
Consider the following output:
aji 1
bsk 2
cmq 3
doh 4
enr 5
fwp 6
gzx 7
What I'm trying to get is something like cmq + fwp, which output should be:
9
I do know how to add values, but I'm missing the select line containing cmq, then select line containing fwp part.
So, is there a way awk could strictly select two specific lines independently (then add them) ?
Edit:
As far as I know, matching words is awk '/cmq/', but I need to do that for let's say "fwp" too so I can add them.
$ awk '$1 ~ /^(cmq|fwp)$/{sum+=$2} END { print sum}' infile
Explanation:
awk '$1 ~ /^(cmq|fwp)$/{ # look for the match in first field
sum+=$2 # sum up 2nd field ($2) value,where sum is variable
}
END{ # at the end
print sum # print variable sum
}' infile
Test Results:
$ cat infile
aji 1
bsk 2
cmq 3
doh 4
enr 5
fwp 6
gzx 7
$ awk '$1 ~ /^(cmq|fwp)$/{sum+=$2} END { print sum}' infile
9
Now, for a more generic way this time -which even works for subtracting-:
awk '/cmq/{x=$2} /fwp/{y=$2} END {print x+y}'
Where:
awk ' # Invoking awk and its instructions
/cmq/{x=$2} # Select line with "cmq", then set its value to x. Both must be tied
/fwp/{y=$2} # Select line with "fwp", then set its value to y. Both must be tied
END # Ends pattern matching/creation
{print x+y} # Print the calculated result
' # Ending awk's instructions
Unfortuanately, two variables are used (x and y).
So, I'm still interested on finding how to do it without any variable, or only one at the very most.
I do have a single-variable way for summing:
awk '/cmq|fwp/ {x+=$2} END {print x}'
But doing this for subtracting:
awk '/cmq|fwp/ {x-=$2} END {print x}'
doesn't work.
As an subsidiary question, anyone knows to achieve such subtracting without or with only one variable ?

bash command for group by count

I have a file in the following format
abc|1
def|2
abc|8
def|3
abc|5
xyz|3
I need to group by these words in the first column and sum the value of the second column. For instance, the output of this file should be
abc|14
def|5
xyz|3
Explanation: the corresponding values for word "abc" are 1, 8, and 5. By adding these numbers, the sum comes out to be 14 and the output becomes "abc|14". Similarly, for word "def", the corresponding values are 2 and 3. Summing up these, the final output comes out to be "def|5".
Thank you very much for the help :)
I tried the following command
awk -F "|" '{arr[$1]+=$2} END {for (i in arr) {print i"|"arr[i]}}' filename
another command which I found was
awk -F "," 'BEGIN { FS=OFS=SUBSEP=","}{arr[$1]+=$2 }END {for (i in arr) print i,arr[i]}' filename
Both didn't show me the intended results. Although I'm also in doubt of the working of these commands as well.
Short GNU datamash solution:
datamash -s -t\| -g1 sum 2 < filename
The output:
abc|14
def|5
xyz|3
-t\| - field separator
-g1 - group by the 1st column
sum 2 - sum up values of the 2nd column
I will just add an answer to fix the sorting issue you had, in your Awk logic, you don't need to use sort/uniq piped to the output of Awk, but process in Awk itself.
Referring to GNU Awk Using Predefined Array Scanning Orders with gawk, you can use the PROCINFO["sorted_in"] variable(gawk specific) to control how you want Awk to sort your final output.
Referring to the section below,
#ind_str_asc
Order by indices in ascending order compared as strings; this is the most basic sort. (Internally, array indices are always strings, so with a[2*5] = 1 the index is 10 rather than numeric 10.)
So using this in your requirement in the END clause just do,
END{PROCINFO["sorted_in"]="#ind_str_asc"; for (i in unique) print i,unique[i]}
with your full command being,
awk '
BEGIN{FS=OFS="|"}{
unique[$1]+=$2;
next
}
END{
PROCINFO["sorted_in"]="#ind_str_asc";
for (i in unique)
print i,unique[i]
}' file
awk -F\| '{ arry[$1]+=$2 } END { asorti(arry,arry2);for (i in arry2) { print arry2[i]"|"arry[arry2[i]]} }' filename
Your initial solution should work apart from the issue with sort. Use asorti function to sort the indices from arry to arry2 and then process these in the loop.

Eliminate useless repeats of values from CSV for line charting

Given a CSV file with contents similar to this:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:01,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
What is the best way using bash or awk scripting to tidy it up and remove all useless zeros. By useless I mean: this data will be used for line charts in web pages. However reading the entire CSV file in the web browser via JavaScript/jQuery etc is very slow. It would be more efficient to eliminate the useless zeros prior to uploading the file. If I remove all the zeros, the lines all more or less show peak to peak to peak instead of real lines from zero to some larger value back to zero, followed by a space until the next value greater than zero.
As you see there are 3 groups in the list of data. Any time there are 3 in a row for example for GRP1, I'd like to remove the middle or 2nd 0 in that list. In reality, this could work for values greater than zero also...if the same values were found every 10 seconds for say 10 in a row... it would be good to leave both ends in place and remove items 2 through 9.
The line chart would look the same, but the data would be much smaller to deal with. Ideally I could do this with a shell script on disk prior to reading the input file.
So (just looking at GRP1) instead of:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:31,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:41,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2
The script would eliminate all useless 3 values...and leave only:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2
Or... Another Expected Result using 0 this time...instead of 3 as the common consecutive value for GRP2...
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:21,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:31,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:41,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:51,GRP2,GRP2,TOTAL,0
2017-05-01,00:01:01,GRP2,GRP2,TOTAL,2
The script would eliminate all useless 0 values...and leave only:
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:51,GRP2,GRP2,TOTAL,0
2017-05-01,00:01:01,GRP2,GRP2,TOTAL,2
#karakfa answer gets me close but still end up with portions similar to this after applying awk to one unique group and then eliminating some duplicates that also showed up for some reason:
I like it but it still ends up with this:
2017-05-02,00:05:51,DTE,DTE,TOTAL,2
2017-05-02,00:06:01,DTE,DTE,TOTAL,1
2017-05-02,00:06:51,DTE,DTE,TOTAL,1
2017-05-02,00:07:01,DTE,DTE,TOTAL,1
2017-05-02,00:07:51,DTE,DTE,TOTAL,1
2017-05-02,00:08:01,DTE,DTE,TOTAL,1
2017-05-02,00:08:51,DTE,DTE,TOTAL,1
2017-05-02,00:09:01,DTE,DTE,TOTAL,1
2017-05-02,00:09:51,DTE,DTE,TOTAL,1
2017-05-02,00:10:01,DTE,DTE,TOTAL,1
2017-05-02,00:10:51,DTE,DTE,TOTAL,1
2017-05-02,00:11:01,DTE,DTE,TOTAL,1
2017-05-02,00:11:51,DTE,DTE,TOTAL,1
2017-05-02,00:12:01,DTE,DTE,TOTAL,1
2017-05-02,00:12:51,DTE,DTE,TOTAL,1
2017-05-02,00:13:01,DTE,DTE,TOTAL,1
2017-05-02,00:13:51,DTE,DTE,TOTAL,1
2017-05-02,00:14:01,DTE,DTE,TOTAL,1
2017-05-02,00:14:51,DTE,DTE,TOTAL,1
2017-05-02,00:15:01,DTE,DTE,TOTAL,1
2017-05-02,00:15:11,DTE,DTE,TOTAL,1
2017-05-02,00:15:21,DTE,DTE,TOTAL,9
Would be wonderful to get to this instead:
2017-05-02,00:05:51,DTE,DTE,TOTAL,2
2017-05-02,00:06:01,DTE,DTE,TOTAL,1
2017-05-02,00:15:11,DTE,DTE,TOTAL,1
2017-05-02,00:15:21,DTE,DTE,TOTAL,9
That's one ill-placed question but I'll take a crack at the title, if you don't mind:
$ awk -F, ' {
if($3 OFS $4 OFS $6 in first)
last[$3 OFS $4 OFS $6]=$0
else
first[$3 OFS $4 OFS $6]=$0 }
END {
for(i in first) {
print first[i]
if(i in last)
print last[i] }
}' file
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:01,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:11,GRP3,GRP3,TOTAL,0
Basically it keeps the first and last (if exists) occurrence of each unique combination of 3rd, 4th and 6th field.
Edit: In the new light of the word consecutive, how about this awful hack:
$ awk -F, '
(p!=$3 OFS $4 OFS $6) {
if(NR>1 && lp<(NR-1))
print q
print $0
lp=NR }
{
p=$3 OFS $4 OFS $6
q=$0 }
' file
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:01,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
and output for the second data:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2
and the third:
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:51,GRP2,GRP2,TOTAL,0
2017-05-01,00:01:01,GRP2,GRP2,TOTAL,2
Simple awk approach:
awk -F, '$NF!=0' inputfile
The output:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
$NF!=0 - takes into account only those lines which don't have 0 as their last field value
awk to the rescue!
$ awk -F'[,:]' '$4==pt+10 && $NF==p {pt=$4; pl=$0; next}
pl {print pl}
{pt=$4;p=$NF}1' file
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2

Resources