AWK: is there some flag to ignore comments? - comments

Comment rows are counted in the NR.
Is there some flag to ignore comments?
How can you limit the range in AWK, not like piping | sed -e '1d', to ignore comment rows?
Example
$ awk '{sum+=$3} END {avg=sum/NR} END {print avg}' coriolis_data
0.885491 // WRONG divided by 11, should be by 10
$ cat coriolis_data
#d-err-t-err-d2-err
.105 0.005 0.9766 0.0001 0.595 0.005
.095 0.005 0.9963 0.0001 0.595 0.005
.115 0.005 0.9687 0.0001 0.595 0.005
.105 0.005 0.9693 0.0001 0.595 0.005
.095 0.005 0.9798 0.0001 0.595 0.005
.105 0.005 0.9798 0.0001 0.595 0.005
.095 0.005 0.9711 0.0001 0.595 0.005
.110 0.005 0.9640 0.0001 0.595 0.005
.105 0.005 0.9704 0.0001 0.595 0.005
.090 0.005 0.9644 0.0001 0.595 0.005

it is best not to touch NR , use a different variable for counting the rows. This version skips comments as well as blank lines.
$ awk '!/^[ \t]*#/&&NF{sum+=$3;++d}END{ave=sum/d;print ave}' file
0.97404

Just decrement NR yourself on comment lines:
awk '/^[[:space:]]*#/ { NR-- } {sum+=$3} END { ... }' coriolis_data
Okay, that did answer the question you asked, but the question you really meant:
awk '{ if ($0 ~ /^[[:space:]]*#/) {NR--} else {sum+=$3} END { ... }' coriolis_data
(It's more awk-ish to use patterns outside the blocks as in the first answer, but to do it that way, you'd have to write your comment pattern twice.)
Edit: Will suggests in the comments using /.../ {NR--; next} to avoid having the if-else block. My thought is that this looks cleaner when you have more complex actions for the matching records, but doesn't matter too much for something this simple. Take your favorite!

Another approach is to use a conditional statement...
awk '{ if( $1 != "#" ){ print $0 } }' coriolis_data
What this does is tell awk to skip lines whose first entry is #. Of course this requires the comment charactter # to stand alone at the beginning of a comment.

There is a SIMPLER way to do it!
$ awk '!/#/ {print $0}' coriolis_data
.105 0.005 0.9766 0.0001 0.595 0.005
.095 0.005 0.9963 0.0001 0.595 0.005
.115 0.005 0.9687 0.0001 0.595 0.005
.105 0.005 0.9693 0.0001 0.595 0.005
.095 0.005 0.9798 0.0001 0.595 0.005
.105 0.005 0.9798 0.0001 0.595 0.005
.095 0.005 0.9711 0.0001 0.595 0.005
.110 0.005 0.9640 0.0001 0.595 0.005
.105 0.005 0.9704 0.0001 0.595 0.005
.090 0.005 0.9644 0.0001 0.595 0.005
Correction: no, it is not!
$ awk '!/#/ {sum+=$3}END{ave=sum/NR}END{print ave}' coriolis_data
0.885491 // WRONG.
$ awk '{if ($0 ~ /^[[:space:]]*#/){NR--}else{sum+=$3}}END{ave=sum/NR}END{print ave}' coriolis_data
0.97404 // RIGHT.

The file that you provide for AWK to parse is not a source file, it's data, therefore, AWK knows nothing about its configuration. In other words, for AWK, lines beginning with # are nothing special.
That said, of course you can skip comments, but you will have to create a logic for that: Just tell AWK to ignore everything that comes after a "#" and count yourself the number of lines.
awk 'BEGIN {lines=0} {if(substr($1, 0, 1) != "#") {sum+=$3; lines++} } END {avg=sum/lines} END {print avg}' coriolis_data
You can, of course, indent it for better readability.

I would remove them with sed first, then remove blank lines with grep.
sed 's/#.*//' < coriolis_data | egrep -v '^$' | awk ...

Related

grep a word and print this line and column 1, column 3 of next three line

I have a text like this,
test to print
1 aa ee 0.000 0.000 0.000
2 bb ff 0.000 0.000 0.000
3 cc gg 0 0 0
I want to print out like below,
1 2 3
test to print ee ff gg
or just
test to print ee ff gg
how to do that ?
$ mygrep(){
grep -A3 "$#" | awk '
NR%4==1 { c1r2=$0; next }
NR%4==2 { c2r1=$1; c2r2=$3; next }
NR%4==3 { c3r1=$1; c3r2=$3; next }
NR%4==0 { c4r1=$1; c4r2=$3
printf " \t%s\t%s\t%s\n%s\t%s\t%s\t%s\n",
c2r1, c3r1, c4r1,
c1r2, c2r2, c3r2, c4r2
}
' | column -t -s $'\t'
}
$ mygrep test <<'EOD'
test to print
1 aa ee 0.000 0.000 0.000
2 bb ff 0.000 0.000 0.000
3 cc gg 0 0 0
EOD
1 2 3
test to print ee ff gg
I would harness GNU AWK for this task following way, let file.txt content be
test to print
1 aa ee 0.000 0.000 0.000
2 bb ff 0.000 0.000 0.000
3 cc gg 0 0 0
then
awk 'BEGIN{ORS="\t"}{print /test/?$0:$3}' file.txt
gives output
test to print ee ff gg
Explanation: I inform GNU AWK to use tab character as output record separator (ORS) that is when printing tab is used as last character rather than newline. Then for each line I print depending on test presence in said line, if it is present whole line ($0) otherwise 3rd column ($3). If you want to know more about ORS read 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR.
(tested in gawk 4.2.1)

Merging similar columns from different files into a matrix

I have the files with following format each with the first column being common amongst all the files:
File1.txt
ID Score
ABCD 0.9
BCBS 0.2
NBNC 0.67
TCGS 0.8
File2.txt
ID Score
ABCD 0.3
BCBS 0.9
NBNC 0.73
TCGS 0.12
File3.txt
ID Score
ABCD 0.23
BCBS 0.65
NBNC 0.94
TCGS 0.56
I want to merge the second column (Score column) of all the files with the first column being common and display the file name minus the extension of each file as the header to identify as to where did the score come from such that the matrix would look something like
ID File1 File2 File3
ABCD 0.9 0.3 0.23
BCBS 0.2 0.9 0.65
NBNC 0.67 0.73 0.94
TCGS 0.8 0.12 0.56
$ cat tst.awk
BEGIN { OFS="\t" }
FNR>1 { id[FNR] = $1; score[FNR,ARGIND] = $2 }
END {
printf "%s%s", "ID", OFS
for (colNr=1; colNr<=ARGIND; colNr++) {
sub(/\..*/,"",ARGV[colNr])
printf "%s%s", ARGV[colNr], (colNr<ARGIND?OFS:ORS)
}
for (rowNr=2; rowNr<=FNR; rowNr++) {
printf "%s%s", id[rowNr], OFS
for (colNr=1; colNr<=ARGIND; colNr++) {
printf "%s%s", score[rowNr,colNr], (colNr<ARGIND?OFS:ORS)
}
}
}
$ awk -f tst.awk File1.txt File2.txt File3.txt
ID File1 File2 File3
ABCD 0.9 0.3 0.23
BCBS 0.2 0.9 0.65
NBNC 0.67 0.73 0.94
TCGS 0.8 0.12 0.56
Pick some string that can't occur in your input as the OFS, I used tab.
If you don't have GNU awk add FNR==1{ ARGIND++ } at the start of the script.
Another alternative
$ awk 'NR==1{$0=$1"\t"FILENAME}1' File1 > all;
for f in File{2..6};
do
paste all <(p $f) > temp && cp temp all;
done
define function p as
p() { awk 'NR==1{print FILENAME;next} {print $2}' $1; }
I copied your data to 6 identical files File1..File6 and the script produced this. Most of the work is setting up the column names
ID File1 File2 File3 File4 File5 File6
ABCD 0.9 0.9 0.9 0.9 0.9 0.9
BCBS 0.2 0.2 0.2 0.2 0.2 0.2
NBNC 0.67 0.67 0.67 0.67 0.67 0.67
TCGS 0.8 0.8 0.8 0.8 0.8 0.8

Grep a pattern and ignore others

I have an output with this pattern :
Auxiliary excitation energy for root 3: (variable value)
It appears a consequent number of time in the output, but I only want to grep the last one.
I'm a beginner in bash so I didn't understand the "tail" fonction yet...
Here is what i wrote :
for nn in 0.00000001 0.4 1.0; do
for w in 0.0 0.001 0.01 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2 0.225 0.25 0.275 0.3 0.325 0.35 0.375 0.4 0.425 0.45 0.475 0.5; do
a=`grep ' Auxiliary excitation energy for root 3: ' $nn"_"$w.out`
echo $w" "${a:47:16} >> data_$nn.dat
done
done
With $nn and $w parameters.
But with this grep I only have the first pattern. How to only get the last one?
data example :
line 1 Auxiliary excitation energy for root 3: 0.75588889
line 2 Auxiliary excitation energy for root 3: 0.74981555
line 3 Auxiliary excitation energy for root 3: 0.74891111
line 4 Auxiliary excitation energy for root 3: 0.86745155
My command grep line 1, i would like to grep the last line which has my pattern : here line 4 with my example.
To get the last match, you can use:
grep ... | tail -n 1
Where ... are your grep parameters. So your script would read (with a little cleanup):
for nn in 0.00000001 0.4 1.0; do
for w in 0.0 0.001 0.01 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2 0.225 0.25 0.275 0.3 0.325 0.35 0.375 0.4 0.425 0.45 0.475 0.5; do
a=$( grep ' Auxiliary excitation energy for root 3: ' $nn"_"$w.out | tail -n 1 )
echo $w" "${a:47:16} >> data_$nn.dat
done
done

Awk - extracting information from an xyz-format matrix

I have an x y z matrix of the format:
1 1 0.02
1 2 0.10
1 4 0.22
2 1 0.70
2 2 0.22
3 2 0.44
3 3 0.42
...and so on. I'm interested in summing all of the z values (column 3) for a particular x value (column 1) and printing the output on separate lines (with the x value as a prefix), such that the output for the previous example would appear as:
1 0.34
2 0.92
3 0.86
I have a strong feeling that awk is the right tool for the job, but knowledge of awk is really lacking and I'd really appreciate any help that anyone can offer.
Thanks in advance.
I agree that awk is a good tool for this job — this is pretty much exactly the sort of task it was designed for.
awk '{ sum[$1] += $3 } END { for (i in sum) print i, sum[i] }' data
For the given data, I got:
2 0.92
3 0.86
1 0.34
Clearly, you could pipe the output to sort -n and get the results in sorted order after all.
To get that in sorted order with awk, you have to go outside the realm of POSIX awk and use the GNU awk extension function asorti:
gawk '{ sum[$1] += $3 }
END { n = asorti(sum, map); for (i = 1; i <= n; i++) print map[i], sum[map[i]] }' data
Output:
1 0.34
2 0.92
3 0.86

counting how many time the order of two consecutive number in a two column files are reversed

Given 2 files of N numbers like
file1
1 0.001
2 0.002
3 0.002
4 0.005
5 0.007
6 0.008
7 0.008
8 0.009
9 0.0010
0 0.011
and the file2 is just a shuffled version of file1:
0 0.011
8 0.009
7 0.008
3 0.002
5 0.007
9 0.0010
1 0.001
4 0.005
2 0.002
6 0.008
I would like to count the order of two consecutive number but in this case, if in the second column of file1.dat there are two consecutive number that are the same ( as for the case of 2-3 and 6-7 ) I would like that we count directly the inversion as 0.5, without looking into file2.dat. In this case the result would be 4 inversions. A similar question (and answer) was made on counting how many time the order of two consecutive number in a file are reversed in a second file in BASH
I did for two cases, pick one you need:
this is counting 0.5 or 1 case result =4
kent$ awk 'FNR==NR{o[NR]=$1;next;}{v[$1]=FNR;m[$1]=$2;n=FNR}
END{ for(i=1;i<=n-1;i++) { t+=m[o[i]]==m[o[i+1]]?0.5:v[o[i]]>v[o[i+1]]?1:0};
print "invertions:"t;
}' f1 f2
invertions:4
this is add extra 0.5 case result=6
kent$ awk 'FNR==NR{o[NR]=$1;next;}{v[$1]=FNR;m[$1]=$2;n=FNR}
END{ for(i=1;i<=n-1;i++) {t1+=(v[o[i]]>v[o[i+1]])?1:0; t2+=m[o[i]]==m[o[i+1]]?0.5:0};
print "invertions:"t1+t2;
}' f1 f2
invertions:6

Resources