Awk Calc Avg Rows Below Certain Line

Awk Calc Avg Rows Below Certain Line - bash

I'm having trouble calculating an average of specific numbers in column BELOW a specific text identifier using awk. I have two columns of data and I'm trying to start the average keying on a common identifier that repeats, which is 01/1991. So, awk should calc the average of all lines beginning with 01/1991, which repeats, using the next 21 lines with total count of rows for average = 22 for the total number of years 1991-2012. The desired output is an average of each TextID/Name entry for all the January's (01) for each year 1991 - 2012 show below:
TextID/Name 1
Avg: 50.34
TextID/Name 2
Avg: 45.67
TextID/Name 3
Avg: 39.97
...
sample data:
TextID/Name 1
01/1991, 57.67
01/1992, 56.43
01/1993, 49.41
..
01/2012, 39.88
TextID/Name 2
01/1991, 45.66
01/1992, 34.77
01/1993, 56.21
..
01/2012, 42.11
TextID/Name 3
01/1991, 32.22
01/1992, 23.71
01/1993, 29.55
..
01/2012, 35.10
continues with the same data for TextID/Name 4
I'm getting an answer using this code shown below but the average is starting to calculate BEFORE the specific identifier line and not on and below that line (01/1991).
awk '$1="01/1991" {sum+=$2} (NR%22==0){avg=sum/22;print"Average: "avg;sum=0;next}' myfile
Thanks and explanations of the solution is greatly appreciated! I have edited the original answer with more description - thank you again.

If you look at your file, the first field is "01/1991," with a comma at the end, not "01/1991". Also, NR%22==0 will look at line numbers divisible by 22, not 22 lines after the point it thinks you care about.
You can do something like this instead:
awk '
BEGIN { l=-1; }
$1 == "01/1991," {
l=22;
s=0;
}
l > 0 { s+=$2; l--; }
l == 0 { print s/22; l--; }'
It has a counter l that it sets to the number of lines to count, then it sums up that number of lines.
You may want to consider simply summing all lines from one 01/1991 to the next though, which might be more robust.

If you're allowed to use Perl instead of Awk, you could do:
#!/usr/bin/env perl
$start = 0;
$have_started = 0;
$count = 0;
$sum = 0;
while (<>) {
$line = $_;
# Grab the value after the date and comma
if ($line = /\d+\/\d+,\s+([\d\.]+)/) {
$val = $+;
}
# Start summing values after 01/1991
if (/01\/1991,\s+([\d\.]+)/) {
$have_started = 1;
$val = $+;
}
# If we have started counting,
if ($have_started) {
$count++;
$sum += $+;
}
}
print "Average of all values = " . $sum/$count;
Run it like so:
$ cat your-text-file.txt | above-perl-script.pl

Related

Get line number where first occurrence of a value appears?

I have a CSV file like below:
E Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Mean
1 0.7019 0.6734 0.6599 0.6511 0.701 0.6977 0.680833333
2 0.6421 0.6478 0.6095 0.608 0.6525 0.6285 0.6314
3 0.6039 0.6096 0.563 0.5539 0.6218 0.5716 0.5873
4 0.5564 0.5545 0.5138 0.4962 0.5781 0.5154 0.535733333
5 0.5056 0.4972 0.4704 0.4488 0.5245 0.4694 0.485983333
I'm trying to use find the row number where the final column has a value below a certain range. For example, below 0.6.
Using the above CSV file, I want to return 3 because E = 3 is the first row where Mean <= 0.60. If there is no value below 0.6 I want to return 0. I am in effect returning the value in the first column based on the final column.
I plan to initialize this number as a constant in gnuplot. How can this be done? I've tagged awk because I think it's related.

In case you want a gnuplot-only version... if you use a file remove the datablock and replace $Data by your filename in " ".
Edit: You can do it without a dummy table, it can be done shorter with stats (check help stats). Even shorter than the accepted solution (well, we are not at code golf here), but additionally platform-independent because it's gnuplot-only.
Furthermore, in case E could be any number, i.e. 0 as well, then it might be better
to first assign E = NaN and then compare E to NaN (see here: gnuplot: How to compare to NaN?).
Script:
### conditional extraction into a variable
reset session
$Data <<EOD
E Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Mean
1 0.7019 0.6734 0.6599 0.6511 0.701 0.6977 0.680833333
2 0.6421 0.6478 0.6095 0.608 0.6525 0.6285 0.6314
3 0.6039 0.6096 0.563 0.5539 0.6218 0.5716 0.5873
4 0.5564 0.5545 0.5138 0.4962 0.5781 0.5154 0.535733333
5 0.5056 0.4972 0.4704 0.4488 0.5245 0.4694 0.485983333
EOD
E = NaN
stats $Data u ($8<=0.6 && E!=E? E=$1 : 0) nooutput
print E
### end of script
Result:
3.0
Actually, OP wants to return E=0 if the condition was not met. Then the script would be like this:
E=0
stats $Data u ($8<=0.6 && E==0? E=$1 : 0) nooutput

Another awk. You could initialize the default return value to var ret in BEGIN but since it's 0 there is really no point as empty var+0 produces the same effect. If the threshold value of 0.6 is not met before the ENDis reached, that is returned. If it is met, exit invokes the END and ret is output:
$ awk '
NR>1 && $NF<0.6 { # final column has a value below a certain range
ret=$1 # I want to return 3 because E = 3
exit
}
END {
print ret+0
}' file
Output:
3

Something like this should do the trick:
awk 'NR>1 && $8<.6 {print $1;fnd=1;exit}END{if(!fnd){print 0}}' yourfile

Parsing multiple instances of data

I am trying to parse multiple instances of data from a textfile. I can grep and grab one line and the lat/lon associated with that find, but I am having issued parsing multiple instances:
... CATEGORICAL ...
SLGT 33618675 34608681 35658642 36668567 38218542 41018363
41588227 41918045 41377903 40177805 38927813 37817869
36678030 35068154 33368262 33078321 32888462 33618675
SLGT 30440169 31710202 33010185 33730148 34010037 33999962
33709892 32869871 30979883 29539912 29430025 30440169
SLGT 41788755 41698893 42069059 42639132 43889124 44438960
44438757 43988717 43278708 42398720 41788755
MRGL 42897922 41907743 40147624 38837627 37637700 35897915
35028021 34038079 33118130 31998226 31698419 32078601
32818733 33848809 34758764 36998623 38588677 39458701
40178757 40608870 41069099 43549479 44499512 44809478
45259379 44989263 45109100 45718986 46478920 46758853
46738752 46398664 44768565 44308457 43198218
MRGL 29720174 31900221 33650181 34160154 34430032 34649931
34159800 32539784 31359767 29739808 29299723 28969581
28959440 99999999 26769674 26579796 26139874
TSTM 45077438 43177245 40597113 99999999 30488085 30248563
29588926 28739072 28569092 99999999 27138160 27578139
27908100 27848061 27518032 26968006 26338005 25698017
25338025 25088048 25058071 25238109 25578128 25888157
26218171 26578170 26988163 27138160 99999999 29200399
31910374 33520340 35190229 35450147 36109944 36399709
35779395 36399167 38559059 40189373 41729594 43029985
42820283 42860489 43580863 44121062 44521135 45281179
46271166 47561286 48251548 48671765 49051814 99999999
38810245 37660271 37120322 36950398 37090559 37380662
38090741 39410791 39980777 40930695 41380598 41370510
41190353 40840299 40220263 38810245
From: https://www.spc.noaa.gov/products/outlook/archive/2019/KWNSPTSDY1_201906241300.txt
Here is my code and results:
#!/bin/sh
sed -n '/^MRGL/,/^TSTM/p;/^TSTM/q' day1_status | sed '$ d' | sed -e 's/MRGL//g' > MRGL
while read line
do
count=1
ncols=$(echo $line | wc -w)
while [ $count -le $ncols ]
do
echo $line | cut -d' ' -f$count
((count++))
done
done < MRGL > MRGL_output.txt
cat MRGL_output.txt | sed ':a;s/\B[0-9]\{2\}\>/.&/;ta'| sed 's/./, -/6' > MRGL_final
Results:
one instance of MRGL and the lat/lon associated with that polygon
more MRGL
32947889 34137855 35307825 36147735 36327622 35797468
27107968 25518232 99999999 27088303 28418215 30208125
30618064
Turn the line above into a single instance of lines
more MRGL_output.txt
32947889
34137855
35307825
36147735
36327622
35797468
27107968
25518232
99999999
27088303
28418215
30208125
30618064
Final format that I need it in
more MRGL_final
32.94, -78.89
34.13, -78.55
35.30, -78.25
36.14, -77.35
36.32, -76.22
35.79, -74.68
27.10, -79.68
25.51, -82.32
99.99, -99.99
27.08, -83.03
28.41, -82.15
30.20, -81.25
30.61, -80.64
Just need to parse multiple instances that show up.
UPDATE for better explanation.
... CATEGORICAL ...
ENH 38298326 40108202 40518094 40357974 39907953 39017948
38038052 36148202 35848297 35888367 36618371 38298326
SLGT 30440169 31710202 33010185 33730148 34010037 33999962
33709892 32869871 30979883 29539912 29430025 30440169
SLGT 33548672 34408661 35918543 36858496 38648520 41018363
41588227 41918045 41377903 40177805 38927813 37817869
36678030 35068154 33368262 33078321 32888462 33548672
SLGT 41788755 41698893 42069059 42639132 43889124 44438960
44438757 43988717 43278708 42398720 41788755
MRGL 29720174 31900221 33650181 34160154 34430032 34649931
34159800 32539784 31359767 30059748 29299723 28969581
28959440 99999999 26769674 26579796 26139874
MRGL 42897922 41907743 40147624 38837627 37637700 35897915
35028021 34038079 33118130 31938225 30758424 30678620
30988709 34128741 36208583 37738554 39508601 40628878
41069099 43549479 44499512 44809478 45259379 44989263
45109100 45718986 46478920 46758853 46738752 46398664
44768565 44308457 43198218
TSTM 30488085 29978211 29408316 29068379 99999999 27138160
27578139 27908100 27848061 27518032 26968006 26338005
25698017 25338025 25088048 25058071 25238109 25578128
25888157 26218171 26578170 26988163 27138160 99999999
45427410 43217292 40247181 99999999 28650405 31910374
33520340 35190229 35450147 36109944 36399709 35779395
36769245 38319148 40189373 41219571 41299753 39959979
38220054 37320091 36560136 36070290 36100295 35840394
36790544 37150626 37880709 39110774 40120876 41150895
41600769 41890540 43070599 43580863 43390914 43401262
44171458 45521497 46131301 47181242 47561286 48251548
48671765 49371856
Wanting to take this data set above and grab each available risk ENH, SLGT, MRGL, TSTM lat and long and place into this format:
"Enhanced Risk"
38.29, -83.26
40.10, -82.02
40.51, -80.94
40.35, -79.74
39.90, -79.53
39.01, -79.48
38.03, -80.52
36.14, -82.02
35.84, -82.97
35.88, -83.67
36.61, -83.71
38.29, -83.26
End:
"Slight Risk"
30.44, -101.69
31.71, -102.02
33.01, -101.85
33.73, -101.48
34.01, -100.37
33.99, -99.62
33.70, -98.92
32.86, -98.71
30.97, -98.83
29.53, -99.12
29.43, -100.25
30.44, -101.69
End:
"Slight Risk"
33.54, -86.72
34.40, -86.61
35.91, -85.43
36.85, -84.96
38.64, -85.20
41.01, -83.63
41.58, -82.27
41.91, -80.45
41.37, -79.03
40.17, -78.05
38.92, -78.13
37.81, -78.69
36.67, -80.30
35.06, -81.54
33.36, -82.62
33.07, -83.21
32.88, -84.62
33.54, -86.72
End:
"Slight Risk"
41.78, -87.55
41.69, -88.93
42.06, -90.59
42.63, -91.32
43.88, -91.24
44.43, -89.60
44.43, -87.57
43.98, -87.17
43.27, -87.08
42.39, -87.20
41.78, -87.55
End:
"Marginal Risk"
29.72, -101.74
31.90, -102.21
33.65, -101.81
34.16, -101.54
34.43, -100.32
34.64, -99.31
34.15, -98.00
32.53, -97.84
31.35, -97.67
30.05, -97.48
29.29, -97.23
28.96, -95.81
28.95, -94.40
26.76, -96.74
26.57, -97.96
26.13, -98.74
End:

Here's a little awk program which seems to work, although I'm not certain about some of the details. In particular, I don't know what the minimum value for longitude is; evidently, a value under the minimum has 100 added to it before the longitude is negated. So you'll have to change LON_THRESHOLD to what you consider the correct value.
I've tried to avoid the usual temptation to golf awk programs into a textual minimum, in the hopes that the way this program works is less obscure. But it's entirely possible that some awkisms snuck in anyway. I added a bit of explanation at the end.
BEGIN { risk["HIGH"] = "High Risk"
risk["ENH"] = "Enhanced Risk"
risk["SLGT"] = "Slight Risk"
risk["MRGL"] = "Marginal Risk"
LON_THRESHOLD = 30
END_STRING = "End:"
}
END { if (in_risk) print END_STRING }
in_risk && substr($0, 1, 1) != " " {
print END_STRING "\n" "\n"
in_risk = 0
}
$1 in risk { printf("\"%s\"\n", risk[$1])
in_risk = 2
}
in_risk { for (i = in_risk; i <= NF; ++i) {
lat = substr($i, 1, 4) / 100
lon = substr($i, 5, 4) / 100
if (lon < LON_THRESHOLD) lon += 100
printf "%5.2f, %.2f\n", lat, -lon
}
in_risk = 1
}
Save that program as, for example, noaa.awk, and then apply it with:
awk -f noaa.awk input.txt
By way of explanation:
Awk programs consist of a series of rules. Each rule has a predicate -- that is, an expression which evaluates to a true or false value -- and an action.
Awk processes each line from its input in turn, running through all of the rules and executing the actions of the ones whose predicates evaluate to a true value. Inside the action, you can use the $ operator to access individual fields in the input (by default, fields are separated with whitespace). $0 is the entire input line, and $n is field number n. Unlike bash/sh, $ is an operator and can be applied to an expression.
BEGIN and END rules are special, in that they are not real variables. BEGIN rules are executed exactly once, before any other processing; END rules are executed exactly once after all processing is finished. In this example, as is common, BEGIN is used to initialise reference data, while END is used for any necessary termination -- in this case, printing the final End: line.
In cases like this, where the desired action is really dependent on where we are in the file, it's necessary to build some kind of state machine, and I did that using the variable in_risk, which has three possible values:
0 or undefined: We're not currently in a block corresponding to a risk selector.
1: The current line, if it starts with a space, is part of a previously identified risk selector.
2: The current line has been detected as starting with a risk selector.
The reason for the difference between the last two values is that $1 in a line which starts with a risk selector is the risk selector, whereas in a line which starts with a space, $1 is actually the first number. So when we're iterating over the numbers in a line, we have to start with $2 for lines which start with a risk selector.

If you're just asking how to turn a file of lines of like AABBCCDD into lines like AA.BB, -CC.DD:
perl -nE '/^(..)(..)(..)(..)$/ && say "$1.$2, -$3.$4"' MRGL_output.txt
(There's almost certainly better ways to get from your original input to those lines, but I'm not really clear on what your posted code is doing or why)
I think this will process your original input correctly, but can't be sure because the numbers in your sample output don't match up with your sample input so I can't verify:
perl -anE 'if (/^MRGL/ .. /^TSTM/) { exit if /^TSTM/; push #nums, #F }
END { for (#nums) {
if (/^(..)(..)(..)(..)$/) { say "$1.$2, -$3.$4" }
}}' day1_status

Got GNU Awk?
awk -v RS='\\s+' '
/[A-Z]/ {p = /^MRGL$/? 1: 0; next}
p {print gensub(/(..)(..)(..)(..)/, "\\1.\\2, -\\3.\\4", "G")}
' file
-v RS'\\s+' - Use any amount of whitespace as the Record Separator
/[A-Z]/ {...} - On records with uppercase alphabetics, do
p = /^MRGL$/? 1: 0; next - Set flag if record is MRGL, else unset, but always skip any other rules.
p {print gensub(...)} - Print result of gensub if flag is set
/(...)/, "\\1", "G" - Capturing groups, Backreferences, Global substitution.

AWK performance while processing big files

I have an awk script that I use for calculate how much time some transactions takes to complete. The script gets the unique ID of each transaction and stores the minimum and maximum timestamp of each one. Then it calculates the difference and at the end it shows those results that are over 60 seconds.
It works very well when used with some thousand (200k) but it takes more time when used in real world. I tested it several times and it takes about 15 minutes to process about 28 million of lines. Can I consider this good performance or it is possible to improve it?
I'm open to any kind of suggestion.
Here you have the complete code
zgrep -E "\(([a-z0-9]){15,}:" /path/to/very/big/log | awk '{
gsub("[()]|:.*","",$4); #just removing ugly chars
++cont
min=$4"min" #name for maximun value of current transaction
max=$4"max" #same as previous, just for readability
split($2,secs,/[:,]/) #split hours,minutes and seconds
seconds = 3600*secs[1] + 60*secs[2] + secs[3] #turn everything into seconds
if(arr[min] > seconds || arr[min] == 0)
arr[min]=seconds
if(arr[max] < seconds)
arr[max]=seconds
dif=arr[max] - arr[min]
if(dif > 60)
result[$4] = dif
}
END{
for(x in result)
print x" - "result[x]
print ":Processed "cont" lines"
}'

You don't need to calculate the dif every time you read a record. Just do it once in the END section.
You don't need that cont variable, just use NR.
You dont need to populate min and max separately string concatenation is slow in awk.
You shouldn't change $4 as that will force the record to be recompiled.
Try this:
awk '{
name = $4
gsub(/[()]|:.*/,"",name); #just removing ugly chars
split($2,secs,/[:,]/) #split hours,minutes and seconds
seconds = 3600*secs[1] + 60*secs[2] + secs[3] #turn everything into seconds
if (NR==1) {
min[name] = max[name] = seconds
}
else {
if (min[name] > seconds) {
min[name] = seconds
}
if (max[name] < seconds) {
max[name] = seconds
}
}
}
END {
for (name in min) {
diff = max[name] - min[name]
if (diff > 60) {
print name, "-", diff
}
}
print ":Processed", NR, "lines"
}'

After making some test, and with the suggestions gave by Ed Morton (both for code improvement and performance test) I found that the bottleneck was the zgrep command. Here is an example that does several things:
Check if we have a transaction line (first if)
Cleans the transaction id
checks if this has been already registered (second if) by checking if it is in the array
If is not registered then checks if it is the appropriate type of transaction and if so it registers the timestamp in second
If is already registered saves the new time-stamp as the maximun
After all it makes the necessary operations to calculate the time difference
Thank you very much to all that helped me.
zcat /veryBigLog.gz | awk '
{if($4 ~ /^\([:alnum:]/ ){
name=$4;gsub(/[()]|:.*/,"",name);
if(!(name in min)){
if($0 ~ /TypeOFTransaction/ ){
split($2,secs,/[:,]/)
seconds = 3600*secs[1] + 60*secs[2] + secs[3]
max[name] = min[name]=seconds
print lengt(min) "new "name " start at "seconds
}
}else{
split($2,secs,/[:,]/)
seconds = 3600*secs[1] + 60*secs[2] + secs[3]
if( max[name] < seconds) max[name]=seconds
print name " new max " max[name]
}
}}END{
for(x in min){
dif=max[x]- min[x]
print max[x]" max - min "min[x]" : "dif
}
print "Processed "NR" Records"
print "Found "length(min)" MOs" }'

UNIX find every line in a text file that contains a string and output last word of these

I have a text file that contains a number of lines that starts with "# Control Point No"
I managed to get an output of only these lines by doing
grep '# Control Point No'
Now I want only to keep the last word of all these line.
The lines look like
"# Control Point No 39217: 1.52520046527084"
So I want to output only the last numbers as 1.52520046527084
and then:
-find lowest value
-find highest value
-calculate average value
All this I want to do is not all included in the post title, sorry
Thanks

Python is your friend:
#!/usr/bin/python
import re, fileinput, sys
numlines = 0
lowest = sys.float_info.max
highest = sys.float_info.min
total = 0.0
for line in fileinput.input():
m = re.match(r'# Control Point No (\d+): (.+)', line)
if m:
value = float(m.group(2))
numlines += 1
if value < lowest:
lowest = value
if value > highest:
highest = value
total += value
print "lowest=", lowest, ", highest=", highest, ", average=", (total / numlines)
$ chmod 0755 procdata.py
$ ./procdata.py < testdata
lowest= 1.0 , highest= 67.9 , average= 7.31550797863

How Break The Infinite Loop in Perl Under Redo (generating random number)

I intend to generate random number the following step:
Read the data from file (<DATA>)
Generate random numbers as many as the input data lines
The random number should not be generated twice,
e.g. if the rand number generated in loop no x, has been created
before then, recreate the random number.
Here is the code I have which leads to infinite loop.
What's wrong with my logic, and how can I fix it?
#!/usr/bin/perl -w
use strict;
my %chrsize = ('chr1' =>249250621);
# For example case I have created the
# repository where a value has been inserted.
my %done =("chr1 182881372" => 1);
while ( <DATA> ) {
chomp;
next if (/^\#/);
my ($chr,$pos) = split(/\s+/,$_);
# this number has been generated before
# with this: int(rand($chrsize{$chr}));
# hence have to create other than this one
my $newst =182881372;
my $newpos = $chr ."\t".$newst;
# recreate random number
for (0...10){
if ( $done{$newpos} ) {
# INFINITE LOOP
$newst = int(rand($chrsize{$chr}));
redo;
}
}
$done{$newpos}=1;
print "$newpos\n";
}
__DATA__
# In reality there are 20M of such lines
# name positions
chr1 157705682
chr1 19492676
chr1 169660680
chr1 226586538
chr1 182881372
chr1 11246753
chr1 69961084
chr1 180227256
chr1 141449512

You had a couple of errors:
You were setting $newst within your loop every time, so $newpos never took on a new value.
Your inner for loop didn't make sense, because you never actually changed $newpos before checking the condition again.
redo; was working on the inner loop.
Here is a corrected version that avoids redo altogether.
Update: I edited the algorithm a bit to make it simpler.
#!/usr/bin/perl -w
use strict;
my $chr1size = 249250621;
my %done;
my $newst;
while ( <DATA> ) {
chomp;
next if (/^\#/);
my ($chr,$pos) = split(/\s+/,$_);
my $newpos;
#This will always run at least once
do {
$newst = int(rand($chr1size));
$newpos = $chr ."\t".$newst;
} while ( $done{$newpos} );
$done{$newpos}=1;
print "$newpos\n";
}
Update 2: while the above algorithm will work, it will get really slow on 20,000,000 lines. Here is an alternative approach that should be faster (There is sort of a pattern to the random numbers it generates, but it would probably ok for most situations).
#!/usr/bin/perl -w
use strict;
my $newst;
#make sure you have enough. This is good if you have < 100,000,000 lines.
use List::Util qw/shuffle/;
my #rand_pieces = shuffle (0..10000);
my $pos1 = 0;
my $offset = 1;
while ( <DATA> ) {
chomp;
next if (/^\#/);
my ($chr,$pos) = split(/\s+/,$_);
$newst = $rand_pieces[$pos1] * 10000 + $rand_pieces[($pos1+$offset)%10000];
my $newpos = $chr ."\t".$newst;
$pos1++;
if ($pos1 > $#rand_pieces)
{
$pos1 = 0;
$offset = ++$offset % 10000;
if ($offset == 1) { die "Out of random numbers!"; }
}
print "$newpos\n";
}

Add a counter to your loop like this:
my $counter = 0;
# recrate
for (0...10){
if ( $done{$newpos} ) {
# INFINITE LOOP
$newst = int(rand($chr1size));
redo if ++$counter < 100; # Safety counter
# It will continue here if the above doesn't match and run out
# eventually
}
}

To get rid of the infinite loop, replace redo with next.
http://www.tizag.com/perlT/perlwhile.php :
"Redo will execute the same iteration over again."
Then you may need to fix the rest of the logic ;).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Awk Calc Avg Rows Below Certain Line - bash

Related

Get line number where first occurrence of a value appears?

Parsing multiple instances of data

AWK performance while processing big files

UNIX find every line in a text file that contains a string and output last word of these

How Break The Infinite Loop in Perl Under Redo (generating random number)

Categories

Resources