Find N max difference between two consecutive number present in file using unix - shell

Integer numbers are stored in file, i need to find Max and N Max difference between two consecutive number present in file ( one integer number on each row/line)
e.g.
12
15
50
80
Max diff : 35 ( 50 -15 ) and say N=2 so 1st max 35 and 2nd max : 30

#!/usr/bin/awk -f
NR>1{ diff = $0 - prev
for (i = 0; i < N; ++i)
if (diff > maxdiff[i])
{ # sort new max. diff.
for (j = N; --j > i; ) if (j-1 in maxdiff) maxdiff[j] = maxdiff[j-1]
maxdiff[j] = diff
break
}
}
{ prev = $0 }
END { for (i in maxdiff) print maxdiff[i] }
- e. g., if the script is named nmaxdiff.awk and the numbers are stored in the file numbers, enter
nmaxdiff.awk N=2 numbers

Related

Average over diagonally in a Matrix

I have a matrix. e.g. 5 x 5 matrix
$ cat input.txt
1 5.6 3.4 2.2 -9.99E+10
2 3 2 2 -9.99E+10
2.3 3 7 4.4 5.1
4 5 6 7 8
5 -9.99E+10 9 11 13
Here I would like to ignore -9.99E+10 values.
I am looking for average of all entries after dividing diagonally. Here are four possibilities (using 999 in place of -9.99E+10 to save space in the graphic):
I would like to average over all the values under different shaded triangles.
So the desire output is:
$cat outfile.txt
P1U 3.39 (Average of all values of Lower side of Possible 1 without considering -9.99E+10)
P1L 6.88 (Average of all values of Upper side of Possible 1 without considering -9.99E+10)
P2U 4.90
P2L 5.59
P3U 3.31
P3L 6.41
P4U 6.16
P4L 4.16
It is being difficult to develop a proper algorithm to write it in fortran or in shell script.
I am thinking of the following algorithm, but can't able to think what is next.
step 1: #Assign -9.99E+10 to the Lower diagonal values of a[ij]
for i in {1..5};do
for j in {1..5};do
a[i,j+1]=-9.99E+10
done
done
step 2: #take the average
sum=0
for i in {1..5};do
for j in {1..5};do
sum=sum+a[i,j]
done
done
printf "%s %5.2f",P1U, sum
step 3: #Assign -9.99E+10 to the upper diagonal values of a[ij]
for i in {1..5};do
for j in {1..5};do
a[i-1,j]=-9.99E+10
done
done
step 4: #take the average
sum=0
for i in {1..5};do
for j in {1..5};do
sum=sum+a[i,j]
done
done
printf "%s %5.2f",P1L,sum
Just save all the values in an aray indexied by row and column number and then in the END section repeat this process of setting the beginning and end row and column loop delimiters as needed when defining the loops for each section:
$ cat tst.awk
{
for (colNr=1; colNr<=NF; colNr++) {
vals[colNr,NR] = $colNr
}
}
END {
sect = "P1U"
begColNr = 1; endColNr = NF; begRowNr = 1; endRowNr = NR
sum = cnt = 0
for (rowNr=begRowNr; rowNr<=endRowNr; rowNr++) {
for (colNr=begRowNr; colNr<=endColNr-rowNr+1; colNr++) {
val = vals[colNr,rowNr]
if ( val != "-9.99E+10" ) {
sum += val
cnt++
}
}
}
printf "%s %.2f\n", sect, (cnt ? sum/cnt : 0)
sect = "P1L"
begColNr = 1; endColNr = NF; begRowNr = 1; endRowNr = NR
sum = cnt = 0
for (rowNr=begRowNr; rowNr<=endRowNr; rowNr++) {
for (colNr=endColNr-rowNr+1; colNr<=endColNr; colNr++) {
val = vals[colNr,rowNr]
if ( val != "-9.99E+10" ) {
sum += val
cnt++
}
}
}
printf "%s %.2f\n", sect, (cnt ? sum/cnt : 0)
}
.
$ awk -f tst.awk file
P1U 3.39
P1L 6.88
I assume given the above for handling the first quadrant diagonal halves you'll be able to figure out the other quadrant diagonal halves and the horizontal/vertical quadrant halves are trivial (just set begRowNr to int(NR/2)+1 or endRowNr to int(NR/2) or begColNr to int(NF/2)+1 or endColNr to int(NF/2) then loop through the resultant full range of values of each).
you can compute all in one iteration
$ awk -v NA='-9.99E+10' '{for(i=1;i<=NF;i++) a[NR,i]=$i}
END {for(i=1;i<=NR;i++)
for(j=1;j<=NF;j++)
{v=a[i,j];
if(v!=NA)
{if(i+j<=6) {p["1U"]+=v; c["1U"]++}
if(i+j>=6) {p["1L"]+=v; c["1L"]++}
if(j>=i) {p["2U"]+=v; c["2U"]++}
if(i<=3) {p["3U"]+=v; c["3U"]++}
if(i>=3) {p["3D"]+=v; c["3D"]++}
if(j<=3) {p["4U"]+=v; c["4U"]++}
if(j>=3) {p["4D"]+=v; c["4D"]++}}}
for(k in p) printf "P%s %.2f\n", k,p[k]/c[k]}' file | sort
P1L 6.88
P1U 3.39
P2U 4.90
P3D 6.41
P3U 3.31
P4D 6.16
P4U 4.16
I forgot to add P2D, but from the pattern it should be clear what needs to be done.
To generalize further as suggested. Assert NF==NR, otherwise diagonals not well defined. Let n=NF (and n=NR) You can replace 6 with n+1 and 3 with ceil(n/2). Which can be implemented as function ceil(x) {return x==int(x)?x:x+1}

Algorithm to find an interval with the highest summed weight of weighted overlapping intervals

Well, I think it's hard to explain, so I've made a figure to show that.
As we can see in this figure, there are 6 intervals of time. Each one has its weight. Higher the opacity, higher the weight. I want an algorithm to find the interval with the highest summed weight. In the case of the figure, it'd be the overlapping of the intervals 5 and 6, which is the area with highest opacity.
Split each interval into start and end points.
Sort the points.
Start with a sum of 0.
Iterate through the points using a sweep-line algorithm:
If you get a start point:
Increase the sum by the value of the corresponding interval.
If the sum count is higher than the best sum so far, store this start point and set a flag.
If you get an end point:
If the flag is set, store the stored start point and this end point with the current sum as the best interval so far and reset the flag.
Decrease the count by the value of the corresponding interval.
This is derived from the answer I wrote here, which is based on the unweighted version, i.e. finding the maximum number of overlapping intervals, rather than the maximum summed weight.
Example:
For this example:
The start / end points will be sorted as: (S = start, E = end)
1S, 1E, 2S, 3S, 2E, 3E, 4S, 5S, 4E, 6S, 5E, 6E
Iterating through them, you'll set the flag on 1S, 5S and 6S, and you'll store the respective intervals at 1E, 4E and 5E (which is the first end-points you get to after the above start points).
You won't set the flag on 2S, 3S or 4S, as the sum will be lower than the best sum so far.
The algorithm logic can be derived from the figure. Assuming that resolution of time intervals is 1 min, then an array can be created and used for all the calculations:
create the array of 24 * 60 elements and fill it with 0 weights;
for each time interval add the weight of this interval to the corresponding part of the array;
find a maximum summed weight by iterating the array;
iterate over the array again and output array index (time) with the maximal summed weight.
This algorithm can be modified for a slightly different task, if you need to have interval indices in the output. In this case the array should contain list of the input time interval indices as a second dimension (or it can be a separate array, depending on particular language).
UPD. I was curious if this simple algorithm is significantly slower than more elegant one suggested by #Dukeling. I coded both algorithms and created an input generator to estimate their performance.
Generator:
#!/bin/sh
awk -v n=$1 '
BEGIN {
tmax = 24 * 60; wmax = 100;
for (i = 0; i < n; i++) {
t1 = int(rand() * tmax);
t2 = int(rand() * tmax);
w = int(rand() * wmax);
if (t2 >= t1) {print t1, t2, w} else {print t2, t1, w}
}
}' | sort -n > i.txt
Algorithm #1:
#!/bin/sh
awk '
{t1[++i] = $1; t2[i] = $2; w[i] = $3}
END {
for (i in t1) {
for (t = t1[i]; t <= t2[i]; t++) {
W[t] += w[i];
}
}
Wmax = 0.;
for (t in W){
if (W[t] > Wmax) {Wmax = W[t]}
}
print Wmax;
for (t in W){
if (W[t] == Wmax) {print t}
}
}
' i.txt > a1.txt
Algorithm #2:
#!/bin/sh
awk '
{t1[++i] = $1; t2[i] = $2; w[i] = $3}
END {
for (i in t1) {
p[t1[i] "a" i] = i "S";
p[t2[i] "b" i] = i "E";
}
n = asorti(p, psorted, "#ind_num_asc");
W = 0.; Wmax = 0.; f = 0;
for (i = 1; i <= n; i++){
P = p[psorted[i] ];
k = int(P);
if (index(P, "S") > 0) {
W += w[k];
if (W > Wmax) {
f = 1;
Wmax = W;
to1 = t1[k]
}
}
else {
if (f != 0) {
to2 = t2[k];
f = 0
}
W -= w[k];
}
}
print Wmax, to1 "-" to2
}
' i.txt > a2.txt
Results:
$ ./gen.sh 1000
$ time ./a1.sh
real 0m0.283s
$ time ./a2.sh
real 0m0.019s
$ cat a1.txt
24618
757
$ cat a2.txt
24618 757-757
$ ./gen.sh 10000
$ time ./a1.sh
real 0m3.026s
$ time ./a2.sh
real 0m0.144s
$ cat a1.txt
252452
746
$ cat a2.txt
252452 746-746
$ ./gen.sh 100000
$ time ./a1.sh
real 0m34.127s
$ time ./a2.sh
real 0m1.999s
$ cat a1.txt
2484719
714
$ cat a2.txt
2484719 714-714
The simple on is ~20x slower.

Finding a range of numbers of a file in another file using awk

I have lots of files like this:
3
10
23
.
.
.
720
810
980
And a much bigger file like this:
2 0.004
4 0.003
6 0.034
.
.
.
996 0.01
998 0.02
1000 0.23
What I want to do is find in which range of the second file my first file falls and then estimate the mean of the values in the 2nd column of that range.
Thanks in advance.
NOTE
The numbers in the files do not necessarily follow an easy pattern like 2,4,6...
Since your smaller files are sorted you can pull out the first row and the last row to get the min and max. Then you just need go through the bigfile with an awk script to compute the mean.
So for each smallfile small you would run the script
awk -v start=$(head -n 1 small) -v end=$(tail -n 1 small) -f script bigfile
Where script can be something simple like
BEGIN {
sum = 0;
count = 0;
range_start = -1;
range_end = -1;
}
{
irow = int($1)
ival = $2 + 0.0
if (irow >= start && end >= irow) {
if (range_start == -1) {
range_start = NR;
}
sum = sum + ival;
count++;
}
else if (irow > end) {
if (range_end == -1) {
range_end = NR - 1;
}
}
}
END {
print "start =", range_start, "end =", range_end, "mean =", sum / count
}
You can try below:
for r in *; do
awk -v r=$r -F' ' \
'NR==1{b=$2;v=$4;next}{if(r >= b && r <= $2){m=(v+$4)/2; print m; exit}; b=$2;v=$4}' bigfile.txt
done
Explanation:
First pass it saves column 2 & 4 into temp variables. For all other passes it checks if filename r is between the begin range (previous coluimn 2) and end range (current column 2).
It then works out the mean and prints the result.

How to match a column name and find out the column position in awk?

I am trying to parse some csv files using awk. I am new to shell scripting and awk.
The csv file i am working on looks something like this :
fnName,minAccessTime,maxAccessTime
getInfo,300,600
getStage,600,800
getStage,600,800
getInfo,250,620
getInfo,200,700
getStage,700,1000
getInfo,280,600
I need to find the average AccessTimes of the different functions.
I have been working with awk and have been able to get the average times provided the exact column numbers are specified like $2, $3 etc.
However I need to have a general script in which if i input "minAccessTime" in the command argument, I need the script to print the average AccessTime (instead of explicitly specifying $2 or $3 while using awk).
I have been googling about this and saw in various forums but none of them seems to work.
Can someone tell me how to do this ? It would be of great help !
Thanks in advance!!
This awk script should give you all that you want.
It first evaluates which column you're interested in by using the name passed in as the COLM variable and checking against the first line. It converts this into an index (it's left as the default 0 if it couldn't find the column).
It then basically runs through all other lines in your input file. On all these other lines (assuming you've specified a valid column), it updates the count, sum, minimum and maximum for both the overall data plus each individual function name.
The former is stored in count, sum, min and max. The latter are stored in associative arrays with similar names (with _arr appended).
Then, once all records are read, the END section outputs the information.
NR == 1 {
for (i = 1; i <= NF; i++) {
if ($i == COLM) {
cidx = i;
}
}
}
NR > 1 {
if (cidx > 0) {
count++;
sum += $cidx;
if (count == 1) {
min = $cidx;
max = $cidx;
} else {
if ($cidx < min) { min = $cidx; }
if ($cidx > max) { max = $cidx; }
}
count_arr[$1]++;
sum_arr[$1] += $cidx;
if (count_arr[$1] == 1) {
min_arr[$1] = $cidx;
max_arr[$1] = $cidx;
} else {
if ($cidx < min_arr[$1]) { min_arr[$1] = $cidx; }
if ($cidx > max_arr[$1]) { max_arr[$1] = $cidx; }
}
}
}
END {
if (cidx == 0) {
print "Column '" COLM "' does not exist"
} else {
print "Overall:"
print " Total records = " count
print " Sum of column = " sum
if (count > 0) {
print " Min of column = " min
print " Max of column = " max
print " Avg of column = " sum / count
}
for (task in count_arr) {
print "Function " task ":"
print " Total records = " count_arr[task]
print " Sum of column = " sum_arr[task]
print " Min of column = " min_arr[task]
print " Max of column = " max_arr[task]
print " Avg of column = " sum_arr[task] / count_arr[task]
}
}
}
Storing that script into qq.awk and placing your sample data into qq.in, then running:
awk -F, -vCOLM=minAccessTime -f qq.awk qq.in
generates the following output, which I'm relatively certain will give you every possible piece of information you need:
Overall:
Total records = 7
Sum of column = 2930
Min of column = 200
Max of column = 700
Avg of column = 418.571
Function getStage:
Total records = 3
Sum of column = 1900
Min of column = 600
Max of column = 700
Avg of column = 633.333
Function getInfo:
Total records = 4
Sum of column = 1030
Min of column = 200
Max of column = 300
Avg of column = 257.5
For `maxAccessTime, you get:
Overall:
Total records = 7
Sum of column = 5120
Min of column = 600
Max of column = 1000
Avg of column = 731.429
Function getStage:
Total records = 3
Sum of column = 2600
Min of column = 800
Max of column = 1000
Avg of column = 866.667
Function getInfo:
Total records = 4
Sum of column = 2520
Min of column = 600
Max of column = 700
Avg of column = 630
And, for xyzzy (a non-existent column), you'll see:
Column 'xyzzy' does not exist
If I understand the requirements correctly, you want the average of a column, and you'd like to specify the column by name.
Try the following script (avg.awk):
BEGIN {
FS=",";
}
NR == 1 {
for (i=1; i <= NF; ++i) {
if ($i == SELECTED_FIELD) {
SELECTED_COL=i;
}
}
}
NR > 1 && $1 ~ SELECTED_FNAME {
sum[$1] = sum[$1] + $SELECTED_COL;
count[$1] = count[$1] + 1;
}
END {
for (f in sum) {
printf("Average %s for %s: %d\n", SELECTED_FIELD, f, sum[f] / count[f]);
}
}
and invoke your script like this
awk -v SELECTED_FIELD=minAccessTime -f avg.awk < data.csv
or
awk -v SELECTED_FIELD=maxAccessTime -f avg.awk < data.csv
or
awk -v SELECTED_FIELD=maxAccessTime -v SELECTED_FNAME=getInfo -f avg.awk < data.csv
EDIT:
Rewritten to group by function name (assumed to be first field)
EDIT2:
Rewritten to allow additional parameter to filter by function name (assumed to be first field)

Find max number of concurrent events

I'd like to print the max number of concurrent events given the start time and end time of each event in "hhmm" format (example input below)
$ cat input.txt
1030,1100
1032,1100
1032,1033
1033,1050
1034,1054
1039,1043
1040,1300
For this, I would
Sort by start time (column 1)
Use awk/sed to iterate over all values in column 2 (i.e end time) to find the count of end times preceeding this event which are greater than the current value (i.e find all
currently running events). To elaborate, assuming line 3 was being processed by awk ... Its end time is 10:33. The end times of the preceding 2 events are 11:00 and 11:00.
Since both these values are greater than 10:33 (i.e. they are still running at 10:33), the third column (i.e. number of concurrent jobs) would contain 2 for this line
The expected output of the awk script to find concurrent events for this input would be
0
1
2
2
2
4
0
Find the max value of this third column.
My awk is rudimentary at best and I am having difficulty implementing step 2.
I'd like this to be a pure script without resorting to a heavy weight language like java.
Hence any help from awk gurus would be highly appreciated. Any non-awk linux one liners are also most welcome.
BEGIN {FS="\,"; i=0}
{ superpos=0;
for (j=1; j<=i; j++ ){
if($2 < a[j,2])
++superpos
}
a[++i,1]=$1;
a[i,2]=$2;
print superpos;
a[i,3]=superpos;
}
END{ max=0;
for (j=1; j<=i; j++ ){
if ( a[j,3]>max)
max= a[j,3];
}
print "max = ",max;
}
Running at ideone
HTH!
Output:
0
0
2
2
2
4
0
max = 4
Edit
Or more awkish, if you prefer:
BEGIN {FS="\,"; max=0 }
{
b=0;
for (var in a){
if($2 < a[var]) b++;
}
a[NR]=$2;
print b;
if (b > max) max = b;
}
END { print "max = ", max }

Resources