Average over diagonally in a Matrix - shell

I have a matrix. e.g. 5 x 5 matrix
$ cat input.txt
1 5.6 3.4 2.2 -9.99E+10
2 3 2 2 -9.99E+10
2.3 3 7 4.4 5.1
4 5 6 7 8
5 -9.99E+10 9 11 13
Here I would like to ignore -9.99E+10 values.
I am looking for average of all entries after dividing diagonally. Here are four possibilities (using 999 in place of -9.99E+10 to save space in the graphic):
I would like to average over all the values under different shaded triangles.
So the desire output is:
$cat outfile.txt
P1U 3.39 (Average of all values of Lower side of Possible 1 without considering -9.99E+10)
P1L 6.88 (Average of all values of Upper side of Possible 1 without considering -9.99E+10)
P2U 4.90
P2L 5.59
P3U 3.31
P3L 6.41
P4U 6.16
P4L 4.16
It is being difficult to develop a proper algorithm to write it in fortran or in shell script.
I am thinking of the following algorithm, but can't able to think what is next.
step 1: #Assign -9.99E+10 to the Lower diagonal values of a[ij]
for i in {1..5};do
for j in {1..5};do
a[i,j+1]=-9.99E+10
done
done
step 2: #take the average
sum=0
for i in {1..5};do
for j in {1..5};do
sum=sum+a[i,j]
done
done
printf "%s %5.2f",P1U, sum
step 3: #Assign -9.99E+10 to the upper diagonal values of a[ij]
for i in {1..5};do
for j in {1..5};do
a[i-1,j]=-9.99E+10
done
done
step 4: #take the average
sum=0
for i in {1..5};do
for j in {1..5};do
sum=sum+a[i,j]
done
done
printf "%s %5.2f",P1L,sum

Just save all the values in an aray indexied by row and column number and then in the END section repeat this process of setting the beginning and end row and column loop delimiters as needed when defining the loops for each section:
$ cat tst.awk
{
for (colNr=1; colNr<=NF; colNr++) {
vals[colNr,NR] = $colNr
}
}
END {
sect = "P1U"
begColNr = 1; endColNr = NF; begRowNr = 1; endRowNr = NR
sum = cnt = 0
for (rowNr=begRowNr; rowNr<=endRowNr; rowNr++) {
for (colNr=begRowNr; colNr<=endColNr-rowNr+1; colNr++) {
val = vals[colNr,rowNr]
if ( val != "-9.99E+10" ) {
sum += val
cnt++
}
}
}
printf "%s %.2f\n", sect, (cnt ? sum/cnt : 0)
sect = "P1L"
begColNr = 1; endColNr = NF; begRowNr = 1; endRowNr = NR
sum = cnt = 0
for (rowNr=begRowNr; rowNr<=endRowNr; rowNr++) {
for (colNr=endColNr-rowNr+1; colNr<=endColNr; colNr++) {
val = vals[colNr,rowNr]
if ( val != "-9.99E+10" ) {
sum += val
cnt++
}
}
}
printf "%s %.2f\n", sect, (cnt ? sum/cnt : 0)
}
.
$ awk -f tst.awk file
P1U 3.39
P1L 6.88
I assume given the above for handling the first quadrant diagonal halves you'll be able to figure out the other quadrant diagonal halves and the horizontal/vertical quadrant halves are trivial (just set begRowNr to int(NR/2)+1 or endRowNr to int(NR/2) or begColNr to int(NF/2)+1 or endColNr to int(NF/2) then loop through the resultant full range of values of each).

you can compute all in one iteration
$ awk -v NA='-9.99E+10' '{for(i=1;i<=NF;i++) a[NR,i]=$i}
END {for(i=1;i<=NR;i++)
for(j=1;j<=NF;j++)
{v=a[i,j];
if(v!=NA)
{if(i+j<=6) {p["1U"]+=v; c["1U"]++}
if(i+j>=6) {p["1L"]+=v; c["1L"]++}
if(j>=i) {p["2U"]+=v; c["2U"]++}
if(i<=3) {p["3U"]+=v; c["3U"]++}
if(i>=3) {p["3D"]+=v; c["3D"]++}
if(j<=3) {p["4U"]+=v; c["4U"]++}
if(j>=3) {p["4D"]+=v; c["4D"]++}}}
for(k in p) printf "P%s %.2f\n", k,p[k]/c[k]}' file | sort
P1L 6.88
P1U 3.39
P2U 4.90
P3D 6.41
P3U 3.31
P4D 6.16
P4U 4.16
I forgot to add P2D, but from the pattern it should be clear what needs to be done.
To generalize further as suggested. Assert NF==NR, otherwise diagonals not well defined. Let n=NF (and n=NR) You can replace 6 with n+1 and 3 with ceil(n/2). Which can be implemented as function ceil(x) {return x==int(x)?x:x+1}

Related

Find N max difference between two consecutive number present in file using unix

Integer numbers are stored in file, i need to find Max and N Max difference between two consecutive number present in file ( one integer number on each row/line)
e.g.
12
15
50
80
Max diff : 35 ( 50 -15 ) and say N=2 so 1st max 35 and 2nd max : 30
#!/usr/bin/awk -f
NR>1{ diff = $0 - prev
for (i = 0; i < N; ++i)
if (diff > maxdiff[i])
{ # sort new max. diff.
for (j = N; --j > i; ) if (j-1 in maxdiff) maxdiff[j] = maxdiff[j-1]
maxdiff[j] = diff
break
}
}
{ prev = $0 }
END { for (i in maxdiff) print maxdiff[i] }
- e. g., if the script is named nmaxdiff.awk and the numbers are stored in the file numbers, enter
nmaxdiff.awk N=2 numbers

awk skipping records. getline command

this is a task related to data compression using fibonacci binary representation.
what i have is this text file:
result.txt
a 20
b 18
c 18
d 15
e 7
this file is a result of scanning a text file and counting the appearances of each char on the file using awk.
now i need to give each char its fibonacci-binary representation length.
since i'm new to ubuntu and teminal, i've done a program in java that receives a number and prints all the fibonacci codewords length up to the number and it's working.
this is exactly what i'm trying to do here. the problem is that it doesn't work...
the length of fibonacci codewords is also work as fibonnaci.
these are the rules:
f(1)=1 - there is 1 codeword of length 1.
f(2)=1 - there is 1 codeword of length 2.
f(3)=2 - there is 2 codeword of length 3.
f(4)=3 - there is 3 codeword of length 4.
and so on...
(i'm adding on more bit to each codeword so the first two lengths will be 2 and 3)
this is the code i've made: its name is scr5
{
a=1;
b=1;
len=2
print $1 , $2, len;
getline;
print $1 ,$2, len+1;
getline;
len=4;
for(i=1; i< num; i++){
c= a+b;
g=c;
while (c >= 1){
print $1 ,$2, len ;
if (getline<=0){
print "EOF"
exit;
}
c--;
i++;
}
a=b;
b=c;
len++;
}}
now i write on terminal:
n=5
awk -v num=$n -f scr5 a
and there are two problems:
1. it skips the third letter c.
2. on the forth letter d, it prints the length of the first letter, 2, instead of length 3.
i guess that there is a problem in the getline command.
thank u very much!
Search Google for getline and awk and you'll mostly find reasons to avoid getline completely! Often it's a sign you're not really doing things the "awk" way. Find an awk tutorial and work through the basics and I'm sure you'll see quickly why your attempt using getlines is not getting you off in the right direction.
In the script below, the BEGIN block is run once at the beginning before any input is read, and then the next block is automatically run once for each line of input --- without any need for getline.
Good luck!
$ cat fib.awk
BEGIN { prior_count = 0; count = 1; len = 1; remaining = count; }
{
if (remaining == 0) {
temp = count;
count += prior_count;
prior_count = temp;
remaining = count;
++len;
}
print $1, $2, len;
--remaining;
}
$ cat fib.txt
a 20
b 18
c 18
d 15
e 7
f 0
g 0
h 0
i 0
j 0
k 0
l 0
m 0
$ awk -f fib.awk fib.txt
a 20 1
b 18 2
c 18 3
d 15 3
e 7 4
f 0 4
g 0 4
h 0 5
i 0 5
j 0 5
k 0 5
l 0 5
m 0 6
The above solution, compressed form :
mawk 'BEGIN{ ___= __= _^=____=+_ } !_ { __+=(\
____=___+_*(_=___+=____))^!_ } $++NF = (_--<_)+__' fib.txt
a 20 1
b 18 2
c 18 3
d 15 3
e 7 4
f 0 4
g 0 4
h 0 5
i 0 5
j 0 5
k 0 5
l 0 5
m 0 6

Algorithm to find an interval with the highest summed weight of weighted overlapping intervals

Well, I think it's hard to explain, so I've made a figure to show that.
As we can see in this figure, there are 6 intervals of time. Each one has its weight. Higher the opacity, higher the weight. I want an algorithm to find the interval with the highest summed weight. In the case of the figure, it'd be the overlapping of the intervals 5 and 6, which is the area with highest opacity.
Split each interval into start and end points.
Sort the points.
Start with a sum of 0.
Iterate through the points using a sweep-line algorithm:
If you get a start point:
Increase the sum by the value of the corresponding interval.
If the sum count is higher than the best sum so far, store this start point and set a flag.
If you get an end point:
If the flag is set, store the stored start point and this end point with the current sum as the best interval so far and reset the flag.
Decrease the count by the value of the corresponding interval.
This is derived from the answer I wrote here, which is based on the unweighted version, i.e. finding the maximum number of overlapping intervals, rather than the maximum summed weight.
Example:
For this example:
The start / end points will be sorted as: (S = start, E = end)
1S, 1E, 2S, 3S, 2E, 3E, 4S, 5S, 4E, 6S, 5E, 6E
Iterating through them, you'll set the flag on 1S, 5S and 6S, and you'll store the respective intervals at 1E, 4E and 5E (which is the first end-points you get to after the above start points).
You won't set the flag on 2S, 3S or 4S, as the sum will be lower than the best sum so far.
The algorithm logic can be derived from the figure. Assuming that resolution of time intervals is 1 min, then an array can be created and used for all the calculations:
create the array of 24 * 60 elements and fill it with 0 weights;
for each time interval add the weight of this interval to the corresponding part of the array;
find a maximum summed weight by iterating the array;
iterate over the array again and output array index (time) with the maximal summed weight.
This algorithm can be modified for a slightly different task, if you need to have interval indices in the output. In this case the array should contain list of the input time interval indices as a second dimension (or it can be a separate array, depending on particular language).
UPD. I was curious if this simple algorithm is significantly slower than more elegant one suggested by #Dukeling. I coded both algorithms and created an input generator to estimate their performance.
Generator:
#!/bin/sh
awk -v n=$1 '
BEGIN {
tmax = 24 * 60; wmax = 100;
for (i = 0; i < n; i++) {
t1 = int(rand() * tmax);
t2 = int(rand() * tmax);
w = int(rand() * wmax);
if (t2 >= t1) {print t1, t2, w} else {print t2, t1, w}
}
}' | sort -n > i.txt
Algorithm #1:
#!/bin/sh
awk '
{t1[++i] = $1; t2[i] = $2; w[i] = $3}
END {
for (i in t1) {
for (t = t1[i]; t <= t2[i]; t++) {
W[t] += w[i];
}
}
Wmax = 0.;
for (t in W){
if (W[t] > Wmax) {Wmax = W[t]}
}
print Wmax;
for (t in W){
if (W[t] == Wmax) {print t}
}
}
' i.txt > a1.txt
Algorithm #2:
#!/bin/sh
awk '
{t1[++i] = $1; t2[i] = $2; w[i] = $3}
END {
for (i in t1) {
p[t1[i] "a" i] = i "S";
p[t2[i] "b" i] = i "E";
}
n = asorti(p, psorted, "#ind_num_asc");
W = 0.; Wmax = 0.; f = 0;
for (i = 1; i <= n; i++){
P = p[psorted[i] ];
k = int(P);
if (index(P, "S") > 0) {
W += w[k];
if (W > Wmax) {
f = 1;
Wmax = W;
to1 = t1[k]
}
}
else {
if (f != 0) {
to2 = t2[k];
f = 0
}
W -= w[k];
}
}
print Wmax, to1 "-" to2
}
' i.txt > a2.txt
Results:
$ ./gen.sh 1000
$ time ./a1.sh
real 0m0.283s
$ time ./a2.sh
real 0m0.019s
$ cat a1.txt
24618
757
$ cat a2.txt
24618 757-757
$ ./gen.sh 10000
$ time ./a1.sh
real 0m3.026s
$ time ./a2.sh
real 0m0.144s
$ cat a1.txt
252452
746
$ cat a2.txt
252452 746-746
$ ./gen.sh 100000
$ time ./a1.sh
real 0m34.127s
$ time ./a2.sh
real 0m1.999s
$ cat a1.txt
2484719
714
$ cat a2.txt
2484719 714-714
The simple on is ~20x slower.

Finding a range of numbers of a file in another file using awk

I have lots of files like this:
3
10
23
.
.
.
720
810
980
And a much bigger file like this:
2 0.004
4 0.003
6 0.034
.
.
.
996 0.01
998 0.02
1000 0.23
What I want to do is find in which range of the second file my first file falls and then estimate the mean of the values in the 2nd column of that range.
Thanks in advance.
NOTE
The numbers in the files do not necessarily follow an easy pattern like 2,4,6...
Since your smaller files are sorted you can pull out the first row and the last row to get the min and max. Then you just need go through the bigfile with an awk script to compute the mean.
So for each smallfile small you would run the script
awk -v start=$(head -n 1 small) -v end=$(tail -n 1 small) -f script bigfile
Where script can be something simple like
BEGIN {
sum = 0;
count = 0;
range_start = -1;
range_end = -1;
}
{
irow = int($1)
ival = $2 + 0.0
if (irow >= start && end >= irow) {
if (range_start == -1) {
range_start = NR;
}
sum = sum + ival;
count++;
}
else if (irow > end) {
if (range_end == -1) {
range_end = NR - 1;
}
}
}
END {
print "start =", range_start, "end =", range_end, "mean =", sum / count
}
You can try below:
for r in *; do
awk -v r=$r -F' ' \
'NR==1{b=$2;v=$4;next}{if(r >= b && r <= $2){m=(v+$4)/2; print m; exit}; b=$2;v=$4}' bigfile.txt
done
Explanation:
First pass it saves column 2 & 4 into temp variables. For all other passes it checks if filename r is between the begin range (previous coluimn 2) and end range (current column 2).
It then works out the mean and prints the result.

Find max number of concurrent events

I'd like to print the max number of concurrent events given the start time and end time of each event in "hhmm" format (example input below)
$ cat input.txt
1030,1100
1032,1100
1032,1033
1033,1050
1034,1054
1039,1043
1040,1300
For this, I would
Sort by start time (column 1)
Use awk/sed to iterate over all values in column 2 (i.e end time) to find the count of end times preceeding this event which are greater than the current value (i.e find all
currently running events). To elaborate, assuming line 3 was being processed by awk ... Its end time is 10:33. The end times of the preceding 2 events are 11:00 and 11:00.
Since both these values are greater than 10:33 (i.e. they are still running at 10:33), the third column (i.e. number of concurrent jobs) would contain 2 for this line
The expected output of the awk script to find concurrent events for this input would be
0
1
2
2
2
4
0
Find the max value of this third column.
My awk is rudimentary at best and I am having difficulty implementing step 2.
I'd like this to be a pure script without resorting to a heavy weight language like java.
Hence any help from awk gurus would be highly appreciated. Any non-awk linux one liners are also most welcome.
BEGIN {FS="\,"; i=0}
{ superpos=0;
for (j=1; j<=i; j++ ){
if($2 < a[j,2])
++superpos
}
a[++i,1]=$1;
a[i,2]=$2;
print superpos;
a[i,3]=superpos;
}
END{ max=0;
for (j=1; j<=i; j++ ){
if ( a[j,3]>max)
max= a[j,3];
}
print "max = ",max;
}
Running at ideone
HTH!
Output:
0
0
2
2
2
4
0
max = 4
Edit
Or more awkish, if you prefer:
BEGIN {FS="\,"; max=0 }
{
b=0;
for (var in a){
if($2 < a[var]) b++;
}
a[NR]=$2;
print b;
if (b > max) max = b;
}
END { print "max = ", max }

Resources