Find max number of concurrent events - bash

I'd like to print the max number of concurrent events given the start time and end time of each event in "hhmm" format (example input below)
$ cat input.txt
1030,1100
1032,1100
1032,1033
1033,1050
1034,1054
1039,1043
1040,1300
For this, I would
Sort by start time (column 1)
Use awk/sed to iterate over all values in column 2 (i.e end time) to find the count of end times preceeding this event which are greater than the current value (i.e find all
currently running events). To elaborate, assuming line 3 was being processed by awk ... Its end time is 10:33. The end times of the preceding 2 events are 11:00 and 11:00.
Since both these values are greater than 10:33 (i.e. they are still running at 10:33), the third column (i.e. number of concurrent jobs) would contain 2 for this line
The expected output of the awk script to find concurrent events for this input would be
0
1
2
2
2
4
0
Find the max value of this third column.
My awk is rudimentary at best and I am having difficulty implementing step 2.
I'd like this to be a pure script without resorting to a heavy weight language like java.
Hence any help from awk gurus would be highly appreciated. Any non-awk linux one liners are also most welcome.

BEGIN {FS="\,"; i=0}
{ superpos=0;
for (j=1; j<=i; j++ ){
if($2 < a[j,2])
++superpos
}
a[++i,1]=$1;
a[i,2]=$2;
print superpos;
a[i,3]=superpos;
}
END{ max=0;
for (j=1; j<=i; j++ ){
if ( a[j,3]>max)
max= a[j,3];
}
print "max = ",max;
}
Running at ideone
HTH!
Output:
0
0
2
2
2
4
0
max = 4
Edit
Or more awkish, if you prefer:
BEGIN {FS="\,"; max=0 }
{
b=0;
for (var in a){
if($2 < a[var]) b++;
}
a[NR]=$2;
print b;
if (b > max) max = b;
}
END { print "max = ", max }

Related

Average over diagonally in a Matrix

I have a matrix. e.g. 5 x 5 matrix
$ cat input.txt
1 5.6 3.4 2.2 -9.99E+10
2 3 2 2 -9.99E+10
2.3 3 7 4.4 5.1
4 5 6 7 8
5 -9.99E+10 9 11 13
Here I would like to ignore -9.99E+10 values.
I am looking for average of all entries after dividing diagonally. Here are four possibilities (using 999 in place of -9.99E+10 to save space in the graphic):
I would like to average over all the values under different shaded triangles.
So the desire output is:
$cat outfile.txt
P1U 3.39 (Average of all values of Lower side of Possible 1 without considering -9.99E+10)
P1L 6.88 (Average of all values of Upper side of Possible 1 without considering -9.99E+10)
P2U 4.90
P2L 5.59
P3U 3.31
P3L 6.41
P4U 6.16
P4L 4.16
It is being difficult to develop a proper algorithm to write it in fortran or in shell script.
I am thinking of the following algorithm, but can't able to think what is next.
step 1: #Assign -9.99E+10 to the Lower diagonal values of a[ij]
for i in {1..5};do
for j in {1..5};do
a[i,j+1]=-9.99E+10
done
done
step 2: #take the average
sum=0
for i in {1..5};do
for j in {1..5};do
sum=sum+a[i,j]
done
done
printf "%s %5.2f",P1U, sum
step 3: #Assign -9.99E+10 to the upper diagonal values of a[ij]
for i in {1..5};do
for j in {1..5};do
a[i-1,j]=-9.99E+10
done
done
step 4: #take the average
sum=0
for i in {1..5};do
for j in {1..5};do
sum=sum+a[i,j]
done
done
printf "%s %5.2f",P1L,sum
Just save all the values in an aray indexied by row and column number and then in the END section repeat this process of setting the beginning and end row and column loop delimiters as needed when defining the loops for each section:
$ cat tst.awk
{
for (colNr=1; colNr<=NF; colNr++) {
vals[colNr,NR] = $colNr
}
}
END {
sect = "P1U"
begColNr = 1; endColNr = NF; begRowNr = 1; endRowNr = NR
sum = cnt = 0
for (rowNr=begRowNr; rowNr<=endRowNr; rowNr++) {
for (colNr=begRowNr; colNr<=endColNr-rowNr+1; colNr++) {
val = vals[colNr,rowNr]
if ( val != "-9.99E+10" ) {
sum += val
cnt++
}
}
}
printf "%s %.2f\n", sect, (cnt ? sum/cnt : 0)
sect = "P1L"
begColNr = 1; endColNr = NF; begRowNr = 1; endRowNr = NR
sum = cnt = 0
for (rowNr=begRowNr; rowNr<=endRowNr; rowNr++) {
for (colNr=endColNr-rowNr+1; colNr<=endColNr; colNr++) {
val = vals[colNr,rowNr]
if ( val != "-9.99E+10" ) {
sum += val
cnt++
}
}
}
printf "%s %.2f\n", sect, (cnt ? sum/cnt : 0)
}
.
$ awk -f tst.awk file
P1U 3.39
P1L 6.88
I assume given the above for handling the first quadrant diagonal halves you'll be able to figure out the other quadrant diagonal halves and the horizontal/vertical quadrant halves are trivial (just set begRowNr to int(NR/2)+1 or endRowNr to int(NR/2) or begColNr to int(NF/2)+1 or endColNr to int(NF/2) then loop through the resultant full range of values of each).
you can compute all in one iteration
$ awk -v NA='-9.99E+10' '{for(i=1;i<=NF;i++) a[NR,i]=$i}
END {for(i=1;i<=NR;i++)
for(j=1;j<=NF;j++)
{v=a[i,j];
if(v!=NA)
{if(i+j<=6) {p["1U"]+=v; c["1U"]++}
if(i+j>=6) {p["1L"]+=v; c["1L"]++}
if(j>=i) {p["2U"]+=v; c["2U"]++}
if(i<=3) {p["3U"]+=v; c["3U"]++}
if(i>=3) {p["3D"]+=v; c["3D"]++}
if(j<=3) {p["4U"]+=v; c["4U"]++}
if(j>=3) {p["4D"]+=v; c["4D"]++}}}
for(k in p) printf "P%s %.2f\n", k,p[k]/c[k]}' file | sort
P1L 6.88
P1U 3.39
P2U 4.90
P3D 6.41
P3U 3.31
P4D 6.16
P4U 4.16
I forgot to add P2D, but from the pattern it should be clear what needs to be done.
To generalize further as suggested. Assert NF==NR, otherwise diagonals not well defined. Let n=NF (and n=NR) You can replace 6 with n+1 and 3 with ceil(n/2). Which can be implemented as function ceil(x) {return x==int(x)?x:x+1}

MapReduce fundamentals

1) `
map(nr, txt)
words = split (txt, ' ')
for(i=0; i< |words| - 1; i++)
emit(words[i]+' '+words[i+1], 1)
reduce(key, vals)
s=0
for v : vals
s += v
if(s = 5)
emit(key,s)`
2) `map(nr, txt)
words = split (txt, ' ')
for(i=0; i < |words|; i++)
emit(txt, length(words[i]))
reduce(key, vals)
s=0
c=0
for v : vals
s += v
c += 1
r = s/c
emit(key,r)`
I am new to MapReduce and when I am not able to understand if the "if condition in the code(1) will ever satisfy"
Q1 We need to determine what this MapReduce function do in both the code?
Could you please give any input on the above question.
The first block of code emits all bigrams that appear more than 5 times. The reducer if condition satisfies if a pair of adjacent words exists at least 5 times
The second block emits every word of the input text with its length. It attempts to calculate the average length of each word, but since a reducer only sees a single key, then that calculation wouldn't do anything (seeing "foo" 1000 times still has a length of 3)

Find N max difference between two consecutive number present in file using unix

Integer numbers are stored in file, i need to find Max and N Max difference between two consecutive number present in file ( one integer number on each row/line)
e.g.
12
15
50
80
Max diff : 35 ( 50 -15 ) and say N=2 so 1st max 35 and 2nd max : 30
#!/usr/bin/awk -f
NR>1{ diff = $0 - prev
for (i = 0; i < N; ++i)
if (diff > maxdiff[i])
{ # sort new max. diff.
for (j = N; --j > i; ) if (j-1 in maxdiff) maxdiff[j] = maxdiff[j-1]
maxdiff[j] = diff
break
}
}
{ prev = $0 }
END { for (i in maxdiff) print maxdiff[i] }
- e. g., if the script is named nmaxdiff.awk and the numbers are stored in the file numbers, enter
nmaxdiff.awk N=2 numbers

Tracing this algorithm, is my trace right?

For a classwork problem I am doing, I am supposed to trace (check for bugs) the following algorithm (in pseudocode):
num <- 2
count <- 1
while count < 5
{
count <- count * num
if count / 2 < 2
print "Hello"
else
while count < 7
{
count <- count + 1
}
print "The count is " + count + "."
}
When i traced this code, I got
num count output
2 1 Hello The count is 1.
My question is, was my trace right? It looks like there is something else I have to add.
When you are tracing the problem, you need to note down all value changes in the program.
In your program, we have 2 variables to trace: count and num. From the program, we can figure out 2 facts:
There is no assignment of num;
All output statements are related to count.
Therefore, we should focus on tracing the changes on count.
Notice that this block:
while count < 7
{
count <- count + 1
}
can be replaced with
if count < 7
{
count = 7
}
The workflow of the program can be depicted in English like below:
Check if count is smaller than 5, YES go to 2, NO program ends;
Double count;
If count / 2 is smaller than 2, YES go to 4, NO go to 5;
Print "Hello", go to 6;
If count is smaller than 7, set count to 7;
Print "The count is +count+.`", go to 1;
Now the task is to use 1 as initial value of count and walk through the work flow until the program terminates.
Let's do it together:
count equals to 1, so go to 2;
Now count equals to 2;
count / 2 equals to 1, which is smaller than 2, so go to 4;
Hello is printed, go to 6;
"The count is 2." is printed, go to 1;
count equals to 2, so go to 2;
Now count equals to 4;
count / 2 equals to 2, which is NOT smaller than 2, so go to 5;
count is set to 7;
"The count is 7." is printed, go to 1;
count equals to 7, so program terminates.
Therefore the output will be:
HelloThe count is 2.The count is 7.
Here is how you should walk through this.
num = 2
count = 1
while 1 < 5
{
2 = 1 * 2
if 2 /2 < 2 //since 1 < 2 print Hello
print "Hello"
else //This is skipped because the if was true
while count < 7
{
count <- count + 1
}
print "The count is " + count + "." //This prints "The Count is 2
}
Then you continue through the while loop with count = 2.
Start of second iteration.
while 2 < 5
{
4 = 2 * 2
count changes each time through the loop.

Finding a range of numbers of a file in another file using awk

I have lots of files like this:
3
10
23
.
.
.
720
810
980
And a much bigger file like this:
2 0.004
4 0.003
6 0.034
.
.
.
996 0.01
998 0.02
1000 0.23
What I want to do is find in which range of the second file my first file falls and then estimate the mean of the values in the 2nd column of that range.
Thanks in advance.
NOTE
The numbers in the files do not necessarily follow an easy pattern like 2,4,6...
Since your smaller files are sorted you can pull out the first row and the last row to get the min and max. Then you just need go through the bigfile with an awk script to compute the mean.
So for each smallfile small you would run the script
awk -v start=$(head -n 1 small) -v end=$(tail -n 1 small) -f script bigfile
Where script can be something simple like
BEGIN {
sum = 0;
count = 0;
range_start = -1;
range_end = -1;
}
{
irow = int($1)
ival = $2 + 0.0
if (irow >= start && end >= irow) {
if (range_start == -1) {
range_start = NR;
}
sum = sum + ival;
count++;
}
else if (irow > end) {
if (range_end == -1) {
range_end = NR - 1;
}
}
}
END {
print "start =", range_start, "end =", range_end, "mean =", sum / count
}
You can try below:
for r in *; do
awk -v r=$r -F' ' \
'NR==1{b=$2;v=$4;next}{if(r >= b && r <= $2){m=(v+$4)/2; print m; exit}; b=$2;v=$4}' bigfile.txt
done
Explanation:
First pass it saves column 2 & 4 into temp variables. For all other passes it checks if filename r is between the begin range (previous coluimn 2) and end range (current column 2).
It then works out the mean and prints the result.

Resources