Here's rather nebulous question.
I have a list of start/stop times from script executions, which may include nested script calls.
| script | start | stop | duration | time executing |
| ------ | ----- | ---- | -------------- | ----------------------------------- |
| A | 1 | 8 | 7 i.e. (8-1) | 3 i.e. ((8-1) - (6-2) - (5-4)) |
| ->B | 2 | 6 | 4 i.e. (6-2) | 3 i.e. ((6-2) - (5-4)) |
| ->->C | 4 | 5 | 1 i.e. (5-4) | 1 |
| D | 9 | 10 | 1 i.e. (10-9) | 1 |
| E | 11 | 12 | 1 i.e. (12-11) | 1 |
| F | 9 | 16 | 7 i.e. (16-9) | 5 i.e. ((16-9) - (14-13) - (16-15)) |
| ->G | 13 | 14 | 4 i.e. (14-13) | 1 i.e. (14-13) |
| ->H | 15 | 16 | 1 i.e. (15-14) | 1 i.e. (16-15) |
Duration is the total time spent in a script.
Time executing is the time spent in the script, but not in subscript.
So A calls B and B calls C. C takes 1 tick, B takes 4 but time executing is just 3, and A takes 7 ticks, but time executing is 3.
F calls G and then H, so takes 7 ticks but time executing is only 5.
What I'm trying to wrap my ('flu-ridden) head around is a pseudo-code algorithm for step-wise or recursing through the list of times in order to generate the time executing value for each row.
Any help for this problem (or cure for common cold) gratefully received. :-)
If all time points are distinct, then script execution timespans are related to each other by an ordered tree: Given any pair of script execution timespans, either one strictly contains the other, or they don't overlap at all. This enables an easy recovery of parent-child relationships, if you wanted to do that.
But if you just care about execution times, we don't even need that! :) There's a pretty simple algorithm that just sorts the starting and ending times and walks through the resulting array of "events", maintaining a stack of open "frames":
Create an array of (time, scriptID) pairs, and insert the start time and end time of each script into it (i.e., insert two pairs per script into the same array).
Sort the array by time.
Create a stack of integer triples, and push a single (0, 0, 0) entry on it. (This is just a dummy entry to simplify later code.) Also create an array seen[] with a boolean flag per script ID, all initially set to false.
Iterate through the sorted array of (time, scriptID) pairs:
Whenever you see a (time, scriptID) pair for a script ID that you have not seen before, that script is starting.
Set seen[scriptID] = true.
Push the triple (time, scriptID, 0) onto the stack. The final component, initially 0, will be used to accumulate the total duration spent in this script's "descendant" scripts.
Whenever you see a time for a script ID that you have seen before (because seen[scriptID] == true), that script is ending.
Pop the top (time, scriptID, descendantDuration) triple from the stack (note that the scriptID in this triple should match the scriptID in the pair at the current index of the array; if not, then somehow you have "intersecting" script timespans that could not correspond to any sequence of nested script runs).
The duration for this script ID is (as you already knew) time - startTime[scriptID].
Its execution time is duration - descendantDuration.
Record the time spent in this script and its descendants by adding its duration to the new top-of-stack's descendantDuration (i.e., third) field.
That's all! For n script executions this will take O(n log n) time, because the sorting step takes that long (iterating over the array and performing the stack operations take just O(n)). Space usage is O(n).
Related
I'm measuring the performance of A and B programs. A is written in Golang, B is written in Python. The important point here is that I'm interested in how the performance value increases, not the absolute performance value of the two programs over time.
For example,
+------+-----+-----+
| time | A | B |
+------+-----+-----+
| 1 | 3 | 500 |
+------+-----+-----+
| 2 | 5 | 800 |
+------+-----+-----+
| 3 | 9 | 1300|
+------+-----+-----+
| 4 | 13 | 1800|
+------+-----+-----+
Where the values in columns A and B(A: 3, 5, 9, 13 / B: 500, 800, 1300, 1800) are the execution times of the program. This execution time can be seen as performance, and the difference between the absolute performance values of A and B is very large. Therefore, the slope comparison of the performance graphs of the two programs would be meaningless.(Python is very slow compared to Golang.)
I want to compare the performance of Program A written in Golang with Program B written in Python, and I'm looking for a calibration tool or formula based on benchmarks that calculates the execution time when Program A is written in Python.
Is there any way to solve this problem ?
If you are interested in the relative change, you should normalize the data for each programming language. In other words, divide the values for golang with 3 and for python, divide with value 500.
+------+-----+-----+
| time | A | B |
+------+-----+-----+
| 1 | 1 | 1 |
+------+-----+-----+
| 2 | 1.66| 1.6 |
+------+-----+-----+
| 3 | 3 | 2.6 |
+------+-----+-----+
| 4 |4.33 | 3.6 |
+------+-----+-----+
I have a drug analysis experiment that need to generate a value based on given drug database and set of 1000 random experiments.
The original database looks like this where the number in the columns represent the rank for the drug. This is a simplified version of actual database, the actual database will have more Drug and more Gene.
+-------+-------+-------+
| Genes | DrugA | DrugB |
+-------+-------+-------+
| A | 1 | 3 |
| B | 2 | 1 |
| C | 4 | 5 |
| D | 5 | 4 |
| E | 3 | 2 |
+-------+-------+-------+
A score is calculated based on user's input: A and C, using the following formula:
# Compute Function
# ['A','C'] as array input
computeFunction(array) {
# do some stuff with the array ...
}
The formula used will be same for any provided value.
For randomness test, each set of experiment requires the algorithm to provide randomized values of A and C, so both A and C can be having any number from 1 to 5
Now I have two methods of selecting value to generate the 1000 sets for P-Value calculation, but I would need someone to point out if there is one better than another, or if there is any method to compare these two methods.
Method 1
Generate 1000 randomized database based on given database input shown above, meaning all the table should contain different set of value pair.
Example for 1 database from 1000 randomized database:
+-------+-------+-------+
| Genes | DrugA | DrugB |
+-------+-------+-------+
| A | 2 | 3 |
| B | 4 | 4 |
| C | 3 | 2 |
| D | 1 | 5 |
| E | 5 | 1 |
+-------+-------+-------+
Next we perform computeFunction() with new A and C value.
Method 2
Pick any random gene from original database and use it as a newly randomized gene value.
For example, we pick the values from E and B as a new value for A and C.
From original database, E is 3, B is 2.
So, now A is 3, C is 2. Next we perform computeFunction() with new A and C value.
Summary
Since both methods produce completely randomized input, therefore it seems to me that it will produce similar 1000-value outcome. Is there any way I could prove they are similar?
I would like some help with a bit of recursive code I need to traverse a graph stored as a collection in PL/SQL.
---------
|LHS|RHS|
---------
| 1 | 2 |
| 2 | 3 |
| 2 | 4 |
| 3 | 5 |
---------
Assuming 1 is the start node, I would like to be able to find 2-3 and 2-4 without looping through the entire collection to check each LHS. I know one solution is to use a global temporary table instead of a collection, but I would really like to avoid reading and writing to and from disk if at all possible.
Edit: The expected output for the above example would be an XML like this:
<1>
<2>
<3>
<5>
</5>
</3>
<4>
</4>
</2>
</1>
Thanks.
Assume, that we have large file, which contains descriptions of the cells of two matrices (A and B):
+---------------------------------+
| i | j | value | matrix |
+---------------------------------+
| 1 | 1 | 10 | A |
| 1 | 2 | 20 | A |
| | | | |
| ... | ... | ... | ... |
| | | | |
| 1 | 1 | 5 | B |
| 1 | 2 | 7 | B |
| | | | |
| ... | ... | ... | ... |
| | | | |
+---------------------------------+
And we want to calculate the product of this matrixes: C = A x B
By definition: C_i_j = sum( A_i_k * B_k_j )
And here is a two-step MapReduce algorithm, for calculation of this product (I will provide a pseudocode):
First step:
function Map (input is a single row of the file from above):
i = row[0]
j = row[1]
value = row[2]
matrix = row[3]
if(matrix == 'A')
emit(i, {j, value, 'A'})
else
emit(j, {i, value, 'B'})
Complexity of this Map function is O(1)
function Reduce(Key, List of tuples from the Map function):
Matrix_A_tuples =
filter( List of tuples from the Map function, where matrix == 'A' )
Matrix_B_tuples =
filter( List of tuples from the Map function, where matrix == 'B' )
for each tuple_A from Matrix_A_tuples
i = tuple_A[0]
value_A = tuple_A[1]
for each tuple_B from Matrix_B_tuples
j = tuple_B[0]
value_B = tuple_B[1]
emit({i, j}, {value_A * value_b, 'C'})
Complexity of this Reduce function is O(N^2)
After the first step we will get something like the following file (which contains O(N^3) lines):
+---------------------------------+
| i | j | value | matrix |
+---------------------------------+
| 1 | 1 | 50 | C |
| 1 | 1 | 45 | C |
| | | | |
| ... | ... | ... | ... |
| | | | |
| 2 | 2 | 70 | C |
| 2 | 2 | 17 | C |
| | | | |
| ... | ... | ... | ... |
| | | | |
+---------------------------------+
So, all we have to do - just sum the values, from lines, which contains the same values i and j.
Second step:
function Map (input is a single row of the file, which produced in first step):
i = row[0]
j = row[1]
value = row[2]
emit({i, j}, value)
function Reduce(Key, List of values from the Map function)
i = Key[0]
j = Key[1]
result = 0;
for each Value from List of values from the Map function
result += Value
emit({i, j}, result)
After the second step we will get the file, which contains cells of the matrix C.
So the question is:
Taking into account, that there are multiple number of instances in MapReduce cluster - which is the most correct way to estimate complexity of the provided algorithm?
The first one, which comes to mind is such:
When we assume that number of instances in the MapReduce cluster is K.
And, because of the number of lines - from file, which produced after the first step is O(N^3) - the overall complexity can be estimated as O((N^3)/K).
But this estimation doesn't take into account many details: such as network bandwidth between instances of MapReduce cluster, ability to distribute data between distances - and perform most of the calculations locally etc.
So, I would like to know which is the best approach for estimation of efficiency of the provided MapReduce algorithm, and does it make sense to use Big-O notation to estimate efficiency of MapReduce algorithms at all?
as you said the Big-O estimates the computation complexity, and does not take into consideration the networking issues such(bandwidth, congestion, delay...)
If you want to calculate how much efficient the communication between instances, in this case you need other networking metrics...
However, I want to tell you something, if your file is not big enough, you will not see an improvement in term of execution speed. This is because the MapReduce works efficiently only with BIG data. Moreover, your code has two steps, that means two jobs. MapReduce, from one job to another, takes time to upload the file and start the job again. This can affect slightly the performance.
I think you can calculate the efficiently in term of speed and time as the MapReduce approach is for sure faster when it comes to big data. This is if we compared it to the sequential algorithms.
Moreover, efficiency can be with regards to the fault-tolerance. This is because MapReduce will manage to handle failures by itself. So, no need for the programmers to handle instance failure or networking failures..
Suppose I have a black box with 3 inputs (each input is 1 bit) and 2 bits output.
The black box counts the amount of turned on input bits.
Using only such black boxes,one needs to implement the counter of turned on bits in the input,which has 7 bits.The implementation should use the minimum possible amount of black boxes.
//This is a job interview question
You're making a binary adder. Try this...
Two black boxes for input with one input remaining:
7 6 5 4 3 2 1
| | | | | | |
------- ------- |
| | | | |
| H L | | H L | |
------- ------- |
| | | | |
Take the two low outputs and the remaining input (1) and feed them to another black box:
L L 1
| | |
-------
| |
| C L |
-------
| |
The low output from this black box will be the low bit of the result. The high output is the carry bit. Feed this carry bit along with the high bits from the first two black boxes into the fourth black box:
H H C L
| | | |
------- |
| | |
| H M | |
------- |
| | |
The result should be the number of "on" bits in the input expressed in binary by the High, Middle and Low bits.
Suppose that each BB outputs a 2-bit binary count 00, 01, 10, or 11, when 0, 1, 2, or 3 of its inputs are on. Also suppose that the desired ultimate output O₄O₂O₁ is a 3-bit binary count 000 ... 111, when 0, 1, ... 7 of the 7 input bits i₁...i₇ are on. For problems like this in general, you can write a boolean expression for what the BB does and a boolean expression for the desired output and then synthesize the output. In this particular case, however, try the obvious approach of putting i₁, i₂, i₃ into a first box B₁, and i₄, i₅, i₆ into a second box B₂, and i₇ into one input of a third box B₃. Looking at this it's clear that if you run the units outputs from B₁ and B₂ into the other two inputs of B₃ then the units output from B₃ is equal to the desired value O₁. You can get the sum of the twos outputs from B₁, B₂, B₃ via a box B₄, and this sum is equal to the desired values O₄O₂.