This is about developing a distributed consensus.
Let's assume we have N nodes ( N1, N2, ..., Nn ),each of the nodes has a different value( A1, A2, ..., An ).These nodes can communicate with each other, and replace their value if it's bigger then the other node's value.
For example if I'm N1 when I communicate with N2 and I find A2 < A1, then I will replace my value with A2
I need to find the least number of exchanges so that more than half of the nodes ( > n / 2 ) hold the smallest possible value.
An exchange is a single communication between two nodes that results may result in a change in one of the two involved node's value, if and only if the other one has different and smaller value.
Given the postulated properties, one of which is values An were postulated to be initially different, the solution is derived from how large the requested strictly dominant majority MAJ actually is, thus setting the minimum (optimistic case supremum) of amount of exchanges .xg!-operations needed, which sets the [TIME]-domain complexity inductively alike this :
n == 2
MAJ ~ 2 _(1)_----->----(2).xg! _1_
xg! = 1
(3)
/ \
/ \
/ \
/ \
/ \
n == 3 / \
MAJ ~ 2 _(1)_----->----(2).xg! _1_
xg! = 1
.xg!_1_(4)-----------(3)
|. . |
| . . |
| . . |
^ . . |
| . . |
n == 4 |. , |
MAJ ~ 3 _(1)_----->----(2).xg! _1_
xg! = 2
________________________________________________________
n == 5 | 6 | 7 | 8 | 9| 10 | 11 | ... | n
MAJ ~ 3 | 4 | 4 | 5 | 5| 6 | 6 | ... | 1 + n // 2
xg! = 2 | 3 | 3 | 4 | 4| 5 | 5 | ... | n // 2
Related
I have a drug analysis experiment that need to generate a value based on given drug database and set of 1000 random experiments.
The original database looks like this where the number in the columns represent the rank for the drug. This is a simplified version of actual database, the actual database will have more Drug and more Gene.
+-------+-------+-------+
| Genes | DrugA | DrugB |
+-------+-------+-------+
| A | 1 | 3 |
| B | 2 | 1 |
| C | 4 | 5 |
| D | 5 | 4 |
| E | 3 | 2 |
+-------+-------+-------+
A score is calculated based on user's input: A and C, using the following formula:
# Compute Function
# ['A','C'] as array input
computeFunction(array) {
# do some stuff with the array ...
}
The formula used will be same for any provided value.
For randomness test, each set of experiment requires the algorithm to provide randomized values of A and C, so both A and C can be having any number from 1 to 5
Now I have two methods of selecting value to generate the 1000 sets for P-Value calculation, but I would need someone to point out if there is one better than another, or if there is any method to compare these two methods.
Method 1
Generate 1000 randomized database based on given database input shown above, meaning all the table should contain different set of value pair.
Example for 1 database from 1000 randomized database:
+-------+-------+-------+
| Genes | DrugA | DrugB |
+-------+-------+-------+
| A | 2 | 3 |
| B | 4 | 4 |
| C | 3 | 2 |
| D | 1 | 5 |
| E | 5 | 1 |
+-------+-------+-------+
Next we perform computeFunction() with new A and C value.
Method 2
Pick any random gene from original database and use it as a newly randomized gene value.
For example, we pick the values from E and B as a new value for A and C.
From original database, E is 3, B is 2.
So, now A is 3, C is 2. Next we perform computeFunction() with new A and C value.
Summary
Since both methods produce completely randomized input, therefore it seems to me that it will produce similar 1000-value outcome. Is there any way I could prove they are similar?
I am trying to expand a data set to include dates outside of the current range.
The data I have ranges from 1992q1 to 2017q1. Each observation exists within a portion of that larger window, for example from 1993q2 to 1997q1.
I need to create quarterly observations for each range to fill the missing time. I have already expanded the existing data into quarters.
What I cannot figure out how to do is add in those missing quarters. For example, country1 may have the dates 1993q2 to 1997q1. I need to add in the missing dates from 1992q1 to 1993q1 and 1997q2 to 2017q1.
A very simple analogue of I want I think is your question is shown by this sandbox dataset.
clear
set obs 10
gen id = cond(_n < 7, 1, 2)
gen qdate = yq(1992, 1) in 1
replace qdate = yq(1992, 3) in 7
bysort id (qdate) : replace qdate = qdate[_n-1] + 1 if missing(qdate)
format qdate %tq
list, sepby(id)
+-------------+
| id qdate |
|-------------|
1. | 1 1992q1 |
2. | 1 1992q2 |
3. | 1 1992q3 |
4. | 1 1992q4 |
5. | 1 1993q1 |
6. | 1 1993q2 |
|-------------|
7. | 2 1992q3 |
8. | 2 1992q4 |
9. | 2 1993q1 |
10. | 2 1993q2 |
+-------------+
fillin id qdate
list, sepby(id)
+-----------------------+
| id qdate _fillin |
|-----------------------|
1. | 1 1992q1 0 |
2. | 1 1992q2 0 |
3. | 1 1992q3 0 |
4. | 1 1992q4 0 |
5. | 1 1993q1 0 |
6. | 1 1993q2 0 |
|-----------------------|
7. | 2 1992q1 1 |
8. | 2 1992q2 1 |
9. | 2 1992q3 0 |
10. | 2 1992q4 0 |
11. | 2 1993q1 0 |
12. | 2 1993q2 0 |
+-----------------------+
So. fillin is a simple way of ensuring that all cross-combinations of identifier and time are present. However, to what benefit? Although not shown in this example, values of other variables spring into existence only as missing values. In some situations, proceeding with interpolation is justified, but usually, you just live with incomplete panels.
How to find solutions like these? One good strategy is to skim through the [D] manual to see what basic data management commands exist.
Say I have a number of weights which I need to spread out across a finite number of knapsacks so that each knapsack has as even a distribution of weights as possible. The catch is that different weights can only be put into the first bags, where each value of varies for each weight.
For example, a weight might only be able to inserted into bags up to bag 4, i.e. bags 1 through 4. Another might have a limit up to 5. The goal as previously stated is to attempt an even spread across all bags, with the number of bags set by the weight with the highest limit.
Is there a name for this problem, and what algorithms exist?
EDIT: To help visualise, say I have 4 weights:
+----------+--------+-----------+
| Weight # | Weight | Bag Limit |
+----------+--------+-----------+
| 1 | 2 | 2 |
| 2 | 3 | 3 |
| 3 | 1 | 1 |
| 4 | 2 | 4 |
+----------+--------+-----------+
A solution to the problem might look like this
| 1 | | | | | | |
| 2 | | 3 | | 2 | | |
|___| |___| |___| |___|
Bag 1 Bag 2 Bag 3 Bag 4
Weights 3 and 1 were placed into Bag 1
Weight 2 was placed into Bag 2
Weight 4 was placed into Bag 3
Here, the load is spread as evenly as possible, and the problem is solved (although perhaps not optimally, as I did this in my head)
Hopefully this might clear up what I'm trying to solve.
I'd describe this problem as bin packing with side constraints -- a lot of NP-hard problems don't have good names because there are so many of them. I would expect the LP-based methods for packing variable-sized bins that decompose the problem into (1) a packing problem over whole bins (2) a knapsack problem within a bin to generate candidate bins to carry over reasonably well.
binary search split an array to two part and search in them.
but my teacher ask us to find a solution for split array to four part then search in parts.
binary search:
binary_search(A, target):
lo = 1, hi = size(A)
while lo <= hi:
mid = lo + (hi-lo)/2
if A[mid] == target:
return mid
else if A[mid] < target:
lo = mid+1
else:
hi = mid-1
but I want split array to 4 part then search.
are is way?
A normal binary search splits the array (container) into two pieces, usually at the midpoint:
+---+---+---+---+---+---+---+---+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
+---+---+---+---+---+---+---+---+
|
V
+---+---+---+---+ +---+---+---+---+
| 1 | 2 | 3 | 4 | | 5 | 6 | 7 | 8 |
+---+---+---+---+ +---+---+---+---+
Based on the midpoint value, the search key is either in the lower section (left) or the higher section (right).
If we take the same concept and split into 4 pieces, the key will be in one of the four quandrants:
+---+---+ +---+---+ +---+---+ +---+---+
| 1 | 2 | | 3 | 4 | | 5 | 6 | | 7 | 8 |
+---+---+ +---+---+ +---+---+ +---+---+
By comparing key to the highest quadrant slot, one can determine which quadrant the key lies in.
In a binary search, the midpoint is found by dividing the search range by 2.
In a 4 part search, the quadrants are found by dividing by four.
Try this algorithm out using pen and paper before coding. When you develop steps that work, then code. This is called designing then coding. A popular development process.
Nobody should be spoon-feeding you code. Work it out yourself.
Edit 1: Search Trees
Arrays and trees are very different with an array, you know where all the items are and you can use an index to access the elements. With a binary or search tree, you need to follow the links; as you don't know where each element is.
A divide by 4 search tree, is usually follows the principles of a B-Tree. Instead of single nodes, you have a page of nodes:
+---------------------------+
| Page Details |
+-----+---------------------+
| key | pointer to sub-tree |
+-----+---------------------+
| key | pointer to sub-tree |
+-----+---------------------+
| key | pointer to sub-tree |
+-----+---------------------+
| key | pointer to sub-tree |
+-----+---------------------+
The page node is an array of nodes. Most algorithms use a binary search in the array of nodes. When the key range is found, the algorithm then traverses the link to the appropriate sub-tree. The process repeats until the key is found in the Page node or on a leaf node.
What is your data structure and where lies your confusion?
Assume, that we have large file, which contains descriptions of the cells of two matrices (A and B):
+---------------------------------+
| i | j | value | matrix |
+---------------------------------+
| 1 | 1 | 10 | A |
| 1 | 2 | 20 | A |
| | | | |
| ... | ... | ... | ... |
| | | | |
| 1 | 1 | 5 | B |
| 1 | 2 | 7 | B |
| | | | |
| ... | ... | ... | ... |
| | | | |
+---------------------------------+
And we want to calculate the product of this matrixes: C = A x B
By definition: C_i_j = sum( A_i_k * B_k_j )
And here is a two-step MapReduce algorithm, for calculation of this product (I will provide a pseudocode):
First step:
function Map (input is a single row of the file from above):
i = row[0]
j = row[1]
value = row[2]
matrix = row[3]
if(matrix == 'A')
emit(i, {j, value, 'A'})
else
emit(j, {i, value, 'B'})
Complexity of this Map function is O(1)
function Reduce(Key, List of tuples from the Map function):
Matrix_A_tuples =
filter( List of tuples from the Map function, where matrix == 'A' )
Matrix_B_tuples =
filter( List of tuples from the Map function, where matrix == 'B' )
for each tuple_A from Matrix_A_tuples
i = tuple_A[0]
value_A = tuple_A[1]
for each tuple_B from Matrix_B_tuples
j = tuple_B[0]
value_B = tuple_B[1]
emit({i, j}, {value_A * value_b, 'C'})
Complexity of this Reduce function is O(N^2)
After the first step we will get something like the following file (which contains O(N^3) lines):
+---------------------------------+
| i | j | value | matrix |
+---------------------------------+
| 1 | 1 | 50 | C |
| 1 | 1 | 45 | C |
| | | | |
| ... | ... | ... | ... |
| | | | |
| 2 | 2 | 70 | C |
| 2 | 2 | 17 | C |
| | | | |
| ... | ... | ... | ... |
| | | | |
+---------------------------------+
So, all we have to do - just sum the values, from lines, which contains the same values i and j.
Second step:
function Map (input is a single row of the file, which produced in first step):
i = row[0]
j = row[1]
value = row[2]
emit({i, j}, value)
function Reduce(Key, List of values from the Map function)
i = Key[0]
j = Key[1]
result = 0;
for each Value from List of values from the Map function
result += Value
emit({i, j}, result)
After the second step we will get the file, which contains cells of the matrix C.
So the question is:
Taking into account, that there are multiple number of instances in MapReduce cluster - which is the most correct way to estimate complexity of the provided algorithm?
The first one, which comes to mind is such:
When we assume that number of instances in the MapReduce cluster is K.
And, because of the number of lines - from file, which produced after the first step is O(N^3) - the overall complexity can be estimated as O((N^3)/K).
But this estimation doesn't take into account many details: such as network bandwidth between instances of MapReduce cluster, ability to distribute data between distances - and perform most of the calculations locally etc.
So, I would like to know which is the best approach for estimation of efficiency of the provided MapReduce algorithm, and does it make sense to use Big-O notation to estimate efficiency of MapReduce algorithms at all?
as you said the Big-O estimates the computation complexity, and does not take into consideration the networking issues such(bandwidth, congestion, delay...)
If you want to calculate how much efficient the communication between instances, in this case you need other networking metrics...
However, I want to tell you something, if your file is not big enough, you will not see an improvement in term of execution speed. This is because the MapReduce works efficiently only with BIG data. Moreover, your code has two steps, that means two jobs. MapReduce, from one job to another, takes time to upload the file and start the job again. This can affect slightly the performance.
I think you can calculate the efficiently in term of speed and time as the MapReduce approach is for sure faster when it comes to big data. This is if we compared it to the sequential algorithms.
Moreover, efficiency can be with regards to the fault-tolerance. This is because MapReduce will manage to handle failures by itself. So, no need for the programmers to handle instance failure or networking failures..