Suppose I have n computers. Each of them has a set of integers. Each computer will not have the same set.
i.e. computer 1 has {1,2,3,4}, computer 2 has {4, 5,10,20,21}, computer 3 has {-10,3,5} and so on.
I want to replicate this data so that all computers will have all the integers , i.e. all of them will have {-10,1,2,3,4,5,10,20,21}
I want to minimize the number of messages that each computer sends and also minimize the time. (i.e. avoid a serial approach where computer 1 first communicates with everyone and gets the data it is missing, then computer 2 does the same and so on.
What is an efficient way of doing this?
Thanks.
Minimal approach would be : All computers send info to just one ( master ) computer and get the result
For reliability you could consider at least two computers as master computers
Assumptions :
Total n computers
One of the computers is considered as master
Algorithm :
All computers send input-info to Master ( total n-1 messages )
Master processes the info
Master sends the result-info to all computers ( total n-1 messages )
Reliability :
Total failure of the system based on this algorithm can only occur if all the masters failed .
Efficiency :
With 1 master , total messages : 2 * (n-1)
With 2 masters , total messages : 2 * 2 * (n-1)
With 3 masters , total messages : 3 * 2 * (n-1)
If all the computers are on the same network, you could use UDP sockets with SO_BROADCAST option.
This way when one computer does a message 'send', all the other computers would 'recv' the message and update as necessary.
Here's one way of doing it in 2*n - 2 moves. Model the machines as nodes in a linked-list and numbered from 1..n.
Let node 1 send all of its data in one message to node 2.
Node 2 remembers the message node 1 sent, performs a union of its content with node 1's content and sends the unified message to Node 3. Then Node 2 waits for a response from Node 3.
Node 3 does the same as above and so on until we get to node 'n'. Node 'n' now has the complete set.
Node 'n' already knows what message node 'n - 1' sent it, so it sends the diff back to node 'n - 1'
Node 'n - 1' performs the union as above. Since it has remembered the message of node 'n - 2' (in step 2 above), it can send the diff back to node 'n - 3'
and so on.
I think it is not complex to show that the above leads to 2 * (n - 1) messages being sent in the network.
I think it can be proven that 2n - 2 is necessary by considering each node to have a unique element. It should be a short exercise in mathematical induction to prove that 2n - 2 is necessary..
Well, there's already a system that does this, is widely adopted, well documented, widely available and, while it's perhaps not perfect (for assorted definitions of perfect), it's practical.
It's called RSync.
You can start here: http://www.samba.org/~tridge/phd_thesis.pdf
Related
There are distributed computation nodes and there are set of computation tasks represented by rows in a database table (a row per task):
A node has no information about other nodes: can't talk other nodes and doesn't even know how many other nodes there are
Nodes can be added and removed, nodes may die and be restarted
A node connected only to a database
There is no limit of tasks per node
Tasks pool is not finite, new tasks always arrive
A node takes a task by marking that row with some timestamp, so that other nodes don't consider it until some timeout is passed after that timestamp (in case of node death and task not done)
The goal is to evenly distribute tasks among nodes. To achieve that I need to define some common algorithm of tasks acquisition: when a node starts, how many tasks to take?
If a node takes all available tasks, when one node is always busy and others idle. So it's not an option.
A good approach would be for each node to take tasks 1 by 1 with some delay. So
each node periodically (once in some time) checks if there are free tasks and takes only 1 task. In this way, shortly after start all nodes acquire all tasks that are more or less equally distributed. However the drawback is that because of the delay, it would take some time to take last task into processing (say there are 10000 tasks, 10 nodes, delay is 1 second: it would take 10000 tasks * 1 second / 10 nodes = 1000 seconds from start until all tasks are taken). Also the distribution is non-deterministic and skew is possible.
Question: what kind/class of algorithms solve such problem, allowing quickly and evenly distribute tasks using some sync point (database in this case), without electing a leader?
For example: nodes use some table to announce what tasks they want to take, then after some coordination steps they achieve consensus and start processing, etc.
So this comes down to a few factors to consider.
How many tasks are currently available overall?
How many tasks are currently accepted overall?
How many tasks has the node accepted in the last X minutes.
How many tasks has the node completed in the last X minutes.
Can the row fields be modified? (A field added).
Can a node request more tasks after it has finished it's current tasks or must all tasks be immediately distributed?
My inclination is do the following:
If practical, add a "node identifier" field (UUID) to the table with the rows. A node when ran generates a UUID node identifier. When it accepts a task it adds a timestamp and it's UUID. This easily let's other nodes be able to determine how many "active" nodes there are.
To determine allocation, the node determines how many tasks are available/accepted. it then notes how many many unique node identifiers (including itself) have accepted tasks. It then uses this formula to accept more tasks (ideally at random to minimize the chance of competition with other nodes). 2 * available_tasks / active_nodes - nodes_accepted_tasks. So if there are 100 available tasks, 10 active nodes, and this node has accepted 5 task already. Then it would accept: 100 / 10 - 5 = 5 tasks. If nodes only look for more tasks when they no longer have any tasks then you can just use available_tasks / active_nodes.
To avoid issues, there should be a max number of tasks that a node will accept at once.
If node identifier is impractical. I would just say that each node should aim to take ceil(sqrt(N)) random tasks, where N is the number of available tasks. If there are 100 tasks. The first node will take 10, the second will take 10, the 3rd will take 9, the 4th will take 9, the 5th will take 8, and so on. This won't evenly distribute all the tasks at once, but it will ensure the nodes get a roughly even number of tasks. The slight staggering of # of tasks means that the nodes will not all finish their tasks at the same time (which admittedly may or may not be desirable). By not fully distributing them (unless there are sqrt(N) nodes), it also reduces the likelihood of conflicts (especially if tasks are randomly selected) are reduced. It also reduces the number of "failed" tasks if a node goes down.
This of course assumes that a node can request more tasks after it has started, if not, it makes it much more tricky.
As for an additional table, you could actually use that to keep track of the current status of the nodes. Each node records how many tasks it has, it's unique UUID and when it last completed a task. Though that may have potential issues with database churn. I think it's probably good enough to just record which node has accepted the task along with when it accepted it. This is again more useful if nodes can request tasks in the future.
I am working on a system where I need to select a leader(out of n nodes) randomly. The leader would change after each round (after the current leader has finished its task). All the nodes would be communicating with each other.
A re-election would take place in two conditions:
The round is finished.
The leader dies prematurely.
Are there any implementations of this idea for reference. Is doing so a good idea? Why? Should this situation be approached differently?
As far as I have understood your question you need to select a different leader from your nodes Everytime so to do this you can put all the nodes in a queue and then find the length of queue and generate a random number from 0 to the queue length and name the node at this index as a leader when he dies or finished it's work you can remove this node from the queue and re elect your leader by the same process.Now the length is 1 less.
Hope I have understood the question correctly.
Let's say we have three devices we loan out to people for dates they request. The current program automatically assigns device IDs to the people when they request them. The way that algorithm works is that it first checks the status of all devices to see if there are any that are "un-requested". If so, it will just assign that to the person requesting.
If all devices have been requested for a certain time period, it will check to see if any device has request dates that do not overlap with the new request. If that is true, it will put in a request for that device.
I would like to write another algorithm that runs in the case that devices could be assigned more efficiently. For example:
Device 1: ####--##---######
Device 2: ----###-###------
Device 3: ---##---####-----
Now let's say another user comes along and puts in a request for a device that lines up like so:
Device #: --------####-----
Device 1: ####--##---######
Device 2: ----###-###------
Device 3: ---##---####-----
With the current system, they are out of luck as there are no devices available for that time period. However, if the algorithm was able to look at the three devices it could move the reservation from Device 2 to the open slot on Device 1 and then fill the request by giving them Device 2, ultimately looking like:
Device 1: ####--###########
Device 2: ----###-####-----
Device 3: ---##---####-----
How would I go about reorganizing these requests assuming a single request cannot span across multiple devices?
How about
initialize device end times to 0 (or some non-zero time if the device is in use)
sort the intervals by end time
for each interval
{
assign the interval to the device
a) whose end time is less than the interval start time (no overlap)
b) has the minimum gap between the device end time and the interval start time
update the device end time
}
Using the example in the question, the intervals (sorted by end time) are:
1 - [1,4]
2 - [4,5]
3 - [5,7]
4 - [7,8]
5 - [9,11]
6 - [9,12]
7 - [12,17]
The algorithm would assign the intervals to two devices like so
time: 12345678901234567
Device 1: 1111333-6666-----
Device 2: ---22-44555777777
Interval 1 could be assigned to either device. Intervals 2,3, and 4 are assigned due to the no overlap rule. Interval 5 is assigned based on the minimum gap rule. Then 6 and 7 are assigned based on the no overlap rule.
A greedy algorithm can maximize device utilization:
If someone requests a device for interval (a, b), start with device 1, and check if it’s available sometime during (a, b). Subtract those intervals from (a, b) where device 1 is free, and mark it as “taken” for those time periods. If nothig is left from the original interval, then you’re done; otherwise continue with device 2 and the remaining time intervals (there can be multiple disjoint intervals remaining!).
Of course this could mean that a given request will be fulfilled by multiple devices, so e.g. if I want to lease a device for the whole week, I might have to use Device 1 from Monday to Wednesday, then return it and use Device 2 from Wednesday to Friday.
In order to convince oneself that the complications of standard algorithms such as Paxos and Raft are necessary, one must understand why simpler solutions aren't satisfactory. Suppose that, in order to reach consensus w.r.t a stream of events in a cluster of N machines (i.e., implement a replicated time-growing log), the following algorithm is proposed:
Whenever a machine wants to append a message to the log, it broadcasts the tuple (msg, rnd, prev), where msg is the message, rnd is a random number, and prev is the ID of the last message on the log.
When a machine receives a tuple, it inserts msg as a child of prev, forming a tree.
If a node has more than one child, only the one with highest rnd is considered valid; the path of valid messages through the tree is the main chain.
If a message is part of the main chain, and it is old enough, it is considered decided/final.
If a machine attempts to submit a message and, after some time, it isn't present on the main chain, that means another machine broadcasted a message at roughly the same time, so you re-broadcast it until it is there.
Looks simple, efficient and resilient to crashes. Would this algorithm work?
I think you have a problem if a machine send two tuple in sequence and the first gets lost (package loss/corruption or whatever)
In that case, lets say machine 1 has prev elemtent id of 10 and sends two more with (msg,rnd,10)=11 and (msg,rnd,11)=12 to machine 2.
Machine 2 only receives (msg,rnd,11) but does not have prev id of 11 in its tree.
Machine 3 receives both, so inserts it into the main tree.
At this time you would have a desync beetween the distributed trees.
I propose an ack for the packages after they are inserted in the tree by machine x to the sender, with him waiting for it to send the next.
This way sender needs to resend previous message to the machines that failed to ack in a given timeframe.
I have a following problem:
I would like to create a virtual topology based on tree graph for example:
0
/ \
1 5
| \ |
2 4 3
Vertices' numbers are ranks of processes.
I managed to do that and i have a handle on my new communicator:
MPI_Comm graph_comm;
MPI_Graph_create(MPI_COMM_WORLD, nnodes, indexes, edges, 0, &graph_comm);
Now my question is:
Is there a possibility to send a broadcast (MPI_Bcast) from each of parent nodes that has children to their children only (in this example process with rank 0 sends bcast to processes 1, 5; process with rank 1 sends bcast to processes 2, 4; process with rank 5 sends bcast to process 3)
It seems to be impossible and one has to create separate communicators for broadcasting. While both MPI_Graph_neighbors_count and MPI_Graph_neighbors should be enough to create new groups, one might wonder why do we need graph topologies in the first place if those groups can be created with exactly the same data as graph topology would be?
Yes, you must create groups in every process and then you can call MPI_Bcast on each group, where root is parent of node (in your example 0 is parent for 1 & 5, but you should remember that root rank is assigned to local communicator so 0 does not have to be 0 in local group, it depends how you create it).
This can help: Group and Communicator Management Routines