MPI virtual graph topology broadcast - parallel-processing

I have a following problem:
I would like to create a virtual topology based on tree graph for example:
0
/ \
1 5
| \ |
2 4 3
Vertices' numbers are ranks of processes.
I managed to do that and i have a handle on my new communicator:
MPI_Comm graph_comm;
MPI_Graph_create(MPI_COMM_WORLD, nnodes, indexes, edges, 0, &graph_comm);
Now my question is:
Is there a possibility to send a broadcast (MPI_Bcast) from each of parent nodes that has children to their children only (in this example process with rank 0 sends bcast to processes 1, 5; process with rank 1 sends bcast to processes 2, 4; process with rank 5 sends bcast to process 3)

It seems to be impossible and one has to create separate communicators for broadcasting. While both MPI_Graph_neighbors_count and MPI_Graph_neighbors should be enough to create new groups, one might wonder why do we need graph topologies in the first place if those groups can be created with exactly the same data as graph topology would be?

Yes, you must create groups in every process and then you can call MPI_Bcast on each group, where root is parent of node (in your example 0 is parent for 1 & 5, but you should remember that root rank is assigned to local communicator so 0 does not have to be 0 in local group, it depends how you create it).
This can help: Group and Communicator Management Routines

Related

How to test banker algorithm and show other ordering has problem

I have found a Python version of the banker algorithm on GeeksForGeeks site here.
However, how to test and show that the safe ordering is correct?
And how to show that other orderings have an error or problem with an example?
https://www.geeksforgeeks.org/bankers-algorithm-in-operating-system-2/
Introduction
Let's consider a very simple example. Let's say there are 2 processes - P0 and P1, and there's only one type of resource A. The system allocates 10 units of A to P0 and 0 to P1, and it still has 1 unit of A left. Moreover, in total , P0 may request up to 11 units during the execution, and P1 - 5.
Let's quickly build up tables and vectors used to determine safe or unsafe sequences for these processes.
Allocation table
Allocation table shows how many resources of each type are allocated to processes. In your example, it looks as follows:
Process
A
P0
10
P1
0
Availability vector
Availability vector shows how many units the system can still offer if it decides so.
A
1
Maximum table
Maximum table shows how many units of A each process may request during the execution (in total).
Process
A
P0
11
P1
5
Need table
Need table shows how many units of A each process may additionally request during the execution
Process
A
P0
1
P1
5
Safe sequence
Now, let's say we ran the Banker's algorithm for our configuration and got the following sequence:
P0 -> P1
Why is it safe?
Case 1 - processes are executed in sequence
P0 starts executing, and demands and receives the remaining 1 unit. So, the system has 0 available resources left. However, once P0 completes, it releases 11 units of A, and it's more than enough to run P1 and for it to complete.
Case 2 - processes are executed in parallel
P0 starts executing, and demands and receives the remaining 1 unit. Then, during its execution, P1 starts too and asks for 5 units. However, its request gets postponed because the system has none. So, the request is put on a waiting list. Later, when P0 releases at least 5 units, P1 finally gets 5. Obviously, no deadlock can happen because if P0 needs resources again, it will either wait for P1 or just ask the system and vice versa.
Unsafe sequence
P1 -> P0
P1 starts executing and demands 5 units from the system. It gets denied and its request is put on a waiting list because the system has only 1 unit. Then, P0 starts and demands 1 unit. It also gets denied because P1 is waiting for 5 units already. The request from P0 is put on the waiting list too. So, we have a deadlock situation because neither of the requests can ever go through.

Recommend algorithm of fair distributed resources allocation consensus

There are distributed computation nodes and there are set of computation tasks represented by rows in a database table (a row per task):
A node has no information about other nodes: can't talk other nodes and doesn't even know how many other nodes there are
Nodes can be added and removed, nodes may die and be restarted
A node connected only to a database
There is no limit of tasks per node
Tasks pool is not finite, new tasks always arrive
A node takes a task by marking that row with some timestamp, so that other nodes don't consider it until some timeout is passed after that timestamp (in case of node death and task not done)
The goal is to evenly distribute tasks among nodes. To achieve that I need to define some common algorithm of tasks acquisition: when a node starts, how many tasks to take?
If a node takes all available tasks, when one node is always busy and others idle. So it's not an option.
A good approach would be for each node to take tasks 1 by 1 with some delay. So
each node periodically (once in some time) checks if there are free tasks and takes only 1 task. In this way, shortly after start all nodes acquire all tasks that are more or less equally distributed. However the drawback is that because of the delay, it would take some time to take last task into processing (say there are 10000 tasks, 10 nodes, delay is 1 second: it would take 10000 tasks * 1 second / 10 nodes = 1000 seconds from start until all tasks are taken). Also the distribution is non-deterministic and skew is possible.
Question: what kind/class of algorithms solve such problem, allowing quickly and evenly distribute tasks using some sync point (database in this case), without electing a leader?
For example: nodes use some table to announce what tasks they want to take, then after some coordination steps they achieve consensus and start processing, etc.
So this comes down to a few factors to consider.
How many tasks are currently available overall?
How many tasks are currently accepted overall?
How many tasks has the node accepted in the last X minutes.
How many tasks has the node completed in the last X minutes.
Can the row fields be modified? (A field added).
Can a node request more tasks after it has finished it's current tasks or must all tasks be immediately distributed?
My inclination is do the following:
If practical, add a "node identifier" field (UUID) to the table with the rows. A node when ran generates a UUID node identifier. When it accepts a task it adds a timestamp and it's UUID. This easily let's other nodes be able to determine how many "active" nodes there are.
To determine allocation, the node determines how many tasks are available/accepted. it then notes how many many unique node identifiers (including itself) have accepted tasks. It then uses this formula to accept more tasks (ideally at random to minimize the chance of competition with other nodes). 2 * available_tasks / active_nodes - nodes_accepted_tasks. So if there are 100 available tasks, 10 active nodes, and this node has accepted 5 task already. Then it would accept: 100 / 10 - 5 = 5 tasks. If nodes only look for more tasks when they no longer have any tasks then you can just use available_tasks / active_nodes.
To avoid issues, there should be a max number of tasks that a node will accept at once.
If node identifier is impractical. I would just say that each node should aim to take ceil(sqrt(N)) random tasks, where N is the number of available tasks. If there are 100 tasks. The first node will take 10, the second will take 10, the 3rd will take 9, the 4th will take 9, the 5th will take 8, and so on. This won't evenly distribute all the tasks at once, but it will ensure the nodes get a roughly even number of tasks. The slight staggering of # of tasks means that the nodes will not all finish their tasks at the same time (which admittedly may or may not be desirable). By not fully distributing them (unless there are sqrt(N) nodes), it also reduces the likelihood of conflicts (especially if tasks are randomly selected) are reduced. It also reduces the number of "failed" tasks if a node goes down.
This of course assumes that a node can request more tasks after it has started, if not, it makes it much more tricky.
As for an additional table, you could actually use that to keep track of the current status of the nodes. Each node records how many tasks it has, it's unique UUID and when it last completed a task. Though that may have potential issues with database churn. I think it's probably good enough to just record which node has accepted the task along with when it accepted it. This is again more useful if nodes can request tasks in the future.

Random leader selection after each round

I am working on a system where I need to select a leader(out of n nodes) randomly. The leader would change after each round (after the current leader has finished its task). All the nodes would be communicating with each other.
A re-election would take place in two conditions:
The round is finished.
The leader dies prematurely.
Are there any implementations of this idea for reference. Is doing so a good idea? Why? Should this situation be approached differently?
As far as I have understood your question you need to select a different leader from your nodes Everytime so to do this you can put all the nodes in a queue and then find the length of queue and generate a random number from 0 to the queue length and name the node at this index as a leader when he dies or finished it's work you can remove this node from the queue and re elect your leader by the same process.Now the length is 1 less.
Hope I have understood the question correctly.

Oracle 11g resource manager clarification

I'm hoping that someone can just confirm my understanding of how the resource manager works...
If I've got a 4 node RAC with 2 consumer groups and 2 services, the services send each consumer group to one node only i.e. consumer group 1 ALWAYS gets sent to node 1 and 2 and consumer group 2 ALWAYS gets sent to node 3 and 4.
If I've got a tiered resource plan such as:
Group Name | L0 | L1 | max
Group 1 | 75% | 0 | 80%
Group 2 | 0 | 75% | 80%
Am I right in saying that as group 1 is on nodes 1 and 2 and group 2 is on nodes 3 and 4, they will each have 75% of resources available on their respective nodes? and both be limited to 80% on the node they are running on?
i.e. Resources are constricted and calculated on a per node basis and not a per cluster.
So just because a connection on node 1 group 1 is using 80% of resources, another connection on node 2 group 1 will have up to 80% available to it and not 0%.
And similarly if group 1 is using its allocated maximum, group 2 will also get its full share on nodes 3 and 4 as group 1 which is of higher priority isn't running on those nodes.
I've had a response from Oracle Support:
Resource management's limits are applied per node except PARALLEL_TARGET_PERCENTAGE, so for your example, you are right.
Connections in consumer group 2 only ever hit node 2 (due to the
services), group 2 will get a minimum of 75% of resources on the 2nd
node and potentially 100% if no max limit has been set or 80% if the max limit has been set.

algorithm to replicate data between computers

Suppose I have n computers. Each of them has a set of integers. Each computer will not have the same set.
i.e. computer 1 has {1,2,3,4}, computer 2 has {4, 5,10,20,21}, computer 3 has {-10,3,5} and so on.
I want to replicate this data so that all computers will have all the integers , i.e. all of them will have {-10,1,2,3,4,5,10,20,21}
I want to minimize the number of messages that each computer sends and also minimize the time. (i.e. avoid a serial approach where computer 1 first communicates with everyone and gets the data it is missing, then computer 2 does the same and so on.
What is an efficient way of doing this?
Thanks.
Minimal approach would be : All computers send info to just one ( master ) computer and get the result
For reliability you could consider at least two computers as master computers
Assumptions :
Total n computers
One of the computers is considered as master
Algorithm :
All computers send input-info to Master ( total n-1 messages )
Master processes the info
Master sends the result-info to all computers ( total n-1 messages )
Reliability :
Total failure of the system based on this algorithm can only occur if all the masters failed .
Efficiency :
With 1 master , total messages : 2 * (n-1)
With 2 masters , total messages : 2 * 2 * (n-1)
With 3 masters , total messages : 3 * 2 * (n-1)
If all the computers are on the same network, you could use UDP sockets with SO_BROADCAST option.
This way when one computer does a message 'send', all the other computers would 'recv' the message and update as necessary.
Here's one way of doing it in 2*n - 2 moves. Model the machines as nodes in a linked-list and numbered from 1..n.
Let node 1 send all of its data in one message to node 2.
Node 2 remembers the message node 1 sent, performs a union of its content with node 1's content and sends the unified message to Node 3. Then Node 2 waits for a response from Node 3.
Node 3 does the same as above and so on until we get to node 'n'. Node 'n' now has the complete set.
Node 'n' already knows what message node 'n - 1' sent it, so it sends the diff back to node 'n - 1'
Node 'n - 1' performs the union as above. Since it has remembered the message of node 'n - 2' (in step 2 above), it can send the diff back to node 'n - 3'
and so on.
I think it is not complex to show that the above leads to 2 * (n - 1) messages being sent in the network.
I think it can be proven that 2n - 2 is necessary by considering each node to have a unique element. It should be a short exercise in mathematical induction to prove that 2n - 2 is necessary..
Well, there's already a system that does this, is widely adopted, well documented, widely available and, while it's perhaps not perfect (for assorted definitions of perfect), it's practical.
It's called RSync.
You can start here: http://www.samba.org/~tridge/phd_thesis.pdf

Resources