Say I have 4 MPI processes labelled: P0, P1, P2, P3. Each process potentially has packets to send to other processes, but may not.
I.e. P0 needs to send packets to P1 and P2, or
P0->[P1, P2]
Similarly,
P1->[P3]
P2 ->[]
P3 -> [P1]
So P1 has to receive potential packets from both P0 and P3, and P3 has to receive packets from P1, and P2 from P0.
How do I do this in MPI? It's sort of like a 'sparse' all to all communication, however in order to set up the recvs I need to know at each process how many times it will receive packets, I'm not sure how to do this, as using MPI_MProbe in a loop breaks as soon as the receiver detects a single packet, how do I ensure that it only breaks when it receives all packets?
Each process needs to tell every other process how many msgs there will be, including zero. You can do that with an all-to-all.
However, more efficiently you can do a reduce-scatter. Each process makes a send buffer of length P with 0/1 depending whether a msg is sent. View that as a matrix with element (i,j) is 1 if process i sends to j. Then a reduce-scatter basically gives each process j the sum of elements in column j. Meaning the number of messages it will receive. You then run a MPI_Probe that many times.
I've solved it with the following similar method to you #Victor Eijkhout,
Code snippet in Rust:
let mut to_receive = vec![0i32; size as usize];
world.all_reduce_into(
&packet_destinations,
&mut to_receive,
SystemOperation::sum(),
);
Where packet_destinations is a vector containing 1 if the process corresponding to the index is being sent data from the current process, and zero otherwise.
Thank you for your response, I loved your HPC textbook by the way.
Related
I have an app in Golang where I have a pipeline setup where each component performs some work, then pass along its results to another component via a buffered channel, then that component performs some work on its input then pass along its results to yet another component via another buffered channel, and so on. For example:
C1 -> C2 -> C3 -> ...
where C1, C2, C3 are components in the pipeline and each "->" is a buffered channel.
In Golang buffered channels are great because it forces a fast producer to slow down to match its downstream consumer (or a fast consumer to slow down to match its upstream producer). So like an assembly line, my pipeline is moving along as fast as the slowest component in that pipeline.
The problem is I want to figure out which component in my pipeline is the slowest one so I can focus on improving that component in order to make the whole pipeline faster.
The way that Golang forces a fast producer or a fast consumer to slow down is by blocking the producer when it tries to send to a buffered channel that is full, or when a consumer tries to consume from a channel that is empty. Like this:
outputChan <- result // producer would block here when sending to full channel
input := <- inputChan // consumer would block here when consuming from empty channel
This makes it hard to tell which one, the producer or consumer, is blocking the most, and thus the slowest component in pipeline. As I cannot tell how long it is blocking for. The one that is blocking the most amount of time is the fastest component and the one that is blocking the least (or not blocking at all) is the slowest component.
I can add code like this just before the read or write to channel to tell whether it would block:
// for producer
if len(outputChan) == cap(outputChan) {
producerBlockingCount++
}
outputChan <- result
// for consumer
if len(inputChan) == 0 {
consumerBlockingCount++
}
input := <-inputChan
However, that would only tell me the number of times it would block, not the total amount of time it is blocked. Not to mention the TOCTOU issue where the check is for a single point in time where state could change immediately right after the check rendering the check incorrect/misleading.
Anybody that has ever been to a casino knows that it's not the number of times that you win or lose that matters, it's the total amount of money that you win or lose that's really matter. I can lose 10 hands with $10 each (for a total of $100 loss) and then wins one single hand of $150, I would still comes out ahead.
Likewise, it's not the number of times that a producer or consumer is blocked that's meaningful. It's the total amount of time that a producer or consumer is blocked that's the determining factor whether it's the slowest component or not.
But I cannot think of anyway to determine the total amount that something is blocked at the reading to / writing from a buffered channel. Or my google-fu isn't good enough. Anyone has any bright idea?
There are several solutions that spring to mind.
1. stopwatch
The least invasive and most obvious is to just note the time,
before and after,
each read or write.
Log it, sum it, report on total I/O delay.
Similarly report on elapsed processing time.
2. benchmark
Do a synthetic bench,
where you have each stage operate on a million
identical inputs, producing a million identical outputs.
Or do a "system test" where you wiretap the
messages that flowed through production,
write them to log files,
and replay relevant log messages to each
of your various pipeline stages,
measuring elapsed times.
Due to the replay, there will be no I/O throttling.
3. pub/sub
Re-architect to use a higher overhead
comms infrastructure, such as Kafka / 0mq / RabbitMQ.
Change the number of nodes participating
in stage-1 processing, stage-2, etc.
The idea is to overwhelm the stage currently
under study, no idle cycles, to measure
its transactions / second throughput
when saturated.
Alternatively, just distribute each stage
to its own node, and measure {user, sys, idle} times,
during normal system behavior.
I have found a Python version of the banker algorithm on GeeksForGeeks site here.
However, how to test and show that the safe ordering is correct?
And how to show that other orderings have an error or problem with an example?
https://www.geeksforgeeks.org/bankers-algorithm-in-operating-system-2/
Introduction
Let's consider a very simple example. Let's say there are 2 processes - P0 and P1, and there's only one type of resource A. The system allocates 10 units of A to P0 and 0 to P1, and it still has 1 unit of A left. Moreover, in total , P0 may request up to 11 units during the execution, and P1 - 5.
Let's quickly build up tables and vectors used to determine safe or unsafe sequences for these processes.
Allocation table
Allocation table shows how many resources of each type are allocated to processes. In your example, it looks as follows:
Process
A
P0
10
P1
0
Availability vector
Availability vector shows how many units the system can still offer if it decides so.
A
1
Maximum table
Maximum table shows how many units of A each process may request during the execution (in total).
Process
A
P0
11
P1
5
Need table
Need table shows how many units of A each process may additionally request during the execution
Process
A
P0
1
P1
5
Safe sequence
Now, let's say we ran the Banker's algorithm for our configuration and got the following sequence:
P0 -> P1
Why is it safe?
Case 1 - processes are executed in sequence
P0 starts executing, and demands and receives the remaining 1 unit. So, the system has 0 available resources left. However, once P0 completes, it releases 11 units of A, and it's more than enough to run P1 and for it to complete.
Case 2 - processes are executed in parallel
P0 starts executing, and demands and receives the remaining 1 unit. Then, during its execution, P1 starts too and asks for 5 units. However, its request gets postponed because the system has none. So, the request is put on a waiting list. Later, when P0 releases at least 5 units, P1 finally gets 5. Obviously, no deadlock can happen because if P0 needs resources again, it will either wait for P1 or just ask the system and vice versa.
Unsafe sequence
P1 -> P0
P1 starts executing and demands 5 units from the system. It gets denied and its request is put on a waiting list because the system has only 1 unit. Then, P0 starts and demands 1 unit. It also gets denied because P1 is waiting for 5 units already. The request from P0 is put on the waiting list too. So, we have a deadlock situation because neither of the requests can ever go through.
I'm looking for best algorithm for message schedule. What I mean with message schedule is a way to send a messages on the bus when we have many consumers at different rate.
Example :
Suppose that we have data D1 to Dn
. D1 to send to many consumer C1 every 5ms, C2 every 19ms, C3 every 30ms, Cn every Rn ms
. Dn to send to C1 every 10ms, C2 every 31ms , Cn every 50ms
What is best algorithm which schedule this actions with the best performance (CPU, Memory, IO)?
Regards
I can think of quite a few options, each with their own costs and benefits. It really comes down to exactly what your needs are -- what really defines "best" for you. I've pseudocoded a couple possibilities below to hopefully help you get started.
Option 1: Execute the following every time unit (in your example, millisecond)
func callEachMs
time = getCurrentTime()
for each datum
for each customer
if time % datum.customer.rate == 0
sendMsg()
This has the advantage of requiring no consistently stored memory -- you just check at each time unit whether your should be sending a message. This can also deal with messages that weren't sent at time == 0 -- just store the time the message was initially sent modulo the rate, and replace the conditional with if time % datum.customer.rate == data.customer.firstMsgTimeMod.
A downside to this method is it is completely reliant on always being called at a rate of 1 ms. If there's lag caused by another process on a CPU and it misses a cycle, you may miss sending a message altogether (as opposed to sending it a little late).
Option 2: Maintain a list of lists of tuples, where each entry represents the tasks that need to be done at that millisecond. Make your list at least as long as the longest rate divided by the time unit (if your longest rate is 50 ms and you're going by ms, your list must be at least 50 long). When you start your program, place the first time a message will be sent into the queue. And then each time you send a message, update the next time you'll send it in that list.
func buildList(&list)
for each datum
for each customer
if list.size < datum.customer.rate
list.resize(datum.customer.rate+1)
list[customer.rate].push_back(tuple(datum.name, customer.name))
func callEachMs(&list)
for each (datum.name, customer.name) in list[0]
sendMsg()
list[customer.rate].push_back((datum.name, customer.name))
list.pop_front()
list.push_back(empty list)
This has the advantage of avoiding the many unnecessary modulus calculations option 1 required. However, that comes with the cost of increased memory usage. This implementation would also not be efficient if there's a large disparity in the rate of your various messages (although you could modify this to deal with algorithms with longer rates more efficiently). And it still has to be called every millisecond.
Finally, you'll have to think very carefully about what data structure you use, as this will make a huge difference in its efficiency. Because you pop from the front and push from the back at every iteration, and the list is a fixed size, you may want to implement a circular buffer to avoid unneeded moving of values. For the lists of tuples, since they're only ever iterated over (random access isn't needed), and there are frequent additions, a singly-linked list may be your best solution.
.
Obviously, there are many more ways that you could do this, but hopefully, these ideas can get you started. Also, keep in mind that the nature of the system you're running this on could have a strong effect on which method works better, or whether you want to do something else entirely. For example, both methods require that they can be reliably called at a certain rate. I also haven't described parallellized implementations, which may be the best option if your application supports them.
Like Helium_1s2 described, there is a second way which based on what I called a schedule table and this is what I used now but this solution has its limits.
Suppose that we have one data to send and two consumer C1 and C2 :
Like you can see we must extract our schedule table and we must identify the repeating transmission cycle and the value of IDLE MINIMUM PERIOD. In fact, it is useless to loop on the smallest peace of time ex 1ms or 1ns or 1mn or 1h (depending on the case) BUT it is not always the best period and we can optimize this loop as follows.
for example one (C1 at 6 and C2 at 9), we remark that there is cycle which repeats from 0 to 18. with a minimal difference of two consecutive send event equal to 3.
so :
HCF(6,9) = 3 = IDLE MINIMUM PERIOD
LCM(6,9) = 18 = transmission cycle length
LCM/HCF = 6 = size of our schedule table
And the schedule table is :
and the sending loop looks like :
while(1) {
sleep(IDLE_MINIMUM_PERIOD); // free CPU for idle min period
i++; // initialized at 0
send(ScheduleTable[i]);
if (i == sizeof(ScheduleTable)) i=0;
}
The problem with this method is that this array will grows if LCM grows which is the case if we have bad combination like with rate = prime number, etc.
In order to convince oneself that the complications of standard algorithms such as Paxos and Raft are necessary, one must understand why simpler solutions aren't satisfactory. Suppose that, in order to reach consensus w.r.t a stream of events in a cluster of N machines (i.e., implement a replicated time-growing log), the following algorithm is proposed:
Whenever a machine wants to append a message to the log, it broadcasts the tuple (msg, rnd, prev), where msg is the message, rnd is a random number, and prev is the ID of the last message on the log.
When a machine receives a tuple, it inserts msg as a child of prev, forming a tree.
If a node has more than one child, only the one with highest rnd is considered valid; the path of valid messages through the tree is the main chain.
If a message is part of the main chain, and it is old enough, it is considered decided/final.
If a machine attempts to submit a message and, after some time, it isn't present on the main chain, that means another machine broadcasted a message at roughly the same time, so you re-broadcast it until it is there.
Looks simple, efficient and resilient to crashes. Would this algorithm work?
I think you have a problem if a machine send two tuple in sequence and the first gets lost (package loss/corruption or whatever)
In that case, lets say machine 1 has prev elemtent id of 10 and sends two more with (msg,rnd,10)=11 and (msg,rnd,11)=12 to machine 2.
Machine 2 only receives (msg,rnd,11) but does not have prev id of 11 in its tree.
Machine 3 receives both, so inserts it into the main tree.
At this time you would have a desync beetween the distributed trees.
I propose an ack for the packages after they are inserted in the tree by machine x to the sender, with him waiting for it to send the next.
This way sender needs to resend previous message to the machines that failed to ack in a given timeframe.
What deterministic algorithm is suitable for the following resource allocation/scheduling problem?
Consider a set of players: P1, P2, P3 and P4. Each player receives data from a cell tower (e.g. in a wireless network). The tower transmits data in 1 second blocks. There are 5 blocks. Each player can be scheduled to receive data in an arbitrary number of the blocks.
Now, the amount of data received in each block is a constant (C) divided by the number of other players scheduled in the same block (because the bandwidth must be shared). A greedy approach would allocate each player to each block but then the data received per block would be reduced.
How can we find an allocation of the players to time-blocks so that the amount of data delivered by the network is maximised? I have tried a number of heuristic methods on this problem (Genetic Algorithm, Sim Anneal) and they work well. However, Id like to solve for the optimum schedule.