MPI: blocking vs non-blocking - parallel-processing

MPI: blocking vs non-blocking - parallel-processing

I am having trouble understanding the concept of blocking communication and non-blocking communication in MPI. What are the differences between the two? What are the advantages and disadvantages?

Blocking communication is done using MPI_Send() and MPI_Recv(). These functions do not return (i.e., they block) until the communication is finished. Simplifying somewhat, this means that the buffer passed to MPI_Send() can be reused, either because MPI saved it somewhere, or because it has been received by the destination. Similarly, MPI_Recv() returns when the receive buffer has been filled with valid data.
In contrast, non-blocking communication is done using MPI_Isend() and MPI_Irecv(). These function return immediately (i.e., they do not block) even if the communication is not finished yet. You must call MPI_Wait() or MPI_Test() to see whether the communication has finished.
Blocking communication is used when it is sufficient, since it is somewhat easier to use. Non-blocking communication is used when necessary, for example, you may call MPI_Isend(), do some computations, then do MPI_Wait(). This allows computations and communication to overlap, which generally leads to improved performance.
Note that collective communication (e.g., all-reduce) is only available in its blocking version up to MPIv2. IIRC, MPIv3 introduces non-blocking collective communication.
A quick overview of MPI's send modes can be seen here.

This post, although is a bit old, but I contend the accepted answer. the statement " These functions don't return until the communication is finished" is a little misguiding because blocking communications doesn't guarantee any handshake b/w the send and receive operations.
First one needs to know, send has four modes of communication : Standard, Buffered, Synchronous and Ready and each of these can be blocking and non-blocking
Unlike in send, receive has only one mode and can be blocking or non-blocking .
Before proceeding further, one must also be clear that I explicitly mention which one is MPI_Send\Recv buffer and which one is system buffer( which is a local buffer in each processor owned by the MPI Library used to move data around among ranks of a communication group)
BLOCKING COMMUNICATION :
Blocking doesn't mean that the message was delivered to the receiver/destination. It simply means that the (send or receive) buffer is available for reuse. To reuse the buffer, it's sufficient to copy the information to another memory area, i.e the library can copy the buffer data to own memory location in the library and then, say for e.g, MPI_Send can return.
MPI standard makes it very clear to decouple the message buffering from send and receive operations. A blocking send can complete as soon as the message was buffered, even though no matching receive has been posted. But in some cases message buffering can be expensive and hence direct copying from send buffer to receive buffer might be efficient. Hence MPI Standard provides four different send modes to give the user some freedom in selecting the appropriate send mode for her application. Lets take a look at what happens in each mode of communication :
1. Standard Mode
In the standard mode, it is up to the MPI Library, whether or not to buffer the outgoing message. In the case where the library decides to buffer the outgoing message, the send can complete even before the matching receive has been invoked. In the case where the library decides not to buffer (for performance reasons, or due to unavailability of buffer space), the send will not return until a matching receive has been posted and the data in send buffer has been moved to the receive buffer.
Thus MPI_Send in standard mode is non-local in the sense that send in standard mode can be started whether or not a matching receive has been posted and its successful completion may depend on the occurrence of a matching receive ( due to the fact it is implementation dependent if the message will be buffered or not) .
The syntax for standard send is below :
int MPI_Send(const void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm)
2. Buffered Mode
Like in the standard mode, the send in buffered mode can be started irrespective of the fact that a matching receive has been posted and the send may complete before a matching receive has been posted. However the main difference arises out of the fact that if the send is stared and no matching receive is posted the outgoing message must be buffered. Note if the matching receive is posted the buffered send can happily rendezvous with the processor that started the receive, but in case there is no receive, the send in buffered mode has to buffer the outgoing message to allow the send to complete. In its entirety, a buffered send is local. Buffer allocation in this case is user defined and in the event of insufficient buffer space, an error occurs.
Syntax for buffer send :
int MPI_Bsend(const void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm)
3. Synchronous Mode
In synchronous send mode, send can be started whether or not a matching receive was posted. However the send will complete successfully only if a matching receive was posted and the receiver has started to receive the message sent by synchronous send. The completion of synchronous send not only indicates that the buffer in the send can be reused, but also the fact that receiving process has started to receive the data. If both send and receive are blocking then the communication does not complete at either end before the communicating processor rendezvous.
Syntax for synchronous send :
int MPI_Ssend(const void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm)
4. Ready Mode
Unlike the previous three mode, a send in ready mode can be started only if the matching receive has already been posted. Completion of the send doesn't indicate anything about the matching receive and merely tells that the send buffer can be reused. A send that uses ready mode has the same semantics as standard mode or a synchronous mode with the additional information about a matching receive. A correct program with a ready mode of communication can be replaced with synchronous send or a standard send with no effect to the outcome apart from performance difference.
Syntax for ready send :
int MPI_Rsend(const void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm)
Having gone through all the 4 blocking-send, they might seem in principal different but depending on implementation the semantics of one mode may be similar to another.
For example MPI_Send in general is a blocking mode but depending on implementation, if the message size is not too big, MPI_Send will copy the outgoing message from send buffer to system buffer ('which mostly is the case in modern system) and return immediately. Lets look at an example below :
//assume there are 4 processors numbered from 0 to 3
if(rank==0){
tag=2;
MPI_Send(&send_buff1, 1, MPI_DOUBLE, 1, tag, MPI_COMM_WORLD);
MPI_Send(&send_buff2, 1, MPI_DOUBLE, 2, tag, MPI_COMM_WORLD);
MPI_Recv(&recv_buff1, MPI_FLOAT, 3, 5, MPI_COMM_WORLD);
MPI_Recv(&recv_buff2, MPI_INT, 1, 10, MPI_COMM_WORLD);
}
else if(rank==1){
tag = 10;
//receive statement missing, nothing received from proc 0
MPI_Send(&send_buff3, 1, MPI_INT, 0, tag, MPI_COMM_WORLD);
MPI_Send(&send_buff3, 1, MPI_INT, 3, tag, MPI_COMM_WORLD);
}
else if(rank==2){
MPI_Recv(&recv_buff, 1, MPI_DOUBLE, 0, 2, MPI_COMM_WORLD);
//do something with receive buffer
}
else{ //if rank == 3
MPI_Send(send_buff, 1, MPI_FLOAT, 0, 5, MPI_COMM_WORLD);
MPI_Recv(recv_buff, 1, MPI_INT, 1, 10, MPI_COMM_WORLD);
}
Lets look at what is happening at each rank in the above example
Rank 0 is trying to send to rank 1 and rank 2, and receive from rank 1 andd 3.
Rank 1 is trying to send to rank 0 and rank 3 and not receive anything from any other ranks
Rank 2 is trying to receive from rank 0 and later do some operation with the data received in the recv_buff.
Rank 3 is trying to send to rank 0 and receive from rank 1
Where beginners get confused is that rank 0 is sending to rank 1 but rank 1 hasn't started any receive operation hence the communication should block or stall and the second send statement in rank 0 should not be executed at all (and this is what MPI documentation stress that it is implementation defined whether or not the outgoing message will be buffered or not). In most of the modern system, such messages of small sizes (here size is 1) will easily be buffered and MPI_Send will return and execute its next MPI_Send statement. Hence in above example, even if the receive in rank 1 is not started, 1st MPI_Send in rank 0 will return and it will execute its next statement.
In a hypothetical situation where rank 3 starts execution before rank 0, it will copy the outgoing message in the first send statement from the send buffer to a system buffer (in a modern system ;) ) and then start executing its receive statement. As soon as rank 0 finishes its two send statements and begins executing its receive statement, the data buffered in system by rank 3 is copied in the receive buffer in rank 0.
In case there's a receive operation started in a processor and no matching send is posted, the process will block until the receive buffer is filled with the data it is expecting. In this situation an computation or other MPI communication will be blocked/halted unless MPI_Recv has returned.
Having understood the buffering phenomena, one should return and think more about MPI_Ssend which has the true semantics of a blocking communication. Even if MPI_Ssend copies the outgoing message from send buffer to a system buffer (which again is implementation defined), one must note MPI_Ssend will not return unless some acknowledge (in low level format) from the receiving process has been received by the sending processor.
Fortunately MPI decided to keep things easer for the users in terms of receive and there is only one receive in Blocking communication : MPI_Recv, and can be used with any of the four send modes described above. For MPI_Recv, blocking means that receive returns only after it contains the data in its buffer. This implies that receive can complete only after a matching send has started but doesn't imply whether or not it can complete before the matching send completes.
What happens during such blocking calls is that the computations are halted until the blocked buffer is freed. This usually leads to wastage of computational resources as Send/Recv is usually copying data from one memory location to another memory location, while the registers in cpu remain idle.
NON-BLOCKING COMMUNICATION :
For Non-Blocking Communication, the application creates a request for communication for send and / or receive and gets back a handle and then terminates. That's all that is needed to guarantee that the process is executed. I.e the MPI library is notified that the operation has to be executed.
For the sender side, this allows overlapping computation with communication.
For the receiver side, this allows overlapping a part of the communication overhead , i.e copying the message directly into the address space of the receiving side in the application.

In using blocking communication you must be care about send and receive calls for example
look at this code
if(rank==0)
{
MPI_Send(x to process 1)
MPI_Recv(y from process 1)
}
if(rank==1)
{
MPI_Send(y to process 0);
MPI_Recv(x from process 0);
}
What happens in this case?
Process 0 sends x to process 1 and blocks until process 1 receives x.
Process 1 sends y to process 0 and blocks until process 0 receives y, but
process 0 is blocked such that process 1 blocks for infinity until the two processes are killed.

It is easy.
Non-blocking means computation and transferring data can happen in the same time for a single process.
While Blocking means, hey buddy, you have to make sure that you have already finished transferring data then get back to finish the next command, which means if there is a transferring followed by a computation, computation must be after the success of transferring.

Both the accepted answer and the other very long one mention overlap of computation and communication as an advantage. That is 1. not the main motivation, and 2. very hard to attain. The main advantage (and the original motivation) of non-blocking communication is that you can express complicated communication patterns without getting deadlock and without processes serializing themselves unnecessarily.
Examples: Deadlock: everyone does a receive, then everyone does a send, for instance along a ring. This will hang.
Serialization: along a linear ordering, everyone except the last does a send to the right, then everyone except the first does a receive from the left. This will have all processes executing sequentially rather than in parallel.

Related

Create channels with extra flags in an idiomatic way

TL;DR I want to have the functionality where a channel has two extra fields that tell the producer whether it is allowed to send to the channel and if so tell the producer what value the consumer expects. Although I know how to do it with shared memory, I believe that this approach goes against Go's ideology of "Do not communicate by sharing memory; instead, share memory by communicating."
Context:
I wish to have a server S that runs (besides others) three goroutines:
Listener that just receives UDP packets and sends them to the demultplexer.
Demultiplexer that takes network packets and based on some data sends it into one of several channels
Processing task which listens to one specific channel and processes data received on that channel.
To check whether some devices on the network are still alive, the processing task will periodically send out nonces over the network and then wait for k seconds. In those k seconds, other participants of my protocol that received the nonce will send a reply containing (besides other information) the nonce. The demultiplexer will receive the packets from the listener, parse them and send them to the processing_channel. After the k seconds elapsed, the processing task processes the messages pushed onto the processing_channel by the demultiplexer.
I want the demultiplexer to not just blindly send any response (of the correct type) it received onto the the processing_channel, but to instead check whether the processing task is currently even expecting any messages and if so which nonce value it expects. I made this design decision in order to drop unwanted packets a soon as possible.
My approach:
In other languages, I would have a class with the following fields (in pseudocode):
class ActivatedChannel{
boolean flag_expecting_nonce;
int expected_nonce;
LinkedList chan;
}
The demultiplexer would then upon receiving a packet of the correct type simply acquire the lock for the ActivatedChannel processing_channel object, check whether the flag is set and the nonce matches, and if so add the message to the LinkedList chan!
Problem:
This approach makes use of locks and shared memory, which does not align with Golang's "Do not communicate by sharing memory; instead, share memory by communicating" mantra. Hence, I would like to know... :
... whether my approach is "bad" regarding Go in the sense that it relies on shared memory.
... how to achieve the outlined result in a more Go-like way.

Yes, the approach described by you doesn't align with Golang's Idiomatic way of implementation. And you have rightly pointed out that in the above approach you are communicating by sharing memory.
To achieve this in Go's Idiomatic way, one of the approaches could be that your Demultiplexer "remembers" all the processing_channels that are expecting nonce and the corresponding type of the nonce. Whenever a processing_channels is ready to receive a reply, it sends a signal to the Demultiplexe saying that it is expecting a reply.
Since Demultiplexer is at the center of all the communication it can maintain a mapping between a processing_channel & the corresponding nonce it expects. It can also maintain a "registry" of all the processing_channels which are expecting a reply.
In this approach, we are Sharing memory by communicating
For communicating that a processing_channel is expecting a reply, the following struct can be used:
type ChannelState struct {
ChannelId string // unique identifier for processing channel
IsExpectingNonce bool
ExpectedNonce int
}
In this approach, there is no lock used.

MPI_SSEND : How does it guarantee reuse of the sender buffer?

I understand that MPI_Bsend will save the sender's buffer in local buffer managed by MPI library, hence it's safe to reuse the sender's buffer for other purposes.
What I do not understand is how MPI_Ssend guarantee this?
Send in synchronous mode.
A send that uses the synchronous mode can be started whether or not a matching receive was posted. However, the send will complete successfully only if a matching receive is posted, and the receive operation has started to receive the message sent by the synchronous send. Thus, the completion of a synchronous send not only indicates that the send buffer can be reused, but also indicates that the receiver has reached a certain point in its execution, namely that it has started executing the matching receive
As per above, MPI_Ssend will return (ie allow further program execution) if matching receive has been posted and it has started to receive the message sent by the synchronous send. Consider the following case:
I send a huge data array of int say data[1 million] via MPI_Ssend. Another process starts receiving it (but might not have done so completely), which allows MPI_Ssend to return and execute the next program statement. The next statement makes changes to the buffer at very end data[1 million] = \*new value*\. Then the MPI_Ssend finally reaches the buffer end and sends this new value which was not what I wanted.
What am I missing in this picture?
TIA

MPI_Ssend() is both blocking and synchronous.
From the MPI 3.1 standard (chapter 3.4, page 37)
Thus, the completion of a synchronous send not only indicates
that the send buffer can be reused, but it also indicates that the
receiver has reached a certain point in its execution, namely that it
has started executing the matching receive.

MPI non-blocking communication and pthreads difference?

In MPI, there are non-blocking calls like MPI_Isend and MPI_Irecv.
If I am working on a p2p project, the Server would listen to many clients.
One way to do it:
for(int i = 1; i < highest_rank; i++){
MPI_Irecv(....,i,....statuses[i]); //listening to all slaves
}
while(true){
for( int i = 1; i < highest_rank; i++){
checkStatus(statuses[i])
if true do somthing
}
Another old way that I could do it is:
Server creating many POSIX threads, pass in a function,
that function will call MPI_Recv and loop forever.
Theoretically, which one would perform faster on the server end? If there is another better way to write the server, please let me know as well.

The latter solution does not seem very efficient to me because of all the overhead from managing the pthreads inside a MPI process.
Anyway I would rewrite you MPI code as:
for(int i = 1; i < highest_rank; i++){
MPI_Irev(....,i,....requests[i]); //listening to all slaves
}
while(true){
MPI_waitany(highest_rank, request[i], index, status);
//do something useful
}
Even better you can use MPI_Recv with MPI_ANY_SOURCE as the rank of the source of a message. It seems like your server does not have anything to do except serving request therefore there is no need to use an asynchronous recv.
Code would be:
while(true){
MPI_Recv(... ,MPI_ANY_SOURCE, REQUEST_TAG,MPI_comm,status)
//retrieve client id from status and do something
}

When calling MPI_Irecv, it is NOT safe to test the recv buffer until AFTER MPI_Test* or MPI_Wait* have been called and successfully completed. The behavior of directly testing the buffer without making those calls is implementation dependent (and ranges from not so bad to a segfault).
Setting up a 1:1 mapping with one MPI_Irecv for each remote rank can be made to work. Depending on the amount of data that is being sent, and the lifespan of that data once received, this approach may consume an unacceptable amount of system resources. Using MPI_Testany or MPI_Testall will likely provide the best balance between message processing and CPU load. If there is no non-MPI processing that needs to be done while waiting on incoming messages, MPI_Waitany or MPI_Waitall may be preferable.
If there are outstanding MPI_Irecv calls, but the application has reached the end of normal processing, it is "necessary" to MPI_Cancel those outstanding calls. Failing to do that may be caught in MPI_Finalize as an error.
A single MPI_Irecv (or just MPI_Recv, depending on how aggressive the message handling needs to be) on MPI_ANY_SOURCE also provides a reasonable solution. This approach can also be useful if the amount of data received is "large" and can be safely discarded after processing. Processing a single incoming buffer at a time can reduce the total system resources required, at the expense of serializing the processing.

Let me just comment on your idea to use POSIX threads (or whatever other threading mechanism there might be). Making MPI calls from multiple threads at the same time requires that the MPI implementation is initialised with the highest level of thread support of MPI_THREAD_MULTIPLE:
int provided;
MPI_Init_thread(&argv, &argc, MPI_THREAD_MULTIPLE, &provided);
if (provided != MPI_THREAD_MULTIPLE)
{
printf("Error: MPI does not provide full thread support!\n");
MPI_Abort(MPI_COMM_WORLD, 1);
}
Although the option to support concurrent calls from different threads was introduced in the MPI standard quite some time ago, there are still MPI implementations that struggle to provide fully working multithreaded support. MPI is all about writing portable, at least in theory, applications, but in this case real life differs badly from theory. For example, one of the most widely used open-source MPI implementation - Open MPI - still does not support native InfiniBand communication (InfiniBand is the very fast low latency fabric, used in most HPC clusters nowadays) when initialised at MPI_THREAD_MULTIPLE level and therefore switches to different, often much slower and with higher latency transports like TCP/IP over regular Ethernet or IP-over-InfiniBand. Also there are some supercomputer vendors, whose MPI implementations do not support MPI_THREAD_MULTIPLE at all, often because of the way the hardware works.
Besides, MPI_Recv is a blocking call which poses problems with proper thread cancellation (if necessary). You have to make sure that all threads escape the infinite loop somehow, e.g. by having each worker send a termination message with the appropriate tag or by some other protocol.

How to send a message without a specific destination in MPI?

I want to send a message to one of the ranks receiving a message with a specific tag. If there is any rank received the message and the message is consumed. In MPI_Recv() we can receive a message using MPI_ANY_SOURCE/MPI_ANY_TAG but the MPI_Send() can't do this. How can I send an message with unknown destination?
The MPI_Bcast() can't do it because after receiving, I have to reply to the source process.
Thanks.

What I would do is have the worker processes signal to the master that they're ready to receive. The master would keep track of which ranks are ready, pick one (lowest rank first, random, round robin, however you like), send to it, and clear its "ready" flag.

Do you just want to send a message it to a random rank?
MPI_Comm_size(MPI_COMM_WORLD, &size);
sendto = rand() % size;
MPI_Send(buffer, count, MPI_CHAR, sendto, 0, MPI_COMM_WORLD);

The short answer is: You cannot do this in MPI.
The slightly longer answer is: You probably do not want to do this. I am guessing that you are trying to set up some sort of work-stealing. As suszterpatt suggested, you could use one-sided communication to 'grab' the work from the sending process, but you will need to use locks, and this will not scale well to many processes unless there is some idea of a local process group (i.e., you cannot have 1,000 processes all work-stealing from one process, you will need to decompose things).

MPI buffered send/receive order

I'm using MPI (with fortran but the question is more specific to the MPI standard than any given language), and specifically using the buffered send/receive functions isend and irecv. Now if we imagine the following scenario:
Process 0:
isend(stuff1, ...)
isend(stuff2, ...)
Process 1:
wait 10 seconds
irecv(in1, ...)
irecv(in2, ...)
Are the messages delivered to Process 1 in the order they were sent, i.e. can I be sure that in1 == stuff1 and in2 == stuff2 if the tag used is the same in all cases?

Yes, the messages are received in the order they are sent. This is described by the standard as non-overtaking messages. See this MPI Standard section for more details, here's an excerpt:
Order Messages are non-overtaking: If a sender sends two messages in succession to the same destination, and both match the same receive, then this operation cannot receive the second message if the first one is still pending. If a receiver posts two receives in succession, and both match the same message, then the second receive operation cannot be satisfied by this message, if the first one is still pending. This requirement facilitates matching of sends to receives. It guarantees that message-passing code is deterministic, if processes are single-threaded and the wildcard MPI_ANY_SOURCE is not used in receives. (Some of the calls described later, such as MPI_CANCEL or MPI_WAITANY, are additional sources of nondeterminism.)

Yes and no.
can I be sure that in1 == stuff1 and
in2 == stuff2 if the tag used is the
same in all cases?
Yes. There is a deterministic 1:1 correlation between send's and recv's that will get the correct input into the correct recv buffer. This behavior is guaranteed by the standard, and is enforced by all MPI implementations.
No. The exact order of internal message progression and the exact order in which buffers on the receiver side are populated is somewhat of a black box....especially when RDMA style message transfers with multiple in-flight buffers are being used (e.g. InfiniBand).
If your code is using multiple threads, and inspecting the buffer to determine completeness (e.g. waiting on a bit to be toggled) rather than using MPI_Test or MPI_Wait, then it is possible that the messages can arrive out of order (but in the correct buffer).
If your code is dependent on the in1 = stuff1 being populated BEFORE in2 = stuff2 is populated on the receiver side, and there is a single sending rank for both messages, then using MPI_Issend (non-blocking, synchronous send) will guarantee the messages are recv'd in order. If you need to guarantee the buffer population order of multiple recv's from multiple sending ranks, then some kind of blocking call is required between each revc (e.g. MPI_Recv, MPI_Barrier, MPI_Wait, etc).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio