MPI buffered send/receive order - parallel-processing

I'm using MPI (with fortran but the question is more specific to the MPI standard than any given language), and specifically using the buffered send/receive functions isend and irecv. Now if we imagine the following scenario:
Process 0:
isend(stuff1, ...)
isend(stuff2, ...)
Process 1:
wait 10 seconds
irecv(in1, ...)
irecv(in2, ...)
Are the messages delivered to Process 1 in the order they were sent, i.e. can I be sure that in1 == stuff1 and in2 == stuff2 if the tag used is the same in all cases?

Yes, the messages are received in the order they are sent. This is described by the standard as non-overtaking messages. See this MPI Standard section for more details, here's an excerpt:
Order Messages are non-overtaking: If a sender sends two messages in succession to the same destination, and both match the same receive, then this operation cannot receive the second message if the first one is still pending. If a receiver posts two receives in succession, and both match the same message, then the second receive operation cannot be satisfied by this message, if the first one is still pending. This requirement facilitates matching of sends to receives. It guarantees that message-passing code is deterministic, if processes are single-threaded and the wildcard MPI_ANY_SOURCE is not used in receives. (Some of the calls described later, such as MPI_CANCEL or MPI_WAITANY, are additional sources of nondeterminism.)

Yes and no.
can I be sure that in1 == stuff1 and
in2 == stuff2 if the tag used is the
same in all cases?
Yes. There is a deterministic 1:1 correlation between send's and recv's that will get the correct input into the correct recv buffer. This behavior is guaranteed by the standard, and is enforced by all MPI implementations.
No. The exact order of internal message progression and the exact order in which buffers on the receiver side are populated is somewhat of a black box....especially when RDMA style message transfers with multiple in-flight buffers are being used (e.g. InfiniBand).
If your code is using multiple threads, and inspecting the buffer to determine completeness (e.g. waiting on a bit to be toggled) rather than using MPI_Test or MPI_Wait, then it is possible that the messages can arrive out of order (but in the correct buffer).
If your code is dependent on the in1 = stuff1 being populated BEFORE in2 = stuff2 is populated on the receiver side, and there is a single sending rank for both messages, then using MPI_Issend (non-blocking, synchronous send) will guarantee the messages are recv'd in order. If you need to guarantee the buffer population order of multiple recv's from multiple sending ranks, then some kind of blocking call is required between each revc (e.g. MPI_Recv, MPI_Barrier, MPI_Wait, etc).

Related

Create channels with extra flags in an idiomatic way

TL;DR I want to have the functionality where a channel has two extra fields that tell the producer whether it is allowed to send to the channel and if so tell the producer what value the consumer expects. Although I know how to do it with shared memory, I believe that this approach goes against Go's ideology of "Do not communicate by sharing memory; instead, share memory by communicating."
Context:
I wish to have a server S that runs (besides others) three goroutines:
Listener that just receives UDP packets and sends them to the demultplexer.
Demultiplexer that takes network packets and based on some data sends it into one of several channels
Processing task which listens to one specific channel and processes data received on that channel.
To check whether some devices on the network are still alive, the processing task will periodically send out nonces over the network and then wait for k seconds. In those k seconds, other participants of my protocol that received the nonce will send a reply containing (besides other information) the nonce. The demultiplexer will receive the packets from the listener, parse them and send them to the processing_channel. After the k seconds elapsed, the processing task processes the messages pushed onto the processing_channel by the demultiplexer.
I want the demultiplexer to not just blindly send any response (of the correct type) it received onto the the processing_channel, but to instead check whether the processing task is currently even expecting any messages and if so which nonce value it expects. I made this design decision in order to drop unwanted packets a soon as possible.
My approach:
In other languages, I would have a class with the following fields (in pseudocode):
class ActivatedChannel{
boolean flag_expecting_nonce;
int expected_nonce;
LinkedList chan;
}
The demultiplexer would then upon receiving a packet of the correct type simply acquire the lock for the ActivatedChannel processing_channel object, check whether the flag is set and the nonce matches, and if so add the message to the LinkedList chan!
Problem:
This approach makes use of locks and shared memory, which does not align with Golang's "Do not communicate by sharing memory; instead, share memory by communicating" mantra. Hence, I would like to know... :
... whether my approach is "bad" regarding Go in the sense that it relies on shared memory.
... how to achieve the outlined result in a more Go-like way.
Yes, the approach described by you doesn't align with Golang's Idiomatic way of implementation. And you have rightly pointed out that in the above approach you are communicating by sharing memory.
To achieve this in Go's Idiomatic way, one of the approaches could be that your Demultiplexer "remembers" all the processing_channels that are expecting nonce and the corresponding type of the nonce. Whenever a processing_channels is ready to receive a reply, it sends a signal to the Demultiplexe saying that it is expecting a reply.
Since Demultiplexer is at the center of all the communication it can maintain a mapping between a processing_channel & the corresponding nonce it expects. It can also maintain a "registry" of all the processing_channels which are expecting a reply.
In this approach, we are Sharing memory by communicating
For communicating that a processing_channel is expecting a reply, the following struct can be used:
type ChannelState struct {
ChannelId string // unique identifier for processing channel
IsExpectingNonce bool
ExpectedNonce int
}
In this approach, there is no lock used.

MPI, If using non-blocking i_send or i_recv, does it matter which one goes first if there is a wait at the end?

If I call MPI_isend then MPI_irecv or do a irecv first then a isend, with a wait at the end? does it matter which order?
if i MPI_isend then MPI_irecv or do a irecv first then a isend, with a
wait at the end? does it matter which order?
MPI_Irecv and MPI_Isend are nonblocking communication routines, therefore one needs to use the MPI_Wait (or use MPI_Test to test for the completion of the request) to ensure that the message is completed, and that the data in the send/receive buffer can be again safely manipulated.
The nonblocking here means one does not wait for the data to be read and sent; rather the data is immediately made available to be read and sent. That does not imply, however, that the data is immediately sent. If it were, there would not be a need to call MPI_Wait.
Quoting #hristo-away-iliev on source
You must always wait on or test nonblocking operations if you'd like
your programs to be standard compliant and hence portable. The
standard allows the implementation to postpone the actual data
transmission until the wait/test call. Some MPI operations (other than
wait/test) progress non-blocking operations but one should not rely on
this behaviour.
and
MPI_Isend does not necessarily progress in the background but only
when the MPI implementation is given the chance to progress it.
MPI_Wait progresses the operation and guarantees its completion. Some
MPI implementations can progress operations in the background using
progression threads. Some cannot. It is implementation-dependent and
one should never rely on one or another specific behaviour.

MPI_SSEND : How does it guarantee reuse of the sender buffer?

I understand that MPI_Bsend will save the sender's buffer in local buffer managed by MPI library, hence it's safe to reuse the sender's buffer for other purposes.
What I do not understand is how MPI_Ssend guarantee this?
Send in synchronous mode.
A send that uses the synchronous mode can be started whether or not a matching receive was posted. However, the send will complete successfully only if a matching receive is posted, and the receive operation has started to receive the message sent by the synchronous send. Thus, the completion of a synchronous send not only indicates that the send buffer can be reused, but also indicates that the receiver has reached a certain point in its execution, namely that it has started executing the matching receive
As per above, MPI_Ssend will return (ie allow further program execution) if matching receive has been posted and it has started to receive the message sent by the synchronous send. Consider the following case:
I send a huge data array of int say data[1 million] via MPI_Ssend. Another process starts receiving it (but might not have done so completely), which allows MPI_Ssend to return and execute the next program statement. The next statement makes changes to the buffer at very end data[1 million] = \*new value*\. Then the MPI_Ssend finally reaches the buffer end and sends this new value which was not what I wanted.
What am I missing in this picture?
TIA
MPI_Ssend() is both blocking and synchronous.
From the MPI 3.1 standard (chapter 3.4, page 37)
Thus, the completion of a synchronous send not only indicates
that the send buffer can be reused, but it also indicates that the
receiver has reached a certain point in its execution, namely that it
has started executing the matching receive.

What data structure does Erlang use in its inboxes?

Erlang uses message passing to communicate between processes. How does it handle concurrent incoming messages? What data structure is used?
The process inbox is made of 2 lists.
The main one is a fifo where all the incoming messages are stored, waiting for the process to examine them in the exact order they were received. The second one is a stack to store the messages that won't match any clause in a given receive statement.
When the process executes a receive statement, it will try to "pattern match" the first message against each clause of the receive in the order they are declared until the first match occurs.
if no match is found, the message is removed from the fifo and stacked on the second list, then the process iterates with the next message (note that the process execution may be suspended in the mean time either because the fifo is empty, or because it has reached his "reduction quota")
if a match is found, the message is removed from the fifo, and the stacked messages are restored in the fifo in their original order
note that the pattern matching process includes the copy of any interesting stuff into the process variables for example if {request,write,Value,_} -> ... succeeds, that means that the examined message is a 4 elements tuple, whose first and second elements are respectively the atoms request and write, whose third element is successfully pattern matched against the variable Value: that means that Value is bound to this element if it was previously unbound, or that Value matches the element, and finally the fourth element is discarded. After this operation is completed, there is no mean to retrieve the original message
You may get some info out of checking out the erl_message primitive, erl_message.c, and its declaration file, erl_message.h.
You may also find these threads helpful (one, two), although I think your question is more about the data structures in play.
ERTS Structures
The erlang runtime system (erts) allocates a fragmented (linked) heap for the scheduling of message passing (see source). The ErlHeapFragment structure can be found here.
However, each process also has a pretty simple fifo queue structure to which they copy messages from the heap in order to consume them. Underlying the queue is a linked list, and there are mechanisms to bypass the heap and use the process queue directly. See here for more info on that guy.
Finally each process also has a stack (also implemented as a list) where messages that don't have a matching pattern in receive are placed. This acts as a way to store messages that might be important, but that the process has no way of handling (matching) until another, different message is received. This is part of how erlang has such powerful "hot-swapping" mechanisms.
Concurrent Message Passing Semantics
At a high level, the erts receives a message and places it in the heap (unless explicitly told not to), and each process is responsible for selecting messages to copy into its own process queue. From what I have read, the messages currently in the queue will be processed before copying from the heap again, but there is likely more nuance.

MPI: blocking vs non-blocking

I am having trouble understanding the concept of blocking communication and non-blocking communication in MPI. What are the differences between the two? What are the advantages and disadvantages?
Blocking communication is done using MPI_Send() and MPI_Recv(). These functions do not return (i.e., they block) until the communication is finished. Simplifying somewhat, this means that the buffer passed to MPI_Send() can be reused, either because MPI saved it somewhere, or because it has been received by the destination. Similarly, MPI_Recv() returns when the receive buffer has been filled with valid data.
In contrast, non-blocking communication is done using MPI_Isend() and MPI_Irecv(). These function return immediately (i.e., they do not block) even if the communication is not finished yet. You must call MPI_Wait() or MPI_Test() to see whether the communication has finished.
Blocking communication is used when it is sufficient, since it is somewhat easier to use. Non-blocking communication is used when necessary, for example, you may call MPI_Isend(), do some computations, then do MPI_Wait(). This allows computations and communication to overlap, which generally leads to improved performance.
Note that collective communication (e.g., all-reduce) is only available in its blocking version up to MPIv2. IIRC, MPIv3 introduces non-blocking collective communication.
A quick overview of MPI's send modes can be seen here.
This post, although is a bit old, but I contend the accepted answer. the statement " These functions don't return until the communication is finished" is a little misguiding because blocking communications doesn't guarantee any handshake b/w the send and receive operations.
First one needs to know, send has four modes of communication : Standard, Buffered, Synchronous and Ready and each of these can be blocking and non-blocking
Unlike in send, receive has only one mode and can be blocking or non-blocking .
Before proceeding further, one must also be clear that I explicitly mention which one is MPI_Send\Recv buffer and which one is system buffer( which is a local buffer in each processor owned by the MPI Library used to move data around among ranks of a communication group)
BLOCKING COMMUNICATION :
Blocking doesn't mean that the message was delivered to the receiver/destination. It simply means that the (send or receive) buffer is available for reuse. To reuse the buffer, it's sufficient to copy the information to another memory area, i.e the library can copy the buffer data to own memory location in the library and then, say for e.g, MPI_Send can return.
MPI standard makes it very clear to decouple the message buffering from send and receive operations. A blocking send can complete as soon as the message was buffered, even though no matching receive has been posted. But in some cases message buffering can be expensive and hence direct copying from send buffer to receive buffer might be efficient. Hence MPI Standard provides four different send modes to give the user some freedom in selecting the appropriate send mode for her application. Lets take a look at what happens in each mode of communication :
1. Standard Mode
In the standard mode, it is up to the MPI Library, whether or not to buffer the outgoing message. In the case where the library decides to buffer the outgoing message, the send can complete even before the matching receive has been invoked. In the case where the library decides not to buffer (for performance reasons, or due to unavailability of buffer space), the send will not return until a matching receive has been posted and the data in send buffer has been moved to the receive buffer.
Thus MPI_Send in standard mode is non-local in the sense that send in standard mode can be started whether or not a matching receive has been posted and its successful completion may depend on the occurrence of a matching receive ( due to the fact it is implementation dependent if the message will be buffered or not) .
The syntax for standard send is below :
int MPI_Send(const void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm)
2. Buffered Mode
Like in the standard mode, the send in buffered mode can be started irrespective of the fact that a matching receive has been posted and the send may complete before a matching receive has been posted. However the main difference arises out of the fact that if the send is stared and no matching receive is posted the outgoing message must be buffered. Note if the matching receive is posted the buffered send can happily rendezvous with the processor that started the receive, but in case there is no receive, the send in buffered mode has to buffer the outgoing message to allow the send to complete. In its entirety, a buffered send is local. Buffer allocation in this case is user defined and in the event of insufficient buffer space, an error occurs.
Syntax for buffer send :
int MPI_Bsend(const void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm)
3. Synchronous Mode
In synchronous send mode, send can be started whether or not a matching receive was posted. However the send will complete successfully only if a matching receive was posted and the receiver has started to receive the message sent by synchronous send. The completion of synchronous send not only indicates that the buffer in the send can be reused, but also the fact that receiving process has started to receive the data. If both send and receive are blocking then the communication does not complete at either end before the communicating processor rendezvous.
Syntax for synchronous send :
int MPI_Ssend(const void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm)
4. Ready Mode
Unlike the previous three mode, a send in ready mode can be started only if the matching receive has already been posted. Completion of the send doesn't indicate anything about the matching receive and merely tells that the send buffer can be reused. A send that uses ready mode has the same semantics as standard mode or a synchronous mode with the additional information about a matching receive. A correct program with a ready mode of communication can be replaced with synchronous send or a standard send with no effect to the outcome apart from performance difference.
Syntax for ready send :
int MPI_Rsend(const void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm)
Having gone through all the 4 blocking-send, they might seem in principal different but depending on implementation the semantics of one mode may be similar to another.
For example MPI_Send in general is a blocking mode but depending on implementation, if the message size is not too big, MPI_Send will copy the outgoing message from send buffer to system buffer ('which mostly is the case in modern system) and return immediately. Lets look at an example below :
//assume there are 4 processors numbered from 0 to 3
if(rank==0){
tag=2;
MPI_Send(&send_buff1, 1, MPI_DOUBLE, 1, tag, MPI_COMM_WORLD);
MPI_Send(&send_buff2, 1, MPI_DOUBLE, 2, tag, MPI_COMM_WORLD);
MPI_Recv(&recv_buff1, MPI_FLOAT, 3, 5, MPI_COMM_WORLD);
MPI_Recv(&recv_buff2, MPI_INT, 1, 10, MPI_COMM_WORLD);
}
else if(rank==1){
tag = 10;
//receive statement missing, nothing received from proc 0
MPI_Send(&send_buff3, 1, MPI_INT, 0, tag, MPI_COMM_WORLD);
MPI_Send(&send_buff3, 1, MPI_INT, 3, tag, MPI_COMM_WORLD);
}
else if(rank==2){
MPI_Recv(&recv_buff, 1, MPI_DOUBLE, 0, 2, MPI_COMM_WORLD);
//do something with receive buffer
}
else{ //if rank == 3
MPI_Send(send_buff, 1, MPI_FLOAT, 0, 5, MPI_COMM_WORLD);
MPI_Recv(recv_buff, 1, MPI_INT, 1, 10, MPI_COMM_WORLD);
}
Lets look at what is happening at each rank in the above example
Rank 0 is trying to send to rank 1 and rank 2, and receive from rank 1 andd 3.
Rank 1 is trying to send to rank 0 and rank 3 and not receive anything from any other ranks
Rank 2 is trying to receive from rank 0 and later do some operation with the data received in the recv_buff.
Rank 3 is trying to send to rank 0 and receive from rank 1
Where beginners get confused is that rank 0 is sending to rank 1 but rank 1 hasn't started any receive operation hence the communication should block or stall and the second send statement in rank 0 should not be executed at all (and this is what MPI documentation stress that it is implementation defined whether or not the outgoing message will be buffered or not). In most of the modern system, such messages of small sizes (here size is 1) will easily be buffered and MPI_Send will return and execute its next MPI_Send statement. Hence in above example, even if the receive in rank 1 is not started, 1st MPI_Send in rank 0 will return and it will execute its next statement.
In a hypothetical situation where rank 3 starts execution before rank 0, it will copy the outgoing message in the first send statement from the send buffer to a system buffer (in a modern system ;) ) and then start executing its receive statement. As soon as rank 0 finishes its two send statements and begins executing its receive statement, the data buffered in system by rank 3 is copied in the receive buffer in rank 0.
In case there's a receive operation started in a processor and no matching send is posted, the process will block until the receive buffer is filled with the data it is expecting. In this situation an computation or other MPI communication will be blocked/halted unless MPI_Recv has returned.
Having understood the buffering phenomena, one should return and think more about MPI_Ssend which has the true semantics of a blocking communication. Even if MPI_Ssend copies the outgoing message from send buffer to a system buffer (which again is implementation defined), one must note MPI_Ssend will not return unless some acknowledge (in low level format) from the receiving process has been received by the sending processor.
Fortunately MPI decided to keep things easer for the users in terms of receive and there is only one receive in Blocking communication : MPI_Recv, and can be used with any of the four send modes described above. For MPI_Recv, blocking means that receive returns only after it contains the data in its buffer. This implies that receive can complete only after a matching send has started but doesn't imply whether or not it can complete before the matching send completes.
What happens during such blocking calls is that the computations are halted until the blocked buffer is freed. This usually leads to wastage of computational resources as Send/Recv is usually copying data from one memory location to another memory location, while the registers in cpu remain idle.
NON-BLOCKING COMMUNICATION :
For Non-Blocking Communication, the application creates a request for communication for send and / or receive and gets back a handle and then terminates. That's all that is needed to guarantee that the process is executed. I.e the MPI library is notified that the operation has to be executed.
For the sender side, this allows overlapping computation with communication.
For the receiver side, this allows overlapping a part of the communication overhead , i.e copying the message directly into the address space of the receiving side in the application.
In using blocking communication you must be care about send and receive calls for example
look at this code
if(rank==0)
{
MPI_Send(x to process 1)
MPI_Recv(y from process 1)
}
if(rank==1)
{
MPI_Send(y to process 0);
MPI_Recv(x from process 0);
}
What happens in this case?
Process 0 sends x to process 1 and blocks until process 1 receives x.
Process 1 sends y to process 0 and blocks until process 0 receives y, but
process 0 is blocked such that process 1 blocks for infinity until the two processes are killed.
It is easy.
Non-blocking means computation and transferring data can happen in the same time for a single process.
While Blocking means, hey buddy, you have to make sure that you have already finished transferring data then get back to finish the next command, which means if there is a transferring followed by a computation, computation must be after the success of transferring.
Both the accepted answer and the other very long one mention overlap of computation and communication as an advantage. That is 1. not the main motivation, and 2. very hard to attain. The main advantage (and the original motivation) of non-blocking communication is that you can express complicated communication patterns without getting deadlock and without processes serializing themselves unnecessarily.
Examples: Deadlock: everyone does a receive, then everyone does a send, for instance along a ring. This will hang.
Serialization: along a linear ordering, everyone except the last does a send to the right, then everyone except the first does a receive from the left. This will have all processes executing sequentially rather than in parallel.

Resources