MPI_SSEND : How does it guarantee reuse of the sender buffer? - parallel-processing

I understand that MPI_Bsend will save the sender's buffer in local buffer managed by MPI library, hence it's safe to reuse the sender's buffer for other purposes.
What I do not understand is how MPI_Ssend guarantee this?
Send in synchronous mode.
A send that uses the synchronous mode can be started whether or not a matching receive was posted. However, the send will complete successfully only if a matching receive is posted, and the receive operation has started to receive the message sent by the synchronous send. Thus, the completion of a synchronous send not only indicates that the send buffer can be reused, but also indicates that the receiver has reached a certain point in its execution, namely that it has started executing the matching receive
As per above, MPI_Ssend will return (ie allow further program execution) if matching receive has been posted and it has started to receive the message sent by the synchronous send. Consider the following case:
I send a huge data array of int say data[1 million] via MPI_Ssend. Another process starts receiving it (but might not have done so completely), which allows MPI_Ssend to return and execute the next program statement. The next statement makes changes to the buffer at very end data[1 million] = \*new value*\. Then the MPI_Ssend finally reaches the buffer end and sends this new value which was not what I wanted.
What am I missing in this picture?
TIA

MPI_Ssend() is both blocking and synchronous.
From the MPI 3.1 standard (chapter 3.4, page 37)
Thus, the completion of a synchronous send not only indicates
that the send buffer can be reused, but it also indicates that the
receiver has reached a certain point in its execution, namely that it
has started executing the matching receive.

Related

Is it granteed the FD_CLOSE event are only posted when there is no data buffered in socket?

We using the WSAEventSelect to bind an socket with an event. And From the MSDN
The FD_CLOSE network event is recorded when a close indication is
received for the virtual circuit corresponding to the socket. In TCP
terms, this means that the FD_CLOSE is recorded when the connection
goes into the TIME WAIT or CLOSE WAIT states. This results from the
remote end performing a shutdown on the send side or a closesocket.
FD_CLOSE being posted after all data is read from a socket. An
application should check for remaining data upon receipt of FD_CLOSE
to avoid any possibility of losing data. For more information, see the
section on Graceful Shutdown, Linger Options, and Socket Closure and
the shutdown function.
Seams the first highlight sentence means the FD_CLOSE will only been posted after all data is read from socket. But the second sentence require an application need to check if there is data in socket when received FD_CLOSE.
Isn't it conflict? How to understand it?
Unfortunately there is a lot of speculation and very little official word. My understanding is the following:
FD_CLOSE being posted after all data is read from a socket.
Edit: My original response here appears to be false. I believe this statement to be referring to a specific type of socket closure, but there doesn't seem to be agreement on exactly what. It is expected that this should be true, but experience shows that it will not always be.
An application should check for remaining data upon receipt of FD_CLOSE to avoid any possibility of losing data.
There may be data still available at the point your application code receives the FD_CLOSE event. In fact, reading around indicates that new data may become available at the socket after you have received the FD_CLOSE. You should check for this data in order to avoid losing it. I've seen some people implement recv loops until the recv call fails (indicating the socket is actually closed) or even restart the event loop waiting for more FD_READs. I think in the general case you can simply attempt a recv with a sufficiently large buffer and assume nothing more will arrive.

C++ IRC Client design

I'm attempting to write an RFC 2812 compliant C++ IRC library.
I am having some trouble with the design of the client itself.
From what I have read IRC communication tends to be asynchronous.
I am using boost::asio::async_read and boost::asio::async_write.
From reading the documentation I have gathered that you cannot perform multiple async_write requests before one is completed. You therefore end up with rather nested callbacks. Doesn't this defeat the purpose of doing async calls? Wouldn't it just be better to use synchronous calls to prevent the nesting? If not, why?
Secondly, if I am not mistaken, each boost::asio::async_write should be followed up by a boost::asio::async_read to receive the server's response to the commands sent. My client's functions, therefore, would need to take a callback parameter so a user of the class may do something after the client receives a response (ex. send another message...).
If I were to continue implementing this with async, should I keep a std::deque<std::tuple<message, callback>> and each time a boost::asio::async_write is finished, and there is a tuple in the queue, dequeue and send the message then raise the callback? Would this be the optimal way to implement this system?
I'm thinking since messages are sent all the time I'm going to have to implement some kind of listener loop that queues up responses, but how would you associate these responses with the specific command that triggered them? Or in the case that the response is just a message to the channel from another user?
The IRC protocol is a full-duplex protocol. As such, one should always be listening to the server connection expecting commands to process. It could be argued that one should primarily use the messages received from the server to update state, rather than correlating request and responses, as the server may not respond to a command or may respond much later than expected. For example, one may issue a WHOIS command, but receive multiple PRIVMSG commands before receiving a response to WHOIS. For a chat client, a user would likely expect being able to receive chat messages while waiting for a response to WHOIS. Hence, having a async_write() to async_read() call chain may not be ideal in handling the protocol.
For a given socket, the Asio documentation does recommend not initiating additional read operations if there is an outstanding composed read operation and not initiating additional write operations if there is an outstanding composed write operation. Queuing up messages and having an asynchronous call chains process from the queue is a great way to fulfill this recommendation. Consider reading this answer for a nice solution using a queue and an asynchronous call chain.
Also, be aware that the server may send a PING command even on an active connection. When the client is responding with a PONG command, it may be necessary to insert the PONG command near the front of the outbound queue so that it gets sent out as soon as possible.
Doesn't this defeat the purpose of doing async calls?
The usual solution is to use strands:
Why do I need strand per connection when using boost::asio?
You are free to queue multiple asynchronous operations on the same io objects using an (implicit) strand¹.
Using a strand ensures that the completion handlers are invoked on that same logical thread.
On the Protocol
You could indeed keep a queue of commands and await responses for each command before sending the next.
You might be a little bit smarter about this if you can spot the correlation due the different type of reply, but then you'd need to keep queues per type of command. I'd consider that premature optimization.

MPI: blocking vs non-blocking

I am having trouble understanding the concept of blocking communication and non-blocking communication in MPI. What are the differences between the two? What are the advantages and disadvantages?
Blocking communication is done using MPI_Send() and MPI_Recv(). These functions do not return (i.e., they block) until the communication is finished. Simplifying somewhat, this means that the buffer passed to MPI_Send() can be reused, either because MPI saved it somewhere, or because it has been received by the destination. Similarly, MPI_Recv() returns when the receive buffer has been filled with valid data.
In contrast, non-blocking communication is done using MPI_Isend() and MPI_Irecv(). These function return immediately (i.e., they do not block) even if the communication is not finished yet. You must call MPI_Wait() or MPI_Test() to see whether the communication has finished.
Blocking communication is used when it is sufficient, since it is somewhat easier to use. Non-blocking communication is used when necessary, for example, you may call MPI_Isend(), do some computations, then do MPI_Wait(). This allows computations and communication to overlap, which generally leads to improved performance.
Note that collective communication (e.g., all-reduce) is only available in its blocking version up to MPIv2. IIRC, MPIv3 introduces non-blocking collective communication.
A quick overview of MPI's send modes can be seen here.
This post, although is a bit old, but I contend the accepted answer. the statement " These functions don't return until the communication is finished" is a little misguiding because blocking communications doesn't guarantee any handshake b/w the send and receive operations.
First one needs to know, send has four modes of communication : Standard, Buffered, Synchronous and Ready and each of these can be blocking and non-blocking
Unlike in send, receive has only one mode and can be blocking or non-blocking .
Before proceeding further, one must also be clear that I explicitly mention which one is MPI_Send\Recv buffer and which one is system buffer( which is a local buffer in each processor owned by the MPI Library used to move data around among ranks of a communication group)
BLOCKING COMMUNICATION :
Blocking doesn't mean that the message was delivered to the receiver/destination. It simply means that the (send or receive) buffer is available for reuse. To reuse the buffer, it's sufficient to copy the information to another memory area, i.e the library can copy the buffer data to own memory location in the library and then, say for e.g, MPI_Send can return.
MPI standard makes it very clear to decouple the message buffering from send and receive operations. A blocking send can complete as soon as the message was buffered, even though no matching receive has been posted. But in some cases message buffering can be expensive and hence direct copying from send buffer to receive buffer might be efficient. Hence MPI Standard provides four different send modes to give the user some freedom in selecting the appropriate send mode for her application. Lets take a look at what happens in each mode of communication :
1. Standard Mode
In the standard mode, it is up to the MPI Library, whether or not to buffer the outgoing message. In the case where the library decides to buffer the outgoing message, the send can complete even before the matching receive has been invoked. In the case where the library decides not to buffer (for performance reasons, or due to unavailability of buffer space), the send will not return until a matching receive has been posted and the data in send buffer has been moved to the receive buffer.
Thus MPI_Send in standard mode is non-local in the sense that send in standard mode can be started whether or not a matching receive has been posted and its successful completion may depend on the occurrence of a matching receive ( due to the fact it is implementation dependent if the message will be buffered or not) .
The syntax for standard send is below :
int MPI_Send(const void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm)
2. Buffered Mode
Like in the standard mode, the send in buffered mode can be started irrespective of the fact that a matching receive has been posted and the send may complete before a matching receive has been posted. However the main difference arises out of the fact that if the send is stared and no matching receive is posted the outgoing message must be buffered. Note if the matching receive is posted the buffered send can happily rendezvous with the processor that started the receive, but in case there is no receive, the send in buffered mode has to buffer the outgoing message to allow the send to complete. In its entirety, a buffered send is local. Buffer allocation in this case is user defined and in the event of insufficient buffer space, an error occurs.
Syntax for buffer send :
int MPI_Bsend(const void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm)
3. Synchronous Mode
In synchronous send mode, send can be started whether or not a matching receive was posted. However the send will complete successfully only if a matching receive was posted and the receiver has started to receive the message sent by synchronous send. The completion of synchronous send not only indicates that the buffer in the send can be reused, but also the fact that receiving process has started to receive the data. If both send and receive are blocking then the communication does not complete at either end before the communicating processor rendezvous.
Syntax for synchronous send :
int MPI_Ssend(const void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm)
4. Ready Mode
Unlike the previous three mode, a send in ready mode can be started only if the matching receive has already been posted. Completion of the send doesn't indicate anything about the matching receive and merely tells that the send buffer can be reused. A send that uses ready mode has the same semantics as standard mode or a synchronous mode with the additional information about a matching receive. A correct program with a ready mode of communication can be replaced with synchronous send or a standard send with no effect to the outcome apart from performance difference.
Syntax for ready send :
int MPI_Rsend(const void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm)
Having gone through all the 4 blocking-send, they might seem in principal different but depending on implementation the semantics of one mode may be similar to another.
For example MPI_Send in general is a blocking mode but depending on implementation, if the message size is not too big, MPI_Send will copy the outgoing message from send buffer to system buffer ('which mostly is the case in modern system) and return immediately. Lets look at an example below :
//assume there are 4 processors numbered from 0 to 3
if(rank==0){
tag=2;
MPI_Send(&send_buff1, 1, MPI_DOUBLE, 1, tag, MPI_COMM_WORLD);
MPI_Send(&send_buff2, 1, MPI_DOUBLE, 2, tag, MPI_COMM_WORLD);
MPI_Recv(&recv_buff1, MPI_FLOAT, 3, 5, MPI_COMM_WORLD);
MPI_Recv(&recv_buff2, MPI_INT, 1, 10, MPI_COMM_WORLD);
}
else if(rank==1){
tag = 10;
//receive statement missing, nothing received from proc 0
MPI_Send(&send_buff3, 1, MPI_INT, 0, tag, MPI_COMM_WORLD);
MPI_Send(&send_buff3, 1, MPI_INT, 3, tag, MPI_COMM_WORLD);
}
else if(rank==2){
MPI_Recv(&recv_buff, 1, MPI_DOUBLE, 0, 2, MPI_COMM_WORLD);
//do something with receive buffer
}
else{ //if rank == 3
MPI_Send(send_buff, 1, MPI_FLOAT, 0, 5, MPI_COMM_WORLD);
MPI_Recv(recv_buff, 1, MPI_INT, 1, 10, MPI_COMM_WORLD);
}
Lets look at what is happening at each rank in the above example
Rank 0 is trying to send to rank 1 and rank 2, and receive from rank 1 andd 3.
Rank 1 is trying to send to rank 0 and rank 3 and not receive anything from any other ranks
Rank 2 is trying to receive from rank 0 and later do some operation with the data received in the recv_buff.
Rank 3 is trying to send to rank 0 and receive from rank 1
Where beginners get confused is that rank 0 is sending to rank 1 but rank 1 hasn't started any receive operation hence the communication should block or stall and the second send statement in rank 0 should not be executed at all (and this is what MPI documentation stress that it is implementation defined whether or not the outgoing message will be buffered or not). In most of the modern system, such messages of small sizes (here size is 1) will easily be buffered and MPI_Send will return and execute its next MPI_Send statement. Hence in above example, even if the receive in rank 1 is not started, 1st MPI_Send in rank 0 will return and it will execute its next statement.
In a hypothetical situation where rank 3 starts execution before rank 0, it will copy the outgoing message in the first send statement from the send buffer to a system buffer (in a modern system ;) ) and then start executing its receive statement. As soon as rank 0 finishes its two send statements and begins executing its receive statement, the data buffered in system by rank 3 is copied in the receive buffer in rank 0.
In case there's a receive operation started in a processor and no matching send is posted, the process will block until the receive buffer is filled with the data it is expecting. In this situation an computation or other MPI communication will be blocked/halted unless MPI_Recv has returned.
Having understood the buffering phenomena, one should return and think more about MPI_Ssend which has the true semantics of a blocking communication. Even if MPI_Ssend copies the outgoing message from send buffer to a system buffer (which again is implementation defined), one must note MPI_Ssend will not return unless some acknowledge (in low level format) from the receiving process has been received by the sending processor.
Fortunately MPI decided to keep things easer for the users in terms of receive and there is only one receive in Blocking communication : MPI_Recv, and can be used with any of the four send modes described above. For MPI_Recv, blocking means that receive returns only after it contains the data in its buffer. This implies that receive can complete only after a matching send has started but doesn't imply whether or not it can complete before the matching send completes.
What happens during such blocking calls is that the computations are halted until the blocked buffer is freed. This usually leads to wastage of computational resources as Send/Recv is usually copying data from one memory location to another memory location, while the registers in cpu remain idle.
NON-BLOCKING COMMUNICATION :
For Non-Blocking Communication, the application creates a request for communication for send and / or receive and gets back a handle and then terminates. That's all that is needed to guarantee that the process is executed. I.e the MPI library is notified that the operation has to be executed.
For the sender side, this allows overlapping computation with communication.
For the receiver side, this allows overlapping a part of the communication overhead , i.e copying the message directly into the address space of the receiving side in the application.
In using blocking communication you must be care about send and receive calls for example
look at this code
if(rank==0)
{
MPI_Send(x to process 1)
MPI_Recv(y from process 1)
}
if(rank==1)
{
MPI_Send(y to process 0);
MPI_Recv(x from process 0);
}
What happens in this case?
Process 0 sends x to process 1 and blocks until process 1 receives x.
Process 1 sends y to process 0 and blocks until process 0 receives y, but
process 0 is blocked such that process 1 blocks for infinity until the two processes are killed.
It is easy.
Non-blocking means computation and transferring data can happen in the same time for a single process.
While Blocking means, hey buddy, you have to make sure that you have already finished transferring data then get back to finish the next command, which means if there is a transferring followed by a computation, computation must be after the success of transferring.
Both the accepted answer and the other very long one mention overlap of computation and communication as an advantage. That is 1. not the main motivation, and 2. very hard to attain. The main advantage (and the original motivation) of non-blocking communication is that you can express complicated communication patterns without getting deadlock and without processes serializing themselves unnecessarily.
Examples: Deadlock: everyone does a receive, then everyone does a send, for instance along a ring. This will hang.
Serialization: along a linear ordering, everyone except the last does a send to the right, then everyone except the first does a receive from the left. This will have all processes executing sequentially rather than in parallel.

How to handle lifecycle of dynamically allocated data in Windows messages?

Simple task: Send a windows message with dynamically allocated data, e.g. an arbitrary length string. How would you manage the responsibility to free this data?
The receiver(s) of the windows message could be responsible to free this data. But: How can you guarantee that all messages will actually be received and thus the linked data will be freed? Imagine the situation that the receiver is shutting down, so it won't process it's message queue any more. However, the message queue still exists (for some time) and can still accept messages, which won't be processed any more.
Thanks!
PostMessage returns a BOOL that tells you whether the message was posted or not. This is usually good enough, because your window should be valid until it receives the WM_DESTROY and the WM_NCDESTROY messages. After a call to DestroyWindow (which sends these messages) you should not be able to successfully call PostMessage again.
Now, if your PostMessage returns FALSE you have to clean up. If it doesn't, the window procedure has to clean up. Don't send messages that have to be cleaned up to random windows that might not handle them. Actually, don't send any WM_USER + x messages to any windows you don't handle.
There's nothing to do here. As soon as the call to SendMessage returns, you can free the data. As it happens, the other app isn't looking at your memory anyway since it's in a different process. Instead Windows marshals the data across the process boundary.
What's more, if you are receiving the data in a WndProc, you can't take a copy of the pointer to the string. Instead you must take a copy of the contents of the string since that pointer is only valid for the duration of that call to WndProc.
The other point to make is that you have a confusion about the message queue. When you send a message, that happens synchronously and the queue is not involved. The message queue is where posted messages are placed. They are process asynchronously.

Optimally reading data from an Asynchronous Socket

I have a problem with a socket library that uses WSAASyncSelect to put the socket into asynchronous mode. In asynchronous mode the socket is placed into a non-blocking mode (WSAWOULDBLOCK is returned on any operations that would block) and windows messages are posted to a notification window to inform the application when the socket is ready to be read, written to etc.
My problem is this - when receiving a FD_READ event I don't know how many bytes to try and recv. If I pass a buffer thats too small, then winsock will automatically post another FD_READ event telling me theres more data to read. If data is arriving very fast, this can saturate the message queue with FD_READ messages, and as WM_TIMER and WM_PAINT messages are only posted when the message queue is empty this means that an application could stop painting if its receiving a lot of data and useing asynchronous sockets with a too small buffer.
How large to make the buffer then? I tried using ioctlsocket(FIONREAD) to get the number of bytes to read, and make a buffer exactly that large, BUT, KB192599 explicitly warns that that approach is fraught with inefficiency.
How do I pick a buffer size thats big enough, but not crazy big?
As far as I could ever work out, the value set using setsockopt with the SO_RVCBUF option is an upper bound on the FIONREAD value. So rather than call ioctlsocket it should be OK to call getsockopt to find out the SO_RCVBUF setting, and use that as the (attempted) value for each recv.
Based on your comment to Aviad P.'s answer, it sounds like this would solve your problem.
(Disclaimer: I have always used FIONREAD myself. But after reading the linked-to KB article I will probably be changing...)
You can set your buffer to be as big as you can without impacting performance, relying on the TCP PUSH flag to make your reads return before filling the buffer if the sender sent a smaller message.
The TCP PUSH flag is set at a logical message boundary (normally after a send operation, unless explicitly set to false). When the receiving end sees the PUSH flag on a TCP packet, it returns any blocking reads (or asynchronous reads, doesn't matter) with whatever's accumulated in the receive buffer up to the PUSH point.
So if your sender is sending reasonable sized messages, you're ok, if he's not, then you limit your buffer size such that even if you read into it all, you don't negatively impact performance (subjective).

Resources