MPI non-blocking communication and pthreads difference? - client-server

In MPI, there are non-blocking calls like MPI_Isend and MPI_Irecv.
If I am working on a p2p project, the Server would listen to many clients.
One way to do it:
for(int i = 1; i < highest_rank; i++){
MPI_Irecv(....,i,....statuses[i]); //listening to all slaves
}
while(true){
for( int i = 1; i < highest_rank; i++){
checkStatus(statuses[i])
if true do somthing
}
Another old way that I could do it is:
Server creating many POSIX threads, pass in a function,
that function will call MPI_Recv and loop forever.
Theoretically, which one would perform faster on the server end? If there is another better way to write the server, please let me know as well.

The latter solution does not seem very efficient to me because of all the overhead from managing the pthreads inside a MPI process.
Anyway I would rewrite you MPI code as:
for(int i = 1; i < highest_rank; i++){
MPI_Irev(....,i,....requests[i]); //listening to all slaves
}
while(true){
MPI_waitany(highest_rank, request[i], index, status);
//do something useful
}
Even better you can use MPI_Recv with MPI_ANY_SOURCE as the rank of the source of a message. It seems like your server does not have anything to do except serving request therefore there is no need to use an asynchronous recv.
Code would be:
while(true){
MPI_Recv(... ,MPI_ANY_SOURCE, REQUEST_TAG,MPI_comm,status)
//retrieve client id from status and do something
}

When calling MPI_Irecv, it is NOT safe to test the recv buffer until AFTER MPI_Test* or MPI_Wait* have been called and successfully completed. The behavior of directly testing the buffer without making those calls is implementation dependent (and ranges from not so bad to a segfault).
Setting up a 1:1 mapping with one MPI_Irecv for each remote rank can be made to work. Depending on the amount of data that is being sent, and the lifespan of that data once received, this approach may consume an unacceptable amount of system resources. Using MPI_Testany or MPI_Testall will likely provide the best balance between message processing and CPU load. If there is no non-MPI processing that needs to be done while waiting on incoming messages, MPI_Waitany or MPI_Waitall may be preferable.
If there are outstanding MPI_Irecv calls, but the application has reached the end of normal processing, it is "necessary" to MPI_Cancel those outstanding calls. Failing to do that may be caught in MPI_Finalize as an error.
A single MPI_Irecv (or just MPI_Recv, depending on how aggressive the message handling needs to be) on MPI_ANY_SOURCE also provides a reasonable solution. This approach can also be useful if the amount of data received is "large" and can be safely discarded after processing. Processing a single incoming buffer at a time can reduce the total system resources required, at the expense of serializing the processing.

Let me just comment on your idea to use POSIX threads (or whatever other threading mechanism there might be). Making MPI calls from multiple threads at the same time requires that the MPI implementation is initialised with the highest level of thread support of MPI_THREAD_MULTIPLE:
int provided;
MPI_Init_thread(&argv, &argc, MPI_THREAD_MULTIPLE, &provided);
if (provided != MPI_THREAD_MULTIPLE)
{
printf("Error: MPI does not provide full thread support!\n");
MPI_Abort(MPI_COMM_WORLD, 1);
}
Although the option to support concurrent calls from different threads was introduced in the MPI standard quite some time ago, there are still MPI implementations that struggle to provide fully working multithreaded support. MPI is all about writing portable, at least in theory, applications, but in this case real life differs badly from theory. For example, one of the most widely used open-source MPI implementation - Open MPI - still does not support native InfiniBand communication (InfiniBand is the very fast low latency fabric, used in most HPC clusters nowadays) when initialised at MPI_THREAD_MULTIPLE level and therefore switches to different, often much slower and with higher latency transports like TCP/IP over regular Ethernet or IP-over-InfiniBand. Also there are some supercomputer vendors, whose MPI implementations do not support MPI_THREAD_MULTIPLE at all, often because of the way the hardware works.
Besides, MPI_Recv is a blocking call which poses problems with proper thread cancellation (if necessary). You have to make sure that all threads escape the infinite loop somehow, e.g. by having each worker send a termination message with the appropriate tag or by some other protocol.

Related

Synchronous vs Asynchronous socket reads

Most example apps I come across for receiving data are using async calls. For instance, c++ examples use boost asio services to bind message handlers to callbacks. But what about an app that only needs to listen to data from a single socket and process the messages in order? Would it be faster to have a loop that polls/recv's from the socket and calls the handler without using a callback (assume main and logging threads are separate)? Or is there no performance difference (assume messages are coming in as fast as the network card and kernel can handle them)?
There are many intricacies I don't know such as the impact of callbacks to performance due to things like branch prediction. Or if there will be a performance penalty of the callbacks call a different thread to do the processing. Curious to hear some thoughts, experiences, dialog on this subject to save myself from attempting both implementations to discover the answer.

std::promise/std::future vs std::condition_variable in C++

Signaling between threads can be achieved with std::promise/std::future or with good old condition variables. Can someone provide examples/use case where one would be a better choice over the other ?
I know that CVs could be used to signal multiple times between threads. Can you give example with std::future/promise to signal multiple times?
Also, is std::future::wait_for equivalent in performance with std::condition_variable::wait?
Let's say I need to wait on multiple futures in a queue as a consumer; does it make sense to go through each of them and check if they are ready like below ?
for(auto it = activeFutures.begin(); it!= activeFutures.end();) {
if(it->valid() && it->wait_for(std::chrono::milliseconds(1)) == std::future_status::ready) {
Printer::print(std::string("+++ Value " + std::to_string(it->get()->getBalance())));
activeFutures.erase(it);
} else {
++it;
}
}
can some one provide examples/use case where 1 would be a better
choice over other ?
These are 2 different tools of the standard library.
In order to give an example where 1 would be better over the other you'd have to come up with a scenario where both tools are a good fit.
However, these are different levels of abstractions to what they do and what they are good for.
from cppreference (emphasis mine):
Condition variables
A condition variable is a synchronization primitive that allows
multiple threads to communicate with each other. It allows some number
of threads to wait (possibly with a timeout) for notification from
another thread that they may proceed. A condition variable is always
associated with a mutex.
Futures
The standard library provides facilities to obtain values that are
returned and to catch exceptions that are thrown by asynchronous tasks
(i.e. functions launched in separate threads). These values are
communicated in a shared state, in which the asynchronous task may
write its return value or store an exception, and which may be
examined, waited for, and otherwise manipulated by other threads that
hold instances of std::future or std::shared_future that reference
As you can see, a condition variable is a synchronization primitive whereas a future is a facility used to communicate results of asynchronous tasks.
The condition variable can be used in a variety of scenarios where you need to synchronizes multiple threads, however you would typically use a std::future when you have tasks/jobs/work to do and you need it done without interrupting your main flow, aka asynchronously.
so in my opinion a good example where you would use a future + promise is when you need to run a long running calculation and get/wait_for the result at a later point of time. In comparison to a condition variable, where you would have had to basically implement std::future + std::promise on your own, possibly using std::condition_variable somewhere internally.
can you give example with std::future/promise to signal multiple times?
have a look at the toy example from shared_future
Also is std::future::wait_for equivalent in performance with std::condition_variable::wait?
well, GCC's implementation of std::future::wait_for uses std::condition_variable::wait_for which correlates with my explanation of the difference between the two. So as you can understand std::future::wait_for adds a very small performance overhead to std::condition_variable::wait_for

WinHttpWriteData completion

I'm using WinHTTP to transfer large files to a PHP-based web server and I want to display the progress and an estimated speed. After reading the docs I have decided to use chunked transfer encoding. The files get transferred correctly but there is an issue with estimating the time that I cannot solve.
I'm using a loop to send chunks with WinHttpWriteData (header+trailer+footer) and I compute the time difference between start and finish with GetTickCount. I have a fixed bandwidth of 4mbit configured on my router in order to test the correctness of my estimation.
The typical time difference for chunks of 256KB is between 450 - 550ms, which is correct. The problem is that once in a while (few seconds/tens of seconds) WinHttpWriteData returns really really fast, like 4-10ms, which is obviously not possible. The next difference is much higher than the average 500ms.
Why does WinHttpWriteData confirms, either synchronously or asynchronously that it has written the data to the destination when, in reality, the data is still being transferred ? Any ideas ?
Oversimplified, my code looks like:
while (dataLeft)
{
t1 = GetTickCount();
WinHttpWriteData(hRequest, chunkHdr, chunkHdrLen , NULL);
waitWriteConfirm();
WinHttpWriteData(hRequest, actualData, actualDataLen , NULL);
waitWriteConfirm();
WinHttpWriteData(hRequest, chunkFtr, chunkFtrLen , NULL);
waitWriteConfirm();
t2 = GetTickCount();
tdif= t2 - t1;
}
This is simply the nature of how sockets work in general.
Whether you call a lower level function like send() or a higher level function like WinHttpWriteData(), the functions return success/failure based on whether they are able to pass data to the underlying socket kernel to not. The kernel queues up data for eventual transmission in the background. The kernel does not report back when the data is actually transmitted, or if the receiver acks the data. The kernel happily accepts new data as long as there is room in the queue, even if it will take awhile to actually transmit. Otherwise, it will block the sender until room becomes available in the queue.
If you need to monitor actual transmission speed, you have to monitor the low level network activity directly, such as with a packet sniffer or driver hook. Otherwise, you can only monitor how fast you are able to pass data to the kernel (which is usually good enough for most purposes).

ZeroMQ: several I/O threads but only of them in user code?

By default, there is only one thread doing I/O in ZeroMQ. Thus, there will be no more than one of such threads in user code, in the case that we are using callbacks, like in Node.js:
aSocket.on ('message', function(request) { ... user code ... } );
But, at least in the C API, one may ask ZeroMQ to have more than one I/O thread.
In this case (several I/O threads), can we assume that no more than one I/O thread will be executing user code in callbacks?
If not true in general, at least, I guess it is so in node.js
To directly answer:
In this case (several I/O threads), can we assume that no more than one I/O thread will be executing user code in callbacks?
The ZeroMQ C library doesn't have a callback-based framework so yes we can assume that. However, as you note in your post, you can set it up to have multiple I/O threads, in which case you need to manually deal with this in your own way -- Again, no callbacks.

Is MPI_Bsend_init/MPI_Start best asynchronous buffered communication.

Is MPI_Bsend_init/MPI_Start best asynchronous buffered communication. Can you guys think of better way to communicate data between processors. Pseudo-Code for N Processing nodes
MPI_Recv(request[i]) -- Recv data
for(i=0;i<N;i++) MPI_Bsend_init(request[i]) -- Setup request
MPI_Start(request[i]) -- Send data
Bsend is the wrong function here from a performance standpoint. There is little to no advantage of Bsend, as the eager protocol used by essentially all implementations today, is buffered on the receiver side automatically for small messages, which is where Bsend would be viable.
In any case, persistent send - as you are using - is already nonblocking, so there is no such thing as Isent_init. See e.g. http://www.mpi-forum.org/docs/mpi-1.1/mpi-11-html/node51.html:
The call is local, with similar semantics to the nonblocking
communication operations described in section Nonblocking
communication . That is, a call to MPI_START with a request created by
MPI_SEND_INIT starts a communication in the same manner as a call to
MPI_ISEND...
And my colleagues have surprised me with confirmation that persistent Send-Recv does provide efficiency gains on modern InfiniBand clusters. I can only assume this is because IB page registration is done up-front.

Resources