Is MPI_Bcast() blocking? - parallel-processing

Is MPI_Bcast() blocking? - parallel-processing

Is MPI_Bcast() blocking or nonblocking? In other word, when the root sends a data, do all processors block until every processor has received this data? If not, how to synchronized (block) all of them so that no one proceeds until all receives the same data.

You need to be a bit careful about terminology here as what MPI means by "blocking" may not be how you have seen it used in other contexts.
In MPI terms, Bcast is blocking. Blocking means that, when the function returns, it has completed the operation it was meant to do. In this case, it means that on return from Bcast it is guaranteed that the receive buffer in every process contains the data you want to broadcast. The non-blocking version is Ibcast.
In MPI terms, what you are asking is whether the operation is synchronous, i.e. implies synchronisation amongst processes. For a point-to-point operation such as Send, this refers to whether or not the sender waits for the receive to be posted before returning from the send call. For collective operations, the question is whether there is a barrier (as pointed out by #Vladimir). Bcast does not necessarily imply a barrier.
However, the reason I am posting is that, in almost all MPI programs written using the standard Send/Recv calls (as opposed to single-sided Put/Get) you do not care if there is a synchronisation after the barrier. All each process cares about is that it has received the data it needs - why would it matter what the other processes are doing? If you subsequently want to communicate with any other process then the MPI routines are designed so that the required synchronisation happens automatically. If you issue a receive and another process is slow, you wait; if you issue a send and the other process has not issued a receive, everything will still work correctly (this is assuming you don't call Rsend - you should never call Rsend!). Whether or not there is synchronisation has effects on performance, but rarely affects whether a program is correct or not.
Unless processes are interacting via some other mechanism (e.g. all accessing the same file) then it is hard to come up with a real example where you care whether or not the Bcast synchronises. Of course you can always construct some edge case, but in real practical applications of MPI it almost never matters.
Many MPI programs are littered with barriers and in my experience they are almost never required for correctness; the only common use case is to ensure meaningful timings for performance measurements.

No, this kind of blocking (waiting for the other processes to finish their part of the job) would be very bad for performance. Every process continues as soon as it has all it need -- that means that the data it was to receive are there, or the data to be sent are at least copied to some buffer.
You can use an MPI_Barrier to synchronize processes if you need to be sure all processes finished. As already said, it can slowdown the program significantly. I use it only for certain diagnostic logging when initializing my code. Not during the actual integration.

Related

Is it safe to call MPI_Finalize immediately after MPI_Send or MPI_Ssend?

Suppose I have 3 nodes with ranks 0, 1 and 2. 1 and 2 are doing some calculations and then transfer the results to 0. Is it safe to call MPI_Send or MPI_Ssend, then immediately MPI_Finalize and then immediately return from main?
Consider this example: 2 fishishes much faster, sends the results to 0 and finalizes. After that, 0 and 1 will still need to communicate, once 1 has finished. Is this an allowed MPI "state"? The standard repeatedly states that no communication via MPI may occur after MPI_Finalize has been called. However, it seems a little impractical if that counts for all nodes and not only the ones that have finalized.
Furthermore, it is unclear to me, whether the MPI_Send and MPI_Ssend calls are safe, since at least for MPI_Send, the call may return before 0 posted a receive. What happens if MPI_Finalize is called immediately after that, if there is still data in some internal buffer? Does hat also apply to MPI_Ssend?

MPI_FINALIZE is collective over all connected ranks but, as with all other collectives except MPI_BARRIER, it may return early. When you call it, the MPI library cleans up after itself and, if possible, returns control back to you. The standard does not require that behaviour though - an implementation that waits until all the other ranks have called MPI_FINALIZE, or simply never returns (except in rank 0 of MPI_COMM_WORLD), is still a compliant one.
As long as the rest of the MPI ranks do not try to communicate with a rank that has called MPI_FINALIZE, they are free to continue communicating with each other. This precludes though any collective calls unless those are on communicators that the exited rank wasn't a member of. The scenario you are describing is a perfectly valid use of MPI.
Regarding what happens when you call MPI_FINALIZE, the standard requires that you take measures to complete the visible part of the communication operations, e.g., wait/test any non-blocking calls. As for buffered operations, the standard says (Section 8.7 Startup, pg. 359):
Advice to implementors. Even though a process has executed all MPI calls needed to complete the communications it is involved with, such communication may not yet be completed from the viewpoint of the underlying MPI system. For example, a blocking send may have returned, even though the data is still buffered at the sender in an MPI buffer; an MPI process may receive a cancel request for a message it has completed receiving. The MPI implementation must ensure that a process has completed any involvement in MPI communication before MPI_FINALIZE returns. Thus, if a process exits after the call to MPI_FINALIZE, this will not cause an ongoing communication to fail. The MPI implementation should also complete freeing all objects marked for deletion by MPI calls that freed them. (End of advice to implementors.)
(emphasis in bold is mine)

How can I tell Windows XP/7 not to switch threads during a certain segment of my code?

I want to prevent a thread switch by Windows XP/7 in a time critical part of my code that runs in a background thread. I'm pretty sure I can't create a situation where I can guarantee that won't happen, because of higher priority interrupts from system drivers, etc. However, I'd like to decrease the probability of a thread switch during that part of my code to the minimum that I can. Are there any create-thread flags or Window API calls that can assist me? General technique tips are appreciated too. If there is a way to get this done without having to raise the threads priority to real-time-critical that would be great, since I worry about creating system performance issues for the user if I do that.
UPDATE: I am adding this update after seeing the first responses to my original post. The concrete application that motivated the question has to do with real-time audio streaming. I want to eliminate every bit of delay I can. I found after coding up my original design that a thread switch can cause a 70ms or more delay at times. Since my app is between two sockets acting as a middleman for delivering audio, the instant I receive an audio buffer I want to immediately turn around and push it out the the destination socket. My original design used two cooperating threads and a semaphore since the there was one thread managing the source socket, and another thread for the destination socket. This architecture evolved from the fact the two devices behind the sockets are disparate entities.
I realized that if I combined the two sockets onto the same thread I could write a code block that reacted immediately to the socket-data-received message and turned it around to the destination socket in one shot. Now if I can do my best to avoid an intervening thread switch, that would be the optimal coding architecture for minimizing delay. To repeat, I know I can't guarantee this situation, but I am looking for tips/suggestions on how to write a block of code that does this and minimizes as best as I can the chance of an intervening thread switch.
Note, I am aware that O/S code behind the sockets introduces (potential) delays of its own.

AFAIK there are no such flags in CreateThread or etc (This also doesn't make sense IMHO). You may snooze other threads in your process from execution during in critical situations (by enumerating them and using SuspendThread), as well as you theoretically may enumerate & suspend threads in other processes.
OTOH snoozing threads is generally not a good idea, eventually you may call some 3rd-party code that would implicitly wait for something that should be accomplished in another threads, which you suspended.
IMHO - you should use what's suggested for the case - playing with thread/process priorities (also you may consider SetThreadPriorityBoost). Also the OS tends to raise the priority to threads that usually don't use CPU aggressively. That is, threads that work often but for short durations (before calling one of the waiting functions that suspend them until some condition) are considered to behave "nicely", and they get prioritized.

How does cooperative multitasking work?

I read this Wikipedia text slice:
Because a cooperatively multitasked system relies on each process regularly giving up time to other processes on the system, one poorly designed program can consume all of the CPU time for itself or cause the whole system to hang.
Out of curiosity, how does one give up that time? Is this some sort of OS call? Let's think about non-preemptive cases like fibers or evented IO that do cooperative multitasking. How do they give up that time?
Take this NodeJS example:
var fs = require('fs');
fs.readFile('/path/to/file', function(err, data) {});
It is obvious to me that the process does nothing while it's waiting for the data, but how does V8 in this case give up time for other processes?
Let's assume Linux/Windows as our OS.
Edit: I found out how Google is doing this with their V8.
On Windows they basically sleep zero time:
void Thread::YieldCPU() {
Sleep(0);
}
And on Linux they make an OS call:
void Thread::YieldCPU() {
sched_yield();
}
of sched.h.

Yes, every program participates in the scheduling decisions of the OS, so you have to call a particular syscall that tells the kernel to take back over. Often this was called yield(). If you imagine how difficult it is to guarantee that a paticular line of code is called at regular, short intervals, or even at all, you get an idea of why cooperative multitasking is a suboptimal solution.
In your example, it is the javascript engine itself is interrupted by the OS scheduler, if it's a preemptive OS. If it's a cooperative one, then no, the engine gets no work done, and neither does any other process. As a result, such systems are usually not suitable for real-time (or even serious) workloads.

An example of such an OS is NetWare. In that system, it was necessary to call a specific function (I think it is called ThreadSwitch or maybe ThreadSwitchWithDelay). And it was always a guess as to how often it was needed. In every single CPU-intensive loop in the product it was necessary to call one of those functions periodically.
But in that system other calls would result in allowing other threads to run. In particular (and germane to the question) is that I/O calls resulted in giving the OS the opportunity to run other threads. Basically any system call that gave control to the OS was sufficient to allow other threads to run (mutex/semaphore calls being important ones).

As a general rule, co-operative multitasking involves the functions signalling that they are now waiting, rather than going into spin loops ( where they process while waiting ) they suspend themselves.
In this case, the processing behind the ReadFile will handle the waiting for data and the relevant signalling that it is suspendable. Within you own code, whatever it is written in, you should suspend processing if you are waiting for a long-running process, not spin. However, in many cases, the suspend processes are automatically handled, because suspension activities are built in. The danger in this is tht if you deliberately force long-term spins, then you will hang the system.
The alternative ( from that wiki ) is pre-emptive multitasking, where the process is forced out action after a certain time, irrespective of what it is doing. This means that whatever you do, it cannot run forever, because the system process will force it out. However, it can be less efficient as the break points are not defined.

Is it better to poll or wait?

I have seen a question on why "polling is bad". In terms of minimizing the amount of processor time used by one thread, would it be better to do a spin wait (i.e. poll for a required change in a while loop) or wait on a kernel object (e.g. a kernel event object in windows)?
For context, assume that the code would be required to run on any type of processor, single core, hyperthreaded, multicore, etc. Also assume that a thread that would poll or wait can't continue until the polling result is satisfactory if it polled instead of waiting. Finally, the time between when a thread starts waiting (or polling) and when the condition is satisfied can potentially vary from a very short time to a long time.
Since the OS is likely to more efficiently "poll" in the case of "waiting", I don't want to see the "waiting just means someone else does the polling" argument, that's old news, and is not necessarily 100% accurate.

Provided the OS has reasonable implementations of these type of concurrency primitives, it's definitely better to wait on a kernel object.
Among other reasons, this lets the OS know not to schedule the thread in question for additional timeslices until the object being waited-for is in the appropriate state. Otherwise, you have a thread which is constantly getting rescheduled, context-switched-to, and then running for a time.
You specifically asked about minimizing the processor time for a thread: in this example the thread blocking on a kernel object would use ZERO time; the polling thread would use all sorts of time.
Furthermore, the "someone else is polling" argument needn't be true. When a kernel object enters the appropriate state, the kernel can look to see at that instant which threads are waiting for that object...and then schedule one or more of them for execution. There's no need for the kernel (or anybody else) to poll anything in this case.

Waiting is the "nicer" way to behave. When you are waiting on a kernel object your thread won't be granted any CPU time as it is known by the scheduler that there is no work ready. Your thread is only going to be given CPU time when it's wait condition is satisfied. Which means you won't be hogging CPU resources needlessly.

I think a point that hasn't been raised yet is that if your OS has a lot of work to do, blocking yeilds your thread to another process. If all processes use the blocking primitives where they should (such as kernel waits, file/network IO etc.) you're giving the kernel more information to choose which threads should run. As such, it will do more work in the same amount of time. If your application could be doing something useful while waiting for that file to open or the packet to arrive then yeilding will even help you're own app.

Waiting does involve more resources and means an additional context switch. Indeed, some synchronization primitives like CLR Monitors and Win32 critical sections use a two-phase locking protocol - some spin waiting is done fore actually doing a true wait.
I imagine doing the two-phase thing would be very difficult, and would involve lots of testing and research. So, unless you have the time and resources, stick to the windows primitives...they already did the research for you.

There are only few places, usually within the OS low-level things (interrupt handlers/device drivers) where spin-waiting makes sense/is required. General purpose applications are always better off waiting on some synchronization primitives like mutexes/conditional variables/semaphores.

I agree with Darksquid, if your OS has decent concurrency primitives then you shouldn't need to poll. polling usually comes into it's own on realtime systems or restricted hardware that doesn't have an OS, then you need to poll, because you might not have the option to wait(), but also because it gives you finegrain control over exactly how long you want to wait in a particular state, as opposed to being at the mercy of the scheduler.

Waiting (blocking) is almost always the best choice ("best" in the sense of making efficient use of processing resources and minimizing the impact to other code running on the same system). The main exceptions are:
When the expected polling duration is small (similar in magnitude to the cost of the blocking syscall).
Mostly in embedded systems, when the CPU is dedicated to performing a specific task and there is no benefit to having the CPU idle (e.g. some software routers built in the late '90s used this approach.)
Polling is generally not used within OS kernels to implement blocking system calls - instead, events (interrupts, timers, actions on mutexes) result in a blocked process or thread being made runnable.

There are four basic approaches one might take:
Use some OS waiting primitive to wait until the event occurs
Use some OS timer primitive to check at some defined rate whether the event has occurred yet
Repeatedly check whether the event has occurred, but use an OS primitive to yield a time slice for an arbitrary and unknown duration any time it hasn't.
Repeatedly check whether the event has occurred, without yielding the CPU if it hasn't.
When #1 is practical, it is often the best approach unless delaying one's response to the event might be beneficial. For example, if one is expecting to receive a large amount of serial port data over the course of several seconds, and if processing data 100ms after it is sent will be just as good as processing it instantly, periodic polling using one of the latter two approaches might be better than setting up a "data received" event.
Approach #3 is rather crude, but may in many cases be a good one. It will often waste more CPU time and resources than would approach #1, but it will in many cases be simpler to implement and the resource waste will in many cases be small enough not to matter.
Approach #2 is often more complicated than #3, but has the advantage of being able to handle many resources with a single timer and no dedicated thread.
Approach #4 is sometimes necessary in embedded systems, but is generally very bad unless one is directly polling hardware and the won't have anything useful to do until the event in question occurs. In many circumstances, it won't be possible for the condition being waited upon to occur until the thread waiting for it yields the CPU. Yielding the CPU as in approach #3 will in fact allow the waiting thread to see the event sooner than would hogging it.

Win32 Overlapped I/O - Completion routines or WaitForMultipleObjects?

I'm wondering which approach is faster and why ?
While writing a Win32 server I have read a lot about the Completion Ports and the Overlapped I/O, but I have not read anything to suggest which set of API's yields the best results in the server.
Should I use completion routines, or should I use the WaitForMultipleObjects API and why ?

You suggest two methods of doing overlapped I/O and ignore the third (or I'm misunderstanding your question).
When you issue an overlapped operation, a WSARecv() for example, you can specify an OVERLAPPED structure which contains an event and you can wait for that event to be signalled to indicate the overlapped I/O has completed. This, I assume, is your WaitForMultipleObjects() approach and, as previously mentioned, this doesn't scale well as you're limited to the number of handles that you can pass to WaitForMultipleObjects().
Alternatively you can pass a completion routine which is called when completion occurs. This is known as 'alertable I/O' and requires that the thread that issued the WSARecv() call is in an 'alertable' state for the completion routine to be called. Threads can put themselves in an alertable state in several ways (calling SleepEx() or the various EX versions of the Wait functions, etc). The Richter book that I have open in front of me says "I have worked with alertable I/O quite a bit, and I'll be the first to tell you that alertable I/O is horrible and should be avoided". Enough said IMHO.
There's a third way, before issuing the call you should associate the handle that you want to do overlapped I/O on with a completion port. You then create a pool of threads which service this completion port by calling GetQueuedCompletionStatus() and looping. You issue your WSARecv() with an OVERLAPPED structure WITHOUT an event in it and when the I/O completes the completion pops out of GetQueuedCompletionStatus() on one of your I/O pool threads and can be handled there.
As previously mentioned, Vista/Server 2008 have cleaned up how IOCPs work a little and removed the problem whereby you had to make sure that the thread that issued the overlapped request continued to run until the request completed. Link to a reference to that can be found here. But this problem is easy to work around anyway; you simply marshal the WSARecv over to one of your I/O pool threads using the same IOCP that you use for completions...
Anyway, IMHO using IOCPs is the best way to do overlapped I/O. Yes, getting your head around the overlapped/async nature of the calls can take a little time at the start but it's well worth it as the system scales very well and offers a simple "fire and forget" method of dealing with overlapped operations.
If you need some sample code to get you going then I have several articles on writing IO completion port systems and a heap of free code that provides a real-world framework for high performance servers; see here.
As an aside; IMHO, you really should read "Windows Via C/C++ (PRO-Developer)" by Jeffrey Richter and Christophe Nasarre as it deals will all you need to know about overlapped I/O and most other advanced windows platform techniques and APIs.

WaitForMultipleObjects is limited to 64 handles; in a highly concurrent application this could become a limitation.
Completion ports fit better with a model of having a pool of threads all of which are capable of handling any event, and you can queue your own (non-IO based) events into the port, whereas with waits you would need to code your own mechanism.
However completion ports, and the event based programming model, are a more difficult concept to really work against.
I would not expect any significant performance difference, but in the end you can only make your own measurements to reflect your usage. Note that Vista/Server2008 made a change with completion ports that the originating thread is not now needed to complete IO operations, this may make a bigger difference (see this article by Mark Russinovich).

Table 6-3 in the book Network Programming for Microsoft Windows, 2nd Edition compares the scalability of overlapped I/O via completion ports vs. other techniques. Completion ports blow all the other I/O models out of the water when it comes to throughput, while using far fewer threads.

The difference between WaitForMultipleObjects() and I/O completion ports is that IOCP scales to thousands of objects, whereas WFMO() does not and should not be used for anything more than 64 objects (even though you could).
You can't really compare them for performance, because in the domain of < 64 objects, they will be essentially identical.
WFMO() however does a round-robin on its objects, so busy objects with low index numbers can starve objects with high index numbers. (E.g. if object 0 is going off constantly, it will starve objects 1, 2, 3, etc). This is obviously undesireable.
I wrote an IOCP library (for sockets) to solve the C10K problem and put it in the public domain. I was able on a 512mb W2K machine to get 4,000 sockets concurrently transferring data. (You can get a lot more sockets, if they're idle - a busy socket consumes more non-paged pool and that's the ultimate limit on how many sockets you can have).
http://www.45mercystreet.com/computing/libiocp/index.html
The API should give you exactly what you need.

Not sure. But I use WaitForMultipleObjects and/or WaitFoSingleObjects. It's very convenient.

Either routine works and I don't really think one is any significant faster then another.
These two approaches exists to satisfy different programming models.
WaitForMultipleObjects is there to facilitate async completion pattern (like UNIX select() function) while completion ports is more towards event driven model.
I personally think WaitForMultipleObjects() approach result in cleaner code and more thread safe.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio