Can I use MPI_Barrier() to synchronize data in-between iteration steps - parallel-processing

Is it good idea to use MPI_Barrier() to synchronize data in-between iteration steps. Please see below pseudo code.
While(numberIterations< MaxIterations)
MPI_Iprobe() -- check for incoming data
while(flagprobe !=0)
MPI_Recv() -- receive data
MPI_Iprobe() -- loop if more data
updateData() -- update myData
for(i=0;i<N;i++) MPI_Bsend_init(request[i]) -- setup request
for(i=0;i<N;i++) MPI_Start(request[i]) -- send data to all other N processors
if(numberIterations = MaxIterations/2)
MPI_Barrier() -- wait for all processors -- CAN I DO THIS
numberIterations ++

Barriers should only be used if the correctness of the program depends on it. From your pseudocode, I can't tell if that's the case, but one barrier halfway through a loop looks very suspect.

Your code will deadlock, with or without a barrier. You receive in every rank before sending any data, so none of the ranks will ever get to a send call. Most applications will have a call such as MPI_Allreduce instead of a barrier after each iteration so all ranks can decide whether an error level is small enough, a task queue is empty, etc. and thus decide whether to terminate.

In this article it says that you have to call MPI_Free_request() before MPI_Bsend_init().


Loop to check condition in concurrent program

I'm reading a book about concurrency in Go (I'm learning it now) and I found this code:
c := sync.NewCond(&sync.Mutex{})
queue := make([]interface{}, 0, 10)
removeFromQueue := func(delay time.Duration) {
queue = queue[1:]
fmt.Println("Removed from queue")
c.L.Unlock() c.Signal()
for i := 0; i < 10; i++ {
// Why this loop?
for len(queue) == 2 {
fmt.Println("Adding to queue")
queue = append(queue, struct{}{})
go removeFromQueue(1*time.Second)
The problem is that I don't understand why the author introduces the for loop marked by the comment. As far as I can see, the program would be correct without it, but the author says that the loop is there because Cond will signal that something has happened only, but that doesn't mean that the state has truly changed.
In what case could that be possible?
Without the actual book at hand, and instead just some code snippets that seem out of context, it is hard to say what the author had in mind in particular. But we can guess. There is a general point about condition variables in most languages, including Go: waiting for some condition to be satisfied does require a loop in general. In some specific cases, the loop is not required.
The Go documentation is, I think, clearer about this. In particular, the text description for sync's func (c *Cond) Wait() says:
Wait atomically unlocks c.L and suspends execution of the calling goroutine. After later resuming execution, Wait locks c.L before returning. Unlike in other systems, Wait cannot return unless awoken by Broadcast or Signal.
Because c.L is not locked when Wait first resumes, the caller typically cannot assume that the condition is true when Wait returns. Instead, the caller should Wait in a loop:
for !condition() {
... make use of condition ...
I added bold emphasis to the phrase that explains the reason for the loop.
Whether you can omit the loop depends on more than one thing:
Under what condition(s) does another goroutine invoke Signal and/or Broadcast?
How many goroutines are running, and what might they be doing in parallel?
As the Go documentation says, there's one case we don't have to worry about in Go, that we might in some other systems. In some systems, the equivalent of Wait is sometimes resumed (via the equivalent of Signal) when Signal (or its equivalent) has not actually been invoked on the condition variable.
The queue example you've quoted is particularly odd because there is only one goroutine—the one running the for loop that counts to ten—that can add entries to the queue. The remaining goroutines only remove entries. So if the queue length is 2, and we pause and wait for a signal that the queue length has changed, the queue length can only have changed to either one or zero: no other goroutine can add to it and only the two goroutines we have created at this point can remove from it. This means that given this particular example, we have one of those cases where the loop is not required after all.
(It's also odd in that queue is given an initial capacity of 10, which is as many items as we'll put in, and then we start waiting when its length is exactly 2, so that we should not reach that capacity anyway. If we were to spin off additional goroutines that might add to the queue, the loop that waits while len(queue) == 2 could indeed be signaled by a removal that drops the count from 2 to 1 but not get a chance to resume until insertion occurs, pushing the count back up to 2. However, depending on the situation, that loop might not be resumed until two other goroutines have each added an entry, pushing the count to 3, for instance. So why repeat the loop when the length is exactly two? If the idea is to preserve queue slots, we should loop while the count is greater than or equal to 2.)
(Besides all this, the initial capacity is not relevant as the queue will be dynamically resized to a large slice if necessary.)

Spin off one statement in a loop parallel OpenMP

I have a large C++ project(Windows VC++ 13) in which I have a costly loop. The loop is costly because one statement(which is another function call) is very expensive. I want to make that one line parallel. What I mean is, I dont want each to wait to completion. On coming to that line, I want to spin it off as a separate thread. End of the iteration I want all the threads to join and then I want to continue my execution. Is that possible? A simplified version of the problem I am solving is,
for(int i =0 ; i<limit; i++)
//some processing and getting some value into a variable say x
<value to use> += <costly function which taken in x as parameter>
//some more processing
What I would like to do is, when it hits the line that calls the costly function, I would like it to spin off a thread. And in the end of the loop I would wait to join all the threads and continue execution. I obviously must not do a parallel for in openMP for the loop since that screws up the calculation of x and sends the same x to multiple calls of the function. Also I would prefer using OpenMP. Can somebody help with this please?

3 queues + 1 finish or device-side checkpoints for all queues

Is there a special "wait for event" function that can wait for 3 queues at the same time at device side so it doesn't wait for all queues serially from host side?
Is there a checkpoint command to send into a command queue such that it must wait for other command queues to hit same(vertically) barrier/checkpoint to wait and continue from device side so no host-side round-trip is needed?
For now, I tried two different versions:
clWaitForEvents(3, evt_);
int evtStatus0 = 0;
sizeof(cl_int), &evtStatus0, NULL);
while (evtStatus0 > 0)
sizeof(cl_int), &evtStatus0, NULL);
int evtStatus1 = 0;
sizeof(cl_int), &evtStatus1, NULL);
while (evtStatus1 > 0)
sizeof(cl_int), &evtStatus1, NULL);
int evtStatus2 = 0;
sizeof(cl_int), &evtStatus2, NULL);
while (evtStatus2 > 0)
sizeof(cl_int), &evtStatus2, NULL);
second one is a bit faster(I saw it from someone else) and both are executed after 3 flush commands.
Looking at CodeXL profiler results, first one waits longer between finish points and some operations don't even seem to be overlapping. Second one shows 3 finish points are all within 3 milliseconds so it is faster and longer parts are overlapped(read+write+compute at the same time).
If there is a way to achieve this with only 1 wait command from host side, there must a "flush" version of it too but I couldn't find.
Is there any way to achieve below picture instead of adding flushes between each pipeline step?
queue1 write checkpoint write checkpoint write
queue2 - compute checkpoint compute checkpoint compute
queue3 - checkpoint read checkpoint read
all checkpoints have to be vertically synchronized and all these actions must not start until a signal is given. Such as:
checkpoints are all handled in device side and only 3 finish commands are needed from host side(even better,only 1 finish for all queues?)
How I bind 3 queues to 3 events with "clWaitForEvents(3, evt_);" for now is:
hCommandQueue->commandQueue.enqueueBarrierWithWaitList(NULL, &evt[0]);
hCommandQueue2->commandQueue.enqueueBarrierWithWaitList(NULL, &evt[1]);
hCommandQueue3->commandQueue.enqueueBarrierWithWaitList(NULL, &evt[2]);
if this "enqueue barrier" can talk with other queues, how could I achieve that? Do I need to keep host-side events alive until all queues are finished or can I delete them or re-use them later? From the documentation, it seems like first barrier's event can be put to second queue and second one's barrier event can be put to third one along with first one's event so maybe it is like:
hCommandQueue->commandQueue.enqueueBarrierWithWaitList(NULL, &evt[0]);
hCommandQueue2->commandQueue.enqueueBarrierWithWaitList(evt_0, &evt[1]);
hCommandQueue3->commandQueue.enqueueBarrierWithWaitList(evt_0_and_1, &evt[2]);
in the end wait for only evt[2] maybe or using only 1 same event for all:
hCommandQueue->commandQueue.enqueueBarrierWithWaitList(sameEvt, &evt[0]);
hCommandQueue2->commandQueue.enqueueBarrierWithWaitList(sameEvt, &evt[1]);
hCommandQueue3->commandQueue.enqueueBarrierWithWaitList(sameEvt, &evt[2]);
where to get sameEvt object?
anyone tried this? Should I start all queues with a barrier so they dont start until I raise some event from host side or lazy-executions of "enqueue" is %100 trustable to "not to start until I flush/finish" them? How do I raise an event from host to device(sameEvt doesn't have a "raise" function, is it clCreateUserEvent?)?
All 3 queues are in-order type and are in same context. Out-of-order type is not supported by all graphics cards. C++ bindings are being used.
Also there are enqueueWaitList(is this deprecated?) and clEnqueueMarker but I don't know how to use them and documentation doesn't have any example in Khronos' website.
You asked too many questions and expressed too many variants to provide you with the only solution, so I will try to answer in general that you can figure out the most suitable solution.
If the queues are bind to the same context (possibly to different devices within the same context) than it is possible to synchronize them through the events. I.e. you can obtain an event from a command submitted to one queue and use this event to synchronize a command submitted to another queue, e.g.
queue1.enqueue(comm1, /*dependency*/ NULL, /*result event*/ &e1);
queue2.enqueue(comm2, /*dependency*/ &e1, /*result event*/ NULL);
In this example, comm2 will wait for comm1 completion.
If you need to enqueue commands first but no to allow them to be executed you can create user event (clCreateUserEvent) and signal it manually (clSetUserEventStatus). The implementation is allowed to process command as soon as they enqueued (the driver is not required to wait for the flush).
The barrier seems overkill for your purpose because it waits for all commands previously submitted to the queue. You can really use clEnqueueMarker that can be used to wait for all events and provide one event to be used for other commands.
As far as I know you can retain the event at any moment if you do not need it more. The implementation should prolong the event life-time if it is required for internal purposes.
I do not know what is enqueueWaitList.
Off-topic: if you need non-trivial dependencies between calculations you may want to consider TBB flow graph and opencl_node. The opencl_node uses events for syncronization and avoids "host-device" synchronizations if possible. However, it can be tricky to use multiple queues for the same device.
As far as I know, Intel HD Graphics 530 supports out-of-order queues (at least host-side).
You are making it much harder than it needs to be. On the write queue take an event. Use that as a condition for the compute on the compute queue, and take another event. Use that as a condition on the read on the read queue. There is no reason to force any other synchronization. Note: My interpretation of the spec is that you must clFlush on a queue that you took an event from before using that event as a condition on another queue.

Golang Event Model

So my problem stems from I thought I'd be awfully clever and try and model hardware in Golang. Yes, use it for the sort of thing that in my day job I write in Verilog.
To start off I'm trying to do the simplest possible hardware pipeline. i.e. a data source that pushes data to a Reg stage which pushes it to the next Reg which pushes it to a sink stage. Things can only advance on the clock. The problem is the event manager seems to have it in for me.
Starting at the data being sent on a clock
select {
case out_chan <- tmp:
front_chan = saved_front_chan
out_chan = nil
fmt.Println("Sent Data, ", tmp, name)
c.ckwg.Done() // Send failed so remove token
if out_chan != nil {
log.Fatal("Stalled, Can't send ", tmp, name)
} else {
fmt.Println("Nothing to send ", name)
i.e. if we can send the data, do, if we can't then fine; but if we can't send data then the front end receiver (see main code) doesn't get re-enabled for reception.
This is sending to a front end which is a bit simpler
for itm := range in_chan {
fmt.Println("Front Channel received", itm, name)
saved_front_chan <- itm
Note the use of a sync.WaitGroup to make sure that this code is run before the end of the clock evaluation phase.
So far so good. The problem is sometimes one of these front ends will have sucessfully "saved_front_chan <- itm" but will not have got around back to listening on in_chan.
This will mean that the previous stage in the pipeline will think it is stalled.
Now this would be fine in a data processing pipeline, but for something that's trying to model hardware to some degree of accuracy it is not. (For this test case I'm modelling a pipeline that tries to support stalls but has no need to stall, so doesn't) As far as I can see though there are no tricks left to me to force the scheduler to have the loop ready to read. Fundamentally I need to know the difference between "Channel can't receive because of pipeline back pressure" and "Channel can't receive because the scheduler hasn't got around to that routine yet."
So the full code is here:
For context, the specification I am actually trying to meet is a stage that only advances the data on a clock pulse, and that clock pulse is supplied to all stages "Simultaneously". Yes I know nothing is really in parallel but I'm trying to fake that behavior. Specifically it must appear that the inputs are sampled at the start of the clock phase and then the data is output at the end of the clock phase. Crucially in the faking of the parallelism it must not matter which order the reg stages receive their clock pulses. I have also experimented with the flow control not being done by golang channel back pressure but by a separate token system - but that got stupendously complex and error prone.
FWIW I have another trial version of this where the broadcast of the clock pulse is done with sync package broadcast structure - but that code got in the way of demonstrating the problem I'm having.
Thanks in advance.

PostMessage occasionally loses a message

I wrote a multi-threaded windows application where thread:
A – is a windows form that handles user interaction and process the data from B.
B – occasionally generates data and passes it two A.
A thread safe queue is used to pass the data from thread B to A. The enqueue and dequeue functions are guarded using a windows critical section objects.
If the queue is empty when the enqueue function is called, the function will use PostMessage to tell A that there is data in the queue. The function checks to make sure the call to PostMessage is executed successfully and repeatedly calls PostMessage if it is not successful (PostMessage has yet to fail).
This worked well for quite some time until one specific computer started to lose the occasional message. By lose I mean that, PostMessage returns successfully in B but A never receives the message. This causes the software to appear frozen.
I have already come up with a couple acceptable workarounds. I am interesting in knowing why windows is loosing these messages and why this is only happening on the one computer.
Here is the relevant portions of the code.
// Only called by B
procedure TSharedQueue.Enqueue(AItem: TSQItem);
B: boolean;
if FCount > 0 then
FLast.FNext := AItem;
FLast := AItem;
FFirst := AItem;
FLast := AItem;
if (FCount = 0) or (FCount mod 10 = 0) then // just in case a message is lost
B := PostMessage(FConsumer, SQ_HAS_DATA, 0, 0);
if not B then
Sleep(1000); // this line of code has never been reached
until B;
// Only called by A
function TSharedQueue.Dequeue: TSQItem;
if FCount > 0 then
Result := FFirst;
FFirst := FFirst.FNext;
Result.FNext := nil;
Result := nil;
// procedure called when SQ_HAS_DATA is received
procedure TfrmMonitor.SQHasData(var AMessage: TMessage);
Item: TSQItem;
while FMessageQueue.Count > 0 do
Item := FMessageQueue.Dequeue;
// use the Item somehow
Is FCount also protected by FQueueLock? If not, then your problem lies with FCount being incremented after the posted message is already processed.
Here's what might be happening:
B enters critical section
B calls PostMessage
A receives the message but doesn't do anything since FCount is 0
B increments FCount
B leaves critical section
A sits there like a duck
A quick remedy would be to increment FCount before calling PostMessage.
Keep in mind that things can happen quicker than one would expect (i.e. the message posted with PostMessage being caught and processed by another thread before you have a chance to increment FCount a few lines later), especially when you're in a true multi-threaded environment (multiple CPUs). That's why I asked earlier if the "problem machine" had multiple CPUs/cores.
An easy way to troubleshoot problems like these is to scaffold the code with additonal logging to log every time you enter a method, enter/leave a critical section etc. Then you can analyze the log to see the true order of events.
On a separate note, a nice little optimization that can be done in a producer/consumer scenario like this is to use two queues instead of one. When the consumer wakes up to process the full queue, you swap the full queue with an empty one and just lock/process the full queue while the new empty queue can be populated without the two threads trying to lock each other's queues. You'd still need some locking in the swapping of the two queues though.
If the queue is empty when the enqueue
function is called, the function will
use PostMessage to tell A that there
is data in the queue.
Are you locking the message queue before checking the queue size and issuing the PostMessage? You may be experiencing a race condition where you check the queue and find it non-empty when in fact A is processing the very last message and is about to go idle.
To see if you're in fact experiencing a race condition and not a problem with PostMessage, you could switch to using an event. The worker thread (A) would wait on the event instead of waiting for a message. B would simply set that event instead of posting a message.
This worked well for quite some time
until one specific computer started to
lose the occasional message.
By any chance, does the number of CPUs or cores that this specific computer have different than the others where you see no problem? Sometimes when you switch from a single-CPU machine to a machine with more than one physical CPU/core, new race conditions or deadlocks may arise.
Could there be a second instance unknowingly running and eating the messages, marking them as handled?
