MPI Odd/Even Compare-Split Deadlock - parallel-processing

I'm trying to write an MPI version of a program that runs an odd/even compare-split operation on n randomly generated elements.
Process 0 should generated the elements and send nlocal of them to the other processes, (keeping the first nlocal for itself). From here, process 0 should print out it's results after running the CompareSplit algorithm. Then, receive the results from the other processes run of the algorithm. Finally, print out the results that it has just received.
I have a large chunk of this already done, but I'm getting a deadlock that I can't seem to fix. I would greatly appreciate any hints that people could give me.
Here is my code http://pastie.org/3742474
Right now I'm pretty sure that the deadlock is coming from the Send/Recv at lines 134 and 151. I've tried changing the Send to use "tag" instead of myrank for the tag parameter..but when I did that I just keep getting a "MPI_ERR_TAG: invalid tag" for some reason.
Obviously I would also run the algorithm within the processors > 0 but I took that part out for now, until I figure out what is going wrong.
Any help is appreciated.
EDIT: I've written a smaller test case, that doesn't contain any CompareSplit operations, but is still deadlocking. http://pastie.org/3744691
I fixed the above test case by changing the tag at line 83 from "myrank" to "tag".
Well the test case works, but when the actual algorithm is added in like in my program it deadlocks..
So, I think I've narrowed the deadlock down to this chunk of code. It looks to be the Sendrecv under the else.
for (i = 1; i <= npes; i++) {
if (i % 2 == 1) // odd phase
MPI_Sendrecv(elmnts, nlocal, MPI_INT, oddrank, 1, relmnts,
nlocal, MPI_INT, oddrank, 1, MPI_COMM_WORLD, &status);
else
MPI_Sendrecv(elmnts, nlocal, MPI_INT, evenrank, 1, relmnts,
nlocal, MPI_INT, evenrank, 1, MPI_COMM_WORLD, &status);
CompareSplit(nlocal, elmnts, relmnts, wspace,
myrank < status.MPI_SOURCE);
}

The tag error was because tags have to be positive integers in the range from 1 to some implementation-dependant maximum which is guaranteed to be at least 32k.
The deadlock is pretty easy to understand; look at what the non-rank zero processes are doing:
else {
// The rest of the processes
// Receive nlocal randomly generated elements from process 0
MPI_Recv(elmnts, nlocal, MPI_INT, 0, tag, comm, &status);
qsort(elmnts, nlocal, sizeof(int), IncOrder); // does it matter where we sort at?
// Send results back to process 0
MPI_Send(elmnts, nlocal, MPI_INT, 0, myrank, comm);
}
So they're doing one receive, and one send back. But processor 0 is doing much more than this; it sends everyone their data, then executes a bunch of send-receives to process 1 (evenrank) and MPI_NULL_PROC (oddrank). But the send-receives to evenrank are noops, and the the send-recieves to process 1 will never be answered because process 1 isn't doing the same thing.
I think you need to move that part of the algorithm outside of the if (rank == 0) test.

It looks like you are calling MPI_Sendrecv [line 113], but there are no process with the oddrank rank to answer it, because oddrank eq -1.

Related

MPI_Send does not work with higher buffer size?

When MPI_Send buffer size is 100 program works, but it stucks when it is 1000 or greater. Why?
if(id == 0){
rgb_image = stbi_load(argv[1], &width, &height, &bpp, CHANNEL_NUM);
for(int i = 0; i < size -1; i++)
MPI_Send(rgb_image,1000,MPI_UINT8_T,i,0,MPI_COMM_WORLD);
}
uint8_t *part = (uint8_t*) malloc(sizeof(uint8_t)*(1000));
if(id != size-1 && size > 1)
MPI_Recv(part,1000,MPI_UINT8_T,0,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
This program is not valid w.r.t. MPI Standard since there is no matching receive (on rank 0) for
MPI_Send(..., dest=0, ...)
MPI_Send() is allowed to block until a matching receive is posted (and that generally happens when the message is "large") ... and the required matching receive never gets posted.
A typical fix would be to issue a MPI_Irecv(...,src = 0,...) on rank 0 before the MPI_Send() (and MPI_Wait() after), or to handle 0 -> 0 communication with MPI_Sendrecv().
That being said, it would likely more efficient to create a communicator will all the ranks minus the last one, and MPI_Bcast() in this communicator.
If a program works for small buffers but not for large, you are probably running into "eager sends". Normally, a send & receive transaction involves the sender & receiver talking back and forth, confirming that the data went across. This is overhead, so for small messages, many MPIs will just send the data, without confirmation. The data then goes into some secret buffer on the receiver.
But this means that your program will "succeed" if it's not a correct program. As is the case here. See #Giles answer.

Static Analysis erroneously reports out of bounds access

While reviewing a codebase, I came upon a particular piece of code that triggered a warning regarding an "out of bounds access". After looking at the code, I could not see a way for the reported access to happen - and tried to minimize the code to create a reproducible example. I then checked this example with two commercial static analysers that I have access to - and also with the open-source Frama-C.
All 3 of them see the same "out of bounds" access.
I don't. Let's have a look:
3 extern int checker(int id);
4 extern int checker2(int id);
5
6 int compute(int *q)
7 {
8 int res = 0, status;
9
10 status = checker2(12);
11 if (!status) {
12 status = 1;
13 *q = 2;
14 for(int i=0; i<2 && 0!=status; i++) {
15 if (checker(i)) {
16 res = i;
17 status=checker2(i);
18 }
19 }
20 }
21 if (!status)
22 *q = res;
23 return status;
24 }
25
26 int someFunc(int id)
27 {
28 int p;
29 extern int data[2];
30
31 int status = checker2(132);
32 status |= compute(&p);
33 if (status == 0) {
34 return data[p];
35 } else
36 return -1;
37 }
Please don't try to judge the quality of the code, or why it does things the way it does. This is a hacked, cropped and mutated version of the original, with the sole intent being to reach a small example that demonstrates the issue.
All analysers I have access to report the same thing - that the indexing in the caller at line 34, doing the return data[p] may read via the invalid index "2". Here's the output from Frama-C - but note that two commercial static analysers provide exactly the same assessment:
$ frama-c -val -main someFunc -rte why.c |& grep warning
...
why.c:34:[value] warning: accessing out of bounds index. assert p < 2;
Let's step the code in reverse, to see how this out of bounds access at line 34 can happen:
To end up in line 34, the returned status from both calls to checker2 and compute should be 0.
For compute to return 0 (at line 32 in the caller, line 23 in the callee), it means that we have performed the assignment at line 22 - since it is guarded at line 21 with a check for status being 0. So we wrote in the passed-in pointer q, whatever was stored in variable res. This pointer points to the variable used to perform the indexing - the supposed out-of-bounds index.
So, to experience an out of bounds access into the data, which is dimensioned to contain exactly two elements, we must have written a value that is neither 0 nor 1 into res.
We write into res via the for loop at 14; which will conditionally assign into res; if it does assign, the value it will write will be one of the two valid indexes 0 or 1 - because those are the values that the for loop allows to go through (it is bound with i<2).
Due to the initialization of status at line 12, if we do reach line 12, we will for sure enter the loop at least once. And if we do write into res, we will write a nice valid index.
What if we don't write into it, though? The "default" setup at line 13 has written a "2" into our target - which is probably what scares the analysers. Can that "2" indeed escape out into the caller?
Well, it doesn't seem so... if the status checks - at either line 11 or at line 21 fail, we will return with a non-zero status; so whatever value we wrote (or didn't, and left uninitialised) into the passed-in q is irrelevant; the caller will not read that value, due to the check at line 33.
So either I am missing something and there is indeed a scenario that leads to an out of bounds access with index 2 at line 34 (how?) or this is an example of the limits of mainstream formal verification.
Help?
When dealing with a case such as having to distinguish between == 0 and != 0 inside a range, such as [INT_MIN; INT_MAX], you need to tell Frama-C/Eva to split the cases.
By adding //# split annotations in the appropriate spots, you can tell Frama-C/Eva to maintain separate states, thus preventing merging them before status is evaluated.
Here's how your code would look like, in this case (courtesy of #Virgile):
extern int checker(int id);
extern int checker2(int id);
int compute(int *q)
{
int res = 0, status;
status = checker2(12);
//# split status <= 0;
//# split status == 0;
if (!status) {
status = 1;
*q = 2;
for(int i=0; i<2 && 0!=status; i++) {
if (checker(i)) {
res = i;
status=checker2(i);
}
}
}
//# split status <= 0;
//# split status == 0;
if (!status)
*q = res;
return status;
}
int someFunc(int id)
{
int p;
extern int data[2];
int status = checker2(132);
//# split status <= 0;
//# split status == 0;
status |= compute(&p);
if (status == 0) {
return data[p];
} else
return -1;
}
In each case, the first split annotation tells Eva to consider the cases status <= 0 and status > 0 separately; this allows "breaking" the interval [INT_MIN, INT_MAX] into [INT_MIN, 0] and [1, INT_MAX]; the second annotation allows separating [INT_MIN, 0] into [INT_MIN, -1] and [0, 0]. When these 3 states are propagated separately, Eva is able to precisely distinguish between the different situations in the code and avoid the spurious alarm.
You also need to allow Frama-C/Eva some margin for keeping the states separated (by default, Eva will optimize for efficiency, merging states somewhat aggressively); this is done by adding -eva-precision 1 (higher values may be required for your original scenario).
Related options: -eva-domains sign (previously -eva-sign-domain) and -eva-partition-history N
Frama-C/Eva also has other options which are related to splitting states; one of them is the signs domain, which computes information about sign of variables, and is useful to distinguish between 0 and non-zero values. In some cases (such as a slightly simplified version of your code, where status |= compute(&p); is replaced with status = compute(&p);), the sign domain may help splitting without the need for annotations. Enable it using -eva-domains sign (-eva-sign-domain for Frama-C <= 20).
Another related option is -eva-partition history N, which tells Frama-C to keep the states partitioned for longer.
Note that keeping states separated is a bit costly in terms of analysis, so it may not scale when applied to the "real" code, if it contains several more branches. Increasing the values given to -eva-precision and -eva-partition-history may help, as well as adding # split annotations.
I'd like to add some remarks which will hopefully be useful in the future:
Using Frama-C/Eva effectively
Frama-C contains several plug-ins and analyses. Here in particular, you are using the Eva plug-in. It performs an analysis based on abstract interpretation that reports all possible runtime errors (undefined behaviors, as the C standard puts it) in a program. Using -rte is thus unnecessary, and adds noise to the result. If Eva cannot be certain about the absence of some alarm, it will report it.
Replace the -val option with -eva. It's the same thing, but the former is deprecated.
If you want to improve precision (to remove false alarms), add -eva-precision N, where 0 <= N <= 11. In your example program, it doesn't change much, but in complex programs with multiple callstacks, extra precision will take longer but minimize the number of false alarms.
Also, consider providing a minimal specification for the external functions, to avoid warnings; here they contain no pointers, but if they did, you'd need to provide an assigns clause to explicitly tell Frama-C whether the functions modify such pointers (or any global variables, for instance).
Using the GUI and Studia
With the Frama-C graphical interface and the Studia plug-in (accessible by right-clicking an expression of interest and choosing the popup menu Studia -> Writes), and using the Values panel in the GUI, you can easily track what the analysis inferred, and better understand where the alarms and values come from. The only downside is that, it does not report exactly where merges happen. For the most precise results possible, you may need to add calls to an Eva built-in, Frama_C_show_each(exp), and put it inside a loop to get Eva to display, at each iteration of its analysis, the values contained in exp.
See section 9.3 (Displaying intermediate results) of the Eva user manual for more details, including similar built-ins (such as Frama_C_domain_show_each and Frama_C_dump_each, which show information about abstract domains). You may need to #include "__fc_builtin.h" in your program. You can use #ifdef __FRAMAC__ to allow the original code to compile when including this Frama-C-specific file.
Being nitpicky about the term erroneous reports
Frama-C is a semantic-based tool whose main analyses are exhaustive, but may contain false positives: Frama-C may report alarms when they do not happen, but it should never forget any possible alarm. It's a trade-off, you can't have an exact tool in all cases (though, in this example, with sufficient -eva-precision, Frama-C is exact, as in reporting only issues which may actually happen).
In this sense, erroneous would mean that Frama-C "forgot" to indicate some issue, and we'd be really concerned about it. Indicating an alarm where it may not happen is still problematic for the user (and we work to improve it, so such situations should happen less often), but not a bug in Frama-C, and so we prefer using the term imprecisely, e.g. "Frama-C/Eva imprecisely reports an out of bounds access".

MPI_Waitall() behavior given MPI_Request array with possibly uninitialized slots for asynchronous send/recv

I have come across a scenario in which I need to allocate a static array of type MPI_Request for keeping track of asynchronous send and receive MPI operations. I have total of 8 Isend and Irecv operations - where 4 of which are Isend and the remaining is Irecv. However, I do not call these 8 functions all at once. Depending on the incoming data, these functions are called in pairs, which means I may call pair of 1 send/receive or 2 send/receive or 3 send/receive or all at once. The fact that they are going to be called in pairs is certain but how many of them will be called is not certain. Below is a pseudo-code:
MPI_Request reqs[8];
MPI_Status stats[8];
if (Rank A exists){
//The process have to send data to A and receive data from A
MPI_Isend(A, ..., &reqs[0]);
MPI_Irecv(A, ..., &reqs[1]);
}
if(Rank B exists){
//The process have to send data to B and receive data from B
MPI_Isend(B, ..., &reqs[2]);
MPI_Irecv(B, ..., &reqs[3]);
}
if(Rank C exists){
//The process have to send data to C and receive data from C
MPI_Isend(C, ..., &reqs[4]);
MPI_Irecv(C, ..., &reqs[5]);
}
if(Rank D exists){
//The process have to send data to D and receive data from D
MPI_Isend(D, ..., &reqs[6]);
MPI_Irecv(D, ..., &reqs[7]);
}
//Wait for asynchronous operations to complete
MPI_Waitall(8, reqs, stats);
Now, I am not sure what the behavior of the program will be. There are total of 8 distinct asynchronous send and receive function calls and there is one slot for each function in MPI_reqs[8] for each call but not all of the functions will always be used. When some of them are not called, some slots in MPI_reqs[8] will be uninitialized. However, I need MPI_Waitall(8, reqs, stats) to return regardless of whether all slots in MPI_reqs[8] are initialized or not.
Could someone explain how the program might behave in this particular scenario?
You could set / initialize those missing requests with MPI_REQUEST_NULL. That said, why not just
int count = 0;
...
Isend(A, &reqs[count++]);
...
MPI_Waitall(count, reqs, stats);
Of course, leaving the value uninitialized and feeding it to some function that reads from it is not a good idea.

Atomic Broadcast Exercice

I'm trying to solve the exercice 5.10 of the book
"Foundations of Multithreaded, Parallel, and Distributed Programming".
The exercice is
"Assume one producer process and N consumer processes share a bounded buffer having B slots. The producer deposits messages in the buffer; consumers fetch them. Every message deposited by the producer is to be received by all N consumers. Futthermore, each consumer is to receive the messages in the order theu were deposited. However, consumers can receive messages at different times. For example, one consumer could receive up to B more messages than another if the second consumer is slow.
Develop a monitor that implements this kind of communication. Use Signal and Continue discipline."
Can someone help me, please?
Thank you very much!
--
EDIT:
I'm commenting now what I already made (cause I thought that the question was very big if I wrote everything that).
/* creating a buffer of B positions. */
global buffer[B];
Monitor {
cond ok_write;
cond ok_read;
int stamp_buffer[B] = [0, 0, .., 0]
request_write (int pos){
if (stamp_buffer[pos] > 0)
wait(ok_write);
write_message (buufer[pos]);
stamp_buffer[pos] = N;
signalAll (ok_read);
}
request_read (int pos){
if (stamp_buffer[pos] == 0)
wait (ok_read);
stamp_buffer[pos] --;
}
release_read (int pos){
if (stamp_buffer[pos]==0)
signal(ok_write);
}
}
So, I think that I still have that problem: "A reader can read the same message two times."
The basic idea of my algorithm is:
The writer write in a position "pos" and set the value of stamp[pos] to N.
Then, when each reader read the position pos, it do stamp[pos] - 1.
So, if stamp[pos] is zero, the message buffer[pos] was already readed N times and the writer can write in this position again.
But, if some reader read a message two times (or more), the writer can wirte a new message in the position pos and some reader will not read the old message.

OpenCL, Is this a normal execution time?

I'm working in an algorithm using OpenCL and I need to measure the execution time of it in its parallel and sequential versions. Due to this, I'm using an external loop to iterate both codes and measure their times but I have obtained:
Sequential: 3.06 segs
Parallel: 269 segs
The code that I'm using for the parallel version is:
t_start=clock(); /* Start measuring time */
for(i=0;i<=N; i++) // N is really big, around a million, but is the same for both versions
{
fitness = 0;
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, NULL, 0, NULL, NULL);
ret = clEnqueueReadBuffer(command_queue, vdistance, CL_TRUE, 0, siz_mem_distance_code, distance_code, 0, NULL, NULL);
ret = clEnqueueReadBuffer(command_queue, vsumatorio, CL_TRUE, 0,siz_mem_sumatorio, sumatorio, 0, NULL, NULL);
fitness = (1/(*sumatorio)) + (*distance_code/12) + ((pow(*distance_code,2))/4) + ((pow(*distance_code,3))/6);
}
t_finish=clock(); /* End measuring time */
Before this piece of code, I have created/initialized all the things that we need to run a program using OpenCL ( platform, devide, context, queue, buffer, kernel,...) and after this code, I release everything.
I have checked that this increase of time is due to read in each iteration both variables ( distance_code and sumatorio) but I must to do it because I have to obtain the fitness value which is a sequential instruction and can only be excuted when the kernel has finished, so... Could you help me? What am I doing wrong?
I hope to have explained myself properly, thanks in advance.
Note: I'm only working with the CPU.
The overhead of launching so many kernels exceeds the benefits of parallelizing a for loop over only 64 data items. You need to rewrite your problem so that you launch relatively few kernels over large batches of data. In that case and if the OpenCL compiler generated appropriate vectorizing machine code you would see an improvement over the sequential version.
Additionally, you should check with either AMD's CodeXL or Intel's Offline Compiler if the generated code contains any vector instructions.

Resources