MPI communication stalls when node is only partially reserved - parallel-processing

This is a tricky one. I will try to describe it as accurately as I can. I inherited a fortran program consisting of a couple of thousand lines of fortran code (in one subroutine) that uses mpi to parallelize computations. Fortunately it only uses very few mpi commands, here they are
call mpi_gather(workdr(ipntr(2)),int(icount/total),mpi_double_precision,&
& workdr(ipntr(2)),int(icount/total),mpi_double_precision,0,mpi_comm_world,mpierr)
call mpi_gather(workdi(ipntr(2)),int(icount/total),mpi_double_precision,&
& workdi(ipntr(2)),int(icount/total),mpi_double_precision,0,mpi_comm_world,mpierr)
a couple of dozen lines later this is followed by
call mpi_bcast(workdr(istart),j,mpi_double_precision,total-1,&
& mpi_comm_world,mpiierr)
call mpi_bcast(workdi(istart),j,mpi_double_precision,total-1,&
& mpi_comm_world,mpiierr)
call mpi_bcast(workdr(ipntr(2)),icount,mpi_double_precision,0,mpi_comm_world,mpiierr)
call mpi_bcast(workdi(ipntr(2)),icount,mpi_double_precision,0,mpi_comm_world,mpiierr)
both routines and the surrounding code are included in an if statement, that is evaluated on each rank
call znaupd ( ido, bmat, n, which, nev, tol, resid, ncv,&
& v, ldv, iparam, ipntr, workd, workl, lworkl,&
& rworkl,info )
if (ido .eq. -1 .or. ido .eq. 1) then
[...code here...]
[...mpi code here...]
[...couple of dozen lines...]
[...mpi code here...]
[...code here...]
end if
This code compiles and compiles and produces reasonable results (it's a physics simulation)
It runs fine on a single node, tested with anything from 4 to 64 cpus
It runs fine if one is using multiple nodes, reserving the node completely, i.e.
node1 using 24 of 24 cpus
node2 using 24 of 24 cpus
But it stalls when one is reserving the node only partially
node1 using 12 of 24 cpus
node2 using 12 of 24 cpus
To be more precise, it is the rank=0 node that is stalling. The above code is run in a loop until the if statement evaluates to false. The output shows that the code is running for several times, but then it happens. Several of the processes are evaluating the if statement as false and exit the loop. But the rank 0 node evaluates it as true and stalls when it calls the mpi_gather.
So you are probably thinking that this cannot be answered without seeing the full code, that there must be something that causes the if statement to evaluated incorrectly on the rank 0 node.
Consider however that it runs fine with an arbitrary number of processors on a single node and an arbitrary number of nodes as long as one reserves all processors on the node.
My thoughts on this so far and my questions:
Are the above mpi calls blocking? My understanding is that the above mpi commands are buffered, so that while some of the processes might continue executing, their message is saved in the buffer of the receiver. So messages can't be lost. Is that correct?
Has anybody else ever experienced an issue similar to this? Stalling if not all processors are reserved? I have to admit I am sort of lost on this. I don't really know where to start debugging. Any hints and pointers are greatly appreciated.
This issue has been reproduced on different clusters with different compilers and different mpi implementations. It really seems to be an issue in the code.
Thanks so much for your help. Any ideas are greatly appreciated.
EDIT: Here are further details of our system:
MPI Environenment: mvapich2 v1.6
Compiler: intel ifort 13.2
The arpack library used is the standard arpack, not p_arpack. The code takes care of the parallelization (optimizing memory usage)

Turns out the problem was not in the code. It was the mpi implemntation! Initially I hadn't thought of this, since I was running the program on two different clusters, using different mpi implementations (mvapich and intel mpi). However it turns out that both were derived from the same 8 year old mpich implementation. After upgrading to a more recent version of mvapich which is derived from a more recent version of mpich, the odd behavior stopped and the code runs as expected.
Thanks again to everybody who provided comments.

Related

How to remove Fortran race condition?

Forgive me if this is not actually a race condition; I'm not that familiar with the nomenclature.
The problem I'm having is that this code runs slower with OpenMP enabled. I think the loop should be plenty big enough (k=100,000), so I don't think overhead is the issue.
As I understand it, a race condition is occurring here because all the loops are trying to access the same v(i,j) values all the time, slowing down the code.
Would the best fix here be to create as many copies of the v() array as threads and have each thread access a different one?
I'm using intel compiler on 16 cores, and it runs just slightly slower than on a single core.
Thanks all!
!$OMP PARALLEL DO
Do 500, k=1,n
Do 10, i=-(b-1),b-1
Do 20, j=-(b-1),b-1
if (abs(i).le.l.and.abs(j).eq.d) then
cycle
endif
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
if (k.eq.n-1) then
vtest(i,j,1)=v(i,j)
endif
if (k.eq.n) then
vtest(i,j,2)=v(i,j)
endif
20 continue
10 continue
500 continue
!$OMP END PARALLEL DO
You certainly have programmed a race condition though I'm not sure that that is the cause of your program's failure to execute more quickly. This line
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
which will be executed by all threads for the same (set of) values for i and j is where the racing happens. Given that your program does nothing to coordinate reads and writes to the elements of v your program is, in practice, not deterministic as there is no way to know the order in which updates to v are made.
You should have observed this non-determinism on inspecting the results of the program, and have noticed that changing the number of threads has an impact on the results too. Then again, with a long-running stencil operation over an array the results may have converged to the same (or similar enough) values.
OpenMP gives you the tools to coordinate access to variables but it doesn't automatically implement them; there is definitely nothing going on under the hood to prevent quasi-simultaneous reads from and writes to v. So the explanation for the lack of performance improvement lies elsewhere. It may be down to the impact of multiple threads on cache at some level in your system's memory hierarchy. A nice, cache-friendly, run over every element of an array in memory order for a serial program becomes a blizzard of (as far as the cache is concerned) random accesses to memory requiring access to RAM at every go.
It's possible that the explanation lies elsewhere. If the time to execute the OpenMP version is slightly longer than the time to execute a serial version I suspect that the program is not, in fact, being executed in parallel. Failure to compile properly is a common (here on SO) cause of that.
How to fix this ?
Well the usual pattern of OpenMP across an array is to parallelise on one of the array indices. The statements
!$omp parallel do
do i=-(b-1),b-1
....
end do
ensure that each thread gets a different set of values for i which means that they write to different elements of v, removing (almost) the data race. As you've written the program each thread gets a different set of values of k but that's not used (much) in the inner loops.
In passing, testing
if (k==n-1) then
and
if (k==n) then
in every iteration looks like you are tying an anchor to your program, why not just
do k=1,n-2
and deal with the updates to vtest at the end of the loop.
You could separate the !$omp parallel do like this
!$omp parallel
do k=1,n-2
!$omp do
do i=-(b-1),b-1
(and make the corresponding changes at the end of the parallel loop and region). Now all threads execute the entire contents of the parallel region but each gets its own set of i values to use. I recommend that you add clauses to your directives to specify the accessibility (eg private or shared) of each variable; but this answer is getting a bit too long and I won't go into more detail on these. Or on using a schedule clause.
Finally, of course, even with the changes I've suggested your program will be non-deterministic because this statement
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
will read neighbouring elements from v which are updated (at a time you have no control over) by another thread. To sort that out ... got to go back to work.

Finding the amount of execution time spent on each processor in a Beowulf cluster

I have downloaded an LU Decomposition program from the following link http://www.cs.nyu.edu/wanghua/course...el/h3/mpi_lu.c and the programming is running very well...The reason for me writing this thread is that can any one help me with getting the time of execution spent on the processors of the nodes connected in the cluster so that it aid me in getting the statistical value from my cluster.
Kindly, help me as I don't know much about MPI Programming, all I want is the amount of time spent on each processor of nodes in the cluster for the above program.
There are at least 2 ways of getting the times you seek, or at least a close approximation to them.
If you have a job management system installed on your cluster (if you don't you should have) then I expect that it will log the time spent on each node by each process involved in your computation. Certainly Grid Engine keeps this data in its accounting file and provides the utility qacct for inspecting that file. I'd be very surprised to learn that the other widely used job management systems don't offer similar data and functions.
You could edit your program and insert mpi_wtime calls at the critical points. Of course, like all MPI routines, this can only be called after mpi_init and before mpi_finalize; you would have to make other arrangements for timing the parts of your code which lie outside the scope of MPI. (on most MPI implementations that do not support clock synchronisation calls to mpi_wtime are possible before mpi_init and after mpi_finalize was called, as there mpi_wtime is simply a wrapper around the system timer routines, but that's not guaranteed to be portable)

Impact of binary code length on performance of CUDA program

I wrote a CUDA program that has several subroutines. When I disable subroutine A, the runtime improves by amount a. When I disable subroutine B the runtime improves by amount b. When I disable subroutines A and B the runtime improves by amount c > a + b. Both subroutines are completely independent of each other.
This next part may be a naive approach to analyze this, but here is what I did: I compiled each version of the code and ran cuobjdump --dump-sass for each binary. The resulting output was about 1350 lines for the complete binary and around 1100 lines for each binary with one subroutine disabled. If I disabled both subroutines I got 850 lines. It appears that I need 3.1 us per line for the first three and 2.4 us for both subroutines disabled.
Since A and B do not contain anything complicated or use memory more intensively than the rest of the code I do not think that this is caused by commenting out all the time intensive operations and leaving the simple ones active. My guess is that the program code with both A and B disabled still fits in the streaming multiprocessors' instruction caches while the other versions are too large. That might result in global memory accesses so that more program code can be loaded and the latency causes this discrepancy. Unfortunately I could not find any information on the instruction cache size.
Can anyone help me with the interpretation of these results?
There can be several things leading to your results. Some may include:
Lack of A & B leads to some additional optimization of the remaining code. The resulting code might not be shorter but may still execute faster.
The optimizer found that certain code in A exists also in B (A and B may still be independent). As a result having both A & B slows the program less than having separate A and separate B.
Having just A or B (or neither) results in better memory access patterns or more cache hits when reading the data in the remaining code. (I am talking about data, not program code)
Lack of A and B may lead to better occupancy, allowing more threads to be run concurrently.
I highly doubt that the instruction cache might be the issue. Even if global memory reads were necessary, it is never "misaligned" as warp always needs only one instruction. My guess is that they use some dedicated piece of hardware for instructions.

MPI on a single machine dualcore

What happend if I ran an MPI program which require 3 nodes (i.e. mpiexec -np 3 ./Program) on a single machine which has 2 cpu?
This depends on your MPI implementation, of course. Most likely, it will create three processes, and use shared memory to exchange the messages. This will work just fine: the operating system will dispatch the two CPUs across the three processes, and always execute one of the ready processes. If a process waits to receive a message, it will block, and the operating system will schedule one of the other two processes to run - one of which will be the one that is sending the message.
Martin has given the right answer and I've plus-1ed him, but I just want to add a few subtleties which are a little too long to fit into the comment box.
There's nothing wrong with having more processes than cores, of course; you probably have dozens running on your machine well before you run any MPI program. You can try with any command-line executable you have sitting around something like mpirun -np 24 hostname or mpirun -np 17 ls on a linux box, and you'll get 24 copies of your hostname, or 17 (probably interleaved) directory listings, and everything runs fine.
In MPI, this using more processes than cores is generally called 'oversubscribing'. The fact that it has a special name already suggests that its a special case. The sorts of programs written with MPI typically perform best when each process has its own core. There are situations where this need not be the case, but it's (by far) the usual one. And for this reason, for instance, OpenMPI has optimized for the usual case -- it just makes the strong assumption that every process has its own core, and so is very agressive in using the CPU to poll to see if a message has come in yet (since it figures it's not doing anything else crucial). That's not a problem, and can easily be turned off if OpenMPI knows it's being oversubscribed ( http://www.open-mpi.org/faq/?category=running#oversubscribing ). It's a design decision, and one which improves the performance of the vast majority of cases.
For historical reasons I'm more familiar with OpenMPI than MPICH2, but my understanding is that MPICH2s defaults are more forgiving of the oversubscribed case -- but I think even there, too it's possible to turn on more agressive busywaiting.
Anyway, this is a long way of saying that yes, there what you're doing is perfectly fine, and if you see any weird problems when you switch MPIs or even versions of MPIs, do a quick search to see if there are any parameters that need to be tweaked for this case.

What is easier to learn and debug OpenMP or MPI?

I have a number crunching C/C++ application. It is basically a main loop for different data sets. We got access to a 100 node cluster with openmp and mpi available. I would like to speedup the application but I am an absolut newbie for both mpi and openmp. I just wonder what is the easiest one to learn and to debug even if the performance is not the best.
I also wonder what is the most adequate for my main loop application.
Thanks
If your program is just one big loop using OpenMP can be as simple as writing:
#pragma omp parallel for
OpenMP is only useful for shared memory programming, which unless your cluster is running something like kerrighed means that the parallel version using OpenMP will only run on at most one node at a time.
MPI is based around message passing and is slightly more complicated to get started. The advantage is though that your program could run on several nodes at one time, passing messages between them as and when needed.
Given that you said "for different data sets" it sounds like your problem might actually fall into the "embarrassingly parallel" category, where provided you've got more than 100 data sets you could just setup the scheduler to run one data set per node until they are all completed, with no need to modify your code and almost a 100x speed up over just using a single node.
For example if your cluster is using condor as the scheduler then you could submit 1 job per data item to the "vanilla" universe, varying only the "Arguments =" line of the job description. (There are other ways to do this for Condor which may be more sensible and there are also similar things for torque, sge etc.)
OpenMP is essentially for SMP machines, so if you want to scale to hundreds of nodes you will need MPI anyhow. You can however use both. MPI to distribute work across nodes and OpenMP to handle parallelism across cores or multiple CPUs per node. I would say OpenMP is a lot easier than messing with pthreads. But it being coarser grained, the speed up you will get from OpenMP will usually be lower than a hand optimized pthreads implementation.

Resources