Using MPI to parallelize a function - parallel-processing

I want to use MPI to parallelize a function that is being called multiple times in my code. What i wanted to know was that if I use MPI_Init inside the function, will it be spawn the processes every time the function is called or will the spawning take place only once? Is there some known design pattern to do this in a systematic way?

The MPI_Init() call just initialises the MPI enviroment, it doesn't do any parallelisation itself. The parallelism comes from how you write the program.
A parallel "Hello, World", the printf() does different things depending on what rank (processor) it's running on. The number of processes is determined by how you execute the program (e.g. the number of processes is set via the -n parameter to mpiexec or mpirun)
int main(int argc, char *argv[]) {
char name[BUFSIZ];
int length=BUFSIZ;
int rank;
int numprocesses;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocesses);
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Get_processor_name(name, &length);
printf("%s, Rank %d of %d: hello world\n", name, rank, numprocesses);
MPI_Finalize();
return 0;
}

That's not really the way MPI (or distributed memory programming) works; you can't really just parallelize a function the way you can with something like OpenMP. At MPI, processes aren't spawned at the time of MPI_Init(), but at the time of running the executable (eg, with mpiexec; this is true even with MPI_Comm_spawn(). Part of the reason for that is that in distributed memory computing, launching processes on potentially a large number of shared-nothing nodes is a very expensive task.
You could cobble something together by having the function you're calling be in a separate executable, but I'm not sure that's what you want.

Related

Barrier before MPI_Bcast()?

I see some open source code use MPI_Barrier before broadcasting the root value:
MPI_Barrier(MPI_COMM_WORLD);
MPI_Bcast(buffer, N, MPI_FLOAT, 0, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
I am not sure if MPI_Bcast() already has natural blocking feature. If this is true, I may not need MPI_Barrier() to synchronize the progress of all the cores. Then I can only use:
MPI_Bcast(buffer, N, MPI_FLOAT, 0, MPI_COMM_WORLD);
So which one is correct?
There is rarely a need to perform explicit synchronisation in MPI and code like that one makes little sense in general. Ranks in MPI mostly process data locally, share no access to global objects, and synchronise implicitly following the semantics of the send and receive operations. Why should rank i care whether some other rank j has received the broadcast when i is processing the received data locally?
Explicit barriers are generally needed in the following situations:
benchmarking - a barrier before a timed region of the code removes any extraneous waiting times resulting from one or more ranks being late to the party
parallel I/O - in this case, there is a global object (a shared file) and the consistency of its content may depend on the proper order of I/O operations, hence the need for explicit synchronisation
one-sided operations (RMA) - similarly to the parallel I/O case, some RMA scenarios require explicit synchronisation
shared-memory windows - those are a subset of RMA where access to memory shared between several ranks doesn't go through MPI calls but rather direct memory read and write instructions are issued, which brings all the problems inherent to shared-memory programming like the possibility of data races occurring and thus the need for locks and barriers into MPI
There are rare cases when the code actually makes sense. Depending on the number of ranks, their distribution throughout the network of processing elements, the size of the data to broadcast, the latency and bandwidth of the interconnect, and the algorithm used by the MPI library to actually implement the data distribution, it may take much longer time to complete when the ranks are even so slightly out of alignment in time due to the phenomenon of delay propagation, which may also apply to the user code itself. Those are pathological cases and usually occur under specific conditions, which is why sometimes you may see code like:
#ifdef UNALIGNED_BCAST_IS_SLOW
MPI_Barrier(MPI_COMM_WORLD);
#endif
MPI_Bcast(buffer, N, MPI_FLOAT, 0, MPI_COMM_WORLD);
or even
if (config.unaligned_bcast_performance_remedy)
MPI_Barrier(MPI_COMM_WORLD);
MPI_Bcast(buffer, N, MPI_FLOAT, 0, MPI_COMM_WORLD);
I've seen at least one MPI-enabled quantum chemistry simulation software package include similar code.
That said, collective operations in MPI are not necessarily synchronising. The only one to guarantee that there is a point in time where all ranks are simultaneously inside the call is MPI_BARRIER. MPI allows ranks to exit early once their participation in the collective operation has finished. For example, MPI_BCAST may be implemented as a linear sequence of sends from the root:
int rank, size;
MPI_Comm_rank(comm, &rank);
MPI_Comm_size(comm, &size);
if (rank == root)
{
for (int i = 0; i < size; i++)
if (i != rank)
MPI_Send(buffer, count, type, i, SPECIAL_BCAST_TAG, comm);
}
else
{
MPI_Recv(buffer, count, type, root, SPECIAL_BCAST_TAG, comm, MPI_STATUS_IGNORE);
}
In this case, rank 0 (if root is not 0) or rank 1 (when root is 0) will be the first one to receive the data and since there is no more communication directed to or from it, can safely return from the broadcast call. If the data buffer is large and the interconnect is slow, it will create quite some temporal staggering between the ranks.

different OpenMP output in different machine

When I m trying to run the following code in my system centos running virtually i am getting right output but when i am trying to run the same code on compact supercomputer "Param Shavak" I am getting incorrect output.... :(
#include<stdio.h>
#include<omp.h>
int main()
{
int p=1,s=1,ti
#pragma omp parallel private(p,tid)shared(s)
{
p=1;
tid=omp_get_thread_num();
p=p+tid;
s=s+tid;
printf("Thread %d P=%d S=%d\n",tid,p,s);
}
return 0;
}
If your program runs correctly in one machine, it must be because it's actually not running in parallel in that machine.
Your program suffers from a race condition in the s=s+tid; line of code. s is a shared variable, so several threads at the same time try to update it, which results in data loss.
You can fix the problem by turning that line of code into an atomic operation:
#pragma omp atomic
s=s+tid;
That way only one thread at a time can read and update the variable s, and the race condition is no more.
In more complex programs you should use atomic operations or critical regions only when necessary, because you don't have parallelism in those regions and that hurts performance.
EDIT: As suggested by user High Performance Mark, I must remark that the program above is very inefficient because of the atomic operation. The proper way to do that kind of calculation (adding to the same variable in all iterations of a loop) is to implement a reduction. OpenMP makes it easy by using the reduction clause:
#pragma omp reduction(operator : variables)
Try this version of your program, using reduction:
#include<stdio.h>
#include<omp.h>
int main()
{
int p=1,s=1,tid;
#pragma omp parallel reduction(+:s) private(p,tid)
{
p=1;
tid=omp_get_thread_num();
p=p+tid;
s=s+tid;
printf("Thread %d P=%d S=%d\n",tid,p,s);
}
return 0;
}
The following link explains critical sections, atomic operations and reduction in a more verbose way: http://www.lindonslog.com/programming/openmp/openmp-tutorial-critical-atomic-and-reduction/

Available cores vs. number of processes in openMPI

I tried the following "hello world" code, first on my system(8 cores) then on server(160 cores):
int main(int argc, char *argv[]) {
int numprocs, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
double t;
t=MPI_Wtime();
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
//printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);
printf("%f---%d---%s\n",MPI_Wtime()-t,rank,processor_name);
sleep(.5);//to make sure, each process needs a significant amount of time to finish
MPI_Finalize();
}
I run the program with 160 processes using mpirun -np 160 ./hello
I expected server run to be more efficient as it have a single core available for each process at the starting point, but the result was opposite.
8 cores : 2.25 sec
160 cores : 5.65 sec
Please correct me, if I am confused regarding the core assignment to each process.
Also please explain how the mapping is done by default? I know there are several ways to do it manually, either by using a rankfile or using some options related to socket/core affinity.
I want to know how the processes are treated in openMPI and how they are given resources by default?
You're not actually measuring the performance of anything that would benefit from scale here. The only thing you're measuring is the startup time. In this case, you would expect that starting more processes would take more time. You have to launch the processes, wire-up the network connections, etc. Also, both your laptop and the server have one process per core so that doesn't change from one to the other.
A better measurement of testing whether having more cores is more efficient is to do some sort of sample calculation and measure the speedup from having more cores. You could try the traditional PI calculation.

Why is this simple OpenCL kernel running so slowly?

I'm looking into OpenCL, and I'm a little confused why this kernel is running so slowly, compared to how I would expect it to run. Here's the kernel:
__kernel void copy(
const __global char* pSrc,
__global __write_only char* pDst,
int length)
{
const int tid = get_global_id(0);
if(tid < length) {
pDst[tid] = pSrc[tid];
}
}
I've created the buffers in the following way:
char* out = new char[2048*2048];
cl::Buffer(
context,
CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY,
length,
out);
Ditto for the input buffer, except that I've initialized the in pointer to random values. Finally, I run the kernel this way:
cl::Event event;
queue.enqueueNDRangeKernel(
kernel,
cl::NullRange,
cl::NDRange(length),
cl::NDRange(1),
NULL,
&event);
event.wait();
On average, the time is around 75 milliseconds, as calculated by:
cl_ulong startTime = event.getProfilingInfo<CL_PROFILING_COMMAND_START>();
cl_ulong endTime = event.getProfilingInfo<CL_PROFILING_COMMAND_END>();
std::cout << (endTime - startTime) * SECONDS_PER_NANO / SECONDS_PER_MILLI << "\n";
I'm running Windows 7, with an Intel i5-3450 chip (Sandy Bridge architecture). For comparison, the "direct" way of doing the copy takes less than 5 milliseconds. I don't think the event.getProfilingInfo includes the communication time between the host and device. Thoughts?
EDIT:
At the suggestion of ananthonline, I changed the kernel to use float4s instead of chars, and that dropped the average run time to about 50 millis. Still not as fast as I would have hoped, but an improvement. Thanks ananthonline!
I think your main problem is the 2048*2048 work groups you are using. The opencl drivers on your system have to manage a lot more overhead if you have this many single-item work groups. This would be especially bad if you were to execute this program using a gpu, because you would get a very low level of saturation of the hardware.
Optimization: call your kernel with larger work groups. You don't even have to change your existing kernel. see question: What should this size be? I have used 64 below as an example. 64 happens to be a decent number on most hardware.
cl::size_t myOptimalGroupSize = 64;
cl::Event event;
queue.enqueueNDRangeKernel(
kernel,
cl::NullRange,
cl::NDRange(length),
cl::NDRange(myOptimalGroupSize),
NULL,
&event);
event.wait();
You should also get your kernel to do more than copy a single value. I have given an answer to a similar question about global memory over here.
CPUs are very different from GPUs. Running this on an x86 CPU, the best way to achieve decent performance would be to use double16 (the largest data type) instead of char or float4 (as suggested by someone else).
In my little experience with OpenCL on CPU, I have never reached performance levels that I could get with an OpenMP parallelization.
The best way to do a copy in parallel with a CPU would be to divide the block to copy into a small number of large sub-block, and let each thread copy a sub-block.
The GPU approach is orthogonal: each thread participates in the copy of the same block.
This is because on GPUs, different thread can access contiguous memory regions efficicently (coalescing).
To do an efficient copy on CPU with OpenCL, use a loop inside your kernel to copy contiguous data. And then use a workgroup size not larger than the number of available cores.
I believe it is the cl::NDRange(1) which is telling the runtime to use single item work groups. This is not efficient. In the C API you can pass NULL for this to leave the work group size up to the runtime; there should be a way to do that in the C++ API as well (perhaps also just NULL). This should be faster on the CPU; it certainly will be on a GPU.

OpenMP slows down program instead of speeding it up: a bug in gcc?

I will first give some background about the problem I'm having so you know what I'm trying to do. I have been helping out with the development of a certain software tool and found out that we could benefit greatly from using OpenMP to parallelize some of the biggest loops in this software. We actually parallelized the loops successfully and with just two cores the loops executed 30% faster, which was an OK improvement. On the other hand we noticed a weird phenomenom in a function that traverses through a tree structure using recursive calls. The program actually slowed down here with OpenMP on and the execution time of this function over doubled. We thought that maybe the tree-structure was not balanced enough for parallelization and commented out the OpenMP pragmas in this function. This appeared to have no effect on the execution time though. We are currently using GCC-compiler 4.4.6 with the -fopenmp flag on for OpenMP support. And here is the current problem:
If we don't use any omp pragmas in the code, all runs fine. But if we add just the following to the beginning of the program's main function, the execution time of the tree travelsal function over doubles from 35 seconds to 75 seconds:
//beginning of main function
...
#pragma omp parallel
{
#pragma omp single
{}
}
//main function continues
...
Does anyone have any clues about why this happens? I don't understand why the program slows down so greatly just from using the OpenMP pragmas. If we take off all the omp pragmas, the execution time of the tree traversal function drops back to 35 seconds again. I would guess that this is some sort of compiler bug as I have no other explanation on my mind right now.
Not everything that can be parallelized, should be parallelized. If you are using a single, then only one thread executes it and the rest have to wait until the region is done. They can either spin-wait or sleep. Most implementations start out with a spin-wait, hoping that the single region will not take too long and the waiting threads can see the completion faster than if sleeping. Spin-waits eat up a lot of processor cycles. You can try specifying that the wait should be passive - but this is only in OpenMP V3.0 and is only a hint to the implementation (so it might not have any effect). Basically, unless you have a lot of work in the parallel region that can compensate for the single, the single is going to increase the parallel overhead substantially and may well make it too expensive to parallelize.
First, OpenMP often reduces performance on first try. It can be tricky to to use omp parallel if you don't understand it inside-out. I may be able to help if you can you tell me a little more about the program structure, specifically the following questions annotated by ????.
//beginning of main function
...
#pragma omp parallel
{
???? What goes here, is this a loop? if so, for loop, while loop?
#pragma omp single
{
???? What goes here, how long does it run?
}
}
//main function continues
....
???? Does performance of this code reduce or somewhere else?
Thanks.
Thank you everyone. We were able to fix the issue today by linking with TCMalloc, one of the solutions ejd offered. The execution time dropped immediately and we were able to get around 40% improvement in execution times over a non-threaded version. We used 2 cores. It seems that when using OpenMP on Unix with GCC, you should also pick a replacement for the standard memory management solution. Otherwise the program may just slow down.
I did some more testing and made a small test program to test whether the issue could be memory operation related. I was unable to replicate the issue of an empty parallel-single region causing program to slow down in my small test program, but I was able to replicate the slow down by parallelizing some malloc calls.
When running the test program on Windows 7 64-bit with 2 CPU-cores, no noticeable slow down was caused by using -fopenmp flag with the gcc (g++) compiler and running the compiled program compared to running the program without OpenMP support.
Doing the same on Kubuntu 11.04 64-bit on the same computer, however, raised the execution to over 4 times of the non-OpenMP version. This issue seems to only appear on Unix-systems and not on Windows.
The source of my test program is below. I have also uploaded zipped-source for win and unix version as well as assembly source for win and unix version for both with and without OpenMP-support. This zip can be downloaded here http://www.2shared.com/file/0thqReHk/omp_speed_test_2011_05_11.html
#include <stdio.h>
#include <windows.h>
#include <list>
#include <sys/time.h>
//#include <cstdlib>
using namespace std;
int main(int argc, char* argv[])
{
// #pragma omp parallel
// #pragma omp single
// {}
int start = GetTickCount();
/*
struct timeval begin, end;
int usecs;
gettimeofday(&begin, NULL);
*/
list<void *> pointers;
#pragma omp parallel for default(shared)
for(int i=0; i< 10000; i++)
//pointers.push_back(calloc(20000, sizeof(void *)));
pointers.push_back(malloc(20000));
for(list<void *>::iterator i = pointers.begin(); i!= pointers.end(); i++)
free(*i);
/*
gettimeofday(&end, NULL);
if (end.tv_usec < begin.tv_usec) {
end.tv_usec += 1000000;
begin.tv_sec += 1;
}
usecs = (end.tv_sec - begin.tv_sec) * 1000000;
usecs += (end.tv_usec - begin.tv_usec);
*/
printf("It took %d milliseconds to finish the memory operations", GetTickCount() - start);
//printf("It took %d milliseconds to finish the memory operations", usecs/1000);
return 0;
}
What remains unanswered now is, what can I do to avoid issues such as these on the Unix-platform..

Resources