What is easier to learn and debug OpenMP or MPI? - openmp

I have a number crunching C/C++ application. It is basically a main loop for different data sets. We got access to a 100 node cluster with openmp and mpi available. I would like to speedup the application but I am an absolut newbie for both mpi and openmp. I just wonder what is the easiest one to learn and to debug even if the performance is not the best.
I also wonder what is the most adequate for my main loop application.
Thanks

If your program is just one big loop using OpenMP can be as simple as writing:
#pragma omp parallel for
OpenMP is only useful for shared memory programming, which unless your cluster is running something like kerrighed means that the parallel version using OpenMP will only run on at most one node at a time.
MPI is based around message passing and is slightly more complicated to get started. The advantage is though that your program could run on several nodes at one time, passing messages between them as and when needed.
Given that you said "for different data sets" it sounds like your problem might actually fall into the "embarrassingly parallel" category, where provided you've got more than 100 data sets you could just setup the scheduler to run one data set per node until they are all completed, with no need to modify your code and almost a 100x speed up over just using a single node.
For example if your cluster is using condor as the scheduler then you could submit 1 job per data item to the "vanilla" universe, varying only the "Arguments =" line of the job description. (There are other ways to do this for Condor which may be more sensible and there are also similar things for torque, sge etc.)

OpenMP is essentially for SMP machines, so if you want to scale to hundreds of nodes you will need MPI anyhow. You can however use both. MPI to distribute work across nodes and OpenMP to handle parallelism across cores or multiple CPUs per node. I would say OpenMP is a lot easier than messing with pthreads. But it being coarser grained, the speed up you will get from OpenMP will usually be lower than a hand optimized pthreads implementation.

Related

Hybrid OpenMP + OpenMPI for mixed distributed & shared memory?

I am developing a code to perform a few very large computations by my standards. Based on single-CPU estimates, expected run-time is ~10 CPU years, and memory requirements are ~64 GB. Little to no IO is required. My serial version of the code in question (written in C) is working well enough and I have to start thinking about how to best parallelize the code.
I have access to clusters with ~64 GB RAM and 16 cores per node. I will probably limit myself to using e.g. <= 8 nodes. I'm imagining a setup where memory is shared between threads on a single node, with separate memory used on different nodes and relatively little communication between nodes.
From what I've read so far, the solution I have come up with is to use a hybrid OpenMP + OpenMPI design, using OpenMP to manage threads on individual compute nodes, and OpenMPI to pass information between nodes, like this:
https://www.rc.colorado.edu/crcdocs/openmpi-openmp
My question is whether this is the "best" way to implement this parallelization. I'm an experienced C programmer but have very limited experience in parallel programming (a little bit with OpenMP, none with OpenMPI; most of my jobs in the past were embarrassingly parallel). As an alternative suggestion, is it possible with OpenMPI to efficiently share memory on a single host? If so then I could avoid using OpenMP, which would make things slightly simpler (one API instead of two).
Hybrid OpenMP and MPI coding is most appropriate for problems where one can clearly identify two separate levels of parallelism - corase grained one and the fine grained one nested inside each coarse subdomain. Since fine grained parallelism requires lots of communication when implemented with message passing, it doesn't scale, because the communication overhead can become comparable to the amount of work being done. As OpenMP is a shared memory paradigm, no data communication is necessary, only access synchronisation, and it is more appropriate for finer grained parallel tasks. OpenMP also benefits from data sharing between threads (and the corresponding cache sharing on modern multi-core CPUs with shared last-level cache) and usually requires less memory than the equivalent message passing code, where some of the data might need to be replicated in all processes. MPI on the other side can run cross nodes and is not limited to running on a single shared-memory system.
Your words suggest that your parallelisation is very coarse grained or belongs to the so-called embarassingly parallel problems. If I were you, I would go hybrid. If you only employ OpenMP pragmas and don't use runtime calls (e.g. omp_get_thread_num()) your code can be compiled as both pure MPI (i.e. with non-threaded MPI processes) or as hybrid, depending on whether you enable OpenMP or not (you can also provide a dummy OpenMP runtime to enable code to be compiled as serial). This will give you both the benefits of OpenMP (data sharing, cache reusage) and MPI (transparent networking, scalability, easy job launching) with the added option to switch off OpenMP and run in an MPI-only mode. And as an added bonus, you will be able to meet the future, which looks like brining us interconnected many-many-core CPUs.

Parallel computing cluster with MPI (MPICH2) and nVidia CUDA

I need to write an application that hashes words from a dictionary to make WPA pre-shared-keys. This is my thesis for a "Networking Security" course. The application needs to be parallel for increased performance. I have some experience with MPI from my IT studies but I would like to tie it up with CUDA. The idea is to use MPI to distribute the load evenly to the nodes of the cluster and then utilize CUDA to run the individual chunks in parallel inside the GPUs of the nodes.
Distributing the load with MPI is something I can easily do and have done in the past. Also computing with CUDA is something I can learn. There is also a project (pyrit) that does more or less what I need to do (actually a lot more) and I can get ideas from there.
I would like some advice on how to make the connection between MPI and CUDA. If there is somebody that has built anything like this I would greatly appreciate his advice and suggestions. Also if you happen to know of any resources on the topic please do point them to me.
Sorry for the lengthy intro but I thought it was necessary to give some background.
This question is largerly open-ended and so it's hard to give a definitive answer. This one is just a summary of the comments made High Performance Mark, me and Jonathan Dursi. I do not claim authorship and thus made this answer a community wiki.
MPI and CUDA are orthogonal. The former is an IPC middleware and is used to communicate between processes (possibly residing on separate nodes) while the latter provides highly data-parallel shared-memory computing to each process that uses it. You can break the task into many small subtasks and use MPI to distribute them to worker processes running on the network. The master/worker approach is suitable for this kind of application, especially if words in the dictionary vary greatly in their length and variance in processing time is to be expected. Provided with all the necessary input values, worker processes can then use CUDA to perform the necessary computations in parallel and then return results back using MPI. MPI also provides the mechanisms necessary to launch and control multinode jobs.
Although MPI and CUDA could be used separately, modern MPI implementations provide some mechanisms that blur the boundaries between those two. It could be either direct support for device pointers in MPI communication operations that transparently call CUDA functions to copy memory when necessary or it could be even support for RDMA to/from device memory without intermediate copy to main memory. The former simplifies your code while the latter can save different amount of time, depending on how your algorithm is structured. The latter also requires both failry new CUDA hardware and drivers and newer networking equipment (e.g. newer InfiniBand HCA).
MPI libraries that support direct GPU memory operations include MVAPICH2 and the trunk SVN version of Open MPI.

Cilk or Cilk++ or OpenMP

I'm creating a multi-threaded application in Linux. here is the scenario:
Suppose I am having x instance of a class BloomFilter and I have some y GB of data(greater than memory available). I need to test membership for this y GB of data in each of the bloom filter instance. It is pretty much clear that parallel programming will help to speed up the task moreover since I am only reading the data so it can be shared across all processes or threads.
Now I am confused about which one to use Cilk, Cilk++ or OpenMP(which one is better). Also I am confused about which one to go for Multithreading or Multiprocessing
Cilk Plus is the current implementation of Cilk by Intel.
They both are multithreaded environment, i.e., multiple threads are spawned during execution.
If you are new to parallel programming probably OpenMP is better for you since it allows an easier parallelization of already developed sequential code. Do you already have a sequential version of your code?
OpenMP uses pragma to instruct the compiler which portions of the code has to run in parallel. If I understand your problem correctly you probably need something like this:
#pragma omp parallel for firstprivate(array_of_bloom_filters)
for i in DATA:
check(i,array_of_bloom_filters);
the instances of different bloom filters are replicated in every thread in order to avoid contention while data is shared among thread.
update:
The paper actually consider an application which is very unbalanced, i.e., different taks (allocated on different thread) may incur in very different workload. Citing the paper that you mentioned "a highly unbalanced task graph that challenges scheduling,
load balancing, termination detection, and task coarsening strategies". Consider that in order to balance computation among threads it is necessary to reduce the task size and therefore increase the time spent in synchronizations.
In other words, good load balancing comes always at a cost. The description of your problem is not very detailed but it seems to me that the problem you have is quite balanced. If this is not the case then go for Cilk, its work stealing approach its probably the best solution for unbalanced workloads.
At the time this was posted, Intel was putting a lot of effort into boosting Cilk(tm) Plus; more recently, some effort has been diverted toward OpenMP 4.0.
It's difficult in general to contrast OpenMP with Cilk(tm) Plus.
If it's not possible to distribute work evenly across threads, one would likely set schedule(runtime) in an OpenMP version, and then at run time try various values of environment variable, such as OMP_SCHEDULE=guided, OMP_SCHEDULE=dynamic,2 or OMP_SCHEDULE=auto. Those are the closest OpenMP analogies to the way Cilk(tm) Plus work stealing works.
Some sparse matrix functions in Intel MKL library do actually scan the job first and determine how much to allocate to each thread so as to balance work. For this method to be useful, the time spent in serial scanning and allocating has to be of lower order than the time spent in parallel work.
Work-stealing, or dynamic scheduling, may lose much of the potential advantage of OpenMP in promoting cache locality by pinning threads with cache locality e.g. by OMP_PROC_BIND=close.
Poor cache locality becomes a bigger issue on a NUMA architecture where it may lead to significant time spent on remote memory access.
Both OpenMP and Cilk(tm) Plus have facilities for switching between serial and parallel execution.

MATLAB Parallel Computing Toolbox - Parallelization vs GPU?

I'm working with someone who has some MATLAB code that they want to be sped up. They are currently trying to convert all of this code into CUDA to get it to run on a CPU. I think it would be faster to use MATLAB's parallel computing toolbox to speed this up, and run it on a cluster that has MATLAB's Distributed Computing Toolbox, allowing me to run this across several different worker nodes. Now, as part of the parallel computing toolbox, you can use things like GPUArray. However, I'm confused as to how this would work. Are using things like parfor (parallelization) and gpuarray (gpu programming) compatible with each other? Can I use both? Can something be split across different worker nodes (parallelization) while also making use of whatever GPUs are available on each worker?
They think its still worth exploring the time it takes to convert all of your matlab code to cuda code to run on a machine with multiple GPUs...but I think the right approach would be to use the features already built into MATLAB.
Any help, advice, direction would be really appreciated!
Thanks!
When you use parfor, you are effectively dividing your for loop into tasks, with one task per loop iteration, and splitting up those tasks to be computed in parallel by several workers where each worker can be thought of as a MATLAB session without an interactive GUI. You configure your cluster to run a specified number of workers on each node of the cluster (generally, you would choose to run a number of workers equal to the number of available processor cores on that node).
On the other hand, gpuarray indicates to MATLAB that you want to make a matrix available for processing by the GPU. Underneath the hood, MATLAB is marshalling the data from main memory to the graphics board's internal memory. Certain MATLAB functions (there's a list of them in the documentation) can operate on gpuarrays and the computation happens on the GPU.
The key differences between the two techniques are that parfor computations happen on the CPUs of nodes of the cluster with direct access to main memory. CPU cores typically have a high clock rate, but there are typically fewer of them in a CPU cluster than there are GPU cores. Individually, GPU cores are slower than a typical CPU core and their use requires that data be transferred from main memory to video memory and back again, but there are many more of them in a cluster. As far as I know, hybrid approaches are supposed to be possible, in which you have a cluster of PCs and each PC has one or more Nvidia Tesla boards and you use both parfor loops and gpuarrays. However, I haven't had occasion to try this yet.
If you are mainly interested in simulations, GPU processing is the perfect choice. However, if you want to analyse (big) data, go with Parallization. The reason for this is, that GPU processing is only faster than cpu processing if you don't have to copy data back and forth. In case of a simulation, you can generate most of the data on the GPU and only need to copy the result back. If you try to work with bigger data on the GPU you will very often run into out of memory problems.
Parallization is great if you have big data structures and more than 2 cores in your computer CPU.
If you write it in CUDA it is guaranteed to run in parallel at the chip-level versus going with MATLAB's best guess for a non-parallel architecture and your best effort to get it to run in parallel.
Kind of like drinking fresh mountain water run-off versus buying filtered water. Go with the purist solution.

MPI for multicore?

With the recent buzz on multicore programming is anyone exploring the possibilities of using MPI ?
I've used MPI extensively on large clusters with multi-core nodes. I'm not sure if it's the right thing for a single multi-core box, but if you anticipate that your code may one day scale larger than a single chip, you might consider implementing it in MPI. Right now, nothing scales larger than MPI. I'm not sure where the posters who mention unacceptable overheads are coming from, but I've tried to give an overview of the relevant tradeoffs below. Read on for more.
MPI is the de-facto standard for large-scale scientific computation and it's in wide use on multicore machines already. It is very fast. Take a look at the most recent Top 500 list. The top machines on that list have, in some cases, hundreds of thousands of processors, with multi-socket dual- and quad-core nodes. Many of these machines have very fast custom networks (Torus, Mesh, Tree, etc) and optimized MPI implementations that are aware of the hardware.
If you want to use MPI with a single-chip multi-core machine, it will work fine. In fact, recent versions of Mac OS X come with OpenMPI pre-installed, and you can download an install OpenMPI pretty painlessly on an ordinary multi-core Linux machine. OpenMPI is in use at Los Alamos on most of their systems. Livermore uses mvapich on their Linux clusters. What you should keep in mind before diving in is that MPI was designed for solving large-scale scientific problems on distributed-memory systems. The multi-core boxes you are dealing with probably have shared memory.
OpenMPI and other implementations use shared memory for local message passing by default, so you don't have to worry about network overhead when you're passing messages to local processes. It's pretty transparent, and I'm not sure where other posters are getting their concerns about high overhead. The caveat is that MPI is not the easiest thing you could use to get parallelism on a single multi-core box. In MPI, all the message passing is explicit. It has been called the "assembly language" of parallel programming for this reason. Explicit communication between processes isn't easy if you're not an experienced HPC person, and there are other paradigms more suited for shared memory (UPC, OpenMP, and nice languages like Erlang to name a few) that you might try first.
My advice is to go with MPI if you anticipate writing a parallel application that may need more than a single machine to solve. You'll be able to test and run fine with a regular multi-core box, and migrating to a cluster will be pretty painless once you get it working there. If you are writing an application that will only ever need a single machine, try something else. There are easier ways to exploit that kind of parallelism.
Finally, if you are feeling really adventurous, try MPI in conjunction with threads, OpenMP, or some other local shared-memory paradigm. You can use MPI for the distributed message passing and something else for on-node parallelism. This is where big machines are going; future machines with hundreds of thousands of processors or more are expected to have MPI implementations that scale to all nodes but not all cores, and HPC people will be forced to build hybrid applications. This isn't for the faint of heart, and there's a lot of work to be done before there's an accepted paradigm in this space.
I would have to agree with tgamblin. You'll probably have to roll your sleeves up and really dig into the code to use MPI, explicitly handling the organization of the message-passing yourself. If this is the sort of thing you like or don't mind doing, I would expect that MPI would work just as well on multicore machines as it would on a distributed cluster.
Speaking from personal experience... I coded up some C code in graduate school to do some large scale modeling of electrophysiologic models on a cluster where each node was itself a multicore machine. Therefore, there were a couple of different parallel methods I thought of to tackle the problem.
1) I could use MPI alone, treating every processor as it's own "node" even though some of them are grouped together on the same machine.
2) I could use MPI to handle data moving between multicore nodes, and then use threading (POSIX threads) within each multicore machine, where processors share memory.
For the specific mathematical problem I was working on, I tested two formulations first on a single multicore machine: one using MPI and one using POSIX threads. As it turned out, the MPI implementation was much more efficient, giving a speed-up of close to 2 for a dual-core machine as opposed to 1.3-1.4 for the threaded implementation. For the MPI code, I was able to organize operations so that processors were rarely idle, staying busy while messages were passed between them and masking much of the delay from transferring data. With the threaded code, I ended up with a lot of mutex bottlenecks that forced threads to often sit and wait while other threads completed their computations. Keeping the computational load balanced between threads didn't seem to help this fact.
This may have been specific to just the models I was working on, and the effectiveness of threading vs. MPI would likely vary greatly for other types of parallel problems. Nevertheless, I would disagree that MPI has an unwieldy overhead.
No, in my opinion it is unsuitable for most processing you would do on a multicore system. The overhead is too high, the objects you pass around must be deeply cloned, and passing large objects graphs around to then run a very small computation is very inefficient. It is really meant for sharing data between separate processes, most often running in separate memory spaces, and most often running long computations.
A multicore processor is a shared memory machine, so there are much more efficient ways to do parallel processing, that do not involve copying objects and where most of the threads run for a very small time. For example, think of a multithreaded Quicksort. The overhead of allocating memory and copying the data to a thread before it can be partioned will be much slower with MPI and an unlimited number of processors than Quicksort running on a single processor.
As an example, in Java, I would use a BlockingQueue (a shared memory construct), to pass object references between threads, with very little overhead.
Not that it does not have its place, see for example the Google search cluster that uses message passing. But it's probably not the problem you are trying to solve.
MPI is not inefficient. You need to break the problem down into chunks and pass the chunks around and reorganize when the result is finished per chunk. No one in the right mind would pass around the whole object via MPI when only a portion of the problem is being worked on per thread. Its not the inefficiency of the interface or design pattern thats the inefficiency of the programmers knowledge of how to break up a problem.
When you use a locking mechanism the overhead on the mutex does not scale well. this is due to the fact that the underlining runqueue does not know when you are going to lock the thread next. You will perform more kernel level thrashing using mutex's than a message passing design pattern.
MPI has a very large amount of overhead, primarily to handle inter-process communication and heterogeneous systems. I've used it in cases where a small amount of data is being passed around, and where the ratio of computation to data is large.
This is not the typical usage scenario for most consumer or business tasks, and in any case, as a previous reply mentioned, on a shared memory architecture like a multicore machine, there are vastly faster ways to handle it, such as memory pointers.
If you had some sort of problem with the properties describe above, and you want to be able to spread the job around to other machines, which must be on the same highspeed network as yourself, then maybe MPI could make sense. I have a hard time imagining such a scenario though.
I personally have taken up Erlang( and i like to so far). The messages based approach seem to fit most of the problem and i think that is going to be one of the key item for multi core programming. I never knew about the overhead of MPI and thanks for pointing it out
You have to decide if you want low level threading or high level threading. If you want low level then use pThread. You have to be careful that you don't introduce race conditions and make threading performance work against you.
I have used some OSS packages for (C and C++) that are scalable and optimize the task scheduling. TBB (threading building blocks) and Cilk Plus are good and easy to code and get applications of the ground. I also believe they are flexible enough integrate other thread technologies into it at a later point if needed (OpenMP etc.)
www.threadingbuildingblocks.org
www.cilkplus.org

Resources