Fortran code using auto parallelization vs MPI - parallel-processing

I am using a Fortran code to run a large scale simulation on a supercomputer. I am able to run the code in serial, but I want to improve the turn around time. I am looking in to making it parallel and i have found that I can use auto-parallelization or MPI, the question I have is: which is more likely to improve the turn around time?
I was able to use Intel Fortran complier with the compiler flag -parallel -par-report to see which DO loops where made parallel, so if I run the complied code on 4 processors would that actually work or do I have to do something special?
In addition, do you know of any useful resources for me too learn MPI. I want to be able to use more processors to increase the simulation time that is my end goal.

More than likely, MPI is going to be faster than auto-parallelization. However, auto-parallelization would take about 0.5 seconds worth of work to get a speed-up of, say, 1.2 compared to Y hours (maybe even up to Q weeks) of trial-and-error debugging to get a speed-up of, say, 1.7.
If you're interested in self-learning MPI through a book, Gropp, Lusk, & Skjellum's Using MPI is probably a good start.

Answer a bit depends on nature of your hardware and your application/workload. Do you use multi-node cluster (most typical) or big shared memory machine? Assuming you are cluster user, you will have to use MPI or Fortran coarray for (more likely) distributed memory cross-node parallelism AND SOMETHING fon inter-node shared memory parallelism (SMP).
Shared memory parallelism can give you speed-up proportional to number of cores on a node(up to 32x with Xeons) or even more with coprocessors. Distributed memory parallelism can give you speedup proportional to number of nodes. Both types (or actually all 3 types) of parallelism have to be used these days to get reasonable performance. You may think of it like a hierarchy: 1.MPI or coarray on the top, 2.something for shared memory threading in the middle and 3. vectorization in the innermost level.
Well, from your question, it sounds like you are talking mostly about SMP multicore threading parallelism level. This is where -parallel Auto-Parallelization behaves. Dont expect big magic from auto-par. If you want to get better scalable parallelism, you have to try fortran OpenMP or MPI-for-shared memory. I would recommend OpenMP in most cases; its often easier to program and more performance.
But. its up to you and you really should think bigger- about all 3 levels of parallelism. If you plan to address all 3 levels, then probably optimal combination (since you are a happy intel fortran user) is 1. MPI for 1st level+ 2. OpenMP for SMP level + 3. AutoVectorization guided by OpenMP 4.0 pragma simd on 3rd level. Im not an expert in coarray, but it might be good alternative to 1.MPI.
My answer does make less sence if you dont deal with classic cluster hardware.

Related

OpenMP and MPI Interaction

Hi currently I'm working on a program that I have working in parallel using MPI. I was wondering if I could gain additional speed in the for loops using OpenMP so I could get more out of each processor. Would I gain anything out of doing this? Also how would I go about it?
From experience it really depend on your problem and on how many MPI processes you are using.
Using large amount of MPI processes usually improve data locality, but your parallelization might not allow large amount of processes.
The thought that you will gain for sure a decent speedup is very often wrong :-(... But then if you reach the point where you cant use more MPI processes due to lack of parallel efficiency you will probably gain the possibility of using more cores efficiently.
From experience you should target a small number of thread (4-8, 1/2 of the socket cores count), especially if you have only small loops (which should be the case if you reach the max number of MPI processes).
A good intro of hybrid parallelism:
http://www.openmp.org/press-release/sc13-tutorial-hybrid-mpi-openmp-parallel-programming/

Emulate a very fast (virtual) CPU core

I know that the usual method when we want to make a big math computation faster is to use multiprocessing / parallel processing: we split the job in for example 4 parts, and we let 4 CPU cores run in parallel (parallelization). This is possible for example in Python with multiprocessing module: on a 4-core CPU, it would allow to use 100% of the processing power of the computer instead of only 25% for a single-process job.
But let's say we want to make faster a non-easily-splittable computation job.
Example: we are given a number generator function generate(n) that takes the previously-generated number as input, and "it is said to have 10^20 as period". We want to check this assertion with the following pseudo-code:
a = 17
for i = 1..10^20
a = generate(a)
check if a == 17
Instead of having a computer's 4 CPU cores (3.3 Ghz) running "in parallel" with a total of 4 processes, is it possible to emulate one very fast single-core CPU of 13.2 Ghz (4*3.3) running one single process with the previous code?
Is such technique available for a desktop computer? If not, is it available on cloud computing platforms (AWS EC2, etc.)?
Single-threaded performance is extremely valuable; it's much easier to write sequential code than to explicitly expose thread-level parallelism.
If there was an easy and efficient general-purpose way to do what you're asking which works when there is no parallelism in the code, it would already be in widespread use. Either internally inside multi-core CPUs, or in software if it required higher-level / larger-scale code transformations.
Out-of-order CPUs can find and exploit instruction-level parallelism within a single thread (over short distances, like a couple hundred instructions), but you need explicit thread-level parallelism to take advantage of multiple cores.
This is similar to How does a single thread run on multiple cores? over on SoftwareEnginnering.SE, except that you've already ruled out any easy-to-find parallelism including instruction-level parallelism. (And the answer is: it doesn't. It's the hardware of a single core that finds the instruction-level parallelism in a single thread; my answer there explains some of the microarchitectural details of how that works.)
The reverse process: turning one big CPU into multiple weaker CPUs does exist, and is useful for running multiple threads which don't have much instruction-level parallelism. It's called SMT (Simultaneous MultiThreading). You've probably heard of Intel's Hyperthreading, the most widely known implementation of SMT. It trades single-threaded performance for more throughput, keeping more execution units fed with useful work more of the time. The cost of building a single wide core grows at least quadratically, which is why typical desktop CPUs don't just have a single massive core with 8-way SMT. (And note that a really wide CPU still wouldn't help with a totally dependent instruction stream, unless the generate function has some internal instruction-level parallelism.)
SMT would be good if you wanted to test 8 different generate() functions at once on a quad-core CPU. Without SMT, you could alternate in software between two generate chains in one thread, so out-of-order execution could be working on instructions from both dependency chains in parallel.
Auto-parallelization by compilers at compile time is possible for source that has some visible parallelism, but if generate(a) isn't "separable" (not the correct technical term, I think) then you're out of luck.
e.g. if it's return a + hidden_array[static_counter++]; then the compiler can use math to prove that summing chunks of the array in parallel and adding the partial sums will still give the same result.
But if there's truly a serial dependency through a (like even a simple LCG PRNG), and the software doesn't know any mathematical tricks to break the dependency or reduce it to a closed form, you're out of luck. Compilers do know tricks like sum(0..n) = n*(n+1)/2 (evaluated slightly differently to avoid integer overflow in a partial result), or a+a+a+... (n times) is a * n, but that doesn't help here.
There is a scheme studied mostly in the academy called "Thread Decomposition". It aims to do more or less what you ask about - given a single-threaded code, it tries to break it down into multiple threads in order to divide the work on a multicore system. This process can be done by a compiler (although this requires figuring out all possible side effects at compile time which is very hard), by a JIT runtime, or through HW binary-translation, but each of these methods has complicated limitations and drawbacks.
Unfortunately, other than being automated, this process has very little appeal as it can hardly match true manual parallelization done by a person how understands the code. It also doesn't simply scale performance according to the number of threads, since it usually incurs a large overhead in the form of code that has to be duplicated.
Example paper by some nice folks from UPC in Barcelona: http://ieeexplore.ieee.org/abstract/document/5260571/

Hybrid OpenMP + OpenMPI for mixed distributed & shared memory?

I am developing a code to perform a few very large computations by my standards. Based on single-CPU estimates, expected run-time is ~10 CPU years, and memory requirements are ~64 GB. Little to no IO is required. My serial version of the code in question (written in C) is working well enough and I have to start thinking about how to best parallelize the code.
I have access to clusters with ~64 GB RAM and 16 cores per node. I will probably limit myself to using e.g. <= 8 nodes. I'm imagining a setup where memory is shared between threads on a single node, with separate memory used on different nodes and relatively little communication between nodes.
From what I've read so far, the solution I have come up with is to use a hybrid OpenMP + OpenMPI design, using OpenMP to manage threads on individual compute nodes, and OpenMPI to pass information between nodes, like this:
https://www.rc.colorado.edu/crcdocs/openmpi-openmp
My question is whether this is the "best" way to implement this parallelization. I'm an experienced C programmer but have very limited experience in parallel programming (a little bit with OpenMP, none with OpenMPI; most of my jobs in the past were embarrassingly parallel). As an alternative suggestion, is it possible with OpenMPI to efficiently share memory on a single host? If so then I could avoid using OpenMP, which would make things slightly simpler (one API instead of two).
Hybrid OpenMP and MPI coding is most appropriate for problems where one can clearly identify two separate levels of parallelism - corase grained one and the fine grained one nested inside each coarse subdomain. Since fine grained parallelism requires lots of communication when implemented with message passing, it doesn't scale, because the communication overhead can become comparable to the amount of work being done. As OpenMP is a shared memory paradigm, no data communication is necessary, only access synchronisation, and it is more appropriate for finer grained parallel tasks. OpenMP also benefits from data sharing between threads (and the corresponding cache sharing on modern multi-core CPUs with shared last-level cache) and usually requires less memory than the equivalent message passing code, where some of the data might need to be replicated in all processes. MPI on the other side can run cross nodes and is not limited to running on a single shared-memory system.
Your words suggest that your parallelisation is very coarse grained or belongs to the so-called embarassingly parallel problems. If I were you, I would go hybrid. If you only employ OpenMP pragmas and don't use runtime calls (e.g. omp_get_thread_num()) your code can be compiled as both pure MPI (i.e. with non-threaded MPI processes) or as hybrid, depending on whether you enable OpenMP or not (you can also provide a dummy OpenMP runtime to enable code to be compiled as serial). This will give you both the benefits of OpenMP (data sharing, cache reusage) and MPI (transparent networking, scalability, easy job launching) with the added option to switch off OpenMP and run in an MPI-only mode. And as an added bonus, you will be able to meet the future, which looks like brining us interconnected many-many-core CPUs.

Cilk or Cilk++ or OpenMP

I'm creating a multi-threaded application in Linux. here is the scenario:
Suppose I am having x instance of a class BloomFilter and I have some y GB of data(greater than memory available). I need to test membership for this y GB of data in each of the bloom filter instance. It is pretty much clear that parallel programming will help to speed up the task moreover since I am only reading the data so it can be shared across all processes or threads.
Now I am confused about which one to use Cilk, Cilk++ or OpenMP(which one is better). Also I am confused about which one to go for Multithreading or Multiprocessing
Cilk Plus is the current implementation of Cilk by Intel.
They both are multithreaded environment, i.e., multiple threads are spawned during execution.
If you are new to parallel programming probably OpenMP is better for you since it allows an easier parallelization of already developed sequential code. Do you already have a sequential version of your code?
OpenMP uses pragma to instruct the compiler which portions of the code has to run in parallel. If I understand your problem correctly you probably need something like this:
#pragma omp parallel for firstprivate(array_of_bloom_filters)
for i in DATA:
check(i,array_of_bloom_filters);
the instances of different bloom filters are replicated in every thread in order to avoid contention while data is shared among thread.
update:
The paper actually consider an application which is very unbalanced, i.e., different taks (allocated on different thread) may incur in very different workload. Citing the paper that you mentioned "a highly unbalanced task graph that challenges scheduling,
load balancing, termination detection, and task coarsening strategies". Consider that in order to balance computation among threads it is necessary to reduce the task size and therefore increase the time spent in synchronizations.
In other words, good load balancing comes always at a cost. The description of your problem is not very detailed but it seems to me that the problem you have is quite balanced. If this is not the case then go for Cilk, its work stealing approach its probably the best solution for unbalanced workloads.
At the time this was posted, Intel was putting a lot of effort into boosting Cilk(tm) Plus; more recently, some effort has been diverted toward OpenMP 4.0.
It's difficult in general to contrast OpenMP with Cilk(tm) Plus.
If it's not possible to distribute work evenly across threads, one would likely set schedule(runtime) in an OpenMP version, and then at run time try various values of environment variable, such as OMP_SCHEDULE=guided, OMP_SCHEDULE=dynamic,2 or OMP_SCHEDULE=auto. Those are the closest OpenMP analogies to the way Cilk(tm) Plus work stealing works.
Some sparse matrix functions in Intel MKL library do actually scan the job first and determine how much to allocate to each thread so as to balance work. For this method to be useful, the time spent in serial scanning and allocating has to be of lower order than the time spent in parallel work.
Work-stealing, or dynamic scheduling, may lose much of the potential advantage of OpenMP in promoting cache locality by pinning threads with cache locality e.g. by OMP_PROC_BIND=close.
Poor cache locality becomes a bigger issue on a NUMA architecture where it may lead to significant time spent on remote memory access.
Both OpenMP and Cilk(tm) Plus have facilities for switching between serial and parallel execution.

How to get best performance of 8 core system using INTEL fortran

Please let me know how to set INTEL fortran compiler option to gain the best performance of 8 core system for IA32 and X64 bits. Actually I want to execute a fortran program and take the advantages of the all CPU time available in 8 core system. Now the program is only using 13 % of CPU time.
You can learn about autovectorization and guided auto-parallelization features of Intel FORTRAN in this tutorial: http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/start/win/tutorial_comp_for_win.pdf.
If you are doing linear algebra, solvers, FFTs, you might get best results if you map your problem into calls into the Intel Math Kernel Libraries: http://software.intel.com/en-us/articles/intel-mkl/
which are already multithreaded and vectorized and cache optimized.
If you are doing media / signal processing you might map your problem into calls into the Intel Performance Primitives library: http://software.intel.com/en-us/articles/intel-ipp/
Happy hacking!
In my specific application, a computational network model containing several loops running thoughout 20k iterations, each iteration accessing a number of nested if's, just by enabling /Q2 level optimization in the compiler was sufficient to reduce the computing time drastically, while keeping the CPU load around 15%.
On a similar note, I have noticed rising the optimization setting to the last level (/Q3), did do what you were asking (running all CPUs at about full load), but the computing time have NOT been reduced at all.
Therefore, if one has a small problem and several cases to test and processing capacity is the only bottleneck, it could be a good idea to open more than one Fortran solution and run those cases simultaneously.

Resources