Multiple single-thread binaries faster than one parallel OpenMP process?

Multiple single-thread binaries faster than one parallel OpenMP process? - performance

I have a relatively large C++ program that does a lot of dynamic memory allocation (size and structure are calculated at runtime; it's pointers all the way down).
The program is embarrassingly parallel, and OpenMP does speed it up, up to a point. However, I found that launching single-threaded binaries, and having them sync up through the file system runs faster, and slows down less as the number of cores used increases.
Is there an intuitive reason why this might happen? (sorry I don't have a code snippet.)
Thank you!

Related

OpenMP and MPI Interaction

Hi currently I'm working on a program that I have working in parallel using MPI. I was wondering if I could gain additional speed in the for loops using OpenMP so I could get more out of each processor. Would I gain anything out of doing this? Also how would I go about it?

From experience it really depend on your problem and on how many MPI processes you are using.
Using large amount of MPI processes usually improve data locality, but your parallelization might not allow large amount of processes.
The thought that you will gain for sure a decent speedup is very often wrong :-(... But then if you reach the point where you cant use more MPI processes due to lack of parallel efficiency you will probably gain the possibility of using more cores efficiently.
From experience you should target a small number of thread (4-8, 1/2 of the socket cores count), especially if you have only small loops (which should be the case if you reach the max number of MPI processes).
A good intro of hybrid parallelism:
http://www.openmp.org/press-release/sc13-tutorial-hybrid-mpi-openmp-parallel-programming/

Comparing performance of two copying techniques?

For copying a huge double array to another array I have following two options:
Option 1
copy(arr1, arr1+N, arr2);
Option 2
#pragma omp parallel for
for(int i = 0; i < N; i++)
arr2[i] = arr1[i];
I want to know for a large value of N. Which of the following will be the better (takes less time) option and when?"
System configuration:
Memory: 15.6 GiB
Processor: Intel® Core™ i5-4590 CPU # 3.30GHz × 4
OS-Type: 64-bit
compiler: gcc (Ubuntu 4.9.2-0ubuntu1~12.04) 4.9.2

Practically, if performance matters, measure it.
std::copy and memcpy are usually highly optimized, using sophisticated performance tricks. Your compiler may or may not be clever enough / have the right configuration options to achieve that performance from a raw loop.
That said, theoretically, parallelizing the copy can provide a benefit. On modern systems you must use multiple threads to fully utilize both your memory and cache bandwidth. Take a look at these benchmark results, where the first two rows compare parallel versus single threaded cache, and the last two rows parallel vs. single threaded main memory bandwidth. On a desktop system like yours, the gap is not very large. In a high-performance oriented system, especially with multiple sockets, more threads are very important to exploit the available bandwidth.
For an optimal solution, you have to consider things like not writing the same cache-line from multiple threads. Also if your compiler doesn't produce perfect code from the raw loop, you may have to actually run std::copy on multiple threads/chunks. In my tests, the raw loop performed much worse, because it doesn't use AVX. Only the Intel compiler managed to actually replace parts in the OpenMP loop with an avx_rep_memcpy - interestingly it did not perform this optimization with a non-OpenMP loop. The optimal number of threads for memory bandwidth is also usually not the number of cores, but less.
The general recommendation is: Start with a simple implementation, in this case the idiomatic std::copy, and later analyze your application to understand where the bottleneck actually is. Do not invest in complex, hard to maintain, system specific optimizations that may only affect a tiny faction of your codes overall runtime. If it turns out this is a bottleneck for your application, and your hardware resources are not utilized well, then you need to understand the performance characteristics of your underlying hardware (local/shared caches, NUMA, prefetchers) and tune your code accordingly.

Option 1 is better.
RAM is a shared resource, you can not simply parallelize it. When one core uses the RAM, the others wait.
Moreover, RAM is usually slower that the CPU -- RAM frequency is lower than CPU freqency, so in the case above even the single core has cycles that just wait on the RAM.
You also might consider memcpy() for copying, it might be faster than std::copy(). It generally depends from the implementation.
Last but not lest, always measure. For start, just put omp_get_wtime() before and after the piece of code you are measuring and see the difference.

(Un)Limit CPU usage - Windows - Mathematica

I'm programming in Mathematica 8.
When I run my programme, I check with Win8 task manager that the CPU usage is at 35% as soon as it starts to run, and my memory usage also increases to 44%. Does Win8 limit the amount of CPU usage that a certain programme may have? I need to make my computer to use more of its resources to run the programme faster.
Any help would be appreciated.

What's happening here is a common misconception about how processors approach problems involving heavy computation.
Although you may indeed have a powerful 4-core processing machine, and you have a program which is capable of using all 4 processing cores (which mathematica definitely is!), unless the code is written in a parallel fashion, you will only be able to use 1 core at a time to do the calculations. As Mysticial mentioned in the comment, not all code is parallelizable, in fact, I'd say a great many problems are not inherently able to be parallelized.
Check here for some good examples of problems that can be split up in a parallel fashion well. Now, your memory usage will simply increase with the size of the data that's being stored temporarily. (ex: storing a 69X69 matrix takes up less memory (RAM) than a 4000X4000, being parallel has little to do with this, and more with the problem itself).
Anyway, tl;dr, your computer is acting normally. To use all 100% of that 4-core machine you're using, check out This mathematica reference guide to parallel operations.

Fortran code using auto parallelization vs MPI

I am using a Fortran code to run a large scale simulation on a supercomputer. I am able to run the code in serial, but I want to improve the turn around time. I am looking in to making it parallel and i have found that I can use auto-parallelization or MPI, the question I have is: which is more likely to improve the turn around time?
I was able to use Intel Fortran complier with the compiler flag -parallel -par-report to see which DO loops where made parallel, so if I run the complied code on 4 processors would that actually work or do I have to do something special?
In addition, do you know of any useful resources for me too learn MPI. I want to be able to use more processors to increase the simulation time that is my end goal.

More than likely, MPI is going to be faster than auto-parallelization. However, auto-parallelization would take about 0.5 seconds worth of work to get a speed-up of, say, 1.2 compared to Y hours (maybe even up to Q weeks) of trial-and-error debugging to get a speed-up of, say, 1.7.
If you're interested in self-learning MPI through a book, Gropp, Lusk, & Skjellum's Using MPI is probably a good start.

Answer a bit depends on nature of your hardware and your application/workload. Do you use multi-node cluster (most typical) or big shared memory machine? Assuming you are cluster user, you will have to use MPI or Fortran coarray for (more likely) distributed memory cross-node parallelism AND SOMETHING fon inter-node shared memory parallelism (SMP).
Shared memory parallelism can give you speed-up proportional to number of cores on a node(up to 32x with Xeons) or even more with coprocessors. Distributed memory parallelism can give you speedup proportional to number of nodes. Both types (or actually all 3 types) of parallelism have to be used these days to get reasonable performance. You may think of it like a hierarchy: 1.MPI or coarray on the top, 2.something for shared memory threading in the middle and 3. vectorization in the innermost level.
Well, from your question, it sounds like you are talking mostly about SMP multicore threading parallelism level. This is where -parallel Auto-Parallelization behaves. Dont expect big magic from auto-par. If you want to get better scalable parallelism, you have to try fortran OpenMP or MPI-for-shared memory. I would recommend OpenMP in most cases; its often easier to program and more performance.
But. its up to you and you really should think bigger- about all 3 levels of parallelism. If you plan to address all 3 levels, then probably optimal combination (since you are a happy intel fortran user) is 1. MPI for 1st level+ 2. OpenMP for SMP level + 3. AutoVectorization guided by OpenMP 4.0 pragma simd on 3rd level. Im not an expert in coarray, but it might be good alternative to 1.MPI.
My answer does make less sence if you dont deal with classic cluster hardware.

Cilk or Cilk++ or OpenMP

I'm creating a multi-threaded application in Linux. here is the scenario:
Suppose I am having x instance of a class BloomFilter and I have some y GB of data(greater than memory available). I need to test membership for this y GB of data in each of the bloom filter instance. It is pretty much clear that parallel programming will help to speed up the task moreover since I am only reading the data so it can be shared across all processes or threads.
Now I am confused about which one to use Cilk, Cilk++ or OpenMP(which one is better). Also I am confused about which one to go for Multithreading or Multiprocessing

Cilk Plus is the current implementation of Cilk by Intel.
They both are multithreaded environment, i.e., multiple threads are spawned during execution.
If you are new to parallel programming probably OpenMP is better for you since it allows an easier parallelization of already developed sequential code. Do you already have a sequential version of your code?
OpenMP uses pragma to instruct the compiler which portions of the code has to run in parallel. If I understand your problem correctly you probably need something like this:
#pragma omp parallel for firstprivate(array_of_bloom_filters)
for i in DATA:
check(i,array_of_bloom_filters);
the instances of different bloom filters are replicated in every thread in order to avoid contention while data is shared among thread.
update:
The paper actually consider an application which is very unbalanced, i.e., different taks (allocated on different thread) may incur in very different workload. Citing the paper that you mentioned "a highly unbalanced task graph that challenges scheduling,
load balancing, termination detection, and task coarsening strategies". Consider that in order to balance computation among threads it is necessary to reduce the task size and therefore increase the time spent in synchronizations.
In other words, good load balancing comes always at a cost. The description of your problem is not very detailed but it seems to me that the problem you have is quite balanced. If this is not the case then go for Cilk, its work stealing approach its probably the best solution for unbalanced workloads.

At the time this was posted, Intel was putting a lot of effort into boosting Cilk(tm) Plus; more recently, some effort has been diverted toward OpenMP 4.0.
It's difficult in general to contrast OpenMP with Cilk(tm) Plus.
If it's not possible to distribute work evenly across threads, one would likely set schedule(runtime) in an OpenMP version, and then at run time try various values of environment variable, such as OMP_SCHEDULE=guided, OMP_SCHEDULE=dynamic,2 or OMP_SCHEDULE=auto. Those are the closest OpenMP analogies to the way Cilk(tm) Plus work stealing works.
Some sparse matrix functions in Intel MKL library do actually scan the job first and determine how much to allocate to each thread so as to balance work. For this method to be useful, the time spent in serial scanning and allocating has to be of lower order than the time spent in parallel work.
Work-stealing, or dynamic scheduling, may lose much of the potential advantage of OpenMP in promoting cache locality by pinning threads with cache locality e.g. by OMP_PROC_BIND=close.
Poor cache locality becomes a bigger issue on a NUMA architecture where it may lead to significant time spent on remote memory access.
Both OpenMP and Cilk(tm) Plus have facilities for switching between serial and parallel execution.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio