Can parallelization have a negative performance impact?

Can parallelization have a negative performance impact? - performance

With the abundance of techniques being employed to increase parallelization in today's compiler-tools (especially auto-parallelization of certain viable for-constructs, c.f. the Intel C++ Compiler, Microsoft Visual Studio 2011, alongside various others), I wondered if parallelization is always guaranteed to improve or have no impact on performance.
Are there any cases in which parallelization would have a distinctly negative impact on performance?
A quick internet search didn't yield much hope, so I decided to turn here to see if anyone has any knowledge of cases where parallelization has a detrimental impact on performance, or better yet, experience in a project where parallelization actually caused difficulties.
I am also curious about whether there are any negative performance implication of auto-vectorization, although I find it quite unlikely that there would be.
Thanks in advance!

Parallelisation usually involves some abstract data exchange between the different processing elements since not all of them have exclusive access to all the data that it needs in order to complete its part of the computation. It could either be messages passed between different processes in an MPI job or it could be synchronisation actions in a multithreaded program. Passing data around or synchronising things takes time and that's why it is usually called communication or synchronisation overhead. There are different classes of problems depending on the ratio between overhead and computation.
Parallel algorithms that require no communication or synchronisation at all are called trivially (or "embarrassingly") parallel problems. An example of this class is a ray-tracing application: each pixel can be computed independently of all the others. Problems in this class scale linearly with the number of processing elements used (and sometimes even superlinearly because of caching effects) - give it twice as many processing elements and it will take twice as less time to perform the computation.
If any amount of communication or synchronisation is involved then things get progressively worse as the ratio between communication/synchronisation and computation increases. Usually this is the case when the problem size is kept fixed as one increases the number of processing elements. Usually the overhead increases with the number of processing elements while the amount of computation per element decreases.

Auto-vectorization can theoretically fall into "traps" where the overhead of getting all the elements in the right places is actually bigger than the time saved by doing things in parallel. Analyzing how much time a piece of code will take is hard, so it's hard for compilers to make the right decision.
Towards the end of these slides are some examples and statistics about auto-vectorization making the performance worse.

Usually with reasonable usage parallelization (mean parallel processing) gives positive performance imact.
But in some cases, from developer point of view, it could cause negative effects:
When allocating to many thread for parallel and/or multithreading processing.
Fork/join parallelism and loops parallelization when iteration is to small and allocating threads costs more time and resources than simple to process items synchronously
Typical multithreading/parallel execution problems like deadlocks, livelocks, threads stravation, race conditions etc.
Debugging and diagnostic, it's harder to find bugs
So all should be used reasonably.
And some links. Sorry they are .NET/Microsoft specific but problems described there are same:
Potential Pitfalls in Data and Task Parallelism
Potential Pitfalls with Parallel LINQ (PLINQ)
Good book where common problems and pitfalls are described:
Patterns for Parallel Programming: Understanding and Applying Parallel Patterns with the .NET Framework 4

From a more theoretical point of view, you may be interested in problems that are not in NC, i.e. the class of decision problems decidable in polylogarithmic time on a parallel computer with a polynomial number of processors.
Off the top of my head, I cannot think of any computational problem that is not, in some way or another, parallelizable. What I have encountered many times though are problems that have been badly parallelized.
Badly parallelized programs can easily be slower than their sequential versions. This can be a result of:
Massive overheads due to the parallelism being too fine-grained, e.g. the amount of work performed per thread is negligible compared to the overhead of starting/scheduling the operation. In OpenMP, this could be the case of a #pragma omp parallel for schedule(dynamic,k) for a small chunk size k.
Repeated concurrent access to shared resources, e.g. if all threads have to wait to access some resource or memory location sequentially. In OpenMP, this can be caused by too many or too large #pragma omp critical sections.
Over-use of slow atomic operations to update variables shared between threads, e.g. using #pragma omp atomic where, in the sequential case, faster regular memory access would be used.
In summary, and in my opinion, there are few inherently sequential problems, but mountains of badly-implemented parallel solutions.

Related

Spinlock implementation reasoning

I want to improve the performance of a program by replacing some of the mutexes
with spinlocks. I have found a spinlock implementation in
http://www.boost.org/doc/libs/1_36_0/boost/detail/spinlock_sync.hpp
which I intend to reuse. I believe this implementation is safer than simpler implementations in which threads keep trying forever like the one found here
http://www.boost.org/doc/libs/1_54_0/doc/html/atomic/usage_examples.html#boost_atomic.usage_examples.example_spinlock.implementation
But i need to clarify some things on the yield function found here
http://www.boost.org/doc/libs/1_36_0/boost/detail/yield_k.hpp
First of all I can assume that the numbers 4,16,32 are arbitrary. I actually tested some other values and I have found that I got best performance in my case by using other values.
But can someone explain the reasoning behind the yield code. Specifically why do we need all three
BOOST_SMT_PAUSE
sched_yield and
nanosleep

Yes, this concept is known as "adaptive spinlock" - see e.g. https://lwn.net/Articles/271817/.
Usually the numbers are chosen for exponential back-off: https://geidav.wordpress.com/tag/exponential-back-off/
So, the numbers aren't arbitrary. However, which "numbers" work for your case depend on your application patterns, requirements and system resources.
The three methods to introduce "micro-delays" are designed explicitly to balance the cost and the potential gain:
zero-cost is to spin on high-CPU, but it results in high power consumption and wasted cycles
a small "cheap" delay might be able to prevent the cost of a context-switch while reducing the CPU load relative to a busy-spin
a simple yield might allow the OS to avoid a context switch depending on other system load (e.g. if the number of threads < number logical cores)
The trade-offs with these are important for low-latency applications where the effect of a context switch or cache misses are significant.
TL;DR
All trade-offs try to find a balance between wasting CPU cycles and losing cache/thread efficiency.

Are there any metrics for both performance and energy efficiency?

For many parallel programs, the parallelization brings substantial cost, making the speedup sublinear. In this case, the parallel versions are less energy efficient than sequential one.
However, people may care both the time performance and energy efficiency, are there any specific metrics commonly used for this purpose?
More specifically, a metric that can determine the number of threads for best energy and performance goal.

The most common metric is performance per watt. Take a look at the "Green500 List". Wikipedia also has an article on performance per watt. The metric is not as clear cut as it first appears because "performance" is not clear cut. FLOPS is very popular at the moment but it has a lot of deficiencies. I disagree that performance/watt can't be used to evaluate the performance of software. Depending upon your application, you may want to use performance/watt/sec.
I don’t know why you want to determine energy efficiency if parallelism is costing you. In fact, I don’t really understand how parallelism can be decreasing energy efficiency unless you are using a single core machine, doing pure computation, and are doing a lot of thrashing between threads. I’m guessing that this is not your own code.
Software power efficiency: The most important two factors are:
getting your computation done faster
making sure that periods between computation are truly idle
These factors break down into a whole host of other more concrete guidelines:
avoid timing interrupts and (shutter) polling
minimize synchronization constructs
exploit parallelism (thread and vectorization)
use a good optimizing compiler
use a thread pool if you are continuously creating and terminating a lot of threads
use efficient high performance libraries
avoid virtual machines (e.g. java and flash)
use a modern (tickless) OS
etc. etc. etc
Dividing your computation between parallel threads should decrease computation times, or else why add its complications? (Yes, I understand that some programming constructs, such as recursion, can result in simpler and cleaner code but worse performance, but these are exceptions.) Decreasing computation should increase energy efficiency. If it doesn't, look at the algorithm and code practice.
If you can give me more detail about your app, I may be able to make more concrete suggestions.

Cilk or Cilk++ or OpenMP

I'm creating a multi-threaded application in Linux. here is the scenario:
Suppose I am having x instance of a class BloomFilter and I have some y GB of data(greater than memory available). I need to test membership for this y GB of data in each of the bloom filter instance. It is pretty much clear that parallel programming will help to speed up the task moreover since I am only reading the data so it can be shared across all processes or threads.
Now I am confused about which one to use Cilk, Cilk++ or OpenMP(which one is better). Also I am confused about which one to go for Multithreading or Multiprocessing

Cilk Plus is the current implementation of Cilk by Intel.
They both are multithreaded environment, i.e., multiple threads are spawned during execution.
If you are new to parallel programming probably OpenMP is better for you since it allows an easier parallelization of already developed sequential code. Do you already have a sequential version of your code?
OpenMP uses pragma to instruct the compiler which portions of the code has to run in parallel. If I understand your problem correctly you probably need something like this:
#pragma omp parallel for firstprivate(array_of_bloom_filters)
for i in DATA:
check(i,array_of_bloom_filters);
the instances of different bloom filters are replicated in every thread in order to avoid contention while data is shared among thread.
update:
The paper actually consider an application which is very unbalanced, i.e., different taks (allocated on different thread) may incur in very different workload. Citing the paper that you mentioned "a highly unbalanced task graph that challenges scheduling,
load balancing, termination detection, and task coarsening strategies". Consider that in order to balance computation among threads it is necessary to reduce the task size and therefore increase the time spent in synchronizations.
In other words, good load balancing comes always at a cost. The description of your problem is not very detailed but it seems to me that the problem you have is quite balanced. If this is not the case then go for Cilk, its work stealing approach its probably the best solution for unbalanced workloads.

At the time this was posted, Intel was putting a lot of effort into boosting Cilk(tm) Plus; more recently, some effort has been diverted toward OpenMP 4.0.
It's difficult in general to contrast OpenMP with Cilk(tm) Plus.
If it's not possible to distribute work evenly across threads, one would likely set schedule(runtime) in an OpenMP version, and then at run time try various values of environment variable, such as OMP_SCHEDULE=guided, OMP_SCHEDULE=dynamic,2 or OMP_SCHEDULE=auto. Those are the closest OpenMP analogies to the way Cilk(tm) Plus work stealing works.
Some sparse matrix functions in Intel MKL library do actually scan the job first and determine how much to allocate to each thread so as to balance work. For this method to be useful, the time spent in serial scanning and allocating has to be of lower order than the time spent in parallel work.
Work-stealing, or dynamic scheduling, may lose much of the potential advantage of OpenMP in promoting cache locality by pinning threads with cache locality e.g. by OMP_PROC_BIND=close.
Poor cache locality becomes a bigger issue on a NUMA architecture where it may lead to significant time spent on remote memory access.
Both OpenMP and Cilk(tm) Plus have facilities for switching between serial and parallel execution.

Parallel speedup with OpenMP

I have two scenarios of measuring metrics like computation time and parallel speedup (sequential_time/parallel_time).
Scenario 1:
Sequential time measurement:
startTime=omp_get_wtime();
for loop computation
endTime=omp_get_wtime();
seq_time = endTime-startTime;
Parallel time measurement:
startTime = omp_get_wtime();
for loop computation (#pragma omp parallel for reduction (+:pi) private (i)
for (blah blah) {
computation;
}
endTime=omp_get_wtime();
paralleltime = endTime-startTime;
speedup = seq_time/paralleltime;
Scenario 2:
Sequential time measurement:
for loop{
startTime=omp_get_wtime();
computation;
endTime=omp_get_wtime();
seq_time += endTime-startTime;
}
Parallel time measurement:
for loop computation (#pragma omp parallel for reduction (+:pi, paralleltime) private (i,startTime,endTime)
for (blah blah) {
startTime=omp_get_wtime();
computation;
endTime=omp_get_wtime();
paralleltime = endTime-startTime;
}
speedup = seq_time/paralleltime;
I know that Scenario 2 is NOT the best production code, but I think that it measures the actual theoretical performance by OVERLOOKING the overhead involved in openmp spawning and managing (thread context switching) several threads. So it will give us a linear speedup. But Scenario 1 considers the overhead involved in spawning and managing threads.
My doubt is this:
With Scenario 1, I am getting a speedup which starts out linear, but tapers off as we move to a higher number of iterations. With Scenario 2, I am getting a full on linear speedup irrespective of the number of iterations. I was told that in reality, Scenario 1 will give me a linear speedup irrespective of the number of iterations. But I think it will not because of the high overload due to thread management. Can someone please explain to me why I am wrong?
Thanks! And sorry about the rather long post.

There's many situations where scenario 2 won't give you linear speedup either -- false sharing between threads (or, for that matter, true sharing of shared variables which get modified), memory bandwidth contention, etc. The sub-linear speedup is generally real, not a measurement artifact.
More generally, once you get to the point where you're putting timers inside for loops, you're considering more fine-grained timing information than is really appropriate to measure using timers like this. You might well want to be able to disentangle the thread management overhead from the actual work being done for a variety of reasons, but here you're trying to do that by inserting N extra function calls to omp_get_wtime(), as well as the arithmetic and the reduction operation, all of which will have non-negligable overhead of their own.
If you really want accurate timing of how much time is being spent in the computation; line, you really want to use something like sampling rather than manual instrumentation (we talk a little bit about the distinction here). Using gprof or scalasca or openspeedshop (all free software) or Intel's VTune or something (commercial package) will give you the information about how much time is being spent on that line -- often even by thread -- with much lower overhead.

First of all, by the definition of the speedup, you should use the scenario 1, which includes parallel overhead.
In the scenario 2, you have the wrong code in the measurement of paralleltime. To satisfy your goal in the scenario 2, you need to have a per-thread paralleltime by allocating int paralleltime[NUM_THREADS] and accessing them by omp_get_thread_num() (Note that this code will have false sharing, so you'd better to allocate 64-byte struct with padding). Then, measure per-thread computation time, and finally take the longest one to calculate a different kind of speedup (I'd say a sort of parallelism instead).
No, you may see sub-linear speedup for even Scenario 2, or even super-linear speedup could be obtained. The potential reasons (i.e., excluding parallel overhead) are:
Load imbalance: the workload length in compuation is different on iteration. This would be the most common reason of low speedup (But, you're saying load imbalance is not the case).
Synchronization cost: if there are any kind of synchronizations (e.g., mutex, event, barrier), you may have waiting/block time.
Caches and memory cost: when computation requires large bandwidth and high working set set, parallel code may suffer from bandwidth limitation (though it's rare in reality) and cache conflicts. Also, false sharing would be a significant reason, but it's easy to avoid it. Super-linear effect can be also observed because using multicore may have more caches (i.e., private L1/L2 caches).
In the scenario 1, it will include the overhead of parallel libraries:
Forking/joining threads: although most parallel libraries implementations won't do physical thread creation/termination on every parallel construct.
Dispatching/joining logical tasks: even if physical threads are already created, you need to dispatch logical task to each thread (in general M task to N thread), and also do a sort of joining operation at the end (e.g., implicit barrier).
Scheduling overhead: for static scheduling (as shown in your code, which uses OpenMP's static scheduling), the overhead is minimal. You can safely ignore the overhead when the workload is sufficiently enough (say 0.1 second). However, dynamic scheduling (such as work-stealing in TBB) has some overhead, but it's not significant once your workload is sufficient.
I don't think your code (1-level static-scheduling parallel loop) does have high parallel overhead due to thread management, unless this code is called million times per a second. So, may be the other reasons that I've mentioned in the above.
Keep in mind that there are many factors that will determine speedup; from the inherent parallelism (=load imbalance and synchronizations) to the overhead of a parallel library (e.g., scheduling overhead).

What do you want to measure exactly? The overhead due to parallelism is part of the real execution time, hence IMHO scenario 1 is better.
Besides, according to your OpenMP directives, you're doing a reduction on some array. In scenario 1, you're taking this into account. In scenario 2, you're not. So basically, you're measuring less things than in scenario 1. This probably has some impact on your measurements too.
Otherwise, Jonathan Dursi's answer is excellent.

There are several options with OpenMP for how the work is distributed among the threads. This can impact the linearity of your measurement method 1. You measurement method 2 doesn't seem useful. What were you trying to get at with that? If you want to know single thread performance, then run a single thread. If you want parallel performance, then you need to include the overhead.

CUDA: Bigger problems in threads

Almost all of the CUDA exemplar code describes doing near-atomic operations on large data sets. What kind of practical limitations are the to the size of a problem each thread can do?
For example, I have another question open at the minute that involves per-thread matrix solving. Is this kind of thing too large to put within each thread?

CUDA is a data parallel programming model for what is effectively an SIMD architecture, so obviously it isn't as flexible as a general purpose multithreaded or MIMD architecture. Certainly kernels can be a lot more complex than simple arithmetic operations.
In my own work I use CUDA a lot for solving partial differential equations (so the finite element, finite difference and finite volume methods), which every thread processes a cell or element from a discretised continuum. In that sort of calculation, there are a lot of FLOPs per thread per cell/element.
The key area to be mindful of is branch divergence. Because it is an SIMD architecture under the hood, code where there is a lot of branching within a warp of threads (which is effectively the SIMD width), will suffer performance penalties. But branch divergence and code complexity need not be synonymous, you can write very "branchy" and "loopy" code which will run well, as long as threads within any given warp don't diverge too often. In FLOP and IOP heavy algorithms, that is usually not too hard to achieve.

I just want to reiterate talonmies and say that there is no real limit to the "size" of a kernel in number of operations. As long as the computation is parallel, CUDA will be effective!
As far a practical considerations, I would just add a few small notes
long running kernels can timeout, depending on os (or when profiling with cudaProf). You might have to change a setting somewhere to increase maximum kernel execution time.
long running kernels on systems without a dedicated gpu can freeze the display (interrupting ui).
warps are executed asynchronously - one warp can access memory while another performs arithmetic in order to use clock cycles effectively. long running kernels might benefit more from attention to this kind of optimization. i'm not really sure about this last one.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio