Parallel speedup with OpenMP

Parallel speedup with OpenMP - performance

I have two scenarios of measuring metrics like computation time and parallel speedup (sequential_time/parallel_time).
Scenario 1:
Sequential time measurement:
startTime=omp_get_wtime();
for loop computation
endTime=omp_get_wtime();
seq_time = endTime-startTime;
Parallel time measurement:
startTime = omp_get_wtime();
for loop computation (#pragma omp parallel for reduction (+:pi) private (i)
for (blah blah) {
computation;
}
endTime=omp_get_wtime();
paralleltime = endTime-startTime;
speedup = seq_time/paralleltime;
Scenario 2:
Sequential time measurement:
for loop{
startTime=omp_get_wtime();
computation;
endTime=omp_get_wtime();
seq_time += endTime-startTime;
}
Parallel time measurement:
for loop computation (#pragma omp parallel for reduction (+:pi, paralleltime) private (i,startTime,endTime)
for (blah blah) {
startTime=omp_get_wtime();
computation;
endTime=omp_get_wtime();
paralleltime = endTime-startTime;
}
speedup = seq_time/paralleltime;
I know that Scenario 2 is NOT the best production code, but I think that it measures the actual theoretical performance by OVERLOOKING the overhead involved in openmp spawning and managing (thread context switching) several threads. So it will give us a linear speedup. But Scenario 1 considers the overhead involved in spawning and managing threads.
My doubt is this:
With Scenario 1, I am getting a speedup which starts out linear, but tapers off as we move to a higher number of iterations. With Scenario 2, I am getting a full on linear speedup irrespective of the number of iterations. I was told that in reality, Scenario 1 will give me a linear speedup irrespective of the number of iterations. But I think it will not because of the high overload due to thread management. Can someone please explain to me why I am wrong?
Thanks! And sorry about the rather long post.

There's many situations where scenario 2 won't give you linear speedup either -- false sharing between threads (or, for that matter, true sharing of shared variables which get modified), memory bandwidth contention, etc. The sub-linear speedup is generally real, not a measurement artifact.
More generally, once you get to the point where you're putting timers inside for loops, you're considering more fine-grained timing information than is really appropriate to measure using timers like this. You might well want to be able to disentangle the thread management overhead from the actual work being done for a variety of reasons, but here you're trying to do that by inserting N extra function calls to omp_get_wtime(), as well as the arithmetic and the reduction operation, all of which will have non-negligable overhead of their own.
If you really want accurate timing of how much time is being spent in the computation; line, you really want to use something like sampling rather than manual instrumentation (we talk a little bit about the distinction here). Using gprof or scalasca or openspeedshop (all free software) or Intel's VTune or something (commercial package) will give you the information about how much time is being spent on that line -- often even by thread -- with much lower overhead.

First of all, by the definition of the speedup, you should use the scenario 1, which includes parallel overhead.
In the scenario 2, you have the wrong code in the measurement of paralleltime. To satisfy your goal in the scenario 2, you need to have a per-thread paralleltime by allocating int paralleltime[NUM_THREADS] and accessing them by omp_get_thread_num() (Note that this code will have false sharing, so you'd better to allocate 64-byte struct with padding). Then, measure per-thread computation time, and finally take the longest one to calculate a different kind of speedup (I'd say a sort of parallelism instead).
No, you may see sub-linear speedup for even Scenario 2, or even super-linear speedup could be obtained. The potential reasons (i.e., excluding parallel overhead) are:
Load imbalance: the workload length in compuation is different on iteration. This would be the most common reason of low speedup (But, you're saying load imbalance is not the case).
Synchronization cost: if there are any kind of synchronizations (e.g., mutex, event, barrier), you may have waiting/block time.
Caches and memory cost: when computation requires large bandwidth and high working set set, parallel code may suffer from bandwidth limitation (though it's rare in reality) and cache conflicts. Also, false sharing would be a significant reason, but it's easy to avoid it. Super-linear effect can be also observed because using multicore may have more caches (i.e., private L1/L2 caches).
In the scenario 1, it will include the overhead of parallel libraries:
Forking/joining threads: although most parallel libraries implementations won't do physical thread creation/termination on every parallel construct.
Dispatching/joining logical tasks: even if physical threads are already created, you need to dispatch logical task to each thread (in general M task to N thread), and also do a sort of joining operation at the end (e.g., implicit barrier).
Scheduling overhead: for static scheduling (as shown in your code, which uses OpenMP's static scheduling), the overhead is minimal. You can safely ignore the overhead when the workload is sufficiently enough (say 0.1 second). However, dynamic scheduling (such as work-stealing in TBB) has some overhead, but it's not significant once your workload is sufficient.
I don't think your code (1-level static-scheduling parallel loop) does have high parallel overhead due to thread management, unless this code is called million times per a second. So, may be the other reasons that I've mentioned in the above.
Keep in mind that there are many factors that will determine speedup; from the inherent parallelism (=load imbalance and synchronizations) to the overhead of a parallel library (e.g., scheduling overhead).

What do you want to measure exactly? The overhead due to parallelism is part of the real execution time, hence IMHO scenario 1 is better.
Besides, according to your OpenMP directives, you're doing a reduction on some array. In scenario 1, you're taking this into account. In scenario 2, you're not. So basically, you're measuring less things than in scenario 1. This probably has some impact on your measurements too.
Otherwise, Jonathan Dursi's answer is excellent.

There are several options with OpenMP for how the work is distributed among the threads. This can impact the linearity of your measurement method 1. You measurement method 2 doesn't seem useful. What were you trying to get at with that? If you want to know single thread performance, then run a single thread. If you want parallel performance, then you need to include the overhead.

Related

Emulate a very fast (virtual) CPU core

I know that the usual method when we want to make a big math computation faster is to use multiprocessing / parallel processing: we split the job in for example 4 parts, and we let 4 CPU cores run in parallel (parallelization). This is possible for example in Python with multiprocessing module: on a 4-core CPU, it would allow to use 100% of the processing power of the computer instead of only 25% for a single-process job.
But let's say we want to make faster a non-easily-splittable computation job.
Example: we are given a number generator function generate(n) that takes the previously-generated number as input, and "it is said to have 10^20 as period". We want to check this assertion with the following pseudo-code:
a = 17
for i = 1..10^20
a = generate(a)
check if a == 17
Instead of having a computer's 4 CPU cores (3.3 Ghz) running "in parallel" with a total of 4 processes, is it possible to emulate one very fast single-core CPU of 13.2 Ghz (4*3.3) running one single process with the previous code?
Is such technique available for a desktop computer? If not, is it available on cloud computing platforms (AWS EC2, etc.)?

Single-threaded performance is extremely valuable; it's much easier to write sequential code than to explicitly expose thread-level parallelism.
If there was an easy and efficient general-purpose way to do what you're asking which works when there is no parallelism in the code, it would already be in widespread use. Either internally inside multi-core CPUs, or in software if it required higher-level / larger-scale code transformations.
Out-of-order CPUs can find and exploit instruction-level parallelism within a single thread (over short distances, like a couple hundred instructions), but you need explicit thread-level parallelism to take advantage of multiple cores.
This is similar to How does a single thread run on multiple cores? over on SoftwareEnginnering.SE, except that you've already ruled out any easy-to-find parallelism including instruction-level parallelism. (And the answer is: it doesn't. It's the hardware of a single core that finds the instruction-level parallelism in a single thread; my answer there explains some of the microarchitectural details of how that works.)
The reverse process: turning one big CPU into multiple weaker CPUs does exist, and is useful for running multiple threads which don't have much instruction-level parallelism. It's called SMT (Simultaneous MultiThreading). You've probably heard of Intel's Hyperthreading, the most widely known implementation of SMT. It trades single-threaded performance for more throughput, keeping more execution units fed with useful work more of the time. The cost of building a single wide core grows at least quadratically, which is why typical desktop CPUs don't just have a single massive core with 8-way SMT. (And note that a really wide CPU still wouldn't help with a totally dependent instruction stream, unless the generate function has some internal instruction-level parallelism.)
SMT would be good if you wanted to test 8 different generate() functions at once on a quad-core CPU. Without SMT, you could alternate in software between two generate chains in one thread, so out-of-order execution could be working on instructions from both dependency chains in parallel.
Auto-parallelization by compilers at compile time is possible for source that has some visible parallelism, but if generate(a) isn't "separable" (not the correct technical term, I think) then you're out of luck.
e.g. if it's return a + hidden_array[static_counter++]; then the compiler can use math to prove that summing chunks of the array in parallel and adding the partial sums will still give the same result.
But if there's truly a serial dependency through a (like even a simple LCG PRNG), and the software doesn't know any mathematical tricks to break the dependency or reduce it to a closed form, you're out of luck. Compilers do know tricks like sum(0..n) = n*(n+1)/2 (evaluated slightly differently to avoid integer overflow in a partial result), or a+a+a+... (n times) is a * n, but that doesn't help here.

There is a scheme studied mostly in the academy called "Thread Decomposition". It aims to do more or less what you ask about - given a single-threaded code, it tries to break it down into multiple threads in order to divide the work on a multicore system. This process can be done by a compiler (although this requires figuring out all possible side effects at compile time which is very hard), by a JIT runtime, or through HW binary-translation, but each of these methods has complicated limitations and drawbacks.
Unfortunately, other than being automated, this process has very little appeal as it can hardly match true manual parallelization done by a person how understands the code. It also doesn't simply scale performance according to the number of threads, since it usually incurs a large overhead in the form of code that has to be duplicated.
Example paper by some nice folks from UPC in Barcelona: http://ieeexplore.ieee.org/abstract/document/5260571/

Can parallelization have a negative performance impact?

With the abundance of techniques being employed to increase parallelization in today's compiler-tools (especially auto-parallelization of certain viable for-constructs, c.f. the Intel C++ Compiler, Microsoft Visual Studio 2011, alongside various others), I wondered if parallelization is always guaranteed to improve or have no impact on performance.
Are there any cases in which parallelization would have a distinctly negative impact on performance?
A quick internet search didn't yield much hope, so I decided to turn here to see if anyone has any knowledge of cases where parallelization has a detrimental impact on performance, or better yet, experience in a project where parallelization actually caused difficulties.
I am also curious about whether there are any negative performance implication of auto-vectorization, although I find it quite unlikely that there would be.
Thanks in advance!

Parallelisation usually involves some abstract data exchange between the different processing elements since not all of them have exclusive access to all the data that it needs in order to complete its part of the computation. It could either be messages passed between different processes in an MPI job or it could be synchronisation actions in a multithreaded program. Passing data around or synchronising things takes time and that's why it is usually called communication or synchronisation overhead. There are different classes of problems depending on the ratio between overhead and computation.
Parallel algorithms that require no communication or synchronisation at all are called trivially (or "embarrassingly") parallel problems. An example of this class is a ray-tracing application: each pixel can be computed independently of all the others. Problems in this class scale linearly with the number of processing elements used (and sometimes even superlinearly because of caching effects) - give it twice as many processing elements and it will take twice as less time to perform the computation.
If any amount of communication or synchronisation is involved then things get progressively worse as the ratio between communication/synchronisation and computation increases. Usually this is the case when the problem size is kept fixed as one increases the number of processing elements. Usually the overhead increases with the number of processing elements while the amount of computation per element decreases.

Auto-vectorization can theoretically fall into "traps" where the overhead of getting all the elements in the right places is actually bigger than the time saved by doing things in parallel. Analyzing how much time a piece of code will take is hard, so it's hard for compilers to make the right decision.
Towards the end of these slides are some examples and statistics about auto-vectorization making the performance worse.

Usually with reasonable usage parallelization (mean parallel processing) gives positive performance imact.
But in some cases, from developer point of view, it could cause negative effects:
When allocating to many thread for parallel and/or multithreading processing.
Fork/join parallelism and loops parallelization when iteration is to small and allocating threads costs more time and resources than simple to process items synchronously
Typical multithreading/parallel execution problems like deadlocks, livelocks, threads stravation, race conditions etc.
Debugging and diagnostic, it's harder to find bugs
So all should be used reasonably.
And some links. Sorry they are .NET/Microsoft specific but problems described there are same:
Potential Pitfalls in Data and Task Parallelism
Potential Pitfalls with Parallel LINQ (PLINQ)
Good book where common problems and pitfalls are described:
Patterns for Parallel Programming: Understanding and Applying Parallel Patterns with the .NET Framework 4

From a more theoretical point of view, you may be interested in problems that are not in NC, i.e. the class of decision problems decidable in polylogarithmic time on a parallel computer with a polynomial number of processors.
Off the top of my head, I cannot think of any computational problem that is not, in some way or another, parallelizable. What I have encountered many times though are problems that have been badly parallelized.
Badly parallelized programs can easily be slower than their sequential versions. This can be a result of:
Massive overheads due to the parallelism being too fine-grained, e.g. the amount of work performed per thread is negligible compared to the overhead of starting/scheduling the operation. In OpenMP, this could be the case of a #pragma omp parallel for schedule(dynamic,k) for a small chunk size k.
Repeated concurrent access to shared resources, e.g. if all threads have to wait to access some resource or memory location sequentially. In OpenMP, this can be caused by too many or too large #pragma omp critical sections.
Over-use of slow atomic operations to update variables shared between threads, e.g. using #pragma omp atomic where, in the sequential case, faster regular memory access would be used.
In summary, and in my opinion, there are few inherently sequential problems, but mountains of badly-implemented parallel solutions.

For parallel algorithm with N threads, can performance gain be more than N?

A theoretical question, maybe it is obvious:
Is it possible that an algorithm, after being implemented in a parallel way with N threads, will be executed more than N times faster than the original, single-threaded algorithm? In other words, can the gain be better that linear with number of threads?

It's not common, but it most assuredly is possible.
Consider, for example, building a software pipeline where each step in the pipeline does a fairly small amount of calculation, but requires enough static data to approximately fill the entire data cache -- but each step uses different static data.
In a case like this, serial calculation on a single processor will normally be limited primarily by the bandwidth to main memory. Assuming you have (at least) as many processors/cores (each with its own data cache) as pipeline steps, you can load each data cache once, and process one packet of data after another, retaining the same static data for all of them. Now your calculation can proceed at the processor's speed instead of being limited by the bandwidth to main memory, so the speed improvement could easily be 10 times greater than the number of threads.
Theoretically, you could accomplish the same with a single processor that just had a really huge cache. From a practical viewpoint, however, the selection of processors and cache sizes is fairly limited, so if you want to use more cache you need to use more processors -- and the way most systems provide to accomplish this is with multiple threads.

Yes.
I saw an algorithm for moving a robot arm through complicated maneuvers that was basically to divide into N threads, and have each thread move more or less randomly through the solution space. (It wasn't a practical algorithm.) The statistics clearly showed a superlinear speedup over one thread. Apparently the probability of hitting a solution over time rose fairly fast and then leveled out some, so the advantage was in having a lot of initial attempts.

Amdahl's law (parallelization) tells us this is not possible for the general case. At best we can perfectly divide the work by N. The reason for this is that given no serial portion, Amdahl's formula for speedup becomes:
Speedup = 1/(1/N)
where N is the number of processors. This of course reduces to just N.

CUDA: Bigger problems in threads

Almost all of the CUDA exemplar code describes doing near-atomic operations on large data sets. What kind of practical limitations are the to the size of a problem each thread can do?
For example, I have another question open at the minute that involves per-thread matrix solving. Is this kind of thing too large to put within each thread?

CUDA is a data parallel programming model for what is effectively an SIMD architecture, so obviously it isn't as flexible as a general purpose multithreaded or MIMD architecture. Certainly kernels can be a lot more complex than simple arithmetic operations.
In my own work I use CUDA a lot for solving partial differential equations (so the finite element, finite difference and finite volume methods), which every thread processes a cell or element from a discretised continuum. In that sort of calculation, there are a lot of FLOPs per thread per cell/element.
The key area to be mindful of is branch divergence. Because it is an SIMD architecture under the hood, code where there is a lot of branching within a warp of threads (which is effectively the SIMD width), will suffer performance penalties. But branch divergence and code complexity need not be synonymous, you can write very "branchy" and "loopy" code which will run well, as long as threads within any given warp don't diverge too often. In FLOP and IOP heavy algorithms, that is usually not too hard to achieve.

I just want to reiterate talonmies and say that there is no real limit to the "size" of a kernel in number of operations. As long as the computation is parallel, CUDA will be effective!
As far a practical considerations, I would just add a few small notes
long running kernels can timeout, depending on os (or when profiling with cudaProf). You might have to change a setting somewhere to increase maximum kernel execution time.
long running kernels on systems without a dedicated gpu can freeze the display (interrupting ui).
warps are executed asynchronously - one warp can access memory while another performs arithmetic in order to use clock cycles effectively. long running kernels might benefit more from attention to this kind of optimization. i'm not really sure about this last one.

Cost/benefit of parallelization based on code size?

How do you figure out whether it's worth parallelizing a particular code block based on its code size? Is the following calculation correct?
Assume:
Thread pool consisting of one thread per CPU.
CPU-bound code block with execution time of X milliseconds.
Y = min(number of CPUs, number of concurrent requests)
Therefore:
Cost: code complexity, potential bugs
Benefit: (X * Y) milliseconds
My conclusion is that it isn't worth parallelizing for small values of X or Y, where "small" depends on how responsive your requests must be.

One thing that will help you figure that out is Amdahl's Law
The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program. For example, if a program needs 20 hours using a single processor core, and a particular portion of 1 hour cannot be parallelized, while the remaining promising portion of 19 hours (95%) can be parallelized, then regardless of how many processors we devote to a parallelized execution of this program, the minimal execution time cannot be less than that critical 1 hour. Hence the speed up is limited up to 20x.
Figure out what you want to achieve in speed up, and how much parallelism you can actually achieve, then see if its worth it.

It depends on many factors, as the difficulty of parallelize the code, the speedup obtained from it (there are overhead costs on dividing the problem and joining the results) and the amount of time that the code is spending there (Amdahl's Law)

Well, the benefit is really more:
(X * (Y-1)) * Tc * Pf
Where Tc is the cost of the threading framework you are using. No threading framework scales perfectly, so using 2x threads will likely be, at best, 1.9x speed.
Pf is some factor for parallization that depends completely on the algorithm (ie: whether or not you'll need to lock, which will slow the process down).
Also, it's Y-1, since single threaded is basically assuming Y==1.
As for deciding, it's also a matter of user frustration/expectation (if they user is annoyed at waiting for something, it'd have a greater benefit than a task that the user doesn't really mind - which is not always just due to wait times, etc - it's partly expectations).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio