What's the difference between parallel cost and parallel work?

What's the difference between parallel cost and parallel work? - performance

I read a paper in which parallel cost for (parallel) algorithms is defined as CP(n) = p * TP(n), where p is the number of processors, T the processing time and n the input. An algorithm is cost-optimal, if CP(n) is approximately constant, i.e. if the algorithm uses two processors instead of one on the same input it takes only half the time.
There is another concept called parallel work, which I don't fully grasp.
The papers says it measures the number of executed parallel OPs.
A algorithm is work-optimal, if it performs as many OPs as it sequential counterpart (asymptotically). A cost-optimal algorithm is always work-optimal but not vice versa.
Can someone illustrate the concept of parallel work and show the similarities and differences to parallel cost?

It sounds like parallel work is simply a measure of the total number of instructions ran by all processes in parallel but counting the ones in parallel only once. If that's the case, then it's more closely related to the time term in your parallel cost equation. Think of it this way: if the parallel version of the algorithm runs more instructions than the sequential version --meaning it is not work-optimal, it will necessarily take more time assuming all instructions are equal in duration. Typically these extra instructions are at the beginning or end of the parallel algorithm and are viewed as overhead of the parallel algorithm. They can correspond to extra bookkeeping or communication or final aggregation of the result.
Thus an algorithm that is not work-optimal cannot be cost-optimal.

Another way to call the "parallel cost" is "cost of context switching" although it can also arise from interdependencies between the different threads.
Consider sorting.
If you implement Bubble Sort in parallel where each thread just picks up the next comparison you will have a huge cost to run it in "parallel", to the point where it will be essentially a messed up sequential version of the algorithm and your parallel work will be essentially zero because most threads just wait most of the time.
Now compare that to Quick Sort and implement a thread for each split of the original array - threads don't need data from other threads, and asymptotically for a bigger starting arrays the cost of spinning these threads will be paid by the parallel nature of the work done... if the system has infinite memory bandwidth. In reality it wouldn't be worth spinning more threads than there are memory access channels because the threads still have the invisible (from code perspective) dependency between them by having shared sequential access to memory

Short
I think parallel cost and parallel work are two sides of the same coin. They're both measures for speed-up,
whereby the latter is the theoretical concept enabling the former.
Long
Let's consider n-dimensional vector addition as a problem that is easy to parallelize, since it can be broken down into n independent tasks.
The problem is inherently work-optimal, because the parallel work doesn't change if the algorithm runs in parallel, there are always n vector components that need to be added.
Considering the parallel cost cannot be done without executing the algorithm on a (virtual) machine, where practical limitations like shortage of memory bandwidth arise. Thus a work-optimal algorithm can only be cost-optimal if the hardware (or the hardware access patterns) allows the perfect execution and division of the problem - and the time.
Cost-optimality is a stronger demand, and as I'm realizing now, just another illustration of efficiency
Under normal circumstances a cost-optimal algorithm will also be work-optimal,
but if the speed-up gained by caching, memory access patterns etc. is super-linear,
i.e. the execution times with two processors is one-tenth instead of the expected half, it is possible for an algorithm that performs more work, and thus is not work-optimal, to still be cost-optimal.

Related

Computing time in relation to number of operations

is it possible to calculate the computing time of a process based on the number of operations that it performs and the speed of the CPU in GHz?
For example, I have a for loop that performs a total number of 5*10^14 cycles. If it runs on a 2.4 GHz processor, will the computing time in seconds be: 5*10^14/2.4*10^9 = 208333 s?
If the process runs on 4 cores in parallel, will the time be reduced by four?
Thanks for your help.

No, it is not possible to calculate the computing time based just on the number of operations. First of all, based on your question, it sounds like you are talking about the number of lines of code in some higher-level programming language since you mention a for loop. So depending on the optimization level of your compiler, you could see varying results in computation time depending on what kinds of optimizations are done.
But even if you are talking about assembly language operations, it is still not possible to calculate the computation time based on the number of instructions and CPU speed alone. Some instructions might take multiple CPU cycles. If you have a lot of memory access, you will likely have cache misses and have to load data from disk, which is unpredictable.
Also, if the time that you are concerned about is the actual amount of time that passes between the moment the program begins executing and the time it finishes, you have the additional confounding variable of other processes running on the computer and taking up CPU time. The operating system should be pretty good about context switching during disk reads and other slow operations so that the program isn't stopped in the middle of computation, but you can't count on never losing some computation time because of this.
As far as running on four cores in parallel, a program can't just do that by itself. You need to actually write the program as a parallel program. A for loop is a sequential operation on its own. In order to run four processes on four separate cores, you will need to use the fork system call and have some way of dividing up the work between the four processes. If you divide the work into four processes, the maximum speedup you can have is 4x, but in most cases it is impossible to achieve the theoretical maximum. How close you get depends on how well you are able to balance the work between the four processes and how much overhead is necessary to make sure the parallel processes successfully work together to generate a correct result.

Can parallelization have a negative performance impact?

With the abundance of techniques being employed to increase parallelization in today's compiler-tools (especially auto-parallelization of certain viable for-constructs, c.f. the Intel C++ Compiler, Microsoft Visual Studio 2011, alongside various others), I wondered if parallelization is always guaranteed to improve or have no impact on performance.
Are there any cases in which parallelization would have a distinctly negative impact on performance?
A quick internet search didn't yield much hope, so I decided to turn here to see if anyone has any knowledge of cases where parallelization has a detrimental impact on performance, or better yet, experience in a project where parallelization actually caused difficulties.
I am also curious about whether there are any negative performance implication of auto-vectorization, although I find it quite unlikely that there would be.
Thanks in advance!

Parallelisation usually involves some abstract data exchange between the different processing elements since not all of them have exclusive access to all the data that it needs in order to complete its part of the computation. It could either be messages passed between different processes in an MPI job or it could be synchronisation actions in a multithreaded program. Passing data around or synchronising things takes time and that's why it is usually called communication or synchronisation overhead. There are different classes of problems depending on the ratio between overhead and computation.
Parallel algorithms that require no communication or synchronisation at all are called trivially (or "embarrassingly") parallel problems. An example of this class is a ray-tracing application: each pixel can be computed independently of all the others. Problems in this class scale linearly with the number of processing elements used (and sometimes even superlinearly because of caching effects) - give it twice as many processing elements and it will take twice as less time to perform the computation.
If any amount of communication or synchronisation is involved then things get progressively worse as the ratio between communication/synchronisation and computation increases. Usually this is the case when the problem size is kept fixed as one increases the number of processing elements. Usually the overhead increases with the number of processing elements while the amount of computation per element decreases.

Auto-vectorization can theoretically fall into "traps" where the overhead of getting all the elements in the right places is actually bigger than the time saved by doing things in parallel. Analyzing how much time a piece of code will take is hard, so it's hard for compilers to make the right decision.
Towards the end of these slides are some examples and statistics about auto-vectorization making the performance worse.

Usually with reasonable usage parallelization (mean parallel processing) gives positive performance imact.
But in some cases, from developer point of view, it could cause negative effects:
When allocating to many thread for parallel and/or multithreading processing.
Fork/join parallelism and loops parallelization when iteration is to small and allocating threads costs more time and resources than simple to process items synchronously
Typical multithreading/parallel execution problems like deadlocks, livelocks, threads stravation, race conditions etc.
Debugging and diagnostic, it's harder to find bugs
So all should be used reasonably.
And some links. Sorry they are .NET/Microsoft specific but problems described there are same:
Potential Pitfalls in Data and Task Parallelism
Potential Pitfalls with Parallel LINQ (PLINQ)
Good book where common problems and pitfalls are described:
Patterns for Parallel Programming: Understanding and Applying Parallel Patterns with the .NET Framework 4

From a more theoretical point of view, you may be interested in problems that are not in NC, i.e. the class of decision problems decidable in polylogarithmic time on a parallel computer with a polynomial number of processors.
Off the top of my head, I cannot think of any computational problem that is not, in some way or another, parallelizable. What I have encountered many times though are problems that have been badly parallelized.
Badly parallelized programs can easily be slower than their sequential versions. This can be a result of:
Massive overheads due to the parallelism being too fine-grained, e.g. the amount of work performed per thread is negligible compared to the overhead of starting/scheduling the operation. In OpenMP, this could be the case of a #pragma omp parallel for schedule(dynamic,k) for a small chunk size k.
Repeated concurrent access to shared resources, e.g. if all threads have to wait to access some resource or memory location sequentially. In OpenMP, this can be caused by too many or too large #pragma omp critical sections.
Over-use of slow atomic operations to update variables shared between threads, e.g. using #pragma omp atomic where, in the sequential case, faster regular memory access would be used.
In summary, and in my opinion, there are few inherently sequential problems, but mountains of badly-implemented parallel solutions.

Parallel speedup with OpenMP

I have two scenarios of measuring metrics like computation time and parallel speedup (sequential_time/parallel_time).
Scenario 1:
Sequential time measurement:
startTime=omp_get_wtime();
for loop computation
endTime=omp_get_wtime();
seq_time = endTime-startTime;
Parallel time measurement:
startTime = omp_get_wtime();
for loop computation (#pragma omp parallel for reduction (+:pi) private (i)
for (blah blah) {
computation;
}
endTime=omp_get_wtime();
paralleltime = endTime-startTime;
speedup = seq_time/paralleltime;
Scenario 2:
Sequential time measurement:
for loop{
startTime=omp_get_wtime();
computation;
endTime=omp_get_wtime();
seq_time += endTime-startTime;
}
Parallel time measurement:
for loop computation (#pragma omp parallel for reduction (+:pi, paralleltime) private (i,startTime,endTime)
for (blah blah) {
startTime=omp_get_wtime();
computation;
endTime=omp_get_wtime();
paralleltime = endTime-startTime;
}
speedup = seq_time/paralleltime;
I know that Scenario 2 is NOT the best production code, but I think that it measures the actual theoretical performance by OVERLOOKING the overhead involved in openmp spawning and managing (thread context switching) several threads. So it will give us a linear speedup. But Scenario 1 considers the overhead involved in spawning and managing threads.
My doubt is this:
With Scenario 1, I am getting a speedup which starts out linear, but tapers off as we move to a higher number of iterations. With Scenario 2, I am getting a full on linear speedup irrespective of the number of iterations. I was told that in reality, Scenario 1 will give me a linear speedup irrespective of the number of iterations. But I think it will not because of the high overload due to thread management. Can someone please explain to me why I am wrong?
Thanks! And sorry about the rather long post.

There's many situations where scenario 2 won't give you linear speedup either -- false sharing between threads (or, for that matter, true sharing of shared variables which get modified), memory bandwidth contention, etc. The sub-linear speedup is generally real, not a measurement artifact.
More generally, once you get to the point where you're putting timers inside for loops, you're considering more fine-grained timing information than is really appropriate to measure using timers like this. You might well want to be able to disentangle the thread management overhead from the actual work being done for a variety of reasons, but here you're trying to do that by inserting N extra function calls to omp_get_wtime(), as well as the arithmetic and the reduction operation, all of which will have non-negligable overhead of their own.
If you really want accurate timing of how much time is being spent in the computation; line, you really want to use something like sampling rather than manual instrumentation (we talk a little bit about the distinction here). Using gprof or scalasca or openspeedshop (all free software) or Intel's VTune or something (commercial package) will give you the information about how much time is being spent on that line -- often even by thread -- with much lower overhead.

First of all, by the definition of the speedup, you should use the scenario 1, which includes parallel overhead.
In the scenario 2, you have the wrong code in the measurement of paralleltime. To satisfy your goal in the scenario 2, you need to have a per-thread paralleltime by allocating int paralleltime[NUM_THREADS] and accessing them by omp_get_thread_num() (Note that this code will have false sharing, so you'd better to allocate 64-byte struct with padding). Then, measure per-thread computation time, and finally take the longest one to calculate a different kind of speedup (I'd say a sort of parallelism instead).
No, you may see sub-linear speedup for even Scenario 2, or even super-linear speedup could be obtained. The potential reasons (i.e., excluding parallel overhead) are:
Load imbalance: the workload length in compuation is different on iteration. This would be the most common reason of low speedup (But, you're saying load imbalance is not the case).
Synchronization cost: if there are any kind of synchronizations (e.g., mutex, event, barrier), you may have waiting/block time.
Caches and memory cost: when computation requires large bandwidth and high working set set, parallel code may suffer from bandwidth limitation (though it's rare in reality) and cache conflicts. Also, false sharing would be a significant reason, but it's easy to avoid it. Super-linear effect can be also observed because using multicore may have more caches (i.e., private L1/L2 caches).
In the scenario 1, it will include the overhead of parallel libraries:
Forking/joining threads: although most parallel libraries implementations won't do physical thread creation/termination on every parallel construct.
Dispatching/joining logical tasks: even if physical threads are already created, you need to dispatch logical task to each thread (in general M task to N thread), and also do a sort of joining operation at the end (e.g., implicit barrier).
Scheduling overhead: for static scheduling (as shown in your code, which uses OpenMP's static scheduling), the overhead is minimal. You can safely ignore the overhead when the workload is sufficiently enough (say 0.1 second). However, dynamic scheduling (such as work-stealing in TBB) has some overhead, but it's not significant once your workload is sufficient.
I don't think your code (1-level static-scheduling parallel loop) does have high parallel overhead due to thread management, unless this code is called million times per a second. So, may be the other reasons that I've mentioned in the above.
Keep in mind that there are many factors that will determine speedup; from the inherent parallelism (=load imbalance and synchronizations) to the overhead of a parallel library (e.g., scheduling overhead).

What do you want to measure exactly? The overhead due to parallelism is part of the real execution time, hence IMHO scenario 1 is better.
Besides, according to your OpenMP directives, you're doing a reduction on some array. In scenario 1, you're taking this into account. In scenario 2, you're not. So basically, you're measuring less things than in scenario 1. This probably has some impact on your measurements too.
Otherwise, Jonathan Dursi's answer is excellent.

There are several options with OpenMP for how the work is distributed among the threads. This can impact the linearity of your measurement method 1. You measurement method 2 doesn't seem useful. What were you trying to get at with that? If you want to know single thread performance, then run a single thread. If you want parallel performance, then you need to include the overhead.

For parallel algorithm with N threads, can performance gain be more than N?

A theoretical question, maybe it is obvious:
Is it possible that an algorithm, after being implemented in a parallel way with N threads, will be executed more than N times faster than the original, single-threaded algorithm? In other words, can the gain be better that linear with number of threads?

It's not common, but it most assuredly is possible.
Consider, for example, building a software pipeline where each step in the pipeline does a fairly small amount of calculation, but requires enough static data to approximately fill the entire data cache -- but each step uses different static data.
In a case like this, serial calculation on a single processor will normally be limited primarily by the bandwidth to main memory. Assuming you have (at least) as many processors/cores (each with its own data cache) as pipeline steps, you can load each data cache once, and process one packet of data after another, retaining the same static data for all of them. Now your calculation can proceed at the processor's speed instead of being limited by the bandwidth to main memory, so the speed improvement could easily be 10 times greater than the number of threads.
Theoretically, you could accomplish the same with a single processor that just had a really huge cache. From a practical viewpoint, however, the selection of processors and cache sizes is fairly limited, so if you want to use more cache you need to use more processors -- and the way most systems provide to accomplish this is with multiple threads.

Yes.
I saw an algorithm for moving a robot arm through complicated maneuvers that was basically to divide into N threads, and have each thread move more or less randomly through the solution space. (It wasn't a practical algorithm.) The statistics clearly showed a superlinear speedup over one thread. Apparently the probability of hitting a solution over time rose fairly fast and then leveled out some, so the advantage was in having a lot of initial attempts.

Amdahl's law (parallelization) tells us this is not possible for the general case. At best we can perfectly divide the work by N. The reason for this is that given no serial portion, Amdahl's formula for speedup becomes:
Speedup = 1/(1/N)
where N is the number of processors. This of course reduces to just N.

CUDA: Bigger problems in threads

Almost all of the CUDA exemplar code describes doing near-atomic operations on large data sets. What kind of practical limitations are the to the size of a problem each thread can do?
For example, I have another question open at the minute that involves per-thread matrix solving. Is this kind of thing too large to put within each thread?

CUDA is a data parallel programming model for what is effectively an SIMD architecture, so obviously it isn't as flexible as a general purpose multithreaded or MIMD architecture. Certainly kernels can be a lot more complex than simple arithmetic operations.
In my own work I use CUDA a lot for solving partial differential equations (so the finite element, finite difference and finite volume methods), which every thread processes a cell or element from a discretised continuum. In that sort of calculation, there are a lot of FLOPs per thread per cell/element.
The key area to be mindful of is branch divergence. Because it is an SIMD architecture under the hood, code where there is a lot of branching within a warp of threads (which is effectively the SIMD width), will suffer performance penalties. But branch divergence and code complexity need not be synonymous, you can write very "branchy" and "loopy" code which will run well, as long as threads within any given warp don't diverge too often. In FLOP and IOP heavy algorithms, that is usually not too hard to achieve.

I just want to reiterate talonmies and say that there is no real limit to the "size" of a kernel in number of operations. As long as the computation is parallel, CUDA will be effective!
As far a practical considerations, I would just add a few small notes
long running kernels can timeout, depending on os (or when profiling with cudaProf). You might have to change a setting somewhere to increase maximum kernel execution time.
long running kernels on systems without a dedicated gpu can freeze the display (interrupting ui).
warps are executed asynchronously - one warp can access memory while another performs arithmetic in order to use clock cycles effectively. long running kernels might benefit more from attention to this kind of optimization. i'm not really sure about this last one.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio