Gustafson's law is a counter to Amdahl's law claiming that most parallel processing applications actually increase the workload when having increased access to parallel processing. Thus, Amdahl's law which assumes constant workload to measure speedup is a poor method for determining benefits from parallel processing.
However, I'm confused as to what exactly Gustafson is trying to argue.
For example, say I take Gustafson's law and apply it to a sequential processor and two parallel processors:
I generate a workload to run on parallel processor #1. It takes 1 second to execute on the parallel processor and takes 10 seconds to execute on the sequential processor.
I generate a bigger workload and run it on parallel processor #2. It takes 1 second to execute on the parallel processor and takes 50 seconds to execute on the sequential processor.
For each workload, there is a speedup relative to the sequential processor. However, this doesn't seem to violate Amdahl's law since there is still some upper speedup limit for each workload. All this is doing is varying the workload such that the "serial only" portion of the code is likely reduced. As per Amdahl's law, decreasing the serial only portion will increase the speed-up limit.
So I'm confused, Gustafson's law which advocates for increasing the workload and maintain constant execution time doesn't seem to add any new information.
What exactly is Gustafson arguing? And specifically, what does "scaled-speedup" even mean?
Gustafson's law is a counter to Amdahl's law claiming that most parallel processing applications actually increase the workload when having increased access to parallel processing. Thus, Amdahl's law which assumes constant workload to measure speedup is a poor method for determining benefits from parallel processing.
No.
First of all, the two laws are not claiming anything. They show the theoretical (maximum) speed-up an application can get based on a specific configuration. The two are basic mathematical formula modelling the behaviour of a parallel algorithm. They goals is to understand some limitations of parallel computing (Amdahl's law) and what developers can do to overcomes them (Gustafson's law).
The Amdahl's law is a mathematical formula describing the theoretical speed-up of an application with a variable-time given workload (but the amount of computation is fixed) computed by several processing units. The workload contains a serial execution part and a parallel one. It shows that the speed-up is bounded by the portion of serial portion of the program. This is not great for developers of parallel applications since this means that the scalability of their application will be quite bad for some rather-sequential workload assuming the workload is fixed.
The Gustafson's law break this assumption and add a new one to look at the problem differently. Indeed, the mathematical formula describes the theoretical speed-up of an application with a fixed-time workload (but the amount of computation is variable) computed by several processing units. It shows that the speed-up can be good as long as application developers can add more parallel work to a given workload.
I generate a workload to run on parallel processor #1. It takes 1 second to execute on the parallel processor and takes 10 seconds to execute on the sequential processor.
I generate a bigger workload and run it on parallel processor #2. It takes 1 second to execute on the parallel processor and takes 50 seconds to execute on the sequential processor.
A parallel program cannot be more than 2 time faster with 2 processing unit compared to one processing unit with these two models. This is possible in practice due to some confounding factors (typically due to cache-effects) but the effect do not uniquely come from the processing units. What do you means by "two parallel processors"? The number of processing units is missing in your example (unless you want to find it from the provided informations).
So I'm confused, Gustafson's law which advocates for increasing the workload and maintain constant execution time doesn't seem to add any new information.
The two laws are like two point-of-views of the same scalability problem. However, they make different assumptions: fixed amount of work with a variable work time VS variable amount of work with a fixed work time.
If you are familiar with the concepts of strong/weak scaling, please note that the Amdahl's law is to the strong scaling what the Gustafson's law is to the weak scaling.
Related
I have given a project for finding all the primes below N.And for this the test case timeout condition is [TestMethod, Timeout(1000)],which means our algorithm should execute in <1sec.
When i run the same program on i5 processor it ran successfully but when i run it on i3 the testcases failed because of timeout error.
Does the runtime of an algorithm depends on processor?
What are the factors that effects run time of an algorithm?
Does the runtime of an algorithm depends on processor?
Of course execution time depends on processor. That's why companies like Intel spend vast amounts of money on producing faster and faster processors.
For example: if an algorithm consists of 1 million operations, and a CPU is capable of performing 1 million operations per second, executing that algorithm would take 1 second. On the other hand, given a processor with performance of 10 million operations per second, that algorithm would take only 0.1s. That's very simplified example, as not every instruction takes the same time, and there's more factors involved (cache, memory bandwidth, special instructions like SSE), but it illustrates general point: faster processor executes code in shorter time - that's the point of faster processors.
Note: that's totally unrelated to time complexity.
I know about Amdahl's law and maximum speedup of a parallel program. But I couldn't research Gustafson's law properly. What is Gustafson's law and what is the difference between Amdahl's and Gustafson's laws?
Amdahl's law
Suppose you have a sequential code and that a fraction f of its computation is parallelized and run on N processing units working in parallel, while the remaining fraction 1-f cannot be improved, i.e., it cannot be parallelized. Amdahl’s law states that the speedup achieved by parallelization is
Gustafson's law
Amdahl’s point of view is focused on a fixed computation problem size as it deals with a code taking a fixed amount of sequential calculation time. Gustafson's objection is that massively parallel machines allow computations previously unfeasible since they enable computations on very large data sets in fixed amount of time. In other words, a parallel platform does more than speeding up the execution of a code: it enables dealing with larger problems.
Suppose you have an application taking a time ts to be executed on N processing units. Of that computing time, a fraction (1-f) must be run sequentially. Accordingly, this application would run on a fully sequential machine in a time t equal to
If we increase the problem size, we can increase the number of processing units to keep the fraction of time the code is executed in parallel equal to f·ts. In this case, the sequential execution time increases with N which now becomes a measure of the problem size. The speedup then becomes
The efficiency would then be
so that the efficiency tends to f for increasing N.
The pitfall of these rather optimistic speedup and efficiency evaluations is related to the fact that, as the problem size increases, communication costs will increase, but increases in communication costs are not accounted for by Gustafson’s law.
References
G. Barlas, Multicore and GPU Programming: An Integrated Approach, Morgan Kaufmann
M.D. Hill, M.R. Marty, Amdahl’s law in the multicore era, Computer, vol. 41, n. 7, pp. 33-38, Jul. 2008.
GPGPU
There are interesting discussions on Amdahl's law as applied to General Purpose Graphics Processing Units, see
Amdahl's law and GPU
Amdahl's Law for GPU Is Amdahl's law accepted for GPUs too?
We are looking at the same problem from different perspectives. Amdahl's law says that if you have, say, 100 more CPUs, how much faster can you solve the same problem?
Gustafson's law is saying, if a parallel computer with 100 CPUs can solve this problem in 30 minutes, how long would it take for a computer with just ONE such CPU to solve the same problem?
Gustafson's law reflects the situations better. For example, we cannot use a 20 year old PC to play most of today's video games because they are just too slow.
Amdahl's law
Generally, if a fraction f of a job is impossible to divide into parallel parts, the whole thing can only get 1/f times faster in parallel. That is Amdahl's law.
Gustafson's law
Generally, for p participants, and an un-parallelizable fraction f the whole thing can get p+(1−p)f times faster . That is Gustafson's law
I read a paper in which parallel cost for (parallel) algorithms is defined as CP(n) = p * TP(n), where p is the number of processors, T the processing time and n the input. An algorithm is cost-optimal, if CP(n) is approximately constant, i.e. if the algorithm uses two processors instead of one on the same input it takes only half the time.
There is another concept called parallel work, which I don't fully grasp.
The papers says it measures the number of executed parallel OPs.
A algorithm is work-optimal, if it performs as many OPs as it sequential counterpart (asymptotically). A cost-optimal algorithm is always work-optimal but not vice versa.
Can someone illustrate the concept of parallel work and show the similarities and differences to parallel cost?
It sounds like parallel work is simply a measure of the total number of instructions ran by all processes in parallel but counting the ones in parallel only once. If that's the case, then it's more closely related to the time term in your parallel cost equation. Think of it this way: if the parallel version of the algorithm runs more instructions than the sequential version --meaning it is not work-optimal, it will necessarily take more time assuming all instructions are equal in duration. Typically these extra instructions are at the beginning or end of the parallel algorithm and are viewed as overhead of the parallel algorithm. They can correspond to extra bookkeeping or communication or final aggregation of the result.
Thus an algorithm that is not work-optimal cannot be cost-optimal.
Another way to call the "parallel cost" is "cost of context switching" although it can also arise from interdependencies between the different threads.
Consider sorting.
If you implement Bubble Sort in parallel where each thread just picks up the next comparison you will have a huge cost to run it in "parallel", to the point where it will be essentially a messed up sequential version of the algorithm and your parallel work will be essentially zero because most threads just wait most of the time.
Now compare that to Quick Sort and implement a thread for each split of the original array - threads don't need data from other threads, and asymptotically for a bigger starting arrays the cost of spinning these threads will be paid by the parallel nature of the work done... if the system has infinite memory bandwidth. In reality it wouldn't be worth spinning more threads than there are memory access channels because the threads still have the invisible (from code perspective) dependency between them by having shared sequential access to memory
Short
I think parallel cost and parallel work are two sides of the same coin. They're both measures for speed-up,
whereby the latter is the theoretical concept enabling the former.
Long
Let's consider n-dimensional vector addition as a problem that is easy to parallelize, since it can be broken down into n independent tasks.
The problem is inherently work-optimal, because the parallel work doesn't change if the algorithm runs in parallel, there are always n vector components that need to be added.
Considering the parallel cost cannot be done without executing the algorithm on a (virtual) machine, where practical limitations like shortage of memory bandwidth arise. Thus a work-optimal algorithm can only be cost-optimal if the hardware (or the hardware access patterns) allows the perfect execution and division of the problem - and the time.
Cost-optimality is a stronger demand, and as I'm realizing now, just another illustration of efficiency
Under normal circumstances a cost-optimal algorithm will also be work-optimal,
but if the speed-up gained by caching, memory access patterns etc. is super-linear,
i.e. the execution times with two processors is one-tenth instead of the expected half, it is possible for an algorithm that performs more work, and thus is not work-optimal, to still be cost-optimal.
We know that the parallel efficiency of a program running on a multicore system can be calculated as speedup/N where N is the number of cores. So in order to use this formula first we need to execute the code on a multicore system and need to know the speedup.
I would like to know if I don't have a multicore system,then is it possible to estimate the speedup of the given code on a multicore system just by executing it on an unicore processor?
I have access to performance counters ( Instruction per cycles , number of cache misses , Number of Instructions etc) and I only have binaries of the code.
[Note: I estimated the parallel_running_time (T_P) = serial_running_time/N but this estimation has unacceptable error]
Thanks
Read up on Amdahl's Law, especially the bit about parallelization.
For you to determine how much you can speed up your program, you have to know what parts of the program can benefit from parallelization and what parts must be executed sequentially. If you know that, and if you know how long the serial and the parallel parts take (individually) on a single processor, then you can estimate how fast the program will be on multiple processors.
From your description, it seems that you don't know which parts can make use of parallel processing and which parts have to be executed sequentially. So it won't be possible to estimate the parallel running time.
I have two scenarios of measuring metrics like computation time and parallel speedup (sequential_time/parallel_time).
Scenario 1:
Sequential time measurement:
startTime=omp_get_wtime();
for loop computation
endTime=omp_get_wtime();
seq_time = endTime-startTime;
Parallel time measurement:
startTime = omp_get_wtime();
for loop computation (#pragma omp parallel for reduction (+:pi) private (i)
for (blah blah) {
computation;
}
endTime=omp_get_wtime();
paralleltime = endTime-startTime;
speedup = seq_time/paralleltime;
Scenario 2:
Sequential time measurement:
for loop{
startTime=omp_get_wtime();
computation;
endTime=omp_get_wtime();
seq_time += endTime-startTime;
}
Parallel time measurement:
for loop computation (#pragma omp parallel for reduction (+:pi, paralleltime) private (i,startTime,endTime)
for (blah blah) {
startTime=omp_get_wtime();
computation;
endTime=omp_get_wtime();
paralleltime = endTime-startTime;
}
speedup = seq_time/paralleltime;
I know that Scenario 2 is NOT the best production code, but I think that it measures the actual theoretical performance by OVERLOOKING the overhead involved in openmp spawning and managing (thread context switching) several threads. So it will give us a linear speedup. But Scenario 1 considers the overhead involved in spawning and managing threads.
My doubt is this:
With Scenario 1, I am getting a speedup which starts out linear, but tapers off as we move to a higher number of iterations. With Scenario 2, I am getting a full on linear speedup irrespective of the number of iterations. I was told that in reality, Scenario 1 will give me a linear speedup irrespective of the number of iterations. But I think it will not because of the high overload due to thread management. Can someone please explain to me why I am wrong?
Thanks! And sorry about the rather long post.
There's many situations where scenario 2 won't give you linear speedup either -- false sharing between threads (or, for that matter, true sharing of shared variables which get modified), memory bandwidth contention, etc. The sub-linear speedup is generally real, not a measurement artifact.
More generally, once you get to the point where you're putting timers inside for loops, you're considering more fine-grained timing information than is really appropriate to measure using timers like this. You might well want to be able to disentangle the thread management overhead from the actual work being done for a variety of reasons, but here you're trying to do that by inserting N extra function calls to omp_get_wtime(), as well as the arithmetic and the reduction operation, all of which will have non-negligable overhead of their own.
If you really want accurate timing of how much time is being spent in the computation; line, you really want to use something like sampling rather than manual instrumentation (we talk a little bit about the distinction here). Using gprof or scalasca or openspeedshop (all free software) or Intel's VTune or something (commercial package) will give you the information about how much time is being spent on that line -- often even by thread -- with much lower overhead.
First of all, by the definition of the speedup, you should use the scenario 1, which includes parallel overhead.
In the scenario 2, you have the wrong code in the measurement of paralleltime. To satisfy your goal in the scenario 2, you need to have a per-thread paralleltime by allocating int paralleltime[NUM_THREADS] and accessing them by omp_get_thread_num() (Note that this code will have false sharing, so you'd better to allocate 64-byte struct with padding). Then, measure per-thread computation time, and finally take the longest one to calculate a different kind of speedup (I'd say a sort of parallelism instead).
No, you may see sub-linear speedup for even Scenario 2, or even super-linear speedup could be obtained. The potential reasons (i.e., excluding parallel overhead) are:
Load imbalance: the workload length in compuation is different on iteration. This would be the most common reason of low speedup (But, you're saying load imbalance is not the case).
Synchronization cost: if there are any kind of synchronizations (e.g., mutex, event, barrier), you may have waiting/block time.
Caches and memory cost: when computation requires large bandwidth and high working set set, parallel code may suffer from bandwidth limitation (though it's rare in reality) and cache conflicts. Also, false sharing would be a significant reason, but it's easy to avoid it. Super-linear effect can be also observed because using multicore may have more caches (i.e., private L1/L2 caches).
In the scenario 1, it will include the overhead of parallel libraries:
Forking/joining threads: although most parallel libraries implementations won't do physical thread creation/termination on every parallel construct.
Dispatching/joining logical tasks: even if physical threads are already created, you need to dispatch logical task to each thread (in general M task to N thread), and also do a sort of joining operation at the end (e.g., implicit barrier).
Scheduling overhead: for static scheduling (as shown in your code, which uses OpenMP's static scheduling), the overhead is minimal. You can safely ignore the overhead when the workload is sufficiently enough (say 0.1 second). However, dynamic scheduling (such as work-stealing in TBB) has some overhead, but it's not significant once your workload is sufficient.
I don't think your code (1-level static-scheduling parallel loop) does have high parallel overhead due to thread management, unless this code is called million times per a second. So, may be the other reasons that I've mentioned in the above.
Keep in mind that there are many factors that will determine speedup; from the inherent parallelism (=load imbalance and synchronizations) to the overhead of a parallel library (e.g., scheduling overhead).
What do you want to measure exactly? The overhead due to parallelism is part of the real execution time, hence IMHO scenario 1 is better.
Besides, according to your OpenMP directives, you're doing a reduction on some array. In scenario 1, you're taking this into account. In scenario 2, you're not. So basically, you're measuring less things than in scenario 1. This probably has some impact on your measurements too.
Otherwise, Jonathan Dursi's answer is excellent.
There are several options with OpenMP for how the work is distributed among the threads. This can impact the linearity of your measurement method 1. You measurement method 2 doesn't seem useful. What were you trying to get at with that? If you want to know single thread performance, then run a single thread. If you want parallel performance, then you need to include the overhead.