Gustafson's law vs Amdahl's law - parallel-processing

I know about Amdahl's law and maximum speedup of a parallel program. But I couldn't research Gustafson's law properly. What is Gustafson's law and what is the difference between Amdahl's and Gustafson's laws?

Amdahl's law
Suppose you have a sequential code and that a fraction f of its computation is parallelized and run on N processing units working in parallel, while the remaining fraction 1-f cannot be improved, i.e., it cannot be parallelized. Amdahl’s law states that the speedup achieved by parallelization is
Gustafson's law
Amdahl’s point of view is focused on a fixed computation problem size as it deals with a code taking a fixed amount of sequential calculation time. Gustafson's objection is that massively parallel machines allow computations previously unfeasible since they enable computations on very large data sets in fixed amount of time. In other words, a parallel platform does more than speeding up the execution of a code: it enables dealing with larger problems.
Suppose you have an application taking a time ts to be executed on N processing units. Of that computing time, a fraction (1-f) must be run sequentially. Accordingly, this application would run on a fully sequential machine in a time t equal to
If we increase the problem size, we can increase the number of processing units to keep the fraction of time the code is executed in parallel equal to f·ts. In this case, the sequential execution time increases with N which now becomes a measure of the problem size. The speedup then becomes
The efficiency would then be
so that the efficiency tends to f for increasing N.
The pitfall of these rather optimistic speedup and efficiency evaluations is related to the fact that, as the problem size increases, communication costs will increase, but increases in communication costs are not accounted for by Gustafson’s law.
References
G. Barlas, Multicore and GPU Programming: An Integrated Approach, Morgan Kaufmann
M.D. Hill, M.R. Marty, Amdahl’s law in the multicore era, Computer, vol. 41, n. 7, pp. 33-38, Jul. 2008.
GPGPU
There are interesting discussions on Amdahl's law as applied to General Purpose Graphics Processing Units, see
Amdahl's law and GPU
Amdahl's Law for GPU Is Amdahl's law accepted for GPUs too?

We are looking at the same problem from different perspectives. Amdahl's law says that if you have, say, 100 more CPUs, how much faster can you solve the same problem?
Gustafson's law is saying, if a parallel computer with 100 CPUs can solve this problem in 30 minutes, how long would it take for a computer with just ONE such CPU to solve the same problem?
Gustafson's law reflects the situations better. For example, we cannot use a 20 year old PC to play most of today's video games because they are just too slow.

Amdahl's law
Generally, if a fraction f of a job is impossible to divide into parallel parts, the whole thing can only get 1/f times faster in parallel. That is Amdahl's law.
Gustafson's law
Generally, for p participants, and an un-parallelizable fraction f the whole thing can get p+(1−p)f times faster . That is Gustafson's law

Related

What is Gustafson's law trying to argue?

Gustafson's law is a counter to Amdahl's law claiming that most parallel processing applications actually increase the workload when having increased access to parallel processing. Thus, Amdahl's law which assumes constant workload to measure speedup is a poor method for determining benefits from parallel processing.
However, I'm confused as to what exactly Gustafson is trying to argue.
For example, say I take Gustafson's law and apply it to a sequential processor and two parallel processors:
I generate a workload to run on parallel processor #1. It takes 1 second to execute on the parallel processor and takes 10 seconds to execute on the sequential processor.
I generate a bigger workload and run it on parallel processor #2. It takes 1 second to execute on the parallel processor and takes 50 seconds to execute on the sequential processor.
For each workload, there is a speedup relative to the sequential processor. However, this doesn't seem to violate Amdahl's law since there is still some upper speedup limit for each workload. All this is doing is varying the workload such that the "serial only" portion of the code is likely reduced. As per Amdahl's law, decreasing the serial only portion will increase the speed-up limit.
So I'm confused, Gustafson's law which advocates for increasing the workload and maintain constant execution time doesn't seem to add any new information.
What exactly is Gustafson arguing? And specifically, what does "scaled-speedup" even mean?
Gustafson's law is a counter to Amdahl's law claiming that most parallel processing applications actually increase the workload when having increased access to parallel processing. Thus, Amdahl's law which assumes constant workload to measure speedup is a poor method for determining benefits from parallel processing.
No.
First of all, the two laws are not claiming anything. They show the theoretical (maximum) speed-up an application can get based on a specific configuration. The two are basic mathematical formula modelling the behaviour of a parallel algorithm. They goals is to understand some limitations of parallel computing (Amdahl's law) and what developers can do to overcomes them (Gustafson's law).
The Amdahl's law is a mathematical formula describing the theoretical speed-up of an application with a variable-time given workload (but the amount of computation is fixed) computed by several processing units. The workload contains a serial execution part and a parallel one. It shows that the speed-up is bounded by the portion of serial portion of the program. This is not great for developers of parallel applications since this means that the scalability of their application will be quite bad for some rather-sequential workload assuming the workload is fixed.
The Gustafson's law break this assumption and add a new one to look at the problem differently. Indeed, the mathematical formula describes the theoretical speed-up of an application with a fixed-time workload (but the amount of computation is variable) computed by several processing units. It shows that the speed-up can be good as long as application developers can add more parallel work to a given workload.
I generate a workload to run on parallel processor #1. It takes 1 second to execute on the parallel processor and takes 10 seconds to execute on the sequential processor.
I generate a bigger workload and run it on parallel processor #2. It takes 1 second to execute on the parallel processor and takes 50 seconds to execute on the sequential processor.
A parallel program cannot be more than 2 time faster with 2 processing unit compared to one processing unit with these two models. This is possible in practice due to some confounding factors (typically due to cache-effects) but the effect do not uniquely come from the processing units. What do you means by "two parallel processors"? The number of processing units is missing in your example (unless you want to find it from the provided informations).
So I'm confused, Gustafson's law which advocates for increasing the workload and maintain constant execution time doesn't seem to add any new information.
The two laws are like two point-of-views of the same scalability problem. However, they make different assumptions: fixed amount of work with a variable work time VS variable amount of work with a fixed work time.
If you are familiar with the concepts of strong/weak scaling, please note that the Amdahl's law is to the strong scaling what the Gustafson's law is to the weak scaling.

Speedup in execution time using Amdahl's law

I have been given this problem in one of my assignments. I understand the principles of the process of speedup, execution time etc.. However, I feel that this question is incomplete. Is it true or can it be solved? If so can you please explain.
A program executes on the original version of a machine that runs at a 2GHz clock rate. The program takes 450 micro-seconds of CPU time. An improvement is made to the machine that affects 80% of the code in the program. Based on Amdahl’s law, this improvement would yield an N% speedup in the execution time for the program. What value is N? Express your answer to two decimal places.

For parallel algorithm with N threads, can performance gain be more than N?

A theoretical question, maybe it is obvious:
Is it possible that an algorithm, after being implemented in a parallel way with N threads, will be executed more than N times faster than the original, single-threaded algorithm? In other words, can the gain be better that linear with number of threads?
It's not common, but it most assuredly is possible.
Consider, for example, building a software pipeline where each step in the pipeline does a fairly small amount of calculation, but requires enough static data to approximately fill the entire data cache -- but each step uses different static data.
In a case like this, serial calculation on a single processor will normally be limited primarily by the bandwidth to main memory. Assuming you have (at least) as many processors/cores (each with its own data cache) as pipeline steps, you can load each data cache once, and process one packet of data after another, retaining the same static data for all of them. Now your calculation can proceed at the processor's speed instead of being limited by the bandwidth to main memory, so the speed improvement could easily be 10 times greater than the number of threads.
Theoretically, you could accomplish the same with a single processor that just had a really huge cache. From a practical viewpoint, however, the selection of processors and cache sizes is fairly limited, so if you want to use more cache you need to use more processors -- and the way most systems provide to accomplish this is with multiple threads.
Yes.
I saw an algorithm for moving a robot arm through complicated maneuvers that was basically to divide into N threads, and have each thread move more or less randomly through the solution space. (It wasn't a practical algorithm.) The statistics clearly showed a superlinear speedup over one thread. Apparently the probability of hitting a solution over time rose fairly fast and then leveled out some, so the advantage was in having a lot of initial attempts.
Amdahl's law (parallelization) tells us this is not possible for the general case. At best we can perfectly divide the work by N. The reason for this is that given no serial portion, Amdahl's formula for speedup becomes:
Speedup = 1/(1/N)
where N is the number of processors. This of course reduces to just N.

Law of parallel programming

There is a law of parallel programming but I can't find a link to it. Shortly it tells the following: if 1/3 of time your program is executed sequentially (in one thread), you can increase your application performance only 3 times.
Amdahl's law refers to the old single-core CPUs. You may want to read carefully what happens on current multi-core CPUs. See
Amdahl's Law in the Multicore Era
and the related page where you will find additional resources.
That would be Amdahl's Law.

Algorithms for modern hardware?

Once again, I find myself with a set of broken assumptions. The article itself is about a 10x performance gain by modifying a proven-optimal algorithm to account for virtual memory:
On a modern multi-issue CPU, running
at some gigahertz clock frequency, the
worst-case loss is almost 10 million
instructions per VM page fault. If you
are running with a rotating disk, the
number is more like 100 million
instructions.
What good is an O(log2(n)) algorithm
if those operations cause page faults
and slow disk operations? For most
relevant datasets an O(n) or even an
O(n^2) algorithm, which avoids page
faults, will run circles around it.
Are there more such algorithms around? Should we re-examine all those fundamental building blocks of our education? What else do I need to watch out for when writing my own?
Clarification:
The algorithm in question isn't faster than the proven-optimal one because the Big-O notation is flawed or meaningless. It's faster because the proven-optimal algorithm relies on an assumption that is not true in modern hardware/OSes, namely that all memory access is equal and interchangeable.
You only need to re-examine your algorithms when your customers complain about the slowness of your program or it is missing critical deadlines. Otherwise focus on correctness, robustness, readability, and ease of maintenance. Until these items are achieved any performance optimization is a waste of development time.
Page faults and disk operations may be platform specific. Always profile your code to see where the bottlenecks are. Spending time on these areas will produce the most benefits.
If you're interested, along with page faults and slow disk operations, you may want to aware of:
Cache hits -- Data Oriented Design
Cache hits -- Reducing unnecessary
branches / jumps.
Cache prediction -- Shrink loops so
they fit into the processor's cache.
Again, these items are only after quality has been achieved, customer complaints and a profiler has analyzed your program.
One important thing is to realize that the most common usage of big-O notation (to talk about runtime complexity) is only half of the story - there's another half, namely space complexity (that can also be expressed using big-O) which can also be quite relevant.
Generally these days, memory capacity advances have outpaced computing speed advances (for a single core - parallelization can get around this), so less focus is given to space complexity, but it's still a factor that should be kept in mind, especially on machines with more limited memory or if working with very large amounts of data.
I'll expand on GregS's answer: the difference is between effective complexity and asymptotic complexity. Asymptotic complexity ignores constant factors and is valid only for “large enough” inputs. Oftentimes “large enough” can actually mean “larger than any computer can deal with, now and for a few decades”; this is where the theory (justifiably) gets a bad reputation. Of course there are also cases where “large enough” means n=3 !
A more complex (and thus more accurate) way of looking at this is to first ask “what is the size range of problems you are interested in?” Then you need to measure the efficiency of various algorithms in that size range, to get a feeling for the ‘hidden constants’. Or you can use finer methods of algorithmic asymptotics which actually give you estimates on the constants.
The other thing to look at are ‘transition points’. Of course an algorithm which runs in 2n2 time will be faster than one that runs in 1016nlog(n) times for all n < 1.99 * 1017. So the quadratic algorithm will be the one to choose (unless you are dealing with the sizes of data that CERN worries about). Even sub-exponential terms can bite – 3n3 is a lot better than n3 + 1016n2 for n < 5*1015 (assuming that these are actual complexities).
There are no broken assumptions that I see. Big-O notation is a measure of algorithmic complexity on a very, very, simplified idealized computing machine, and ignoring constant terms. Obviously it is not the final word on actual speeds on actual machines.
O(n) is only part of the story -- a big part, and frequently the dominant part, but not always the dominant part. Once you get to performance optimization (which should not be done too early in your development), you need to consider all of the resources you are using. You can generalize Amdahl's Law to mean that your execution time will be dominated by the most limited resource. Note that also means that the particular hardware on which you're executing must also be considered. A program that is highly optimized and extremely efficient for a massively parallel computer (e.g., CM or MassPar) would probably not do well on a big vector box (e.g., Cray-2) nor on a high-speed microprocessor. That program might not even do well on a massive array of capable microprocessors (map/reduce style). The different optimizations for the different balances of cache, CPU communications, I/O, CPU speed, memory access, etc. mean different performance.
Back when I spent time working on performance optimizations, we would strive for "balanced" performance over the whole system. A super-fast CPU with a slow I/O system rarely made sense, and so on. O() typically considers only CPU complexity. You may be able to trade off memory space (unrolling loops doesn't make O() sense, but it does frequently help real performance); concern about cache hits vice linear memory layouts vice memory bank hits; virtual vs real memory; tape vs rotating disk vs RAID, and so on. If your performance is dominated by CPU activity, with I/O and memory loafing along, big-O is your primary concern. If your CPU is at 5% and the network is at 100%, maybe you can get away from big-O and work on I/O, caching, etc.
Multi-threading, particularly with multi-cores, makes all of that analysis even more complex. This opens up into very extensive discussion. If you are interested, Google can give you months or years of references.

Resources