Gustafson's law is a counter to Amdahl's law claiming that most parallel processing applications actually increase the workload when having increased access to parallel processing. Thus, Amdahl's law which assumes constant workload to measure speedup is a poor method for determining benefits from parallel processing.
However, I'm confused as to what exactly Gustafson is trying to argue.
For example, say I take Gustafson's law and apply it to a sequential processor and two parallel processors:
I generate a workload to run on parallel processor #1. It takes 1 second to execute on the parallel processor and takes 10 seconds to execute on the sequential processor.
I generate a bigger workload and run it on parallel processor #2. It takes 1 second to execute on the parallel processor and takes 50 seconds to execute on the sequential processor.
For each workload, there is a speedup relative to the sequential processor. However, this doesn't seem to violate Amdahl's law since there is still some upper speedup limit for each workload. All this is doing is varying the workload such that the "serial only" portion of the code is likely reduced. As per Amdahl's law, decreasing the serial only portion will increase the speed-up limit.
So I'm confused, Gustafson's law which advocates for increasing the workload and maintain constant execution time doesn't seem to add any new information.
What exactly is Gustafson arguing? And specifically, what does "scaled-speedup" even mean?
Gustafson's law is a counter to Amdahl's law claiming that most parallel processing applications actually increase the workload when having increased access to parallel processing. Thus, Amdahl's law which assumes constant workload to measure speedup is a poor method for determining benefits from parallel processing.
No.
First of all, the two laws are not claiming anything. They show the theoretical (maximum) speed-up an application can get based on a specific configuration. The two are basic mathematical formula modelling the behaviour of a parallel algorithm. They goals is to understand some limitations of parallel computing (Amdahl's law) and what developers can do to overcomes them (Gustafson's law).
The Amdahl's law is a mathematical formula describing the theoretical speed-up of an application with a variable-time given workload (but the amount of computation is fixed) computed by several processing units. The workload contains a serial execution part and a parallel one. It shows that the speed-up is bounded by the portion of serial portion of the program. This is not great for developers of parallel applications since this means that the scalability of their application will be quite bad for some rather-sequential workload assuming the workload is fixed.
The Gustafson's law break this assumption and add a new one to look at the problem differently. Indeed, the mathematical formula describes the theoretical speed-up of an application with a fixed-time workload (but the amount of computation is variable) computed by several processing units. It shows that the speed-up can be good as long as application developers can add more parallel work to a given workload.
I generate a workload to run on parallel processor #1. It takes 1 second to execute on the parallel processor and takes 10 seconds to execute on the sequential processor.
I generate a bigger workload and run it on parallel processor #2. It takes 1 second to execute on the parallel processor and takes 50 seconds to execute on the sequential processor.
A parallel program cannot be more than 2 time faster with 2 processing unit compared to one processing unit with these two models. This is possible in practice due to some confounding factors (typically due to cache-effects) but the effect do not uniquely come from the processing units. What do you means by "two parallel processors"? The number of processing units is missing in your example (unless you want to find it from the provided informations).
So I'm confused, Gustafson's law which advocates for increasing the workload and maintain constant execution time doesn't seem to add any new information.
The two laws are like two point-of-views of the same scalability problem. However, they make different assumptions: fixed amount of work with a variable work time VS variable amount of work with a fixed work time.
If you are familiar with the concepts of strong/weak scaling, please note that the Amdahl's law is to the strong scaling what the Gustafson's law is to the weak scaling.
I have given a project for finding all the primes below N.And for this the test case timeout condition is [TestMethod, Timeout(1000)],which means our algorithm should execute in <1sec.
When i run the same program on i5 processor it ran successfully but when i run it on i3 the testcases failed because of timeout error.
Does the runtime of an algorithm depends on processor?
What are the factors that effects run time of an algorithm?
Does the runtime of an algorithm depends on processor?
Of course execution time depends on processor. That's why companies like Intel spend vast amounts of money on producing faster and faster processors.
For example: if an algorithm consists of 1 million operations, and a CPU is capable of performing 1 million operations per second, executing that algorithm would take 1 second. On the other hand, given a processor with performance of 10 million operations per second, that algorithm would take only 0.1s. That's very simplified example, as not every instruction takes the same time, and there's more factors involved (cache, memory bandwidth, special instructions like SSE), but it illustrates general point: faster processor executes code in shorter time - that's the point of faster processors.
Note: that's totally unrelated to time complexity.
How do I measure the execution time of a method in ECLiPSe CLP? Currently, I have this:
measure_traditional(Difficulty,Selection,Choice):-
statistics(runtime, _),
time(solve_traditional(Difficulty,Selection,Choice,_)),
time(solve_traditional(Difficulty,Selection,Choice,_)),
time(solve_traditional(Difficulty,Selection,Choice,_)),
time(solve_traditional(Difficulty,Selection,Choice,_)),
time(solve_traditional(Difficulty,Selection,Choice,_)),
time(solve_traditional(Difficulty,Selection,Choice,_)),
time(solve_traditional(Difficulty,Selection,Choice,_)),
time(solve_traditional(Difficulty,Selection,Choice,_)),
statistics(runtime,[_|T]), % T
write(T).
I need to write the time it took to perform a method solve_traditional(...) and write it out to a text file. However, it is not precise enough. Sometimes time will print 0.015 or 0.016 seconds for the given method, but usually it prints 0.0 seconds.
Figuring the method completes too fast, I decided to make use of statistics(runtime, ...) to measure the time it takes between two runtime calls. I could then measure for example the time it takes to complete 20 method calls and divide the measured time T by 20.
Only problem is, with 20 calls T equals either 0, 16, 32 or 48 milliseconds. Apparently, it measures the time for each method call separately and finds the sum of the execution times (which is often just 0.0s). This beats the whole purpose of measuring the runtime for N method calls and dividing the time T by N.
In short: the current methods I'm using for execution time measurements are inadequate. Is there a way to make it more precise (9 decimals for example)?
Benchmarking is a tricky business in any programming language, and particularly so in CLP. Especially if you plan to publish your results, you should be extremely thorough and make absolutely sure you are measuring what you claim to measure.
Timers: Are you measuring real time, process cpu time, thread cpu time? Including time spent in system calls? Including or excluding garbage collection? ...
See the different timers offered by the statistics/2 primitive.
There is a real-time high-resolution timer that can be accessed via statistics(hr_time,T).
Timer resolution: In your example the timer resolution seems to be 1/60 sec. That means, to get 3 significant digits in your time measurement, you have to measure at least a runtime of 1000*1/60 = 16.7 seconds.
If your benchmark runtime is too short, you have to run it multiple times.
Runtime variance: On modern machines it is increasingly difficult to get reproducible timings. This is due to effects that have nothing to do with the program you are measuring, such as cache behaviour, paging, context switches, power management hardware, memory alignment, etc.
Run enough repetitions, run on a quiet machine, make sure your results are reproducible.
Repeating benchmarks: In a system like ECLiPSe, running benchmarks repeatedly must be done carefully to ensure that the successive runs really do the same computation, and ideally have same or similar cache and garbage collection behaviour.
In your code, you run the benchmark successively in a conjunction. This is not recommended because variable instantiations, delayed goals or garbage can survive from previous runs and slow down or speed up subsequent runs. As suggested above, you could use the pattern
run_n_times(N,Goal) :- \+ ( between(1,N,1,_), \+ Goal ).
which is essentially a way of repeating N times the sequence
once(Goal), fail
The point of this is that the combination of once/1 and fail undoes all of Goal's computation, so that the next iteration starts as much as possible from a similar machine state. Unfortunately, this undo-process itself adds extra runtime, which distorts the measurement...
Test overheads: If you run your benchmark several times, you need a test framework that does that for you, and this contributes to the runtime you measure.
You either have to make sure that the overhead is negligible, or you have to measure the overhead (e.g. by running the test framework with a dummy benchmark) and subtract it, for example:
benchmark(N, DummyGoal, Goal, Time) :-
cputime(T1),
run_n_times(N, DummyGoal),
cputime(T2),
run_n_times(N, Goal),
cputime(T3),
Time is (T3-T2)-(T2-T1).
CLP specifics: There are many other considerations specific to the kind of data-driven operations that occur in CLP solvers, and which make CLP runtimes very difficult to compare. These solvers have many internal degrees of freedom regarding scheduling of propagators, degrees of pruning, tie breaking rules in search control, etc.
A paper that discusses these things specifically is:
On Benchmarking Constraint Logic Programming Platforms, by Mark Wallace, Joachim Schimpf, Kish Shen and Warwick Harvey. In CONSTRAINTS Journal, ed. E.C. Freuder,9(1), pp 5-34, Kluwer, 2004.
As far as I understand a sampling profiler works as follows: it interupts the program execution in regular intervals and reads out the call stack. It notes which part of the program is currently executing and increments a counter that represents this part of the program. In a post processing step: For each function of the program the ratio of the whole execution time is computed, for which the function is responsible for. This is done by looking at the counter C for this specific function and the total number of samples N:
ratio of the function = C / N
Finding the hotspots then is easy, as this are the parts of the program with a high ratio.
But how can this be done for a parallel program running on parallel hardware. As far as I know, when the program execution is interupted the executing parts of the program on ALL processors are determined. Due to that a function which is executed in parallel gets counted multiple times. Thus the number of samples C of this function can not be used for computing its share of the whole execution time anymore.
Is my thinking correct? Are there other ways how the hotspots of a parallel program can be identified - or is this just not possible using sampling?
You're on the right track.
Whether you need to sample all the threads depends on whether they are doing the same thing or different things.
It is not essential to sample them all at the same time.
You need to look at the threads that are actually working, not just idling.
Some points:
Sampling should be on wall-clock time, not CPU time, unless you want to be blind to needless I/O and other blocking calls.
You're not just interested in which functions are on the stack, but which lines of code, because they convey the purpose of the time being spent. It is more useful to look for a "hot purpose" than a "hot spot".
The cost of a function or line of code is just the fraction of samples it appears on. To appreciate that, suppose samples are taken every 10ms for a total of N samples. If the function or line of code could be made to disappear, then all the samples in which it is on the stack would also disappear, reducing N by that fraction. That's what speedup is.
In spite of the last point, in sampling, quality beats quantity. When the goal is to understand what opportunities you have for speedup, you get farther faster by manually scrutinizing 10-20 samples to understand the full reason why each moment in time is being spent. That's why I take samples manually. Knowing the amount of time with statistical precision is really far less important.
I can't emphasize enough the importance of finding and fixing more than one problem. Speed problems come in severals, and each one you fix has a multiplier effect on those done already. The ones you don't find end up being the limiting factor.
Programs that involve a lot of asynchronous inter-thread message-passing are more difficult, because it becomes harder to discern the full reason why a moment in time is being spent.
More on that.
How do you figure out whether it's worth parallelizing a particular code block based on its code size? Is the following calculation correct?
Assume:
Thread pool consisting of one thread per CPU.
CPU-bound code block with execution time of X milliseconds.
Y = min(number of CPUs, number of concurrent requests)
Therefore:
Cost: code complexity, potential bugs
Benefit: (X * Y) milliseconds
My conclusion is that it isn't worth parallelizing for small values of X or Y, where "small" depends on how responsive your requests must be.
One thing that will help you figure that out is Amdahl's Law
The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program. For example, if a program needs 20 hours using a single processor core, and a particular portion of 1 hour cannot be parallelized, while the remaining promising portion of 19 hours (95%) can be parallelized, then regardless of how many processors we devote to a parallelized execution of this program, the minimal execution time cannot be less than that critical 1 hour. Hence the speed up is limited up to 20x.
Figure out what you want to achieve in speed up, and how much parallelism you can actually achieve, then see if its worth it.
It depends on many factors, as the difficulty of parallelize the code, the speedup obtained from it (there are overhead costs on dividing the problem and joining the results) and the amount of time that the code is spending there (Amdahl's Law)
Well, the benefit is really more:
(X * (Y-1)) * Tc * Pf
Where Tc is the cost of the threading framework you are using. No threading framework scales perfectly, so using 2x threads will likely be, at best, 1.9x speed.
Pf is some factor for parallization that depends completely on the algorithm (ie: whether or not you'll need to lock, which will slow the process down).
Also, it's Y-1, since single threaded is basically assuming Y==1.
As for deciding, it's also a matter of user frustration/expectation (if they user is annoyed at waiting for something, it'd have a greater benefit than a task that the user doesn't really mind - which is not always just due to wait times, etc - it's partly expectations).