Weak vs Strong Scaling Speedup and Efficiency - parallel-processing

I have a theoretical question. As you know, for the analysis of scaling, the speedup is defined as S(N) = T(1) / T(N) where T(i) is the runtime with i processors. The efficiency is then defined as E(N) = S / N. These definitions make perfect sense for strong scaling.
Right now, I try to compute the weak scaling efficiency of my program. And in this case, the following problem occurs: These formulas are nonsense for weak scaling. Weak scaling means, that the workload on a processors is the same, the number of processors is increased (thus the total problem size as well).
Using the formulas above, a perfectly scaling program would have a speedup of 1 and an efficiency of 1/N - which of coarse is completely unintuitive.
It would seem more appropriate to define the weak scaling efficiency as E(N) = S(1) / S(N).
So here is the actual question: How is weak scaling efficiency generally defined? Like I said it would make more sense?
I tried to find it out, but all I got was the well known formulas, probably implicitly only used for strong scaling

If you assume the time required for the computation shouldn't increase as the number of processors increases -- which may only be true in embarrassingly parallel problems -- weak scaling efficiency is defined as E(N) = T(1)/T(N).
For example, if every time the number of processors used is doubled the execution time increases by 25% of T(1) then T(16) = T(1) + .25*4*T(1) = 2*T(1) and E(16) = T(1)/2*T(1) = 0.5, or 50%.
A "speedup" in a weak scaling study doesn't make sense. You refer instead to the percentage of time increase as the number of processors increase.
Finally, a minor nit, speedup is defined as the ratio of the execution time of the best known sequential algorithm over the execution time of the parallel implementation. What you're working with is scalability. This is an important thing to note because often times parallel algorithms often use implementations that are sequentially suboptimal.

Related

Calculating O(n^3) if we know O(n) runtimetime?

If I have a program that runs over some data in O(n) time, can I semi-accurately guestimate the O(n^3) runtime from my O(n) run?
**O(n) = 5 million iterations # 2 minutes total runtime**
**O(n^2) = ??**
(5 million)^2 = 2.5+13
2.5+13 / 5 million = 5 million minutes
5 million / 60 = 83,333 hours = 3,472 days = 9.5 years
**O(n^3) = ??**
(5 million)^3 = 1.25e+20
1.25e+20 / 5 million = 2.5e+13 minutes
2.5e+13 / 60 = 416666666667 hours = 17361111111.1 days = 47,564,688 years
Technically knowing O(...) doesn't tell you anything about any execution time for specific finite inputs.
Practically, you can make an estimation for example in the way you did, but the caveat is that it will only give you the order-of-magnitude under the assumptions that 1. the constant scaling factor omitted in the O(...) notations is roughly 1 in whatever units you chose (number of iterations here) in both programs/algorithms and 2. that the input value is large enough so that higher-order terms omitted by the O(...) notation are not relevant anymore.
Whether these assumptions are good assumptions will depend on the particular programs/algorithms you are looking at. It is trivial to come up with examples where this is a really bad approximation, but there are also many cases where such an estimate may be reasonable.
If you just want to estimate whether the alternate program will execute in a non-absurd time frame (e.g. hours vs centuries) I think it will often be a good enough for that, assuming you did not choose a weird unit and assuming there is nothing in the program that would explicitly increase the asymptotic scaling, like e.g. an inner loop with exactly 10000000 iterations.
If I have a program that runs over some data in O(n) time, can I semi-accurately guestimate the O(n^3) runtime from my O(n) run?
No.
There is no the O(n3) runtime, nor either any the O(n) time. Asymptotic complexity speaks to how the behavior of a particular program or subprogram scales with input size. You can use that to estimate the performance of the same program for one input size from appropriate measurements of the performance of that program for other input sizes, but that does not give you any information about any other program's specific performance for a given input size.
In particular, your idea seems to be that the usually-ignored coefficient of the bounding function is a property of the machine, but this is not at all the case. The coefficient is mostly a property of the details of the program. If you estimate it for one program then you know it only for that program. Forget programs with different asymptotic complexity: two programs with the same asymptotic complexity can be constructed that have arbitrarily different absolute performance for any given input size.

Performance and scalability of applications in parallel computers

See the picture that is part of the Advanced Computer Architecture by Hwang which talks about the scalability of performance in parallel processing.
The questions are
1- Regarding figure (a), what are the examples for theta (exponential) and alpha (constant)? Which workloads grow exponentially by increasing the number of machines? Also, I haven't seen a constant workload when working with multi cores/computers.
2- Regarding figure (b), why the efficiency of exponential workloads are the highest? Can not understand that!
3- Regarding figure (c), what does fixed-memory model mean? A fixed memory workloads sounds like alpha which is noted as fixed-load model.
4- Regarding figure (c), what does fixed-time model mean? The term "fixed" is misguiding I think. I interpret that as "constant". The text says that fixed-time model is actually the linear in (a) gamma.
5- Regarding figure (c), why exponential model (memory bound) doesn't hit the communication bound?
You may see the text describing the figure below.
I also have to say that I don't understand the last line Sometimes, even if minimum time is achieved with mere processors, the system utilization (or efficiency) may be very poor!!
Can some one shed a light with some examples on that?
Workload refers to the input size or problem size, which is basically the amount of data to be processed. Machine size is the number of processors. Efficiency is defined as speedup divided by the machine size. The efficiency metric is more meaningful than speedup1. To see this, consider for example a program that achieves a speedup of 2X on a parallel computer. This may sound impressive. But if I also told you that the parallel computer has 1000 processors, a 2X speedup is really terrible. Efficiency, on the other hand, captures both the speedup and the context in which it was achieved (the number of processors used). In this example, efficiency is equal to 2/1000 = 0.002. Note that efficiency ranges between 1 (best) and 1/N (worst). If I just tell you that the efficiency is 0.002, you'd immediately realize that it's terrible, even if I don't tell you the number of processors.
Figure (a) shows different kinds of applications whose workloads can change in different ways to utilize a specific number of processors. That is, the applications scale differently. Generally, the reason you add more processors is to be able to exploit the increasing amount of parallelism available in larger workloads. The alpha line represents an application with a fixed-size workload, i.e, the amount of parallelism is fixed so adding more processors will not give any additional speedup. If the speedup is fixed but N gets larger, then the efficiency decreases and its curve would look like that of 1/N. Such an application has zero scalability.
The other three curves represent applications that can maintain high efficiency for with increasing number of processors (i.e., scalable) by increasing the workload in some pattern. The gamma curve represents the ideal workload growth. This is defined as the growth that maintains high efficiency but in a realistic way. That is, it does not put too much pressure on other parts of the system such as memory, disk, inter-processor communication, or I/O. So scalability is achievable. Figure (b) shows the efficiency curve of gamma. The efficiency slightly deteriorates due to the overhead of higher parallelism and due to the serial part of the application whose execution time does not change. This represents a perfectly scalable application: we can realistically make use of more processors by increasing the workload. The beta curve represents an application that is somewhat scalable, i.e., good speedups can be attained by increasing the workload but the efficiency deteriorates a little faster.
The theta curve represents an application where very high efficiency can be achieved because there is so much data that can be processed in parallel. But that efficiency can only be achieved theoretically. That's because the workload has to grow exponentially, but realistically, all of that data cannot be efficiently handled by the memory system. So such an application is considered to be poorly scalable despite of the theoretical very high efficiency.
Typically, applications with sub-linear workload growth end up being communication-bound when increasing the number of processors while applications with super-linear workload growth end up being memory-bound. This is intuitive. Applications that process very large amounts of data (the theta curve) spend of most of their time processing the data independently with little communication. On the other hand, applications that process moderate amounts of data (the beta curve) tend to have more communication between the processors where each processor uses a small amount of data to calculate something and then shares it with others for further processing. The alpha application is also communication-bound because if you use too many processors to process the fixed amount of data, then the communication overhead will be too high since each processor will operate on a tiny data set. The fixed-time model is called so because it scales very well (it takes about the same amount of time to process more data with more processors available).
I also have to say that I don't understand the last line Sometimes,
even if minimum time is achieved with mere processors, the system
utilization (or efficiency) may be very poor!!
How to reach the minimum execution time? Increase the number of processors as long as the speedup is increasing. Once the speedup reaches a fixed value, then you've reached the number of processors that achieve the minimum execution time. However, efficiency might be very poor if the speedup is small. This follows naturally from the efficiency formula. For example, suppose that an algorithm achieves a speedup of 3X on a 100-processor system and increasing the number of processors further will not increase the speedup. Therefore, the minimum execution time is achieved with a 100 processors. But efficiency is merely 3/100= 0.03.
Example: Parallel Binary Search
A serial binary search has an execution time equal to log2(N) where N is the number of elements in the array to be searched. This can be parallelized by partitioning the array into P partitions where P is the number of processors. Each processor then will perform a serial binary search on its partition. At the end, all partial results can be combined in serial fashion. So the execution time of the parallel search is (log2(N)/P) + (C*P). The latter term represents the overhead and the serial part that combines the partial results. It's linear in P and C is just some constant. So the speedup is:
log2(N)/((log2(N)/P) + (C*P))
and the efficiency is just that divided by P. By how much the workload (the size of the array) should increase to maintain maximum efficiency (or making the speedup as close to P as possible)? Consider for example what happens when we increase the input size linearly with respect to P. That is:
N = K*P, where K is some constant. The speedup is then:
log2(K*P)/((log2(K*P)/P) + (C*P))
How does the speedup (or efficiency) change as P approaches infinity? Note that the numerator has a logarithm term, but the denominator has a logarithm plus a polynomial of degree 1. The polynomial grows exponentially faster than the logarithm. In other words, the denominator grows exponentially faster than the numerator and the speedup (and hence the efficiency) approaches zero rapidly. It's clear that we can do better by increasing the workload at a faster rate. In particular, we have to increase exponentially. Assume that the input size is the of the form:
N = KP, where K is some constant. The speedup is then:
log2(KP)/((log2(KP)/P) + (C*P))
= P*log2(K)/((P*log2(K)/P) + (C*P))
= P*log2(K)/(log2(K) + (C*P))
This is a little better now. Both the numerator and denominator grow linearly, so the speedup is basically a constant. This is still bad because the efficiency would be that constant divided by P, which drops steeply as P
increases (it would look like the alpha curve in Figure (b)). It should be clear now the input size should be of the form:
N = KP2, where K is some constant. The speedup is then:
log2(KP2)/((log2(KP2)/P) + (C*P))
= P2*log2(K)/((P2*log2(K)/P) + (C*P))
= P2*log2(K)/((P*log2(K)) + (C*P))
= P2*log2(K)/(C+log2(K)*P)
= P*log2(K)/(C+log2(K))
Ideally, the term log2(K)/(C+log2(K)) should be one, but that's impossible since C is not zero. However, we can make it arbitrarily close to one by making K arbitrarily large. So K has to be very large compared to C. This makes the input size even larger, but does not change it asymptotically. Note that both of these constants have to be determined experimentally and they are specific to a particular implementation and platform. This is an example of the theta curve.
(1) Recall that speedup = (execution time on a uniprocessor)/(execution time on N processors). The minimum speedup is 1 and the maximum speedup is N.

what is the time complexity to divide two numbers?

Assume that I have two numbers a and b (a>b), and if I divide a by b (i.e. calculate a/b). How much time I need to provide?
Well, People are commenting about the instruction set as well architecture. so here is the assumption.
Assume a and b are two integers each of them has n bits and we have standard x86_64 machine with standard instruction set.
A request was made to provide an answer rather than just a link, so I will have a go at this. As pointed out by phs above, there is a good link at https://en.wikipedia.org/wiki/Division_algorithm#Newton.E2.80.93Raphson_division.
Division is one of a number of operations which, as far as computational complexity theory is concerned, are no more expensive than multiplication. One of the reasons for this is that computational complexity theory only really cares about how the cost of an algorithm grows as the amount of data to it gets large, which in this case means multi-precision division. Another is that there is a faster algorithm for division than pen-and-paper long division - this algorithm is in fact good enough to influence the design of computer hardware - famous examples being the Cray-1 reciprocal iteration and the Pentium bug.
The fast way to do division is, instead of dividing a by by, multiply a by 1/b, reducing the problem to computing a reciprocal. To compute 1/b, you first of all scale the problem by powers of two to get b in the range [1, 2), and make a first guess of the answer, typically from a lookup table - the Pentium bug had errors in the lookup table. Now you have an answer with lots of error - you have 1/b + x, where x is the error, which is unknown to you, but small if your lookup table was of a decent size.
The theory of Newton-Raphson iteration for solving equations tells you that if c = 1/b + x is a guess for 1/b, then c(2-bc) is a better guess. If c = 1/b + x then some algebra will tell you that the better guess works out as 1/b -bx^2. You have squared the error x, and since x was small (say 0.1 to start off with) you have roughly doubled the number of bits correct.
You are doubling the number of bits you have correct every time you do this, so it doesn't take many iterations to get a (good enough) answer. Now (here comes the neat part) because you know each iteration is only an approximation anyway, you need only calculate it to the accuracy that you reckon the approximation will give, not the full accuracy of the answer you want. Most of the underlying work is the multiplication in c(2-b) and this grows faster than linear in the number of bits of accuracy you work to. When you sit down and work out the cost of all of this, you find that it grows rapidly enough with the number of digits that you get a sum that looks like 1 = 1/2 + 1/4 + 1/8 +... - lots of terms but converging to answer not too far off the very first one - and the cost of a multi-precision divide is not more than a constant factor more than the cost of a multi-precision multiply.

Why is the constant always dropped from big O analysis?

I'm trying to understand a particular aspect of Big O analysis in the context of running programs on a PC.
Suppose I have an algorithm that has a performance of O(n + 2). Here if n gets really large the 2 becomes insignificant. In this case it's perfectly clear the real performance is O(n).
However, say another algorithm has an average performance of O(n2 / 2). The book where I saw this example says the real performance is O(n2). I'm not sure I get why, I mean the 2 in this case seems not completely insignificant. So I was looking for a nice clear explanation from the book. The book explains it this way:
"Consider though what the 1/2 means. The actual time to check each value
is highly dependent on the machine instruction that the code
translates to and then on the speed at which the CPU can execute the instructions. Therefore the 1/2 doesn't mean very much."
And my reaction is... huh? I literally have no clue what that says or more precisely what that statement has to do with their conclusion. Can somebody spell it out for me please.
Thanks for any help.
There's a distinction between "are these constants meaningful or relevant?" and "does big-O notation care about them?" The answer to that second question is "no," while the answer to that first question is "absolutely!"
Big-O notation doesn't care about constants because big-O notation only describes the long-term growth rate of functions, rather than their absolute magnitudes. Multiplying a function by a constant only influences its growth rate by a constant amount, so linear functions still grow linearly, logarithmic functions still grow logarithmically, exponential functions still grow exponentially, etc. Since these categories aren't affected by constants, it doesn't matter that we drop the constants.
That said, those constants are absolutely significant! A function whose runtime is 10100n will be way slower than a function whose runtime is just n. A function whose runtime is n2 / 2 will be faster than a function whose runtime is just n2. The fact that the first two functions are both O(n) and the second two are O(n2) doesn't change the fact that they don't run in the same amount of time, since that's not what big-O notation is designed for. O notation is good for determining whether in the long term one function will be bigger than another. Even though 10100n is a colossally huge value for any n > 0, that function is O(n) and so for large enough n eventually it will beat the function whose runtime is n2 / 2 because that function is O(n2).
In summary - since big-O only talks about relative classes of growth rates, it ignores the constant factor. However, those constants are absolutely significant; they just aren't relevant to an asymptotic analysis.
Big O notation is most commonly used to describe an algorithm's running time. In this context, I would argue that specific constant values are essentially meaningless. Imagine the following conversation:
Alice: What is the running time of your algorithm?
Bob: 7n2
Alice: What do you mean by 7n2?
What are the units? Microseconds? Milliseconds? Nanoseconds?
What CPU are you running it on? Intel i9-9900K? Qualcomm Snapdragon 845? (Or are you using a GPU, an FPGA, or other hardware?)
What type of RAM are you using?
What programming language did you implement the algorithm in? What is the source code?
What compiler / VM are you using? What flags are you passing to the compiler / VM?
What is the operating system?
etc.
So as you can see, any attempt to indicate a specific constant value is inherently problematic. But once we set aside constant factors, we are able to clearly describe an algorithm's running time. Big O notation gives us a robust and useful description of how long an algorithm takes, while abstracting away from the technical features of its implementation and execution.
Now it is possible to specify the constant factor when describing the number of operations (suitably defined) or CPU instructions an algorithm executes, the number of comparisons a sorting algorithm performs, and so forth. But typically, what we're really interested in is the running time.
None of this is meant to suggest that the real-world performance characteristics of an algorithm are unimportant. For example, if you need an algorithm for matrix multiplication, the Coppersmith-Winograd algorithm is inadvisable. It's true that this algorithm takes O(n2.376) time, whereas the Strassen algorithm, its strongest competitor, takes O(n2.808) time. However, according to Wikipedia, Coppersmith-Winograd is slow in practice, and "it only provides an advantage for matrices so large that they cannot be processed by modern hardware." This is usually explained by saying that the constant factor for Coppersmith-Winograd is very large. But to reiterate, if we're talking about the running time of Coppersmith-Winograd, it doesn't make sense to give a specific number for the constant factor.
Despite its limitations, big O notation is a pretty good measure of running time. And in many cases, it tells us which algorithms are fastest for sufficiently large input sizes, before we even write a single line of code.
Big-O notation only describes the growth rate of algorithms in terms of mathematical function, rather than the actual running time of algorithms on some machine.
Mathematically, Let f(x) and g(x) be positive for x sufficiently large.
We say that f(x) and g(x) grow at the same rate as x tends to infinity, if
now let f(x)=x^2 and g(x)=x^2/2, then lim(x->infinity)f(x)/g(x)=2. so x^2 and x^2/2 both have same growth rate.so we can say O(x^2/2)=O(x^2).
As templatetypedef said, hidden constants in asymptotic notations are absolutely significant.As an example :marge sort runs in O(nlogn) worst-case time and insertion sort runs in O(n^2) worst case time.But as the hidden constant factors in insertion sort is smaller than that of marge sort, in practice insertion sort can be faster than marge sort for small problem sizes on many machines.
You are completely right that constants matter. In comparing many different algorithms for the same problem, the O numbers without constants give you an overview of how they compare to each other. If you then have two algorithms in the same O class, you would compare them using the constants involved.
But even for different O classes the constants are important. For instance, for multidigit or big integer multiplication, the naive algorithm is O(n^2), Karatsuba is O(n^log_2(3)), Toom-Cook O(n^log_3(5)) and Schönhage-Strassen O(n*log(n)*log(log(n))). However, each of the faster algorithms has an increasingly large overhead reflected in large constants. So to get approximate cross-over points, one needs valid estimates of those constants. Thus one gets, as SWAG, that up to n=16 the naive multiplication is fastest, up to n=50 Karatsuba and the cross-over from Toom-Cook to Schönhage-Strassen happens for n=200.
In reality, the cross-over points not only depend on the constants, but also on processor-caching and other hardware-related issues.
Big O without constant is enough for algorithm analysis.
First, the actual time does not only depend how many instructions but also the time for each instruction, which is closely connected to the platform where the code runs. It is more than theory analysis. So the constant is not necessary for most case.
Second, Big O is mainly used to measure how the run time will increase as the problem becomes larger or how the run time decrease as the performance of hardware improved.
Third, for situations of high performance optimizing, constant will also be taken into consideration.
The time required to do a particular task in computers now a days does not required a large amount of time unless the value entered is very large.
Suppose we wants to multiply 2 matrices of size 10*10 we will not have problem unless we wants to do this operation multiple times and then the role of asymptotic notations becomes prevalent and when the value of n becomes very big then the constants don't really makes any difference to the answer and are almost negligible so we tend to leave them while calculating the complexity.
Time complexity for O(n+n) reduces to O(2n). Now 2 is a constant. So the time complexity will essentially depend on n.
Hence the time complexity of O(2n) equates to O(n).
Also if there is something like this O(2n + 3) it will still be O(n) as essentially the time will depend on the size of n.
Now suppose there is a code which is O(n^2 + n), it will be O(n^2) as when the value of n increases the effect of n will become less significant compared to effect of n^2.
Eg:
n = 2 => 4 + 2 = 6
n = 100 => 10000 + 100 => 10100
n = 10000 => 100000000 + 10000 => 100010000
As you can see the effect of the second expression as lesser effect as the value of n keeps increasing. Hence the time complexity evaluates to O(n^2).

How can one test time complexity "experimentally"?

Could it be done by keeping a counter to see how many iterations an algorithm goes through, or does the time duration need to be recorded?
The currently accepted won't give you any theoretical estimation, unless you are somehow able to fit the experimentally measured times with a function that approximates them. This answer gives you a manual technique to do that and fills that gap.
You start by guessing the theoretical complexity function of the algorithm. You also experimentally measure the actual complexity (number of operations, time, or whatever you find practical), for increasingly larger problems.
For example, say you guess an algorithm is quadratic. Measure (Say) the time, and compute the ratio of time to your guessed function (n^2):
for n = 5 to 10000 //n: problem size
long start = System.time()
executeAlgorithm(n)
long end = System.time()
long totalTime = end - start
double ratio = (double) time / (n * n)
end
. As n moves towards infinity, this ratio...
Converges to zero? Then your guess is too low. Repeat with something bigger (e.g. n^3)
Diverges to infinity? Then your guess is too high. Repeat with something smaller (e.g. nlogn)
Converges to a positive constant? Bingo! Your guess is on the money (at least approximates the theoretical complexity for as large n values as you tried)
Basically that uses the definition of big O notation, that f(x) = O(g(x)) <=> f(x) < c * g(x) - f(x) is the actual cost of your algorithm, g(x) is the guess you put, and c is a constant. So basically you try to experimentally find the limit of f(x)/g(x); if your guess hits the real complexity, this ratio will estimate the constant c.
Algorithm complexity is defined as (something like:)
the number of operations the algorithm does as a function
of its input size.
So you need to try your algorithm with various input sizes (i.e. for sort - try sorting 10 elements, 100 elements etc.), and count each operation (e.g. assignment, increment, mathematical operation etc.) the algorithm does.
This will give you a good "theoretical" estimation.
If you want real-life numbers on the other hand - use profiling.
As others have mentioned, the theoretical time complexity is a function of number of cpu operations done by your algorithm. In general processor time should be a good approximation for that modulo a constant. But the real run time may vary because of a number of reasons such as:
processor pipeline flushes
Cache misses
Garbage collection
Other processes on the machine
Unless your code is systematically causing some of these things to happen, with enough number of statistical samples, you should have a fairly good idea of the time complexity of your algorithm, based on observed runtime.
The best way would be to actually count the number of "operations" performed by your algorithm. The definition of "operation" can vary: for an algorithm such as quicksort, it could be the number of comparisons of two numbers.
You could measure the time taken by your program to get a rough estimate, but various factors could cause this value to differ from the actual mathematical complexity.
yes.
you can track both, actual performance and number of iterations.
Might I suggest using ANTS profiler. It will provide you this kind of detail while you run your app with "experimental" data.

Resources