Performance and scalability of applications in parallel computers - performance

See the picture that is part of the Advanced Computer Architecture by Hwang which talks about the scalability of performance in parallel processing.
The questions are
1- Regarding figure (a), what are the examples for theta (exponential) and alpha (constant)? Which workloads grow exponentially by increasing the number of machines? Also, I haven't seen a constant workload when working with multi cores/computers.
2- Regarding figure (b), why the efficiency of exponential workloads are the highest? Can not understand that!
3- Regarding figure (c), what does fixed-memory model mean? A fixed memory workloads sounds like alpha which is noted as fixed-load model.
4- Regarding figure (c), what does fixed-time model mean? The term "fixed" is misguiding I think. I interpret that as "constant". The text says that fixed-time model is actually the linear in (a) gamma.
5- Regarding figure (c), why exponential model (memory bound) doesn't hit the communication bound?
You may see the text describing the figure below.
I also have to say that I don't understand the last line Sometimes, even if minimum time is achieved with mere processors, the system utilization (or efficiency) may be very poor!!
Can some one shed a light with some examples on that?

Workload refers to the input size or problem size, which is basically the amount of data to be processed. Machine size is the number of processors. Efficiency is defined as speedup divided by the machine size. The efficiency metric is more meaningful than speedup1. To see this, consider for example a program that achieves a speedup of 2X on a parallel computer. This may sound impressive. But if I also told you that the parallel computer has 1000 processors, a 2X speedup is really terrible. Efficiency, on the other hand, captures both the speedup and the context in which it was achieved (the number of processors used). In this example, efficiency is equal to 2/1000 = 0.002. Note that efficiency ranges between 1 (best) and 1/N (worst). If I just tell you that the efficiency is 0.002, you'd immediately realize that it's terrible, even if I don't tell you the number of processors.
Figure (a) shows different kinds of applications whose workloads can change in different ways to utilize a specific number of processors. That is, the applications scale differently. Generally, the reason you add more processors is to be able to exploit the increasing amount of parallelism available in larger workloads. The alpha line represents an application with a fixed-size workload, i.e, the amount of parallelism is fixed so adding more processors will not give any additional speedup. If the speedup is fixed but N gets larger, then the efficiency decreases and its curve would look like that of 1/N. Such an application has zero scalability.
The other three curves represent applications that can maintain high efficiency for with increasing number of processors (i.e., scalable) by increasing the workload in some pattern. The gamma curve represents the ideal workload growth. This is defined as the growth that maintains high efficiency but in a realistic way. That is, it does not put too much pressure on other parts of the system such as memory, disk, inter-processor communication, or I/O. So scalability is achievable. Figure (b) shows the efficiency curve of gamma. The efficiency slightly deteriorates due to the overhead of higher parallelism and due to the serial part of the application whose execution time does not change. This represents a perfectly scalable application: we can realistically make use of more processors by increasing the workload. The beta curve represents an application that is somewhat scalable, i.e., good speedups can be attained by increasing the workload but the efficiency deteriorates a little faster.
The theta curve represents an application where very high efficiency can be achieved because there is so much data that can be processed in parallel. But that efficiency can only be achieved theoretically. That's because the workload has to grow exponentially, but realistically, all of that data cannot be efficiently handled by the memory system. So such an application is considered to be poorly scalable despite of the theoretical very high efficiency.
Typically, applications with sub-linear workload growth end up being communication-bound when increasing the number of processors while applications with super-linear workload growth end up being memory-bound. This is intuitive. Applications that process very large amounts of data (the theta curve) spend of most of their time processing the data independently with little communication. On the other hand, applications that process moderate amounts of data (the beta curve) tend to have more communication between the processors where each processor uses a small amount of data to calculate something and then shares it with others for further processing. The alpha application is also communication-bound because if you use too many processors to process the fixed amount of data, then the communication overhead will be too high since each processor will operate on a tiny data set. The fixed-time model is called so because it scales very well (it takes about the same amount of time to process more data with more processors available).
I also have to say that I don't understand the last line Sometimes,
even if minimum time is achieved with mere processors, the system
utilization (or efficiency) may be very poor!!
How to reach the minimum execution time? Increase the number of processors as long as the speedup is increasing. Once the speedup reaches a fixed value, then you've reached the number of processors that achieve the minimum execution time. However, efficiency might be very poor if the speedup is small. This follows naturally from the efficiency formula. For example, suppose that an algorithm achieves a speedup of 3X on a 100-processor system and increasing the number of processors further will not increase the speedup. Therefore, the minimum execution time is achieved with a 100 processors. But efficiency is merely 3/100= 0.03.
Example: Parallel Binary Search
A serial binary search has an execution time equal to log2(N) where N is the number of elements in the array to be searched. This can be parallelized by partitioning the array into P partitions where P is the number of processors. Each processor then will perform a serial binary search on its partition. At the end, all partial results can be combined in serial fashion. So the execution time of the parallel search is (log2(N)/P) + (C*P). The latter term represents the overhead and the serial part that combines the partial results. It's linear in P and C is just some constant. So the speedup is:
log2(N)/((log2(N)/P) + (C*P))
and the efficiency is just that divided by P. By how much the workload (the size of the array) should increase to maintain maximum efficiency (or making the speedup as close to P as possible)? Consider for example what happens when we increase the input size linearly with respect to P. That is:
N = K*P, where K is some constant. The speedup is then:
log2(K*P)/((log2(K*P)/P) + (C*P))
How does the speedup (or efficiency) change as P approaches infinity? Note that the numerator has a logarithm term, but the denominator has a logarithm plus a polynomial of degree 1. The polynomial grows exponentially faster than the logarithm. In other words, the denominator grows exponentially faster than the numerator and the speedup (and hence the efficiency) approaches zero rapidly. It's clear that we can do better by increasing the workload at a faster rate. In particular, we have to increase exponentially. Assume that the input size is the of the form:
N = KP, where K is some constant. The speedup is then:
log2(KP)/((log2(KP)/P) + (C*P))
= P*log2(K)/((P*log2(K)/P) + (C*P))
= P*log2(K)/(log2(K) + (C*P))
This is a little better now. Both the numerator and denominator grow linearly, so the speedup is basically a constant. This is still bad because the efficiency would be that constant divided by P, which drops steeply as P
increases (it would look like the alpha curve in Figure (b)). It should be clear now the input size should be of the form:
N = KP2, where K is some constant. The speedup is then:
log2(KP2)/((log2(KP2)/P) + (C*P))
= P2*log2(K)/((P2*log2(K)/P) + (C*P))
= P2*log2(K)/((P*log2(K)) + (C*P))
= P2*log2(K)/(C+log2(K)*P)
= P*log2(K)/(C+log2(K))
Ideally, the term log2(K)/(C+log2(K)) should be one, but that's impossible since C is not zero. However, we can make it arbitrarily close to one by making K arbitrarily large. So K has to be very large compared to C. This makes the input size even larger, but does not change it asymptotically. Note that both of these constants have to be determined experimentally and they are specific to a particular implementation and platform. This is an example of the theta curve.
(1) Recall that speedup = (execution time on a uniprocessor)/(execution time on N processors). The minimum speedup is 1 and the maximum speedup is N.

Related

How to account for cache misses in estimating performance?

Generally performance is given in terms of O() Order of magnitude: O(Magnitude)+K where the K is generally ignored as it applies mainly to smaller Ns.
But more and more I have seen performance dominated by underlying data size, but this is not part of algorithmic complexity
Assuming algorithm A is O(logN) but uses O(N) space and algorithm B is O(N) but uses O(logN) It used to be the case that algorithm A was faster. Now with cache misses in multi-tiered caches, it is likely that algorithm B will be faster for large numbers and possibly small numbers if it has a smaller K
The problem is how do you represent this?
Well, the use of O(N) nomenclature abstracts away some important details that generally are only insignificant as N approaches infinity. Those details can and often are the most significant factors at values of N less than infinity. To help explain, consider that if a term is listed as O(N^x), it is only specifying the most significant factor of N. In reality, the performance could be characterized as:
aN^x + bN^(x-1) +cN^(x-2) + ... + K
So as N approaches infinity, the dominant term becomes N^x, but clearly at values of N that are less than infinity the dominant term could be one of the lesser terms. Looking at your examples, you give two algorithms. Let's call algorithm A the one that provides O(N) performance, and the one that provides O(logN) performance we'll call algorithm B. In reality, these two algorithms have performance characteristics as follows:
Performance A = aN + b(log N) + c
Performance B = x(log N) + y
If your constant values are a=0.001 and x=99,999, you can see how A provides better performance than B. In addition, you mention that one algorithm increases the likelihood of a cache miss, and that likelihood depends on the size of the data. You'll need to figure out the likelihood of the cache miss as a function of the data size and use that as a factor when calculating the O performance of the overall algorithm. For example:
If the cost of a cache miss is CM (we'll assume it's constant), then for algorithm A the overall cache performance is F(N)CM. If that cache performance is a factor in the dominant loop of algorithm A (the O(log N) part), then the real performance characteristic of algorithm A is O( F(N)(log N)). For algorithm B the overall cache performance would be F(log N)*CM. If the cache miss manifests during the dominant loop of algorithm B, then the real performance of algorithm B is O(F(log N)*N). As long as you can determine F(), you can then compare algorithm A and B.
cache misses do not take into account in big O notation since they are constant factors.
Even if you pessimistically assume every array seek is going to be a cache miss, and let's say a cache miss takes 100 cycles (this time is constant, since we are assuming Random Access Memory), than iterating the array of length n is going to take 100*n cycles for the cache misses (+ overhead for loop and control), and in general terms it remains O(n).
One reason big O is used so often is because it is platform independent (well, when speaking about RAM machines at least). If we would have took cache misses into account, the result would have been different for each platform.
If you are looking for a theoretic notation that takes constants into account - you are looking for tilde notation.
Also, that's why "big O notation" is seldom enough for large scale, or time critical systems, and these are constantly profiled to find bottlenecks which will be improved locally by the developers, so if you're looking for real performance - do it empirically, and don't settle for theoretic notations.

Why constants are not considered in analysis of algorithm efficiency?

Multiplicative constants are not considered in analysis of algorithm time efficiency because A) they cancel out when computing efficiency functions B) constant functions grow very slowly with input size growth C) they have a small effect when input size is small D) they can be overcome by faster machines E) they do not affect the actual run time of the algorithm
My guess is "B", but I don't know correct answer. Are all options incorrect ?
So here's my comment extended to an answer:
B) constant functions grow very slowly with input size growth
This doesn't even make sense. A constant function doesn't even grow at all; however, here we are not talking about constant run-time functions, but about constant coefficients that may occur when estimating the actual number of "steps" given the asymptotic complexity of an algorithm.
In asymptotic analysis, however, we do not care about the exact number of steps, only the limit of the ratio of running times as a function of the input size as the input size goes to infinity.
E. g. O(n ^ 2) means that if you double the input size, the running time will be approximately 4 times the original, if you triple the input size, it will be 9 times the original, etc. It does not say that the execution will take exactly "4 steps" or "9 steps".
C) they have a small effect when input size is small
No, they have rather significant effects when the input size is small. Again, we are considering the limit as the input size approaches infinity. Any constant is asymptotically negligible compared to any non-constant monotonically growing function of n as n goes to infinity.
When n is small, then constants can have a tremendous effect on execution times. For example, there are all sorts of interesting and clever data structures, but if we only have small amounts of data, we often prefer arrays over e. g. a binary tree or a linked list, even for frequent insertion, because the good cache locality properties of the array make its constant factor so small that the theoretically O(n) insertion may well be a lot faster than an O(log n) insertion into a tree.
D) they can be overcome by faster machines
This answer completely misses the point, asymptotic analysis of algorithms has nothing to do with how fast physical machines are. Yes, machines are becoming faster over time, but again, that's just a constant factor. If you run a program for an O(n ^ 2) algorithm on a faster machine, it will still take 4 times the CPU time to execute it with a doubled input size.
E) they do not affect the actual run time of the algorithm
That's also wrong, they absolutely do.
So the only remaining answer is A, which may be correct if interpreted as in my explanation above (relating to ratios of execution times), but I would have phrased it quite differently for sure.
I think the answer is D:
Multiplicative constants are not considered in analysis of algorithm time efficiency because
D) they can be overcome by faster machines
Machines are becoming faster giving constant factor speedups, which overcomes the multiplicative constants, hence we can ignore the multiplicative constants for analysis.
I'd rather say we ignore multiplicative constants because they depend on the particular machine but for multiple choice we have to pick the best answer offered.

Why is order of growth preferred as a benchmark for algorithm performance wrt runtime?

I learnt that growth rate is often used to gauge the runtime and efficiency of an algorithm. My question is why use growth rate instead of using the exact(or approximate) relation between the runtime and input size?
Edit:
Thanks for the responses. I would like to clarify what I meant by "relation between the runtime and input size" as it is a little vague.
From what I understand, growth rate is the gradient of the runtime against input. So a growth rate of n^2 would give an equation of the form t = k(n^3) + Constant. Given that the equation is more informative(as it includes constants) and shows a direct relation to the time needed, I thought it would be preferred.
I do understand that as n increase, constants soon becomes irrelevant and depending on different computation configuration, k will be different. Perhaps that is why it is sufficient to just work with the growth rate.
The algorithm isn't the only factor affecting actual running time
Things like programming language, optimizations, branch prediction, I/O speed, paging, processing speed, etc. all come into play.
One language / machine / whatever may certainly have advantages over another, so every algorithm needs to be executed under the exact same conditions.
Beyond that, one algorithm may outperform another in C when considering input and output residing in RAM, but the other may outperform the first in Python when considering input and output residing on disk.
There will no doubt be little to no chance of agreement on the exact conditions that should be used to perform all the benchmarking, and, even if such agreement could be reached, it would certainly be irresponsible to use 5-year-old benchmarking results today in the computing world, so these results would need to be recreated for all algorithms on a regular basis - this would be a massive, very time-consuming task.
Algorithms have varying constant factors
In the extreme case, the constant factors of certain algorithms are so high that other asymptotically slower algorithms outperform it on all reasonable inputs in the modern day. If we merely go by running time, the fact that these algorithms would outperform the others on larger inputs may be lost.
In the less extreme case, we'll get results that will be different at other input sizes because of the constant factors involved - we may see one algorithm as faster in all our tests, but as soon as we hit some input size, the other may become faster.
The running times of some algorithms depend greatly on the input
Basic quicksort on already sorted data, for example, takes O(n^2), while it takes O(n log n) on average.
One can certainly determine the best and worst cases and run the algorithm for those, but the average case is something that could only be determined through mathematical analysis - you can't run it for 'the average case' - you could run it a bunch of times for random input and get average of that, but that's fairly imprecise.
So a rough estimate is sufficient
Because of the above reasons, it makes sense to just say an algorithm is, for example, O(n^2), which very roughly means that, if we're dealing with large enough input size, it would take 4 times longer if the input size doubles. If you've been paying attention, you'll know that the actual time taken could be quite different from 4 times longer, but it at least gives us some idea - we won't expect it to take twice as long, nor 10 times longer (although it might under extreme circumstances). We can also reasonably expect, for example, an O(n log n) algorithm to outperform an O(n^2) algorithm for a large n, which is a useful comparison, and may be easier to see what's going on than some perhaps more exact representation.
You can use both types of measures. In practice, it can be useful to measure performance with specific inputs that you are likely to work with (benchmarking), but it is also quite useful to know the asymptotic behavior of algorithms, as that tells us the (space/time) cost in the general case of "very large inputs" (technically, n->infinity).
Remember that in many cases, the main term of the runtime often far outweighs the importance of lower-order terms, especially as n takes on large values. Therefore, we can summarize or abstract away information by giving a "growth rate" or bound on the algorithm's performance, instead of working with the "exact" runtime or space requirements. Exact in quotes because the constants for the various terms of your runtime may very much vary between runs, between machines - basically different conditions will produce different "constants". In summary, we are interested in asymptotic algorithm behavior because it is still very useful and machine-agnostic.
Growth rate is a relation between the run time of the algorithm and the size of its input. However, this measure is not expressed in units of time, because the technology quickly makes these units obsolete. Only 20 years ago, a microsecond wasn't a lot of time; if you work with embedded systems, it is still not all that much. On the other hand, on mainstream computers with clock speeds of over a gigahertz a microsecond is a lot of time.
An algorithm does not become faster if you run it on a faster hardware. If you say, for example, that an algorithm takes eight milliseconds for an input of size 100, the information would be meaningless until you say on what computer you run your computations: it could be a slow algorithm running on a fast hardware, a fast algorithm running on a slow hardware, or anything in between.
If you also say that it also takes, say, 32 milliseconds for an input of size 200, it would be more meaningful, because the reader would be able to derive the growth rate: the reader would know that doubling the input size quadruples the time, which is a nice thing to know. However, you might as well specify that your algorithm is O(n2).

What is the meaning of "constant" in this context?

I am currently reading the Introduction to Algorithms book and I have a question in regard to analyzing an algorithm:
The computational cost for merge sort is c lg n according to the book and it says that
We restrict c to be a constant so that the word size does not grow arbirarily (If the word size could grow arbitrarily, we could store huge amounts of data in one word and operate on it all in constant time)
I do not understand the meaning of "constant" here. Could anyone explain clearly what this means?
Computational complexity in the study of algorithms deals with finding function(s) which provide upper and lower bounds for how much time (or space) the algorithm requires. Recall basic algebra in high school where you learned about the general point-slope formula for a line? That formula, y = mx + b, provided two parameters, m (slope), and b (y intercept), which described a line completely. Those constants (m,b) described where the line lay, and a larger slope meant that the line was steeper.
Algorithmic complexity is just a way to describe the upper (and possibly lower) bounds for how long an algorithm takes to run (and/or how much space is required). With big-O (and big-Theta) notation, you are finding a function which provides upper (and lower) bounds for the algorithm costs. The constants are just shifting the curve, not changing the shape of the curve.
We restrict c to be a constant so that the word size does not grow arbirarily (If the word size could grow arbitrarily, we could store huge amounts of data in one word and operate on it all in constant time)
On a physical computer, there is some maximum size to a machine word. On a 32-bit system, that would be 32 bits, and on a 64-bit system, it's probably 64 bits. Operations on machine words are (usually) assumed to take time O(1) even though they operate on lots of bits at the same time. For example, if you use a bitwise OR or bitwise AND on a machine word, you can think of it as performing 32 or 64 parallel OR or AND operations in a single unit of time.
When trying to build a theoretical model for a computing system, it's necessary to assume an upper bound on the maximum size of a machine word. If you don't do this, then you could claim that you could perform operations like "compute the OR of n values in time O(1)" or "add together two arbitrary-precision numbers in time O(1)," operations that you can't actually do on a real computer. Therefore, there's usually an assumption that the machine word has some maximum size so that if you do want to compute the OR of n values, you can still do so, but you can't do it instantaneously by packing all the values into one machine word and performing a single assembly instruction to get the result.
Hope this helps!

Are algorithms rated on the big-o notation affected by parallelism?

I've just read an article about a breakthrough on matrix multiplication; an algorithm that is O(n^2.373). But I guess matrix multiplication is something that can be parallelized. So, if we ever start producing thousandth-cores processors, will this become irrelevant? How would things change?
Parallel execution doesn't change the basics of the complexity for a particular algorithm -- at best, you're just taking the time for some given size, and dividing by the number of cores. This may reduce time for a given size by a constant factor, but has no effect on the algorithm's complexity.
At the same time, parallel execution does sometimes change which algorithm(s) you want to use for particular tasks. Some algorithms that work well in serial code just don't split up into parallel tasks very well. Others that have higher complexity might be faster for practical-sized problems because they run better in parallel.
For an extremely large number of cores, the complexity of the calculation itself may become secondary to simply getting the necessary data to/from all the cores to do the calculation. most computations of big-O don't take these effects into account for a serial calculation, but it can become quite important for parallel calculations, especially for some models of parallel machines that don't give uniform access to all nodes.
If quantum computing comes to something practical some day, then yes, complexity of algorithms will change.
In the meantime, parallelizing an algorithm, with a fixed number of processors, just divides its runtime proportionally (and that, in the best case, when there are no dependencies between the tasks performed at every processor). That means, dividing the runtime by a constant, and so the complexity remains the same.
By Amdahl's law, for the same size of problem, parallelization will come to a point of diminishing return with the increase in the number of cores (theoretically). In reality, from a certain degree of parallelization, the overhead of parallelization will actually decrease the performance of the program.
However, by Gustafson's law, the increase of number of cores actually helps as the size of the problem increases. That is the motivation behind cluster computing. As we have more computing power, we can tackle problem at a larger scale or better precision with the help of parallelization.
Algorithms that we learn as is may or may not be paralellizable. Sometimes, a separate algorithm must be used to efficiently execute the same task in parallel. Either way, the Big-O notation must be re-analyze for the parallel case to take into consideration the effect of parallelization on the time complexity of the algorithm.

Resources