Are algorithms rated on the big-o notation affected by parallelism? - algorithm

I've just read an article about a breakthrough on matrix multiplication; an algorithm that is O(n^2.373). But I guess matrix multiplication is something that can be parallelized. So, if we ever start producing thousandth-cores processors, will this become irrelevant? How would things change?

Parallel execution doesn't change the basics of the complexity for a particular algorithm -- at best, you're just taking the time for some given size, and dividing by the number of cores. This may reduce time for a given size by a constant factor, but has no effect on the algorithm's complexity.
At the same time, parallel execution does sometimes change which algorithm(s) you want to use for particular tasks. Some algorithms that work well in serial code just don't split up into parallel tasks very well. Others that have higher complexity might be faster for practical-sized problems because they run better in parallel.
For an extremely large number of cores, the complexity of the calculation itself may become secondary to simply getting the necessary data to/from all the cores to do the calculation. most computations of big-O don't take these effects into account for a serial calculation, but it can become quite important for parallel calculations, especially for some models of parallel machines that don't give uniform access to all nodes.

If quantum computing comes to something practical some day, then yes, complexity of algorithms will change.
In the meantime, parallelizing an algorithm, with a fixed number of processors, just divides its runtime proportionally (and that, in the best case, when there are no dependencies between the tasks performed at every processor). That means, dividing the runtime by a constant, and so the complexity remains the same.

By Amdahl's law, for the same size of problem, parallelization will come to a point of diminishing return with the increase in the number of cores (theoretically). In reality, from a certain degree of parallelization, the overhead of parallelization will actually decrease the performance of the program.
However, by Gustafson's law, the increase of number of cores actually helps as the size of the problem increases. That is the motivation behind cluster computing. As we have more computing power, we can tackle problem at a larger scale or better precision with the help of parallelization.
Algorithms that we learn as is may or may not be paralellizable. Sometimes, a separate algorithm must be used to efficiently execute the same task in parallel. Either way, the Big-O notation must be re-analyze for the parallel case to take into consideration the effect of parallelization on the time complexity of the algorithm.

Related

Complexity comparision of SIMD based algorithm implemenation [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am just curious to know, what is the way to measure the complexity of an algorithm implemented by two different SIMD say SSE3 and CUDA. Normally we compare algorithm complexity with Big-O notation. Is there any such way to compare runtime improvement with SIMD ?
If somebody ask how much an algorithm will improve if you run it on GPU. Can you measure it theoretically ? without running the benchmark on both CPU and GPU. ?
Note: I understand what Big-O is . So, all I want to know is how SSE3 performs compared to CUDA or the CPU based implementation for the same algorithm without raw bench-marking
Big-O notation is, for the most part, inapplicable to CPU instructions.
Instructions belong to the realm of the lowly hardware, the impure flesh of the computer. Computer science does not concern itself with such crude notions.
(Actually, the term "Computer Science" is a misnomer. There is this widely circulated quote that "Computer science is no more about computers than astronomy is about telescopes." It is misattributed to Edsger Dijkstra, but actually it originates from Michael R. Fellows, read about it here: https://en.wikiquote.org/wiki/Computer_science)
In any case, if you insist on thinking of algorithms as being executed by CPU instructions, and if you also insist on reasoning about the run-time complexity of instructions, then you have to think of memory accesses.
You need to first come up with some "unit of work" that is comparable between SSE3 and CUDA, and then you need to setup some mechanism so that you can measure
how the number of memory accesses of SSE3 increases in relation to the amount of work to do, and
how the number of memory accesses of CUDA increases in relation to the same amount of work to do.
That would be quite hard to accomplish, and my guess would be that the results would be quite linear, meaning that no matter which one takes more or less memory accesses, the number of accesses would change linearly with respect to the amount of work to do, in both cases.
Big-O generally speaks about how an algorithm scales as the number of items considered, N, grows.
So, for an O(N) algorithm, if you have 10 times the data, you expect roughly 10 times the run-time. If you have an O(n log₂ n) algorithm, 10 times the data gives you ~33x more work.
Roughly speaking, CUDA and GPUs parallelize operations across p cores. The total work done is then given by W=pt, where p is the number of cores and t the time complexity of each core's operation. You could, for instance, sort N items using √N processors each doing O(√N log N) work. The total time complexity of the algorithm is still O(N log N), but the algorithm is thought of as "work-efficient" because its total time complexity is the same (or less than) the best known serial algorithm. So Big-O analysis can tell you something about how algorithms run in parallel environments.
However, it is not the tool you are looking for. Big-O is for talking about how algorithms scale: it can't tell us how an algorithm will perform on different hardware.
For a theoretical approach to performance, you'll want to look into Amdahl's law and Gustafson's law, which provide simple metrics of how much improvement we can expected in programs when their resources are increased. They both boil down to acknowledging that the total increase in speed is a function of the portion of the program we can speed up and the amount we can speed it up by.
Can you do better than that? Yes. There are more sophisticated models you can use to determine how your code might perform in different environments, looking into the roofline model (alternative link) might get you started.
Beyond this, your question gets too specific to be answered well here. Running the benchmarks is likely to be your personal best option.
Some of the answers (and maybe your question) reflect basic misconceptions about Big-O. O(f(n)) is just a classification of mathematical functions. It has nothing to do a priori with run time or anything else. The definition of big-O makes this pretty clear.
With the right mathematical machinery, sure, what you ask is certainly possible.
To use big-O in the context of algorithms, you must first describe the function that you're classifying. A classical example is sorting algorithms. Here a function frequently used is the number of comparisons f(n) needed to sort a list of given length n. When we say (rather imprecisely) that a sorting algorithm is O(n log n), we're usually saying that when n is big enough, the number of comparisons needed to sort is capped by K n log(n) where K is some positive constant.
You can also use big-O in the context of run time, but you must also stipulate a formal machine model. This is a mathematically precise abstraction of a real machine. There are many of these. For many purposes that don't involve parallel processing, the Word RAM or Real RAM is used, but there are other options.
When we say "an algorithm is O(g(n))", what we really mean, most of the time, is "the number of Word RAM clock cycles needed to run the algorithm on input of size n is capped by K g(n) for some constant K, given that n is large enough.
That is, with an abstract machine model in hand, the function classified by big-O is just the number of clock cycles of the abstract machine as a function of input size.
For SIMD computation, the PRAM is one commonly used abstract machine model. As all abstract machine models, it makes simplifying assumptions. For example, both memory size and number of processors are unlimited. It also allows for different strategies to resolve memory contention among processors. Just as for the Word RAM, big-O performance of algorithms running on a PRAM are statements about the number of clock cycles needed with respect to input size.
There has been some work on abstract models of GPUs that allow big O to approximate useful functions describing algorithm performance. It remains a topic of research.
So comparing sequential vs. parallel performance or parallel performance on different architectures is a matter of choosing the right abstract models, describing a program for each that implements the algorithm you're interested in, then comparing big-O descriptions of the number of clock cycle each requires.
Using SIMD/SSE3 does not change Big-O complexity, because it executes constant number (upto 16) operations.
O(Function(N)/Const) === O(Function(N))
The same is true for CUDA, despite of thousand cores working (if we ignore different memory access model and CUDA implementation does not utilize another algorithm)

Why is order of growth preferred as a benchmark for algorithm performance wrt runtime?

I learnt that growth rate is often used to gauge the runtime and efficiency of an algorithm. My question is why use growth rate instead of using the exact(or approximate) relation between the runtime and input size?
Edit:
Thanks for the responses. I would like to clarify what I meant by "relation between the runtime and input size" as it is a little vague.
From what I understand, growth rate is the gradient of the runtime against input. So a growth rate of n^2 would give an equation of the form t = k(n^3) + Constant. Given that the equation is more informative(as it includes constants) and shows a direct relation to the time needed, I thought it would be preferred.
I do understand that as n increase, constants soon becomes irrelevant and depending on different computation configuration, k will be different. Perhaps that is why it is sufficient to just work with the growth rate.
The algorithm isn't the only factor affecting actual running time
Things like programming language, optimizations, branch prediction, I/O speed, paging, processing speed, etc. all come into play.
One language / machine / whatever may certainly have advantages over another, so every algorithm needs to be executed under the exact same conditions.
Beyond that, one algorithm may outperform another in C when considering input and output residing in RAM, but the other may outperform the first in Python when considering input and output residing on disk.
There will no doubt be little to no chance of agreement on the exact conditions that should be used to perform all the benchmarking, and, even if such agreement could be reached, it would certainly be irresponsible to use 5-year-old benchmarking results today in the computing world, so these results would need to be recreated for all algorithms on a regular basis - this would be a massive, very time-consuming task.
Algorithms have varying constant factors
In the extreme case, the constant factors of certain algorithms are so high that other asymptotically slower algorithms outperform it on all reasonable inputs in the modern day. If we merely go by running time, the fact that these algorithms would outperform the others on larger inputs may be lost.
In the less extreme case, we'll get results that will be different at other input sizes because of the constant factors involved - we may see one algorithm as faster in all our tests, but as soon as we hit some input size, the other may become faster.
The running times of some algorithms depend greatly on the input
Basic quicksort on already sorted data, for example, takes O(n^2), while it takes O(n log n) on average.
One can certainly determine the best and worst cases and run the algorithm for those, but the average case is something that could only be determined through mathematical analysis - you can't run it for 'the average case' - you could run it a bunch of times for random input and get average of that, but that's fairly imprecise.
So a rough estimate is sufficient
Because of the above reasons, it makes sense to just say an algorithm is, for example, O(n^2), which very roughly means that, if we're dealing with large enough input size, it would take 4 times longer if the input size doubles. If you've been paying attention, you'll know that the actual time taken could be quite different from 4 times longer, but it at least gives us some idea - we won't expect it to take twice as long, nor 10 times longer (although it might under extreme circumstances). We can also reasonably expect, for example, an O(n log n) algorithm to outperform an O(n^2) algorithm for a large n, which is a useful comparison, and may be easier to see what's going on than some perhaps more exact representation.
You can use both types of measures. In practice, it can be useful to measure performance with specific inputs that you are likely to work with (benchmarking), but it is also quite useful to know the asymptotic behavior of algorithms, as that tells us the (space/time) cost in the general case of "very large inputs" (technically, n->infinity).
Remember that in many cases, the main term of the runtime often far outweighs the importance of lower-order terms, especially as n takes on large values. Therefore, we can summarize or abstract away information by giving a "growth rate" or bound on the algorithm's performance, instead of working with the "exact" runtime or space requirements. Exact in quotes because the constants for the various terms of your runtime may very much vary between runs, between machines - basically different conditions will produce different "constants". In summary, we are interested in asymptotic algorithm behavior because it is still very useful and machine-agnostic.
Growth rate is a relation between the run time of the algorithm and the size of its input. However, this measure is not expressed in units of time, because the technology quickly makes these units obsolete. Only 20 years ago, a microsecond wasn't a lot of time; if you work with embedded systems, it is still not all that much. On the other hand, on mainstream computers with clock speeds of over a gigahertz a microsecond is a lot of time.
An algorithm does not become faster if you run it on a faster hardware. If you say, for example, that an algorithm takes eight milliseconds for an input of size 100, the information would be meaningless until you say on what computer you run your computations: it could be a slow algorithm running on a fast hardware, a fast algorithm running on a slow hardware, or anything in between.
If you also say that it also takes, say, 32 milliseconds for an input of size 200, it would be more meaningful, because the reader would be able to derive the growth rate: the reader would know that doubling the input size quadruples the time, which is a nice thing to know. However, you might as well specify that your algorithm is O(n2).

Analysis with Parallel Algorithms

Everyone knows that bubblesort is O(n^2), but this is based on the number of comparisons needed to sort this. I have a question in which, if I didn't care about the number of comparisons, but the output time, then how do you do analysis of this? Is there a way to do analysis on output time instead of comparisons?
For example, if you could have bubble sort and have parallel comparisons happening at for all pairs (even then odd comparisons), then the throughput time would be something like 2n-1 throughput time. The number of comparisons would be high, but I don't care as the final throughput time is quick.
So in essence, is there a common analysis for overall parallel performance time? If so, just give me some key terms and I'll learn the rest from google.
Parallel programming is a bit of red herring here. Making assumptions about run time only on big O notation can be misleading. To compare run times of algorithms you need the full equation not just the big O notation.
The problem is that big O notation tells you the dominating term as n goes to infinity. But the run time is on finite ranges of n. This is easy to understand from continuous mathematics (my background).
Consider y=Ax and y=Bx^2 Big O notation would tell you that y=Bx^2 is slower. However, between (0,A/B) it's less than y=Ax. In this case it could be faster to use the O(x^2) algorithm than the O(x) algorithm for x<A/B.
In fact I have heard of sorting algorithms which start off with a O(nlogn) algorithm and then switch to a O(n^2) logarithm when n is sufficiently small.
The best example is matrix multiplication. The naïve algorithm is O(n^3) but there are algorithms that get that down to O(n^2.3727). However, every algorithm I have seen has such a large constant that the naïve O(n^3) is still the fastest algorithm for all particle values of n and that does not look likely to change any time soon.
So really what you need is the full equation to compare. Something like An^3 (let's ignore lower order terms) and Bn^2.3727. In this case B is so incredibly large that the O(n^3) method always wins.
Parallel programming usually just lowers the constant. For example when I do matrix multiplication using four cores my time goes from An^3 to A/4 n^3. The same thing will happen with your parallel bubble sort. You will decrease the constant. So it's possible that for some range of values of n that your parallel bubble sort will beat a non-parallel (or possibly even parallel) merge sort. Though, unlike matrix multiplication, I think the range will be pretty small.
Algorithm analysis is not meant to give actual run times. That's what benchmarks are for. Analysis tells you how much relative complexity is in a program, but the actual run time for that program depends on so many other factors that strict analysis can't guarantee real-world times. For example, what happens if your OS decides to suspend your program to install updates? Your run time will be shot. Even running the same program over and over yields different results due to the complexity of computer systems (memory page faults, virtual machine garbage collection, IO interrupts, etc). Analysis can't take these into account.
This is why parallel processing doesn't usually come under consideration during analysis. The mechanism for "parallelizing" a program's components is usually external to your code, and usually based on a probabilistic algorithm for scheduling. I don't know of a good way to do static analysis on that. Once again, you can run a bunch of benchmarks and that will give you an average run time.
The time efficiency we get by parallel steps can be measured by round complexity. Where each round consists of parallel steps occurring at the same time. By doing so, we can see how effective the throughput time is, in similar analysis that we are used to.

Analyzing algorithms - Why only time complexity?

I was learning about algorithms and time complexity, and this quesiton just sprung into my mind.
Why do we only analyze an algorithm's time complexity?
My question is, shouldn't there be another metric for analyzing an algorithm? Say I have two algos A and B.
A takes 5s for 100 elements, B takes 1000s for 100 elements. But both have O(n) time.
So this means that the time for A and B both grow slower than cn grows for two separate constants c=c1 and c=c2. But in my very limited experience with algorithms, we've always ignored this constant term and just focused on the growth. But isn't it very important while choosing between my given example of A and B? Over here c1<<c2 so Algo A is much better than Algo B.
Or am I overthinking at an early stage and proper analysis will come later on? What is it called?
OR is my whole concept of time complexity wrong and in my given example both can't have O(n) time?
We worry about the order of growth because it provides a useful abstraction to the behaviour of the algorithm as the input size goes to infinity.
The constants "hidden" by the O notation are important, but they're also difficult to calculate because they depend on factors such as:
the particular programming language that is being used to implement the algorithm
the specific compiler that is being used
the underlying CPU architecture
We can try to estimate these, but in general it's a lost cause unless we make some simplifying assumptions and work on some well defined model of computation, like the RAM model.
But then, we're back into the world of abstractions, precisely where we started!
We measure lots of other types of complexity.
Space (Memory usage)
Circuit Depth / Size
Network Traffic / Amount of Interaction
IO / Cache Hits
But I guess you're talking more about a "don't the constants matter?" approach. Yes, they do. The reason it's useful to ignore the constants is that they keep changing. Different machines perform different operations at different speeds. You have to walk the line between useful in general and useful on your specific machine.
It's not always time. There's also space.
As for the asymptotic time cost/complexity, which O() gives you, if you have a lot of data, then, for example, O(n2)=n2 is going to be worse than O(n)=100*n for n>100. For smaller n you should prefer this O(n2).
And, obviously, O(n)=100*n is always worse than O(n)=10*n.
The details of your problem should contribute to your decision between several possible solutions (choices of algorithms) to it.
A takes 5s for 100 elements, B takes 1000s for 100 elements. But both
have O(n) time.
Why is that?
O(N) is an asymptotic measurement on the number of steps required to execute a program in relation to the programs input.
This means that for really large values of N the complexity of the algorithm is linear growth.
We don't compare X and Y seconds. We analyze how the algorithm behaves as the input goes larger and larger
O(n) gives you an idea how much slower the same algorithm will be for a different n, not for comparing algorithms.
On the other hand there is also space complexity - how memory usage grows as a function of input n.

What is the computational complexity of the MapReduce overhead

Given that the complexity of the map and reduce tasks are O(map)=f(n) and O(reduce)=g(n) has anybody taken the time to write down how the Map/Reduce intrinsic operations (sorting, shuffling, sending data, etc.) increases the computational complexity? What is the overhead of the Map/Reduce orchestration?
I know that this is a nonsense when your problem is big enough, just don't care about the inefficiencies, but for small problems that can run in a small machine or a couple of machines, should I go through the pain of designing parallel algorithms when I have a Map/Reduce implementation already at hand?
For small problems "that can run in a small machine or a couple of machines," yes, you should rewrite them if performance is essential. Like others have pointed out, communication overhead is high.
I don't think anybody has done any complexity analysis on M/R operations because it's so heavily implementation-, machine-, and algorithm-specific. You should get so many variables just for, say, sorting:
O(n log n * s * (1/p)) where:
- n is the number of items
- s is the number of nodes
- p is the ping time between nodes (assuming equal ping times between all nodes in the network)
Does that make sense? It gets really messy really quick. M/R is also a programming framework, not an algorithm in and of itself, and complexity analysis is usually reserved for algorithms.
The closest thing to what you're looking for may be complexity analysis of multi-threaded algorithms, which is much simpler.
I know that this is a nonsense when your problem is big enough, just don't care about the inefficiencies, but for small problems that can run in a small machine or a couple of machines, should I go through the pain of designing parallel algorithms when I have a Map/Reduce implementation already at hand?
This is a difficult problem to analyze. On the one hand, if the problem is too small then classical complexity analysis is liable to give the wrong answer due to lower order terms dominating for small N.
On the other hand, complexity analysis where one of the variables is the number of compute nodes will also fail if the number of compute nodes is too small ... once again because of the overheads of the Map/Reduce infrastructure contribution to lower order terms.
So what can you do about it? Well, one approach would be to do a more detailed analysis that does not rely on complexity. Figure out the cost function(s), including the lower order terms and the constants, for your particular implementation of the algorithms and the map/reduce framework. Then substitute values for the problem size variables, the number of nodes etc. Complicated, though you may be able to get by with estimates for certain parts of the cost function.
The second approach is to "suck it and see".
Map-Reduce for Machine Learning on Multicore is worth a look, comparing how the complexity of various well known machine learning algorithms changes when changed to a MR-"friendly" form.
Cheers.

Resources