Given that the complexity of the map and reduce tasks are O(map)=f(n) and O(reduce)=g(n) has anybody taken the time to write down how the Map/Reduce intrinsic operations (sorting, shuffling, sending data, etc.) increases the computational complexity? What is the overhead of the Map/Reduce orchestration?
I know that this is a nonsense when your problem is big enough, just don't care about the inefficiencies, but for small problems that can run in a small machine or a couple of machines, should I go through the pain of designing parallel algorithms when I have a Map/Reduce implementation already at hand?
For small problems "that can run in a small machine or a couple of machines," yes, you should rewrite them if performance is essential. Like others have pointed out, communication overhead is high.
I don't think anybody has done any complexity analysis on M/R operations because it's so heavily implementation-, machine-, and algorithm-specific. You should get so many variables just for, say, sorting:
O(n log n * s * (1/p)) where:
- n is the number of items
- s is the number of nodes
- p is the ping time between nodes (assuming equal ping times between all nodes in the network)
Does that make sense? It gets really messy really quick. M/R is also a programming framework, not an algorithm in and of itself, and complexity analysis is usually reserved for algorithms.
The closest thing to what you're looking for may be complexity analysis of multi-threaded algorithms, which is much simpler.
I know that this is a nonsense when your problem is big enough, just don't care about the inefficiencies, but for small problems that can run in a small machine or a couple of machines, should I go through the pain of designing parallel algorithms when I have a Map/Reduce implementation already at hand?
This is a difficult problem to analyze. On the one hand, if the problem is too small then classical complexity analysis is liable to give the wrong answer due to lower order terms dominating for small N.
On the other hand, complexity analysis where one of the variables is the number of compute nodes will also fail if the number of compute nodes is too small ... once again because of the overheads of the Map/Reduce infrastructure contribution to lower order terms.
So what can you do about it? Well, one approach would be to do a more detailed analysis that does not rely on complexity. Figure out the cost function(s), including the lower order terms and the constants, for your particular implementation of the algorithms and the map/reduce framework. Then substitute values for the problem size variables, the number of nodes etc. Complicated, though you may be able to get by with estimates for certain parts of the cost function.
The second approach is to "suck it and see".
Map-Reduce for Machine Learning on Multicore is worth a look, comparing how the complexity of various well known machine learning algorithms changes when changed to a MR-"friendly" form.
Cheers.
Related
Apart from the obvious "It's faster when there are many elements". When is it more appropriate to use a simple sorting algorithm (0(N^2)) compared to an advanced one (O(N log N))?
I've read quite a bit about for example insertion sort being preferred when you've got a small array that's nearly sorted because you get the best case N. Why is it not good to use quicksort for example, when you've got say 20 elements. Not just insertion or quick but rather when and why is a more simple algorithm useful compared to an advanced?
EDIT: If we're working with for example an array, does it matter which data input we have? Such as objects or primitive types (Integer).
The big-oh notation captures the runtime cost of the algorithm for large values of N. It is less effective at measuring the runtime of the algorithm for small values.
The actual transition from one algorithm to another is not a trivial thing. For large N, the effects of N really dominate. For small numbers, more complex effects become very important. For example, some algorithms have better cache coherency. Others are best when you know something about the data (like your example of insertion sort when the data is nearly sorted).
The balance also changes over time. In the past, CPU speeds and memory speeds were closer together. Cache coherency issues were less of an issue. In modern times, CPU speeds have generally left memory busses behind, so cache coherency is more important.
So there's no one clear cut and dry answer to when you should use one algorithm over another. The only reliable answer is to profile your code and see.
For amusement: I was looking at the dynamic disjoint forest problem a few years back. I came across a state-of-the-art paper that permitted some operations to be done in something silly like O(log log N / log^4N). They did some truly brilliant math to get there, but there was a catch. The operations were so expensive that, for my graphs of 50-100 nodes, it was far slower than the O(n log n) solution that I eventually used. The paper's solution was far more important for people operating on graphs of 500,000+ nodes.
When programming sorting algorithms, you have to take into account how much work would be put into implementing the actual algorithm vs the actual speed of it. For big O, the time to implement advanced algorithms would be outweighed by the decreased time taken to sort. For small O, such as 20-100 items, the difference is minimal, so taking a simpler route is much better.
First of all O-Notation gives you the sense of the worst case scenario. So in case the array is nearly sorted the execution time could be near to linear time so it would be better than quick sort for example.
In case the n is small enough, we do take into consideration other aspects. Algorithms such as Quick-sort can be slower because of all the recursions called. At that point it depends on how the OS handles the recursions which can end up being slower than the simple arithmetic operations required in the insertion-sort. And not to mention the additional memory space required for recursive algorithms.
Better than 99% of the time, you should not be implementing a sorting algorithm at all.
Instead use a standard sorting algorithm from your language's standard library. In one line of code you get to use a tested and optimized implementation which is O(n log(n)). It likely implements tricks you wouldn't have thought of.
For external sorts, I've used the Unix sort utility from time to time. Aside from the non-intuitive LC_ALL=C environment variable that I need to get it to behave, it is very useful.
Any other cases where you actually need to implement your own sorting algorithm, what you implement will be driven by your precise needs. I've had to deal with this exactly once for production code in two decades of programming. (That was because for a complex series of reasons, I needed to be sorting compressed data on a machine which literally did not have enough disk space to store said data uncompressed. I used a merge sort.)
When I analyze algorithms, I suddenly asked this question to myself, if we had ternary computer time complexity would be cheaper ? or is there any base that we can build computers so that time complexity analysis would not matter ? I could not find much on the internet, but ternary based computer would process it much faster with given same resources.
I would appreciate any thoughts in this questions
No, the theoretical complexities of virtually all algorithms would remain the same in big-O-notation, since they don't depend on number representation: they just assume certain basic operations such as addition or multiplication take O(1) steps.
For practical considerations, maybe some very narrow area dealing with base-3 representation itself would get an up-to-linear boost. Much like nowadays, getting the number of set bits in an integer has its own fast instruction (POPCNT) in modern processors, so it can be considered O(1).
To get a feeling of what it takes for a new computing technology to wreak havoc on algorithm complexities, read about quantum computers.
I learnt that growth rate is often used to gauge the runtime and efficiency of an algorithm. My question is why use growth rate instead of using the exact(or approximate) relation between the runtime and input size?
Edit:
Thanks for the responses. I would like to clarify what I meant by "relation between the runtime and input size" as it is a little vague.
From what I understand, growth rate is the gradient of the runtime against input. So a growth rate of n^2 would give an equation of the form t = k(n^3) + Constant. Given that the equation is more informative(as it includes constants) and shows a direct relation to the time needed, I thought it would be preferred.
I do understand that as n increase, constants soon becomes irrelevant and depending on different computation configuration, k will be different. Perhaps that is why it is sufficient to just work with the growth rate.
The algorithm isn't the only factor affecting actual running time
Things like programming language, optimizations, branch prediction, I/O speed, paging, processing speed, etc. all come into play.
One language / machine / whatever may certainly have advantages over another, so every algorithm needs to be executed under the exact same conditions.
Beyond that, one algorithm may outperform another in C when considering input and output residing in RAM, but the other may outperform the first in Python when considering input and output residing on disk.
There will no doubt be little to no chance of agreement on the exact conditions that should be used to perform all the benchmarking, and, even if such agreement could be reached, it would certainly be irresponsible to use 5-year-old benchmarking results today in the computing world, so these results would need to be recreated for all algorithms on a regular basis - this would be a massive, very time-consuming task.
Algorithms have varying constant factors
In the extreme case, the constant factors of certain algorithms are so high that other asymptotically slower algorithms outperform it on all reasonable inputs in the modern day. If we merely go by running time, the fact that these algorithms would outperform the others on larger inputs may be lost.
In the less extreme case, we'll get results that will be different at other input sizes because of the constant factors involved - we may see one algorithm as faster in all our tests, but as soon as we hit some input size, the other may become faster.
The running times of some algorithms depend greatly on the input
Basic quicksort on already sorted data, for example, takes O(n^2), while it takes O(n log n) on average.
One can certainly determine the best and worst cases and run the algorithm for those, but the average case is something that could only be determined through mathematical analysis - you can't run it for 'the average case' - you could run it a bunch of times for random input and get average of that, but that's fairly imprecise.
So a rough estimate is sufficient
Because of the above reasons, it makes sense to just say an algorithm is, for example, O(n^2), which very roughly means that, if we're dealing with large enough input size, it would take 4 times longer if the input size doubles. If you've been paying attention, you'll know that the actual time taken could be quite different from 4 times longer, but it at least gives us some idea - we won't expect it to take twice as long, nor 10 times longer (although it might under extreme circumstances). We can also reasonably expect, for example, an O(n log n) algorithm to outperform an O(n^2) algorithm for a large n, which is a useful comparison, and may be easier to see what's going on than some perhaps more exact representation.
You can use both types of measures. In practice, it can be useful to measure performance with specific inputs that you are likely to work with (benchmarking), but it is also quite useful to know the asymptotic behavior of algorithms, as that tells us the (space/time) cost in the general case of "very large inputs" (technically, n->infinity).
Remember that in many cases, the main term of the runtime often far outweighs the importance of lower-order terms, especially as n takes on large values. Therefore, we can summarize or abstract away information by giving a "growth rate" or bound on the algorithm's performance, instead of working with the "exact" runtime or space requirements. Exact in quotes because the constants for the various terms of your runtime may very much vary between runs, between machines - basically different conditions will produce different "constants". In summary, we are interested in asymptotic algorithm behavior because it is still very useful and machine-agnostic.
Growth rate is a relation between the run time of the algorithm and the size of its input. However, this measure is not expressed in units of time, because the technology quickly makes these units obsolete. Only 20 years ago, a microsecond wasn't a lot of time; if you work with embedded systems, it is still not all that much. On the other hand, on mainstream computers with clock speeds of over a gigahertz a microsecond is a lot of time.
An algorithm does not become faster if you run it on a faster hardware. If you say, for example, that an algorithm takes eight milliseconds for an input of size 100, the information would be meaningless until you say on what computer you run your computations: it could be a slow algorithm running on a fast hardware, a fast algorithm running on a slow hardware, or anything in between.
If you also say that it also takes, say, 32 milliseconds for an input of size 200, it would be more meaningful, because the reader would be able to derive the growth rate: the reader would know that doubling the input size quadruples the time, which is a nice thing to know. However, you might as well specify that your algorithm is O(n2).
I've just read an article about a breakthrough on matrix multiplication; an algorithm that is O(n^2.373). But I guess matrix multiplication is something that can be parallelized. So, if we ever start producing thousandth-cores processors, will this become irrelevant? How would things change?
Parallel execution doesn't change the basics of the complexity for a particular algorithm -- at best, you're just taking the time for some given size, and dividing by the number of cores. This may reduce time for a given size by a constant factor, but has no effect on the algorithm's complexity.
At the same time, parallel execution does sometimes change which algorithm(s) you want to use for particular tasks. Some algorithms that work well in serial code just don't split up into parallel tasks very well. Others that have higher complexity might be faster for practical-sized problems because they run better in parallel.
For an extremely large number of cores, the complexity of the calculation itself may become secondary to simply getting the necessary data to/from all the cores to do the calculation. most computations of big-O don't take these effects into account for a serial calculation, but it can become quite important for parallel calculations, especially for some models of parallel machines that don't give uniform access to all nodes.
If quantum computing comes to something practical some day, then yes, complexity of algorithms will change.
In the meantime, parallelizing an algorithm, with a fixed number of processors, just divides its runtime proportionally (and that, in the best case, when there are no dependencies between the tasks performed at every processor). That means, dividing the runtime by a constant, and so the complexity remains the same.
By Amdahl's law, for the same size of problem, parallelization will come to a point of diminishing return with the increase in the number of cores (theoretically). In reality, from a certain degree of parallelization, the overhead of parallelization will actually decrease the performance of the program.
However, by Gustafson's law, the increase of number of cores actually helps as the size of the problem increases. That is the motivation behind cluster computing. As we have more computing power, we can tackle problem at a larger scale or better precision with the help of parallelization.
Algorithms that we learn as is may or may not be paralellizable. Sometimes, a separate algorithm must be used to efficiently execute the same task in parallel. Either way, the Big-O notation must be re-analyze for the parallel case to take into consideration the effect of parallelization on the time complexity of the algorithm.
I'm no professional programmer and I don't study it. I'm an aerospace student and did a numeric method for my diploma thesis and also coded a program to prove that it works.
I did several methods and implemented several algorithms and tried to show the proofs why different situations needed their own algorithm to solve the task.
I did this proof with a mathematical approach, but some algorithm was so specific that I do know what they do and they do it right, but it was very hard to find a mathematical function or something to show how many iterations or loops it has to do until it finishes.
So, I would like to know how you do this comparison. Do you also present a mathematical function, or do you just do a speedtest of both algorithms, and if you do it mathematically, how do you do that? Do you learn this during your university studies, or how?
Thank you in advance, Andreas
The standard way of comparing different algorithms is by comparing their complexity using Big O notation. In practice you would of course also benchmark the algorithms.
As an example the sorting algorithms bubble sort and heap sort has complexity O(n2) and O(n log n) respective.
As a final note it's very hard to construct representative benchmarks, see this interesting post from Christer Ericsson on the subject.
While big-O notation can provide you with a way of distinguishing an awful algorithm from a reasonable algorithm, it only tells you about a particular definition of computational complexity. In the real world, this won't actually allow you to choose between two algorithms, since:
1) Two algorithms at the same order of complexity, let's call them f and g, both with O(N^2) complexity might differ in runtime by several orders of magnitude. Big-O notation does not measure the number of individual steps associated with each iteration, so f might take 100 steps while g takes 10.
In addition, different compilers or programming languages might generate more or less instructions for each iteration of the algorithm, and subtle choices in the description of the algorithm can make cache or CPU hardware perform 10s to 1000s of times worse, without changing either the big-O order, or the number of steps!
2) An O(N) algorithm might outperform an O(log(N)) algorithm
Big-O notation does not measure the number of individual steps associated with each iteration, so if O(N) takes 100 steps, but O(log(N)) takes 1000 steps for each iteration, then for data sets up to a certain size O(N) will be better.
The same issues apply to compilers as above.
The solution is to do an initial mathematical analysis of Big-O notation, followed by a benchmark-driven performance tuning cycle, using time and hardware performance counter data, as well as a good dollop of experience.
Firstly one would need to define what more efficient means, does it mean quicker, uses less system resources (such as memory) etc... (these factors are sometimes mutually exclusive)
In terms of standard definitions of efficiency one would often utilize Big-0 Notation, however in the "real world" outside academia normally one would profile/benchmark both equations and then compare the results
It's often difficult to make general assumptions about Big-0 notation as this is primarily concerned with looping and assumes a fixed cost for the code within a loop so benchmarking would be the better way to go
One caveat to watch out for is that sometimes the result can vary significantly based on the dataset size you're working with - for small N in a loop one will sometimes not find much difference
You might get off easy when there is a significant difference in the asymptotic Big-O complexity class for the worst case or for the expected case. Even then you'll need to show that the hidden constant factors don't make the "better" (from the asymptotic perspective) algorithm slower for reasonably sized inputs.
If difference isn't large, then given the complexity of todays computers, benchmarking with various datasets is the only correct way. You cannot even begin to take into account all of the convoluted interplay that comes from branch prediction accuracy, data and code cache hit rates, lock contention and so on.
Running speed tests is not going to provide you with as good quality an answer as mathematics will. I think your outline approach is correct -- but perhaps your experience and breadth of knowledge let you down when analysing on of your algorithms. I recommend the book 'Concrete Mathematics' by Knuth and others, but there are a lot of other good (and even more not good) books covering the topic of analysing algorithms. Yes, I learned this during my university studies.
Having written all that, most algoritmic complexity is analysed in terms of worst-case execution time (so called big-O) and it is possible that your data sets do not approach worst-cases, in which case the speed tests you run may illuminate your actual performance rather than the algorithm's theoretical performance. So tests are not without their value. I'd say, though, that the value is secondary to that of the mathematics, which shouldn't cause you any undue headaches.
That depends. At the university you do learn to compare algorithms by calculating the number of operations it executes depending on the size / value of its arguments. (Compare analysis of algorithms and big O notation). I would require of every decent programmer to at least understand the basics of that.
However in practice this is useful only for small algorithms, or small parts of larger algorithms. You will have trouble to calculate this for, say, a parsing algorithm for an XML Document. But knowing the basics often keeps you from making braindead errors - see, for instance, Joel Spolskys amusing blog-entry "Back to the Basics".
If you have a larger system you will usually either compare algorithms by educated guessing, making time measurements, or find the troublesome spots in your system by using a profiling tool. In my experience this is rarely that important - fighting to reduce the complexity of the system helps more.
To answer your question: " Do you also present a mathematical function, or do you just do a speedtest of both algorithms."
Yes to both - let's summarize.
The "Big O" method discussed above refers to the worst case performance as Mark mentioned above. The "speedtest" you mention would be a way to estimate "average case performance". In practice, there may be a BIG difference between worst case performance and average case performance. This is why your question is interesting and useful.
Worst case performance has been the classical way of defining and classifying algorithm performance. More recently, research has been more concerned with average case performance or more precisely performance bounds like: 99% of the problems will require less than N operations. You can imagine why the second case is far more practical for most problems.
Depending on the application, you might have very different requirements. One application may require response time to be less than 3 seconds 95% of the time - this would lead to defining performance bounds. Another might require performance to NEVER exceed 5 seconds - this would lead to analyzing worst case performance.
In both cases this is taught at the university or grad school level. Anyone developing new algorithms used in real-time applications should learn about the difference between average and worst case performance and should also be prepared to develop simulations and analysis of algorithm performance as part of an implementation process.
Hope this helps.
Big O notation give you the complexity of an algoritm in the worst case, and is mainly usefull to know how the algoritm will grow in execution time when the ammount of data that have to proccess grow up. For example (C-style syntax, this is not important):
List<int> ls = new List<int>(); (1) O(1)
for (int i = 0; i < ls.count; i++) (2) O(1)
foo(i); (3) O(log n) (efficient function)
Cost analysis:
(1) cost: O(1), constant cost
(2) cost: O(1), (int i = 0;)
O(1), (i < ls.count)
O(1), (i++)
---- total: O(1) (constant cost), but it repeats n times (ls.count)
(3) cost: O(log n) (assume it, just an example),
but it repeats n times as it is inside the loop
So, in asymptotic notation, it will have a cost of: O(n log n) (not as efficient) wich in this example is a reasonable result, but take this example:
List<int> ls = new List<int>(); (1) O(1)
for (int i = 0; i < ls.count; i++) (2) O(1)
if ( (i mod 2) == 0) ) (*) O(1) (constant cost)
foo(i); (3) O(log n)
Same algorithm but with a little new line with a condition. In this case asymptotic notation will chose the worst case and will conclude same results as above O(n log n), when is easily detectable that the (3) step will execute only half the times.
Data an so are only examples and may not be exact, just trying to illustrate the behaviour of the Big O notation. It mainly gives you the behaviour of your algoritm when data grow up (you algoritm will be linear, exponencial, logarithmical, ...), but this is not what everybody knows as "efficiency", or almost, this is not the only "efficiency" meaning.
However, this methot can detect "impossible of process" (sorry, don't know the exact english word) algoritms, this is, algoritms that will need a gigantic amount of time to be processed in its early steps (think in factorials, for example, or very big matix).
If you want a real world efficiency study, may be you prefere catching up some real world data and doing a real world benchmark of the beaviour of you algoritm with this data. It is not a mathematical style, but it will be more precise in the majority of cases (but not in the worst case! ;) ).
Hope this helps.
Assuming speed (not memory) is your primary concern, and assuming you want an empirical (not theoretical) way to compare algorithms, I would suggest you prepare several datasets differing in size by a wide margin, like 3 orders of magnitude. Then run each algorithm against every dataset, clock them, and plot the results. The shape of each algorithm's time vs. dataset size curve will give a good idea of its big-O performance.
Now, if the size of your datasets in practice are pretty well known, an algorithm with better big-O performance is not necessarily faster. To determine which algorithm is faster for a given dataset size, you need to tune each one's performance until it is "as fast as possible" and then see which one wins. Performance tuning requires profiling, or single-stepping at the instruction level, or my favorite technique, stackshots.
As others have pointed out rightfully a common way is to use the Big O-notation.
But, the Big O is only good as long as you consider processing performance of algorithms that are clearly defined and scoped (such as a bubble sort).
It's when other hardware resources or other running software running in parallell comes into play that the part called engineering kicks in. The hardware has its constraints. Memory and disk are limited resources. Disk performance even depend on the mechanics involved.
An operating system scheduler will for instance differentiate on I/O bound and CPU bound resources to improve the total performance for a given application. A DBMS will take into account disk reads and writes, memory and CPU usage and even networking in the case of clusters.
These things are hard to prove mathematically but are often easily benchmarked against a set of usage patterns.
So I guess the answer is that developers both use theoretical methods such as Big O and benchmarking to determine the speed of algorithms and its implementations.
This is usually expressed with big O notation. Basically you pick a simple function (like n2 where n is the number of elements) that dominates the actual number of iterations.