Do you use Big-O complexity evaluation in the 'real world'? - performance

Recently in an interview I was asked several questions related to the Big-O of various algorithms that came up in the course of the technical questions. I don't think I did very well on this... In the ten years since I took programming courses where we were asked to calculate the Big-O of algorithms I have not have one discussion about the 'Big-O' of anything I have worked on or designed. I have been involved in many discussions with other team members and with the architects I have worked with about the complexity and speed of code, but I have never been part of a team that actually used Big-O calculations on a real project. The discussions are always "is there a better or more efficient way to do this given our understanding of out data?" Never "what is the complexity of this algorithm"?
I was wondering if people actually have discussions about the "Big-O" of their code in the real word?

It's not so much using it, it's more that you understand the implications.
There are programmers who do not realise the consequence of using an O(N^2) sorting algorithm.
I doubt many apart from those working in academia would use Big-O Complexity Analysis in anger day-to-day.

No needless n-squared
In my experience you don't have many discussions about it, because it doesn't need discussing. In practice, in my experience, all that ever happens is you discover something is slow and see that it's O(n^2) when in fact it could be O(n log n) or O(n), and then you go and change it. There's no discussion other than "that's n-squared, go fix it".
So yes, in my experience you do use it pretty commonly, but only in the basest sense of "decrease the order of the polynomial", and not in some highly tuned analysis of "yes, but if we switch to this crazy algorithm, we'll increase from logN down to the inverse of Ackerman's function" or some such nonsense. Anything less than a polynomial, and the theory goes out the window and you switch to profiling (e.g. even to decide between O(n) and O(n log n), measure real data).

Big-O notation is rather theoretical, while in practice, you are more interested in actual profiling results which give you a hard number as to how your performance is.
You might have two sorting algorithms which by the book have O(n^2) and O(nlogn) upper bounds, but profiling results might show that the more efficient one might have some overhead (which is not reflected in the theoretical bound you found for it) and for the specific problem set you are dealing with, you might choose the theoretically-less-efficient sorting algorithm.
Bottom line: in real life, profiling results usually take precedence over theoretical runtime bounds.

I do, all the time. When you have to deal with "large" numbers, typically in my case: users, rows in database, promotion codes, etc., you have to know and take into account the Big-O of your algorithms.
For example, an algorithm that generates random promotion codes for distribution could be used to generate billions of codes... Using a O(N^2) algorithm to generate unique codes means weeks of CPU time, whereas a O(N) means hours.
Another typical example is queries in code (bad!). People look up a table then perform a query for each row... this brings up the order to N^2. You can usually change the code to use SQL properly and get orders of N or NlogN.
So, in my experience, profiling is useful ONLY AFTER the correct class of algorithms is used. I use profiling to catch bad behaviours like understanding why a "small" number bound application under-performs.

The answer from my personal experience is - No. Probably the reason is that I use only simple, well understood algorithms and data structures. Their complexity analysis is already done and published, decades ago. Why we should avoid fancy algorithms is better explained by Rob Pike here. In short, a practitioner almost never have to invent new algorithms and as a consequence almost never have to use Big-O.
Well that doesn't mean that you should not be proficient in Big-O. A project might demand the design and analysis of an altogether new algorithm. For some real-world examples, please read the "war stories" in Skiena's The Algorithm Design Manual.

To the extent that I know that three nested for-loops are probably worse than one nested for-loop. In other words, I use it as a reference gut feeling.
I have never calculated an algorithm's Big-O outside of academia. If I have two ways to approach a certain problem, if my gut feeling says that one will have a lower Big-O than the other one, I'll probably instinctively take the smaller one, without further analysis.
On the other hand, if I know for certain the size of n that comes into my algorithm, and I know for certain it to be relatively small (say, under 100 elements), I might take the most legible one (I like to know what my code does even one month after it has been written). After all, the difference between 100^2 and 100^3 executions is hardly noticeable by the user with today's computers (until proven otherwise).
But, as others have pointed out, the profiler has the last and definite word: If the code I write executes slowly, I trust the profiler more than any theoretical rule, and fix accordingly.

I try to hold off optimizations until profiling data proves they are needed. Unless, of course, it is blatently obvious at design time that one algorithm will be more efficient than the other options (without adding too much complexity to the project).

Yes, I use it. And no, it's not often "discussed", just like we don't often discuss whether "orderCount" or "xyz" is a better variable name.
Usually, you don't sit down and analyze it, but you develop a gut feeling based on what you know, and can pretty much estimate the O-complexity on the fly in most cases.
I typically give it a moment's thought when I have to perform a lot of list operations. Am I doing any needless O(n^2) complexity stuff, that could have been done in linear time? How many passes am I making over the list? It's not something you need to make a formal analysis of, but without knowledge of big-O notation, it becomes a lot harder to do accurately.
If you want your software to perform acceptably on larger input sizes, then you need to consider the big-O complexity of your algorithms, formally or informally. Profiling is great for telling you how the program performs now, but if you're using a O(2^n) algorithm, your profiler will tell you that everything is just fine as long as your input size is tiny. And then your input size grows, and runtime explodes.
People often dismiss big-O notation as "theoretical", or "useless", or "less important than profiling". Which just indicates that they don't understand what big-O complexity is for. It solves a different problem than a profiler does. Both are essential in writing software with good performance. But profiling is ultimately a reactive tool. It tells you where your problem is, once the problem exists.
Big-O complexity proactively tells you which parts of your code are going to blow up if you run it on larger inputs. A profiler can not tell you that.

No. I don't use Big-O complexity in 'real world' situations.
My view on the whole issue is this - (maybe wrong.. but its just my take.)
The Big-O complexity stuff is ultimately to understand how efficient an algorithm is. If from experience or by other means, you understand the algorithms you are dealing with, and are able to use the right algo in the right place, thats all that matters.
If you know this Big-O stuff and are able to use it properly, well and good.
If you don't know to talk about algos and their efficiencies in the mathematical way - Big-O stuff, but you know what really matters - the best algo to use in a situation - thats perfectly fine.
If you don't know either, its bad.

Although you rarely need to do deep big-o analysis of a piece of code, it's important to know what it means and to be able to quickly evaluate the complexity of the code you're writing and the consequences it might have.
At development time you often feel like it's "good enough". Eh, no-one will ever put more than 100 elements in this array right ? Then, one day, someone will put 1000 elements in the array (trust users on that: if the code allows it, one of them will do it). And that n^2 algorithm that was good enough now is a big performance problem.
It's sometimes usefull the other way around: if you know that you functionaly have to make n^2 operations and the complexity of your algorithm happens to be n^3, there might be something you can do about it to make it n^2. Once it's n^2, you'll have to work on smaller optimizations.
In the contrary, if you just wrote a sorting algorithm and find out it has a linear complexity, you can be sure that there's a problem with it. (Of course, in real life, occasions were you have to write your own sorting algorithm are rare, but I once saw someone in an interview who was plainly satisfied with his one single for loop sorting algorithm).

Yes, for server-side code, one bottle-neck can mean you can't scale, because you get diminishing returns no matter how much hardware you throw at a problem.
That being said, there are often other reasons for scalability problems, such as blocking on file- and network-access, which are much slower than any internal computation you'll see, which is why profiling is more important than BigO.

Related

When would it be important to be able to calculate order of growth?

I'm reading the chapter 2 and 3 of CLRS, and get stuck so often, especially in the problems provided at the end of each chapter, that I wonder if it'll ever be worthwhile for this much effort. I can't understand the solution online like this one: http://clrs.skanev.com/02/problems/01.html
I heard that this book is one of the most popular text books for university CS class, but do people skip intricate parts and just memorize important things, like insertion sort has this order of growth and merge sort has that order of growth, and go ahead?
Isn't it just enough to be familiar with many useful algorithms to have about as much understanding of computer science as people with a degree in CS do in general?
Understanding isn't about memorization. It's about being able to apply the knowledge to solve problems. The textbook problems are quite simple compared to most real-life problems. So, skipping these simply means you're not learning at all, and you certainly won't be able to apply any of it in real life. You're memorizing, but you can't use what you've memorized.
TL;DR: The proof of being able to use the knowledge is the ability to solve problems, and textbook problems are simple.‡ One doesn't go without the other.
‡ Knuth's texts are a notable exception: he also offers some borderline intractable problems, and everything in between :)
The point is that "people with a degree in CS ... in general" can work out the order of growth of an algorithm. That's why people go to the effort of learning this stuff. If you just want to be able to say "mergesort is O(n log n)", then indeed, all you need is to see and memorise that fact. If you want to be able to work out the O() of an algorithm, even when it's one you've never seen before - then you need these methods.

what is the advantage of recursion algorithm over iteration algorithm?

I am having problem in understanding one thing that when recursion involves so much space as well as the time complexity of both the iterative algos and recursive algos are same unless I apply Dynamic programming to it , then why should we use recursion ,Is it mere for reducing the lines of code that we use this , since even for implementing recursion , a whole PCB has to be saved during passing of control of function from one call to another ?
Although I have seen many posts related to it but still it's not clear to me that what is the major advantage of implementing recursion over iteration ?
what is the major advantage of implementing recursion over iteration ?
Readability - don't neglect it. If the code is readable and simple - it will take less time to code it (which is very important in real life), and a simpler code is also easier to maintain (since in future updates, it will be easy to understand what's going on).
Now, I am not saying "Make everything recursive!", but if something is significantly more readable in a recursive solution - unless in production it makes the code suffer in performance in noticeable way - keep it!.
Performance. Yes, you heard me. Sometimes, to transform a recursive solution to an iterative one, you need more than a simple loop - you need a loop + a stack.
However, many times, your stack DS is not as efficient as the machine one, which is optimized exactly for this purpose by hundreds of employees in Intel/AMD.
This thread discusses a specific case, where a trivial conversion of an algorithm from recursive to iterative yields worse results, and in order to outperform the machine stack - you will need to invest lots of hours in optimizing your stack, and your time is a scarce resource.
Again, I am not saying - "recursion is always faster!". But in some situations, it could be.

Comparing algorithmic performance to old methods

I have written a new algorithm for something. Now I need to compare it with existing methods, some of which are old about 10 years.
The idea I had is to look at benchmarks of different processors over the years in order to establish how much faster my processor (i7-920) is than average processor from 2003. Then I would simply divide old methods' execution time by the speedup factor and use those numbers to compare with my own algorithm.
Has something like this been done? So I don't redo the existing work.
Can such a comparison be done some other way?
Are there some scientific papers written about such comparisons which I can reference?
I don't know which of these are possible for you, but here's a list of options I can think of:
Run their implementation side-by-side on your machine against yours.
This is the best option.
Rewrite their implementations and do (1).
You preferably need to compare it against their test to ensure you get vaguely similar results.
Find a library that implements their algorithm (or multiple libraries) and do (1).
I suggest multiple libraries, if possible, since a single one may not have implemented the algorithm efficiently. You may also want to compare these against their test.
Compare the algorithms mathematically.
This may be difficult, but it's not impossible.
Do what you presented.
(a) I would not recommend this as there are other determining factors in your computer other than the processor speed that affect the speed of an algorithm. Getting an equation that perfectly balances these will likely be very difficult.
(b) There is a massive difference between top and bottom of the line computers, so using the average is not a particularly good idea. If the author didn't provide details regarding this, I'm afraid your benchmark is not likely to be too accurate.
Go out and buy a machine of similar specs to the one used by the desired test to benchmark on.
A 10-year-old machine should be pretty cheap, if you can find one. Also, see (5.b).
Contact the author to allow for any of the other options.
Papers often provide contact details of the authors, or you should be able to find them elsewhere if they have any sort of online presence and you're half-decent at using Google.
If I were reviewing your results, I would be annoyed if you attempted to demonstrate less than an order of magnitude speedup this way. There are a lot of variables determining algorithm performance, and I would be skeptical that a generic benchmark could capture the right ones. My gold standard is old and new algorithms implemented by the same programmer, with similar effort made to optimize, running on the same hardware. Using the previous authors' implementation instead of making a new one is commonplace in the experimental algorithms literature, but using different hardware isn't.
Algorithmic performance is usually measured in big-O terms, for which it is better to count basic operations, like comparisons, and do it for a range of input sizes.
If you must measure overall time, at least eliminate other sources of difference.
As #larsmans said, do it on the same processor.
Also, if there is existing work, there's no harm in repeating it.
Generally, in science, that's a good thing.
You should attempt to reduce the amount of differing factors between the two runs. I think just run-timing the two algorithms side by side on the same machine and/or comparing their Big O times are both equally valid and important. You should also attempt to use updated libraries and other external functions; using outdated ones my also be the cause of timing results.

Is Quicksort a potential security risk?

I just wondered whether (with some serious paranoia and under certain circumstances) the use of the QuickSort algorithm can be seen as a security risk in an application.
Both its basic implementation and improved versions like 3-median-quicksort have the peculiarity of behaving deviant for certain input data, which means that their runtime can increase extremely in these cases (having O(n^2) complexity) not to mention the possibility of a stackoverflow.
Hence I would see potential to do harm by providing pre-sorted data to a programm that causes the algorithm to behave like this, which could have unpredictable consequences for e.g. a multi-client web application.
Is this strange case worth any security consideration (and would therefore force us to use Intro- or Mergesort instead)?
Edit: I know that there are ways to prevent Quicksort's worst cases, but what about language integrated sorts (like the 3-Median of .NET). Would they be taboo?
Yes, it is a security risk - DoS, to be specific - which is trivially mitigated by adding a check for recursion depth in your quicksort, and switching to something else instead if a certain depth is reached. If you switch to heapsort, then you'll get introsort, which is what many STL implementations actually use.
Alternatively, you just randomize the selection of pivot element.
Many implementations of quicksort are done using a randomized version of the algorithm. This means a DoS attack with specially-crafted input is not possible.
Also, even without this, most data sets are simple too small to have O(nlog) vs O(n^2) matter. The size of the set to sort would have to be quite large to have an impact. Even with a few million elements, the time difference would likely not be very large.
Overall, any given web-application using quicksort is much more likely to have other security flaws.
Take a look at this question (and marked answer) which discusses ways of reducing QuickSort's worst case:
Why is quicksort better than mergesort?
If performance is something that matters, then QuickSort would seem a poor choice in most circumstances, security concern or not. Is there something that causes you to shy away from algorithms like Heapsort or Mergesort?
I think this is very much a question of where you're actually using the quick sort. Using O(n^2) algorithms is perfectly fine when your working with arrays of 5 items, for instance. On the other hand, when there's a chance the data can be significantly large, fearing a DoS is not the first problem you'll face - the first problem will be getting bad performance way before you're facing a real problem. Given the large number of other algorithms available, just have it replaced if it's in a critical location.
It is, but only in very, very unlikely cases -- all of which are easy for a properly-designed algorithm to avoid.
But if you want to be super-safe, you may want to use something like Introsort, which starts out as QuickSort but switches over to Heap Sort if it detects from the recursion depth that the algorithm is starting to go quadratic.
Edit: I see Pavel beat me to Introsort.
In Response to Edited Question: I haven't personally tested every single Quicksort library, but I feel safe betting that pretty much all of them have checks in place to avoid the worst case.

Do you find cyclomatic complexity a useful measure?

I've been playing around with measuring the cyclomatic complexity of a big code base.
Cyclomatic complexity is the number of linearly independent paths through a program's source code and there are lots of free tools for your language of choice.
The results are interesting but not surprising. That is, the parts I know to be the hairiest were in fact the most complex (with a rating of > 50). But what I am finding useful is that a concrete "badness" number is assigned to each method as something I can point to when deciding where to start refactoring.
Do you use cyclomatic complexity? What's the most complex bit of code you found?
We refactor mercilessly, and use Cyclomatic complexity as one of the metrics that gets code on our 'hit list'. 1-6 we don't flag for complexity (although it could get questioned for other reasons), 7-9 is questionable, and any method over 10 is assumed to be bad unless proven otherwise.
The worst we've seen was 87 from a monstrous if-else-if chain in some legacy code we had to take over.
Actually, cyclomatic complexity can be put to use beyond just method level thresholds. For starters, one big method with high complexity may be broken into several small methods with lower complexity. But has it really improved the codebase? Granted, you may get somewhat better readability by all those method names. But the total conditional logic hasn't changed. And the total conditional logic can often be reduced by replacing conditionals with polymorphism.
We need a metric that doesn't turn green by mere method decomposition. I call this CC100.
CC100 = 100 * (Total cyclomatic complexity of codebase) / (Total lines of code)
It's useful to me in the same way that big-O is useful: I know what it is, and can use it to get a gut feeling for whether a method is good or bad, but I don't need to compute it for every function I've written.
I think simpler metrics, like LOC, are at least as good in most cases. If a function doesn't fit on one screen, it almost doesn't matter how simple it is. If a function takes 20 parameters and makes 40 local variables, it doesn't matter if its cyclomatic complexity is 1.
Until there is a tool that can work well with C++ templates, and meta-programming techniques, it's not much help in my situation. Anyways just remember that
"not all things that count can be
measured, and not all things that can
be measured count"
Einstein
So remember to pass any information of this type through human filtering too.
We recently started to use it. We use NDepend to do some static code analysis, and it measures cyclomatic complexity. I agree, it's a decent way to identify methods for refactoring.
Sadly, we have seen #'s above 200 for some methods created by our developers offshore.
You'll know complexity when you see it. The main thing this kind of tool is useful for is flagging the parts of the code that were escaping your attention.
I frequently measure the cyclomatic complexity of my code. I've found it helps me spot areas of code that are doing too much. Having a tool point out the hot-spots in my code is much less time consuming than having to read through thousands of lines of code trying to figure out which methods are not following the SRP.
However, I've found that when I do a cyclomatic complexity analysis on other people's code it usually leads to feelings of frustration, angst, and general anger when I find code with cyclomatic complexity in the 100's. What compels people to write methods that have several thousand lines of code in them?!
It's great for help identifying candidates for refactoring, but it's important to keep your judgment around. I'd support kenj0418's ranges for pruning guides.
There's a Java metric called CRAP4J that empirically combines cyclomatic complexity and JUnit test coverage to come up with a single metric. He's been doing research to try and improve his empirical formula. I'm not sure how widespread it is.
Cyclomatic Complexity is just one composant of what could be called Fabricated Complexity. A while back, I wrote an article to summarize several dimensions of code complexity:
Fighting Fabricated Complexity
Tooling is needed to be efficient at handling code complexity. The tool NDepend for .NET code will let you analyze many dimensions of the code complexity including code metrics like:
Cyclomatic Complexity, Nesting Depth, Lack Of Cohesion of Methods, Coverage by Tests...
including dependencies analysis and including a language (Code Query Language) dedicated to ask, what is complex in my code, and to write rule?
Yes, we use it and I have found it useful too. We have a big legacy code base to tame and we found alarming high cyclomatic complexity. (387 in one method!). CC points you directly to areas that are worth to refactor. We use CCCC on C++ code.
I haven't used it in a while, but on a previous project it really helped identify potential trouble spots in someone elses code (wouldn't be mine of course!)
Upon finding the area's to check out, i quickly found numerious problems (also lots of GOTOS would you believe!) with logic and some really strange WTF code.
Cyclomatic complexity is great for showing areas which probably are doing to much and therefore breaking the single responsibilty prinicpal. These's ideally should be broken up into mulitple functions
I'm afraid that for the language of the project for which I would most like metrics like this, LPC, there are not, in fact, lots of free tools for producing it available. So no, not so useful to me.
+1 for kenj0418's hit list values.
The worst I've seen was a 275. There were a couple others over 200 that we were able to refactor down to much smaller CCs; they were still high but it got them pushed further back in line. We didn't have much luck with the 275 beast -- it was (probably still is) a web of if- and switch-statements that was just way too complex. It's only real value is as a step-through when they decide to rebuild the system.
The exceptions to high CC that I was comfortable with were factories; IMO, they are supposed to have a high CC but only if they are only doing simple object creation and returning.
After understanding what it means, I now have started to use it on a "trial" basis. So far I have found it to be useful, because usually high CC goes hand in hand with the Arrow Anti-Pattern, which makes code harder to read and understand. I do not have a fixed number yet, but NDepend is alerting for everything above 5, which looks like a good start to investigate methods.

Resources