Data Structures & Algorithms Optimal Solution Explanation - algorithm

im currently doing a ds&a udemy course as i am prepping for the heavy recruiting this upcoming fall. i stumbled upon a problem that prompted along the lines of:
"Given to list arrays, figure out what integer is missing from the second list array that was present in the first list array "
There were two solutions given in the course one which was considered a brute force solution and the other one the more optimal.
Here are the solutions:
def finderBasic(list1,list2):
list1.sort()
list2.sort()
for i in range(len(list1)):
if list1[i] != list2[i]:
return list1[i]
def finderOptimal(list1,list2):
d = collections.defaultdict(int)
for num in list2:
d[num] = 1
for num in list1:
if d[num] == 0:
return num
else:
d[num] -= 1
The course explains that the finderOptimal is a more optimal way of solving the problem as it solves it in O(n) or linearly. Can someone please further explain to me why that is. I just felt like the finderBasic was much more simpler and only went through one loop. Any help would be much appreciated thank you!

You would be correct, if it was only about going through loop, the first solution would be better.
-- as you said, going through one for loop (whole) takes O(n) time, and it doesn't matter if you go through it once, twice or c-times (as long as c is small enough).
However the heavy operation here is sorting, as it takes cca n*log(n) time, which is larger than O(n). That means, even if you run through the for loop twice in the 2nd solution, it will be still much better than sorting once.
Please note, that accessing dictionary key takes approximately O(1) time, so the time is still O(n) time with the loop.
Refer to: https://wiki.python.org/moin/TimeComplexity
The basic solution may be better for a reader, as it's very simple and straight forward, however it's more complex.

Disclaimer: I am not familiar with python.
There are two loops you are not accounting for in the first example. Each of those sort() calls would have at least two nested loops to implement the sorting. On top of that, usually the best performance you can get in the general case is O(n log(n)) when doing sorting.
The second case avoids all sorting and simply uses a "playcard" to mark what is present. Additionally, it uses dictionary which is a hash table. I am sure you have already learned that hash tables offer constant time - O(1) - operations.
Simpler does not always mean most efficient. Conversely, efficient is often hard to comprehend.

Related

Algorithm for finding Time Complexity of Algorithm [duplicate]

I wonder whether there is any automatic way of determining (at least roughly) the Big-O time complexity of a given function?
If I graphed an O(n) function vs. an O(n lg n) function I think I would be able to visually ascertain which is which; I'm thinking there must be some heuristic solution which enables this to be done automatically.
Any ideas?
Edit: I am happy to find a semi-automated solution, just wondering whether there is some way of avoiding doing a fully manual analysis.
It sounds like what you are asking for is an extention of the Halting Problem. I do not believe that such a thing is possible, even in theory.
Just answering the question "Will this line of code ever run?" would be very difficult if not impossible to do in the general case.
Edited to add:
Although the general case is intractable, see here for a partial solution: http://research.microsoft.com/apps/pubs/default.aspx?id=104919
Also, some have stated that doing the analysis by hand is the only option, but I don't believe that is really the correct way of looking at it. An intractable problem is still intractable even when a human being is added to the system/machine. Upon further reflection, I suppose that a 99% solution may be doable, and might even work as well as or better than a human.
You can run the algorithm over various size data sets, and you could then use curve fitting to come up with an approximation. (Just looking at the curve you create probably will be enough in most cases, but any statistical package has curve fitting).
Note that some algorithms exhibit one shape with small data sets, but another with large... and the definition of large remains a bit nebulous. This means that an algorithm with a good performance curve could have so much real world overhead that (for small data sets) it doesn't work as well as the theoretically better algorithm.
As far as code inspection techniques, none exist. But instrumenting your code to run at various lengths and outputting a simple file (RunSize RunLength would be enough) should be easy. Generating proper test data could be more complex (some algorithms work better/worse with partially ordered data, so you would want to generate data that represented your normal use-case).
Because of the problems with the definition of "what is large" and the fact that performance is data dependent, I find that static analysis often is misleading. When optimizing performance and selecting between two algorithms, the real world "rubber hits the road" test is the only final arbitrator I trust.
A short answer is that it's impossible because constants matter.
For instance, I might write a function that runs in O((n^3/k) + n^2). This simplifies to O(n^3) because as n approaches infinity, the n^3 term will dominate the function, irrespective of the constant k.
However, if k is very large in the above example function, the function will appear to run in almost exactly n^2 until some crossover point, at which the n^3 term will begin to dominate. Because the constant k will be unknown to any profiling tool, it will be impossible to know just how large a dataset to test the target function with. If k can be arbitrarily large, you cannot craft test data to determine the big-oh running time.
I am surprised to see so many attempts to claim that one can "measure" complexity by a stopwatch. Several people have given the right answer, but I think that there is still room to drive the essential point home.
Algorithm complexity is not a "programming" question; it is a "computer science" question. Answering the question requires analyzing the code from the perspective of a mathematician, such that computing the Big-O complexity is practically a form of mathematical proof. It requires a very strong understanding of the fundamental computer operations, algebra, perhaps calculus (limits), and logic. No amount of "testing" can be substituted for that process.
The Halting Problem applies, so the complexity of an algorithm is fundamentally undecidable by a machine.
The limits of automated tools applies, so it might be possible to write a program to help, but it would only be able to help about as much as a calculator helps with one's physics homework, or as much as a refactoring browser helps with reorganizing a code base.
For anyone seriously considering writing such a tool, I suggest the following exercise. Pick a reasonably simple algorithm, such as your favorite sort, as your subject algorithm. Get a solid reference (book, web-based tutorial) to lead you through the process of calculating the algorithm complexity and ultimately the "Big-O". Document your steps and results as you go through the process with your subject algorithm. Perform the steps and document your progress for several scenarios, such as best-case, worst-case, and average-case. Once you are done, review your documentation and ask yourself what it would take to write a program (tool) to do it for you. Can it be done? How much would actually be automated, and how much would still be manual?
Best wishes.
I am curious as to why it is that you want to be able to do this. In my experience when someone says: "I want to ascertain the runtime complexity of this algorithm" they are not asking what they think they are asking. What you are most likely asking is what is the realistic performance of such an algorithm for likely data. Calculating the Big-O of a function is of reasonable utility, but there are so many aspects that can change the "real runtime performance" of an algorithm in real use that nothing beats instrumentation and testing.
For example, the following algorithms have the same exact Big-O (wacky pseudocode):
example a:
huge_two_dimensional_array foo
for i = 0, i < foo[i].length, i++
for j = 0; j < foo[j].length, j++
do_something_with foo[i][j]
example b:
huge_two_dimensional_array foo
for j = 0, j < foo[j].length, j++
for i = 0; i < foo[i].length, i++
do_something_with foo[i][j]
Again, exactly the same big-O... but one of them uses row ordinality and one of them uses column ordinality. It turns out that due to locality of reference and cache coherency you might have two completely different actual runtimes, especially depending on the actual size of the array foo. This doesn't even begin to touch the actual performance characteristics of how the algorithm behaves if it's part of a piece of software that has some concurrency built in.
Not to be a negative nelly but big-O is a tool with a narrow scope. It is of great use if you are deep inside algorithmic analysis or if you are trying to prove something about an algorithm, but if you are doing commercial software development the proof is in the pudding, and you are going to want to have actual performance numbers to make intelligent decisions.
Cheers!
This could work for simple algorithms, but what about O(n^2 lg n), or O(n lg^2 n)?
You could get fooled visually very easily.
And if its a really bad algorithm, maybe it wouldn't return even on n=10.
Proof that this is undecidable:
Suppose that we had some algorithm HALTS_IN_FN(Program, function) which determined whether a program halted in O(f(n)) for all n, for some function f.
Let P be the following program:
if(HALTS_IN_FN(P,f(n)))
{
while(1);
}
halt;
Since the function and the program are fixed, HALTS_IN_FN on this input is constant time. If HALTS_IN_FN returns true, the program runs forever and of course does not halt in O(f(n)) for any f(n). If HALTS_IN_FN returns false, the program halts in O(1) time.
Thus, we have a paradox, a contradiction, and so the program is undecidable.
A lot of people have commented that this is an inherently unsolvable problem in theory. Fair enough, but beyond that, even solving it for any but the most trivial cases would seem to be incredibly difficult.
Say you have a program that has a set of nested loops, each based on the number of items in an array. O(n^2). But what if the inner loop is only run in a very specific set of circumstances? Say, on average, it's run in aprox log(n) cases. Suddenly our "obviously" O(n^2) algorithm is really O(n log n). Writing a program that could determine if the inner loop would be run, and how often, is potentially more difficult than the original problem.
Remember O(N) isn't god; high constants can and will change the playing field. Quicksort algorithms are O(n log n) of course, but when the recursion gets small enough, say down to 20 items or so, many implementations of quicksort will change tactics to a separate algorithm as it's actually quicker to do a different type of sort, say insertion sort with worse O(N), but much smaller constant.
So, understand your data, make educated guesses, and test.
I think it's pretty much impossible to do this automatically. Remember that O(g(n)) is the worst-case upper bound and many functions perform better than that for a lot of data sets. You'd have to find the worst-case data set for each one in order to compare them. That's a difficult task on its own for many algorithms.
You must also take care when running such benchmarks. Some algorithms will have a behavior heavily dependent on the input type.
Take Quicksort for example. It is a worst-case O(n²), but usually O(nlogn). For two inputs of the same size.
The traveling salesman is (I think, not sure) O(n²) (EDIT: the correct value is 0(n!) for the brute force algotithm) , but most algorithms get rather good approximated solutions much faster.
This means that the the benchmarking structure has to most of the time be adapted on an ad hoc basis. Imagine writing something generic for the two examples mentioned. It would be very complex, probably unusable, and likely will be giving incorrect results anyway.
Jeffrey L Whitledge is correct. A simple reduction from the halting problem proves that this is undecidable...
ALSO, if I could write this program, I'd use it to solve P vs NP, and have $1million... B-)
I'm using a big_O library (link here) that fits the change in execution time against independent variable n to infer the order of growth class O().
The package automatically suggests the best fitting class by measuring the residual from collected data against each class growth behavior.
Check the code in this answer.
Example of output,
Measuring .columns[::-1] complexity against rapid increase in # rows
--------------------------------------------------------------------------------
Big O() fits: Cubic: time = -0.017 + 0.00067*n^3
--------------------------------------------------------------------------------
Constant: time = 0.032 (res: 0.021)
Linear: time = -0.051 + 0.024*n (res: 0.011)
Quadratic: time = -0.026 + 0.0038*n^2 (res: 0.0077)
Cubic: time = -0.017 + 0.00067*n^3 (res: 0.0052)
Polynomial: time = -6.3 * x^1.5 (res: 6)
Logarithmic: time = -0.026 + 0.053*log(n) (res: 0.015)
Linearithmic: time = -0.024 + 0.012*n*log(n) (res: 0.0094)
Exponential: time = -7 * 0.66^n (res: 3.6)
--------------------------------------------------------------------------------
I guess this isn't possible in a fully automatic way since the type and structure of the input differs a lot between functions.
Well, since you can't prove whether or not a function even halts, I think you're asking a little much.
Otherwise #Godeke has it.
I don't know what's your objective in doing this, but we had a similar problem in a course I was teaching. The students were required to implement something that works at a certain complexity.
In order not to go over their solution manually, and read their code, we used the method #Godeke suggested. The objective was to find students who used linked list instead of a balansed search tree, or students who implemented bubble sort instead of heap sort (i.e. implementations that do not work in the required complexity - but without actually reading their code).
Surprisingly, the results did not reveal students who cheated. That might be because our students are honest and want to learn (or just knew that we'll check this ;-) ). It is possible to miss cheating students if the inputs are small, or if the input itself is ordered or such. It is also possible to be wrong about students who did not cheat, but have large constant values.
But in spite of the possible errors, it is well worth it, since it saves a lot of checking time.
As others have said, this is theoretically impossible. But in practice, you can make an educated guess as to whether a function is O(n) or O(n^2), as long as you don't mind being wrong sometimes.
First time the algorithm, running it on input of various n. Plot the points on a log-log graph. Draw the best-fit line through the points. If the line fits all the points well, then the data suggests that the algorithm is O(n^k), where k is the slope of the line.
I am not a statistician. You should take all this with a grain of salt. But I have actually done this in the context of automated testing for performance regressions. The patch here contains some JS code for it.
If you have lots of homogenious computational resources, I'd time them against several samples and do linear regression, then simply take the highest term.
It's easy to get an indication (e.g. "is the function linear? sub-linear? polynomial? exponential")
It's hard to find the exact complexity.
For example, here's a Python solution: you supply the function, and a function that creates parameters of size N for it. You get back a list of (n,time) values to plot, or to perform regression analysis. It times it once for speed, to get a really good indication it would have to time it many times to minimize interference from environmental factors (e.g. with the timeit module).
import time
def measure_run_time(func, args):
start = time.time()
func(*args)
return time.time() - start
def plot_times(func, generate_args, plot_sequence):
return [
(n, measure_run_time(func, generate_args(n+1)))
for n in plot_sequence
]
And to use it to time bubble sort:
def bubble_sort(l):
for i in xrange(len(l)-1):
for j in xrange(len(l)-1-i):
if l[i+1] < l[i]:
l[i],l[i+1] = l[i+1],l[i]
import random
def gen_args_for_sort(list_length):
result = range(list_length) # list of 0..N-1
random.shuffle(result) # randomize order
# should return a tuple of arguments
return (result,)
# timing for N = 1000, 2000, ..., 5000
times = plot_times(bubble_sort, gen_args_for_sort, xrange(1000,6000,1000))
import pprint
pprint.pprint(times)
This printed on my machine:
[(1000, 0.078000068664550781),
(2000, 0.34400010108947754),
(3000, 0.7649998664855957),
(4000, 1.3440001010894775),
(5000, 2.1410000324249268)]

Is a while loop with a nested for loop O(n) or O(n^2)?

I have 2 blocks of code. One with a single while loop, and the second with a for loop inside the while loop. My professor is telling me that Option 1 has an algorithm complexity of O(n) and Option 2 has an algorithm complexity of O(n^2), however can't explain why that is, other than pointing to the nested for loops. I am confused because both perform the exact same number of calculations for any given size N, which doesn't seem to be indicative that they have different algorithm complexities.
I'd like to know:
a) if my professor is correct, and how they can boast the same calculations but have different big Os.
b) if my professor is incorrect and they are the same complexity, is it O(n) or O(n^2)? Why?
I've used inline comments denoted by '#' to note the computations. Packages to deliver should be N. Self.trucks is a list. self.isWorkDayComplete is a boolean determined by whether all packages have been delivered.
Option 1:
# initializes index for fake for loop
truck_index = 0
while(not self.workDayCompleted):
# checks if truck index has reached end of self.trucks list
if(truck_index != len(self.trucks)):
# does X amount of calculations required for delivery of truck's packages
while(not self.trucks[truck_index].isEmpty()):
trucks[truck_index].travel()
trucks[truck_index].deliverPackage()
if(hub.packagesExist()):
truck[truck_index].travelToHub()
truck[truck_index].loadPackages()
# increments index
truck_index += 1
else:
# resets index to 0 for next iteration set through truck list
truck_index = 0
# does X amount of calculations required for while loop condition
self.workDayCompleted = isWorkDayCompleted()
Option 2:
while(not self.workDayCompleted):
# initializes index (i)
# each iteration checks if truck index has reached end of self.trucks list
# increments index
for i in range(len(trucks)):
# does X amount of calculations required for Delivery of truck's packages
while(not self.trucks[i].isEmpty()):
trucks[i].travel()
trucks[i].deliverPackage()
if(hub.packagesExist()):
truck[i].travelToHub()
truck[i].loadPackages()
# does X amount of calculations required for while loop condition
self.workDayCompleted = isWorkDayCompleted()
Any help is greatly appreciated, thank you!
It certainly seems like these two pieces of code are effectively implementing the same algorithm (i.e. deliver a package with each truck, then check to see if the work day is completed, repeat until the work day is completed). From this perspective you're right to be skeptical.
The question becomes: are they O(n) or O(n2)? As you've described it, this is impossible to determine because we don't know what the conditions are for the work day being completed. Is it related to the amount of work that has been done by the trucks? Without that information we have no ability to reason about when the outer loop exits. For all we know the condition is that each truck must deliver 2n packages and the complexity is actually O(n 2n).
So if your professor is right, my only guess is that there's a difference between the implementations of isWorkDayCompleted() between the two options. Barring something like that, though, the two options should have the same complexity.
Regardless, when it comes to problems like this it is always important to make sure that you're both talking about the same things:
What n means (presumably the number of trucks)
What you're counting (presumably the number of deliveries and maybe also the checks for the work day being done)
What the end state is (this is the red flag for me -- the work day being completed needs better defined)
Subsequent edits lead me to believe both of these options are O(n), since they ultimately perform one or two "travel" operations per package, depending on the number of trucks and their capacity. Given this, I think the answer to your core question (do those different control structures result in different complexity analysis) is no, they don't.
It also seems unlikely that the internals are affecting the code complexity in some important way, so my advice would be to get back together with your professor and see if they can expand on their thoughts. It very well might be that this was an oversight on their part or that they were trying to make a more subtle point about how some of the component you're using were implemented.
If you get their explanation and there is something more complex going on that you still have trouble understanding, that should probably be a separate question (perhaps linked to this one).
a) if my professor is correct, and how they can boast the same calculations but have different big Os.
Two algorithms that do the same number of "basic operations" have the same time complexity, regardless how the code is structured.
b) if my professor is incorrect and they are the same complexity, is it O(n) or O(n^2)? Why?
First you have to define: what is "n"? Is n the number of trucks? Next, does the number of "basic operations" per truck the same or does it vary in some way?
For example: If the number of operations per truck is constant C, the total number of operations is C*n. That's in the complexity class O(n).

Hash tables Ω(n^2) runtime?

I am really confused about this. Having read the textbook and done exercises I still don't get how it works, and unfortunately I can't go in person to see the professor and it's somewhat difficult to get in touch (summer online course, different time zones). I feel like it would 'click' if I just understood how to do this problem. The textbook details hash functions and runtime individually but I feel like this question is outside the scope of what we've learned. If someone could point me at anything that might help, that would be great.
1) Consider the process of inserting m keys into a hash table T[0..m − 1], where m is a prime, and we use open addressing. The hash function we use is h(k, i) = (k + i) mod m. Give an example of m keys k1, k2 ... km, such that the following sequence of operations takes Ω(n^2) time:
insert(k1), insert(k2), ..., insert(km)
I understand that insert operations are supposed to take O(1) time or, in some cases, O(n). How exactly am I supposed to come up with keys that will turn that into Ω(n^2) time? I'm hoping to understand this and I feel like I'm missing some huge hint, because the textbook chapter seems simple, makes sense to me, and doesn't help with this at all. In the question it's stated that m is a prime, is this important? I'm just so lost, and Google for once fails me.
The keyword here is hash collision:
In order for a hash function to work well, you need the values for a certain input to be well-distributed over all m possible values the entries are stored in. If the hash table has about as many entries as elements were inserted, you can expect every element to be stored at (or near) its hash value (meaning only small amounts of probing are necessary), making access, insertion and deletion a constant-time operation.
If you however find different input values for which the hash function maps to the same value every time (collisions), during insertion the probing step will have to skip over all previously added elements, taking Ω(n) time per element on average. Thus we get a runtime of Ω(n²)

Programmatically obtaining Big-O efficiency of code

I wonder whether there is any automatic way of determining (at least roughly) the Big-O time complexity of a given function?
If I graphed an O(n) function vs. an O(n lg n) function I think I would be able to visually ascertain which is which; I'm thinking there must be some heuristic solution which enables this to be done automatically.
Any ideas?
Edit: I am happy to find a semi-automated solution, just wondering whether there is some way of avoiding doing a fully manual analysis.
It sounds like what you are asking for is an extention of the Halting Problem. I do not believe that such a thing is possible, even in theory.
Just answering the question "Will this line of code ever run?" would be very difficult if not impossible to do in the general case.
Edited to add:
Although the general case is intractable, see here for a partial solution: http://research.microsoft.com/apps/pubs/default.aspx?id=104919
Also, some have stated that doing the analysis by hand is the only option, but I don't believe that is really the correct way of looking at it. An intractable problem is still intractable even when a human being is added to the system/machine. Upon further reflection, I suppose that a 99% solution may be doable, and might even work as well as or better than a human.
You can run the algorithm over various size data sets, and you could then use curve fitting to come up with an approximation. (Just looking at the curve you create probably will be enough in most cases, but any statistical package has curve fitting).
Note that some algorithms exhibit one shape with small data sets, but another with large... and the definition of large remains a bit nebulous. This means that an algorithm with a good performance curve could have so much real world overhead that (for small data sets) it doesn't work as well as the theoretically better algorithm.
As far as code inspection techniques, none exist. But instrumenting your code to run at various lengths and outputting a simple file (RunSize RunLength would be enough) should be easy. Generating proper test data could be more complex (some algorithms work better/worse with partially ordered data, so you would want to generate data that represented your normal use-case).
Because of the problems with the definition of "what is large" and the fact that performance is data dependent, I find that static analysis often is misleading. When optimizing performance and selecting between two algorithms, the real world "rubber hits the road" test is the only final arbitrator I trust.
A short answer is that it's impossible because constants matter.
For instance, I might write a function that runs in O((n^3/k) + n^2). This simplifies to O(n^3) because as n approaches infinity, the n^3 term will dominate the function, irrespective of the constant k.
However, if k is very large in the above example function, the function will appear to run in almost exactly n^2 until some crossover point, at which the n^3 term will begin to dominate. Because the constant k will be unknown to any profiling tool, it will be impossible to know just how large a dataset to test the target function with. If k can be arbitrarily large, you cannot craft test data to determine the big-oh running time.
I am surprised to see so many attempts to claim that one can "measure" complexity by a stopwatch. Several people have given the right answer, but I think that there is still room to drive the essential point home.
Algorithm complexity is not a "programming" question; it is a "computer science" question. Answering the question requires analyzing the code from the perspective of a mathematician, such that computing the Big-O complexity is practically a form of mathematical proof. It requires a very strong understanding of the fundamental computer operations, algebra, perhaps calculus (limits), and logic. No amount of "testing" can be substituted for that process.
The Halting Problem applies, so the complexity of an algorithm is fundamentally undecidable by a machine.
The limits of automated tools applies, so it might be possible to write a program to help, but it would only be able to help about as much as a calculator helps with one's physics homework, or as much as a refactoring browser helps with reorganizing a code base.
For anyone seriously considering writing such a tool, I suggest the following exercise. Pick a reasonably simple algorithm, such as your favorite sort, as your subject algorithm. Get a solid reference (book, web-based tutorial) to lead you through the process of calculating the algorithm complexity and ultimately the "Big-O". Document your steps and results as you go through the process with your subject algorithm. Perform the steps and document your progress for several scenarios, such as best-case, worst-case, and average-case. Once you are done, review your documentation and ask yourself what it would take to write a program (tool) to do it for you. Can it be done? How much would actually be automated, and how much would still be manual?
Best wishes.
I am curious as to why it is that you want to be able to do this. In my experience when someone says: "I want to ascertain the runtime complexity of this algorithm" they are not asking what they think they are asking. What you are most likely asking is what is the realistic performance of such an algorithm for likely data. Calculating the Big-O of a function is of reasonable utility, but there are so many aspects that can change the "real runtime performance" of an algorithm in real use that nothing beats instrumentation and testing.
For example, the following algorithms have the same exact Big-O (wacky pseudocode):
example a:
huge_two_dimensional_array foo
for i = 0, i < foo[i].length, i++
for j = 0; j < foo[j].length, j++
do_something_with foo[i][j]
example b:
huge_two_dimensional_array foo
for j = 0, j < foo[j].length, j++
for i = 0; i < foo[i].length, i++
do_something_with foo[i][j]
Again, exactly the same big-O... but one of them uses row ordinality and one of them uses column ordinality. It turns out that due to locality of reference and cache coherency you might have two completely different actual runtimes, especially depending on the actual size of the array foo. This doesn't even begin to touch the actual performance characteristics of how the algorithm behaves if it's part of a piece of software that has some concurrency built in.
Not to be a negative nelly but big-O is a tool with a narrow scope. It is of great use if you are deep inside algorithmic analysis or if you are trying to prove something about an algorithm, but if you are doing commercial software development the proof is in the pudding, and you are going to want to have actual performance numbers to make intelligent decisions.
Cheers!
This could work for simple algorithms, but what about O(n^2 lg n), or O(n lg^2 n)?
You could get fooled visually very easily.
And if its a really bad algorithm, maybe it wouldn't return even on n=10.
Proof that this is undecidable:
Suppose that we had some algorithm HALTS_IN_FN(Program, function) which determined whether a program halted in O(f(n)) for all n, for some function f.
Let P be the following program:
if(HALTS_IN_FN(P,f(n)))
{
while(1);
}
halt;
Since the function and the program are fixed, HALTS_IN_FN on this input is constant time. If HALTS_IN_FN returns true, the program runs forever and of course does not halt in O(f(n)) for any f(n). If HALTS_IN_FN returns false, the program halts in O(1) time.
Thus, we have a paradox, a contradiction, and so the program is undecidable.
A lot of people have commented that this is an inherently unsolvable problem in theory. Fair enough, but beyond that, even solving it for any but the most trivial cases would seem to be incredibly difficult.
Say you have a program that has a set of nested loops, each based on the number of items in an array. O(n^2). But what if the inner loop is only run in a very specific set of circumstances? Say, on average, it's run in aprox log(n) cases. Suddenly our "obviously" O(n^2) algorithm is really O(n log n). Writing a program that could determine if the inner loop would be run, and how often, is potentially more difficult than the original problem.
Remember O(N) isn't god; high constants can and will change the playing field. Quicksort algorithms are O(n log n) of course, but when the recursion gets small enough, say down to 20 items or so, many implementations of quicksort will change tactics to a separate algorithm as it's actually quicker to do a different type of sort, say insertion sort with worse O(N), but much smaller constant.
So, understand your data, make educated guesses, and test.
I think it's pretty much impossible to do this automatically. Remember that O(g(n)) is the worst-case upper bound and many functions perform better than that for a lot of data sets. You'd have to find the worst-case data set for each one in order to compare them. That's a difficult task on its own for many algorithms.
You must also take care when running such benchmarks. Some algorithms will have a behavior heavily dependent on the input type.
Take Quicksort for example. It is a worst-case O(n²), but usually O(nlogn). For two inputs of the same size.
The traveling salesman is (I think, not sure) O(n²) (EDIT: the correct value is 0(n!) for the brute force algotithm) , but most algorithms get rather good approximated solutions much faster.
This means that the the benchmarking structure has to most of the time be adapted on an ad hoc basis. Imagine writing something generic for the two examples mentioned. It would be very complex, probably unusable, and likely will be giving incorrect results anyway.
Jeffrey L Whitledge is correct. A simple reduction from the halting problem proves that this is undecidable...
ALSO, if I could write this program, I'd use it to solve P vs NP, and have $1million... B-)
I'm using a big_O library (link here) that fits the change in execution time against independent variable n to infer the order of growth class O().
The package automatically suggests the best fitting class by measuring the residual from collected data against each class growth behavior.
Check the code in this answer.
Example of output,
Measuring .columns[::-1] complexity against rapid increase in # rows
--------------------------------------------------------------------------------
Big O() fits: Cubic: time = -0.017 + 0.00067*n^3
--------------------------------------------------------------------------------
Constant: time = 0.032 (res: 0.021)
Linear: time = -0.051 + 0.024*n (res: 0.011)
Quadratic: time = -0.026 + 0.0038*n^2 (res: 0.0077)
Cubic: time = -0.017 + 0.00067*n^3 (res: 0.0052)
Polynomial: time = -6.3 * x^1.5 (res: 6)
Logarithmic: time = -0.026 + 0.053*log(n) (res: 0.015)
Linearithmic: time = -0.024 + 0.012*n*log(n) (res: 0.0094)
Exponential: time = -7 * 0.66^n (res: 3.6)
--------------------------------------------------------------------------------
I guess this isn't possible in a fully automatic way since the type and structure of the input differs a lot between functions.
Well, since you can't prove whether or not a function even halts, I think you're asking a little much.
Otherwise #Godeke has it.
I don't know what's your objective in doing this, but we had a similar problem in a course I was teaching. The students were required to implement something that works at a certain complexity.
In order not to go over their solution manually, and read their code, we used the method #Godeke suggested. The objective was to find students who used linked list instead of a balansed search tree, or students who implemented bubble sort instead of heap sort (i.e. implementations that do not work in the required complexity - but without actually reading their code).
Surprisingly, the results did not reveal students who cheated. That might be because our students are honest and want to learn (or just knew that we'll check this ;-) ). It is possible to miss cheating students if the inputs are small, or if the input itself is ordered or such. It is also possible to be wrong about students who did not cheat, but have large constant values.
But in spite of the possible errors, it is well worth it, since it saves a lot of checking time.
As others have said, this is theoretically impossible. But in practice, you can make an educated guess as to whether a function is O(n) or O(n^2), as long as you don't mind being wrong sometimes.
First time the algorithm, running it on input of various n. Plot the points on a log-log graph. Draw the best-fit line through the points. If the line fits all the points well, then the data suggests that the algorithm is O(n^k), where k is the slope of the line.
I am not a statistician. You should take all this with a grain of salt. But I have actually done this in the context of automated testing for performance regressions. The patch here contains some JS code for it.
If you have lots of homogenious computational resources, I'd time them against several samples and do linear regression, then simply take the highest term.
It's easy to get an indication (e.g. "is the function linear? sub-linear? polynomial? exponential")
It's hard to find the exact complexity.
For example, here's a Python solution: you supply the function, and a function that creates parameters of size N for it. You get back a list of (n,time) values to plot, or to perform regression analysis. It times it once for speed, to get a really good indication it would have to time it many times to minimize interference from environmental factors (e.g. with the timeit module).
import time
def measure_run_time(func, args):
start = time.time()
func(*args)
return time.time() - start
def plot_times(func, generate_args, plot_sequence):
return [
(n, measure_run_time(func, generate_args(n+1)))
for n in plot_sequence
]
And to use it to time bubble sort:
def bubble_sort(l):
for i in xrange(len(l)-1):
for j in xrange(len(l)-1-i):
if l[i+1] < l[i]:
l[i],l[i+1] = l[i+1],l[i]
import random
def gen_args_for_sort(list_length):
result = range(list_length) # list of 0..N-1
random.shuffle(result) # randomize order
# should return a tuple of arguments
return (result,)
# timing for N = 1000, 2000, ..., 5000
times = plot_times(bubble_sort, gen_args_for_sort, xrange(1000,6000,1000))
import pprint
pprint.pprint(times)
This printed on my machine:
[(1000, 0.078000068664550781),
(2000, 0.34400010108947754),
(3000, 0.7649998664855957),
(4000, 1.3440001010894775),
(5000, 2.1410000324249268)]

What are the ways of calculating Average Case's

Background:
For my Data Structures and Algorithms I am studying the Big O Notation. So far I understand how to workout the time complexity, best and worst case scenario. However, the average case is just baffling my head. The teacher is just throwing at us equations that I don't understand. And he is not willing to explain them in detail.
Question:
So please guys, what is the best way to calculate this? Is there one equation that calculates this or does it vary from algorithm to algorithm?
What are the steps you take to calculate this?
Let's take an example of Insertion sort algorithm?
Research:
I looked on youtube and stackoverflow for answers. But they all use different equations.
Any help would be great
thanks
As mentioned in the comment you have to look at the average input to the algorithm (which in this case means random). A good way to think about it is to try at trace what the algorithm would do if the input was average.
For the example of insertion sort:
In the best case (when the input is already sorted) the algorithm will look through the input but never exchanging anything, clearly resulting in a running time of O(n).
In the worst case (when the input is exactly opposite if the desired order) the algorithm will move every input all the way from it's current position to the start of the list, that is, the object on index 0 will not be moved, the object on index 1 will be moved once, the object on input 2 will be moved twice and so on, resulting in a running time of 0+1+2+3+...+n-1 ≈ 0.5n² = O(n²).
The same way of thinking can be used to find the average case, but instead of each object moving all the way to the start, we can expect that it will on average move halfway down to the start, that is, the object on index 0 will not be moved, the object on index 1 will be moved a half time (of cause this only makes sense on average), the object on input 2 will be moved once, the object on index 3 will be moved 1,5 times and so on, resulting in a running time of 0 + 0.5 + 1 + 1.5 + 2 + ... + (n-1)/2 ≈ 0.25n² (at each index, we have half of what we had in the worst case) = 0(n²).
Of cause not all algorithms are as simple as this, but looking at what the algorithm would do on each step if the input was random usually helps. If you have any kind of information available on the input to the algorithm, (for instance insertion sort is often used as the last step after an other algorithm has done most of the sorting, as it is very efficient if the input is almost sorted, and in such a case we might for example know that no object is going to be moved more than x times) then this can be taken into account when computing the average running time.

Resources