Strategy for figuring out input size of recurrence equation? - algorithm

Something I'm struggling with is figuring out a precise way of knowing how our size n should be defined.
To demonstrate what I mean by that; take for example binary search. The input size n of T(n) is defined to be high - low + 1. I don't really understand how we can
Figure that out from just the algorithm without taking an "educated guess"
confirm that we are not wasting our time proving the recurrence equation with a falsely derived input size
I would really appreciate some advise, thanks in advance.

Indeed the input size is usually not arbitrary. It is necessary to correctly determine what that is.
So how do we do it?
First of all you have to understand the algorithm - what it does and why do you even use it. Usually you can brute force any problem quite easily, but you still decide to use something better. Why? Answering that question should make everything clear.
Let's take a look at your binary search example. You want to find a specific value by calling a monotonic function that will tell you if your choice is too low or too high. For example, you may want to find a largest value less than a specific number in an array.
What is a brute force approach? Well, you can ask for every value possible. What affects the complexity? The number of values you can choose. That's your input size. To decrease the complexity you may want to use the binary search, which let's you do much less queries. The input size remains the same.
Let's have a look at some other algorithms:
Euclidean algorithm for GCD - brute force? Search for every number less or equal to the minimum of your input. What affects the complexity? The number of values you want to check, that is the numbers for which you want to find the GCD.
BFS/DFS (graph traversal) - you want to visit each node once. For each node you have to additionally check every edge. In total: the input size is the number of nodes and edges.
KMP/Karp-Rabin/any other pattern finding algorithm - you want to find occurences of pattern in a text, the input size will obviously be a text size, but also a pattern size.
Once you understand the problem the algorithm solves, think of a brute force approach. Determine what affects the speed, what can be improved. That is most likely your input size. In a more complex cases, the input size is stated in an algorithm's description.

Related

Sorting Algorithm that minimizes the maximum number of comparisons in which individual items are involved

I'm interested in finding a comparison sorting algorithm that minimizes the number of times each single element is compared with others during the run of the algorithm.
For a randomly sorted list, I'm interested in two distributions: the number of comparisons that are needed to sort a list (this is the traditional criterion) and the number of comparisons in which each single element of the list is involved.
Among the algorithms that have a good performance in terms of the number of comparisons, say achieving O(n log(n)) on average, I would like to find out the one for which, on average, the number of times a single element is compared with others is minimized.
I supposed that the theoretical minimum is O(log(n)) which is obtained by dividing the above figure on the total number of comparisons by n.
I'm also interested in the case where data are likely to be already ordered to some extent.
Is perhaps a simulation the best way to go about finding an answer?
(My previous question has been put on hold - This is now a very clear question, if you can't understand it then please explain why)
Yes you definitely should do simulations.
There you will implicitely set the size and pre-ordering constraints in a way that may allow more specific statements than the general question you rose.
There can, however, not be a clear answer to such question in general.
Big-O deals with asymptotic behaviour while your question
seem to target smaller problem sizes. So Big-O could hint on the best candidates for sufficiently large input sets to a sort run. (But, e.g. if you are interested in size<=5 the results may be completely different!)
For getting proper estimate on comparison operations you would need
to analyze each individual algorithm.
At the end, the result (for a given algorithm) will necesarily be specific to the dataset being sorted.
Also, on avarage is not well defined in your context. I'd assume you intend to refer to the number of comparisions on the participating objects for a given sort and not avarage over a (sufficiently large) set of sort runs.
Even within a single algorithm the distribution of comparisions an individual object is taking place in may show a large standard deviation in one case and be (nearly) equally distributed in another case.
As complexity of a sorting algorithm is determined by the total number of comparisons (and position changes thereof), I do not assume there will be much from therotical analysis contributing to an answer.
Maybe you can add some background on what would make an answer to your question "interesting" in a practical sense?

Why is the runtime to construct a decision tree mnlog(n)?

When m is the amount of features and n is the amount of samples, the python scikit-learn site (http://scikit-learn.org/stable/modules/tree.html) states that the runtime to construct a binary decision tree is mnlog(n).
I understand that the log(n) comes from the average height of the tree after splitting. I understand that at each split, you have to look at each feature (m) and choose the best one to split on. I understand that this is done by calculating a "best metric" (in my case, a gini impurity) for each sample at that node (n). However, to find the best split, doesn't this mean that you would have to look at each possible way to split the samples for each feature? And wouldn't that be something like 2^n-1 * m rather than just mn? Am I thinking about this wrong? Any advice would help. Thank you.
One way to build a decision tree would be, at each point, to do something like this:
For each possible feature to split on:
Find the best possible split for that feature.
Determine the "goodness" of this fit.
Of all the options tried above, take the best and use that for the split.
The question is how do perform each step. If you have continuous data, a common technique for finding the best possible split would be to sort the data into ascending order along that data point, then consider all possible partition points between those data points and taking the one that minimizes the entropy. This sorting step takes time O(n log n), which dominates the runtime. Since we're doing that for each of the O(m) features, the runtime ends up working out to O(mn log n) total work done per node.

Performance of binary search algorithm when there are many duplicates

http://katemats.com/interview-questions/ says:
You are given a sorted array and you want to find the number N. How do you do the search as quickly as possible (not just traversing each element)?
How would the performance of your algorithm change if there were lots of duplicates in the array?
My answer to the first question is binary search, which is O(log(n)), where n is the number of elements in the array.
According to this answer, "we have a maximum of log_2(n-1) steps" in the worst case when "element K is not present in A and smaller than all elements in A".
I think the answer to the second question is it doesn't affect the performance. Is this correct?
If you are talking worst case / big O, then you are correct - log(n) is your bound. However, if your data is fairly uniformly distributed (or you can map to that distribution), interpolating where to pick your partition can get log(log(n)) behavior. When you do the interpolation too, you also get rid of your worse cases where you have looking for one of the end elements (of course there are new pathological cases though).
For many many duplicates you might be willing to stride further away the direct center on the next probe. With more dups, you get a better margin for guessing correctly. While always choosing the half-way point gets you there in good time, educated guesses might get you some really excellent average behavior.
When I interview, I like to hear those answers, both knowledge of the book and what the theoretical is, but also what things can be done to specialize to the given situation. Often these constant factors can be really helpful (look at quicksort and its partition choosing schemes).
I don't think having duplicates matters.
You're looking for a particular number N, what matters is whether or not the current node matches N.
If I'm looking for the number 1 in the list 1-2-3-4-5-6 the performance would be identical to searching the list 1-9-9-9-9-9.
If the number N is duplicated then you will have a chance of finding it a couple steps sooner. For example if the same search was done on the list 1-1-1-1-1-9.

Maximum two-dimensional subset-sum

I'm given a task to write an algorithm to compute the maximum two dimensional subset, of a matrix of integers. - However I'm not interested in help for such an algorithm, I'm more interested in knowing the complexity for the best worse-case that can possibly solve this.
Our current algorithm is like O(n^3).
I've been considering, something alike divide and conquer, by splitting the matrix into a number of sub-matrices, simply by adding up the elements within the matrices; and thereby limiting the number of matrices one have to consider in order to find an approximate solution.
Worst case (exhaustive search) is definitely no worse than O(n^3). There are several descriptions of this on the web.
Best case can be far better: O(1). If all of the elements are non-negative, then the answer is the matrix itself. If the elements are non-positive, the answer is the element that has its value closest to zero.
Likewise if there are entire rows/columns on the edges of your matrix that are nothing but non-positive integers, you can chop these off in your search.
I've figured that there isn't a better way to do it. - At least not known to man yet.
And I'm going to stick with the solution I got, mainly because its simple.

Is there such a thing as "negative" big-O complexity? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Are there any O(1/n) algorithms?
This just popped in my head for no particular reason, and I suppose it's a strange question. Are there any known algorithms or problems which actually get easier or faster to solve with larger input? I'm guessing that if there are, it wouldn't be for things like mutations or sorting, it would be for decision problems. Perhaps there's some problem where having a ton of input makes it easy to decide something, but I can't imagine what.
If there is no such thing as negative complexity, is there a proof that there cannot be? Or is it just that no one has found it yet?
No that is not possible. Since Big-Oh is suppose to be an approximation of the number of operations an algorithm performs related to its domain size then it would not make sense to describe an algorithm as using a negative number of operations.
The formal definition section of the wikipedia article actually defines the Big-Oh notation in terms of using positive real numbers. So there actually is not even a proof because the whole concept of Big-Oh has no meaning on the negative real numbers per the formal definition.
Short answer: Its not possible because the definition says so.
update
Just to make it clear, I'm answering this part of the question: Are there any known algorithms or problems which actually get easier or faster to solve with larger input?
As noted in accepted answer here, there are no algorithms working faster with bigger input.
Are there any O(1/n) algorithms?
Even an algorithm like sleep(1/n) has to spend time reading its input, so its running time has a lower bound.
In particular, author referes relatively simple substring search algorithm:
http://en.wikipedia.org/wiki/Horspool
PS But using term 'negative complexity' for such algorithms doesn't seem to be reasonable to me.
To think in an algorithm that executes in negative time, is the same as thinking about time going backwards.
If the program starts executing at 10:30 AM and stops at 10:00 AM without passing through 11:00 AM, it has just executed with time = O(-1).
=]
Now, for the mathematical part:
If you can't come up with a sequence of actions that execute backwards in time (you never know...lol), the proof is quite simple:
positiveTime = O(-1) means:
positiveTime <= c * -1, for any C > 0 and n > n0 > 0
Consider the "C > 0" restriction.
We can't find a positive number that multiplied by -1 will result in another positive number.
By taking that in account, this is the result:
positiveTime <= negativeNumber, for any n > n0 > 0
Wich just proves that you can't have an algorithm with O(-1).
Not really. O(1) is the best you can hope for.
The closest I can think of is language translation, which uses large datasets of phrases in the target language to match up smaller snippets from the source language. The larger the dataset, the better (and to a certain extent faster) the translation. But that's still not even O(1).
Well, for many calculations like "given input A return f(A)" you can "cache" calculation results (store them in array or map), which will make calculation faster with larger number of values, IF some of those values repeat.
But I don't think it qualifies as "negative complexity". In this case fastest performance will probably count as O(1), worst case performance will be O(N), and average performance will be somewhere inbetween.
This is somewhat applicable for sorting algorithms - some of them have O(N) best-case scenario complexity and O(N^2) worst case complexity, depending on the state of data to be sorted.
I think that to have negative complexity, algorithm should return result before it has been asked to calculate result. I.e. it should be connected to a time machine and should be able to deal with corresponding "grandfather paradox".
As with the other question about the empty algorithm, this question is a matter of definition rather than a matter of what is possible or impossible. It is certainly possible to think of a cost model for which an algorithm takes O(1/n) time. (That is not negative of course, but rather decreasing with larger input.) The algorithm can do something like sleep(1/n) as one of the other answers suggested. It is true that the cost model breaks down as n is sent to infinity, but n never is sent to infinity; every cost model breaks down eventually anyway. Saying that sleep(1/n) takes O(1/n) time could be very reasonable for an input size ranging from 1 byte to 1 gigabyte. That's a very wide range for any time complexity formula to be applicable.
On the other hand, the simplest, most standard definition of time complexity uses unit time steps. It is impossible for a positive, integer-valued function to have decreasing asymptotics; the smallest it can be is O(1).
I don't know if this quite fits but it reminds me of bittorrent. The more people downloading a file, the faster it goes for all of them

Resources