Binary vs Linear searches for unsorted N elements - algorithm

I try to understand a formula when we should use quicksort. For instance, we have an array with N = 1_000_000 elements. If we will search only once, we should use a simple linear search, but if we'll do it 10 times we should use sort array O(n log n). How can I detect threshold when and for which size of input array should I use sorting and after that use binary search?

You want to solve inequality that rougly might be described as
t * n > C * n * log(n) + t * log(n)
where t is number of checks and C is some constant for sort implementation (should be determined experimentally). When you evaluate this constant, you can solve inequality numerically (with uncertainty, of course)

Like you already pointed out, it depends on the number of searches you want to do. A good threshold can come out of the following statement:
n*log[b](n) + x*log[2](n) <= x*n/2 x is the number of searches; n the input size; b the base of the logarithm for the sort, depending on the partitioning you use.
When this statement evaluates to true, you should switch methods from linear search to sort and search.
Generally speaking, a linear search through an unordered array will take n/2 steps on average, though this average will only play a big role once x approaches n. If you want to stick with big Omicron or big Theta notation then you can omit the /2 in the above.

Assuming n elements and m searches, with crude approximations
the cost of the sort will be C0.n.log n,
the cost of the m binary searches C1.m.log n,
the cost of the m linear searches C2.m.n,
with C2 ~ C1 < C0.
Now you compare
C0.n.log n + C1.m.log n vs. C2.m.n
or
C0.n.log n / (C2.n - C1.log n) vs. m
For reasonably large n, the breakeven point is about C0.log n / C2.
For instance, taking C0 / C2 = 5, n = 1000000 gives m = 100.

You should plot the complexities of both operations.
Linear search: O(n)
Sort and binary search: O(nlogn + logn)
In the plot, you will see for which values of n it makes sense to choose the one approach over the other.

This actually turned into an interesting question for me as I looked into the expected runtime of a quicksort-like algorithm when the expected split at each level is not 50/50.
the first question I wanted to answer was for random data, what is the average split at each level. It surely must be greater than 50% (for the larger subdivision). Well, given an array of size N of random values, the smallest value has a subdivision of (1, N-1), the second smallest value has a subdivision of (2, N-2) and etc. I put this in a quick script:
split = 0
for x in range(10000):
split += float(max(x, 10000 - x)) / 10000
split /= 10000
print split
And got exactly 0.75 as an answer. I'm sure I could show that this is always the exact answer, but I wanted to move on to the harder part.
Now, let's assume that even 25/75 split follows an nlogn progression for some unknown logarithm base. That means that num_comparisons(n) = n * log_b(n) and the question is to find b via statistical means (since I don't expect that model to be exact at every step). We can do this with a clever application of least-squares fitting after we use a logarithm identity to get:
C(n) = n * log(n) / log(b)
where now the logarithm can have any base, as long as log(n) and log(b) use the same base. This is a linear equation just waiting for some data! So I wrote another script to generate an array of xs and filled it with C(n) and ys and filled it with n*log(n) and used numpy to tell me the slope of that least squares fit, which I expect to equal 1 / log(b). I ran the script and got b inside of [2.16, 2.3] depending on how high I set n to (I varied n from 100 to 100'000'000). The fact that b seems to vary depending on n shows that my model isn't exact, but I think that's okay for this example.
To actually answer your question now, with these assumptions, we can solve for the cutoff point of when: N * n/2 = n*log_2.3(n) + N * log_2.3(n). I'm just assuming that the binary search will have the same logarithm base as the sorting method for a 25/75 split. Isolating N you get:
N = n*log_2.3(n) / (n/2 - log_2.3(n))
If your number of searches N exceeds the quantity on the RHS (where n is the size of the array in question) then it will be more efficient to sort once and use binary searches on that.

Related

Runtime of triple nested for loops

I'm currently going through "Cracking the coding interview" textbook and I'm reviewing Big-O and runtime. One of the examples were as such:
Print all positive integer solutions to the equation a^3 + b^3 = c^3 + d^3 where a, b, c, d are integers between 1 and 1000.
The psuedo code solution provided is:
n = 1000;
for c from 1 to n
for d from 1 to n
result = c^3 + d^3
append (c,d) to list at value map[result]
for each result, list in map
for each pair1 in list
for each pair2 in list
print pair1, pair2
The runtime is O(N^2)
I'm not sure how O(N^2) is obtained and after extensive googling and trying to figure out why, I still have no idea. My rational is as following:
Top half is O(N^2) because the outer loop goes to n and inner loop executes n times each.
The bottom half I'm not sure how to calculate, but I got O(size of map) * O(size of list) * O(size of list) = O(size of map) * O(size of list^2).
O(N^2) + O(size of map) * O(size of list^2)
The 2 for loops adding the pairs to the list of the map = O(N) * O(N) b/c it's 2 for loops running N times.
The outer for loop for iterating through the map = O(2N-1) = O(N) b/c the size of the map is 2N - 1 which is essentially N.
The 2 for loops for iterating through the pairs of each list = O(N) * O(N) b/c each list is <= N
Total runtime: O(N^2) + O(N) * O(N^2) = O(N^3)
Not sure what I'm missing here
Could someone help me figure out how O(N^2) is obtained or why my solution is incorrect. Sorry if my explanation is a bit confusing. Thanks
Based on the first part of the solution, sum(size of lists) == N. This means that the second part (nested loop) cannot be more complex then O(N^2). As you said, the complexity is O(size of map)*O(size of list^2), but it should rather be:
O(size of map)*(O(size of list1^2) + O(size of list2^2) + ... )
This means, that in the worst-case scenario we will get a map of size 1, and one list of size N, and the resulting complexity of O(1)*O((N-1)^2) <==> O(N^2)
In other scenarios the complexity will be lower. For instance if we have map of 2 elements, then we will get 2 lists with the total size of N. So the result will then be:
O(2)*( O(size of list1^2) + O(size of list2^2)), where (size of list1)+(size of list2) == N
and we know from basic maths that X^2 + Y^2 <= (X+Y)^2 for positive numbers.
The complexity of the second part is O(sum of (length of lists)^2 in map), since the length of the list varies depending on the we know that sum of length of lists in map is n^2 since we definitely added n^2 pairs in the first bit of the code. Since T(program) = O(n^2) + O(sum of length of lists in map) * O(sum of length of lists in map / size of map) = O(n^2) * O(sum of length of lists in map / size of map), it remains to show that sum of length of lists in map / size of map is O(1). Doing this requires quite a bit of number theory and unfortunately I can't help you there. But do check out these links for more info on how you would go about it: https://en.wikipedia.org/wiki/Taxicab_number
https://math.stackexchange.com/questions/1274816/numbers-that-can-be-expressed-as-the-sum-of-two-cubes-in-exactly-two-different-w
http://oeis.org/A001235
This is a very interesting question! cdo256 made some good points, I will try to explain a bit more and complete the picture.
It is more or less obvious that the key questions are - how many integers exist that can be expressed as a sum of two positive cubes in k different ways (where k >= 2), and what is the possible size of k ? This number determines the sizes of lists which are values of map, which determine the total complexity of the program. Our "search space" is from 2 to 2 * 10^9 because c and d both iterate from 1 to 1000, so the sum of their cubes is at most 2 * 10^9. If none of the numbers in the range [2, 2 * 10^9] could be expressed as a sum of two cubes in more than one way, than the complexity of our program would be O(n^2). Why? Well, first part is obviously O(n^2), and the second part depends on the size of lists which are values of map. But in this case all lists have size 1, and there are n^2 keys in map which gives O(n^2).
However, that is not the case, there is a famous example of "taxicub number" 1729, so let us return to our main question - the number of different ways to express an integer as a sum of two cubes of positive integers. This is an active field of research in number theory, and great summary is given in Joseph H. Silverman's article Taxicabs and Sums of Two Cubes. I recommend to read it thoroughly. Current records are given here. Some interesting facts:
smallest integer that can be expressed as a sum of two cubes of positive integers in three different ways is 87,539,319
smallest integer that can be expressed as a sum of two cubes of positive integers in four different ways is 6,963,472,309,248 (> 2*10^9)
smallest integer that can be expressed as a sum of two cubes of positive integers in six different ways is 24,153,319,581,254,312,065,344 (> 2*10^9)
As you can easily see e.g. here, there are only 2184 integers in range [2, 2 * 10^9] that are expressible as a sum of two positive cubes in two or three different ways, and for k = 4,5,.. these numbers are out of our range. Therefore, the number of keys in map is very close to n^2, and sizes of the value lists are at most 3, which implies that the complexity of the code
for each pair1 in list
for each pair2 in list
print pair1, pair2
is constant, so the total complexity is again O(n^2).

Ternary search is worse than binary search?

People usually ask the opposite of this question (Is ternary search is better than binary search?).
I really think it's worse (not in terms of both run at complexity of O(log n)).
I'm really not good with the math, just pure intuition.
If we take an array of size n and use binary search, after the first comparison we have n/2 problem size, and after the 2nd comparison we have n/4 of the problem size.
With ternary search, the first loop run already makes 2 comparisons! And we're left with a problem size of n/3.
Am I right about this or am I missing something? The thing is in everything I read people usually take into account that the first loop run of ternary search is 1 compare which I think is wrong.
As a fun exercise, let's think about d-ary search, where you look at (d - 1) evenly-spaced elements of the array, then decide which of d different intervals to recursively descend into. If you use d-ary search, the size of the array remaining at each point will be n, n / d, n / d2, n / d3, ..., n / dk, ... . This means that there will be logd n layers in the recursion. At each layer, you make exactly d - 1 comparisons. This means that the total number of comparisons made is (d - 1) logd n. For any fixed value of d, this is O(log n), though the constant factor will be different.
With binary search, d = 2, and the number of comparisons will be (roughly) (2 - 1) log2 n = log2 n. For simplicity, let's write that as a natural logarithm by writing it as ln n / ln 2 = 1.443 ln n.
With ternary search, d = 3, and the number of comparisons will be (roughly) (3 - 1) log3 n = 2 log3 n. Writing this as a natural log, we get that this is 2 ln n / ln 3 = 1.820 ln n. Therefore, the number of comparisons in ternary search is indeed bigger than the number of comparisons in binary search, so you'd expect it to be slower than binary search. Indeed, as this article points out, that's what happens in practice.
Hope this helps!

Geometric mean pivot

I am PhD student and I am working on my project,
I want to know what will be worst case partition time complexity if I am using geometric mean as pivot to partitioning array into approximate two equal part?
results :-
Vladimir Yaroslavskiy dual pivot quickselect partition :- 2307601193 nanosecond
Geometric mean pivot quickselect partition :- 8661916394 nanosecond
We know that it is very costly and make quick partition much slower. There are many algorithms Which are much faster than quick select to find median but in our project we are not going to use them directly.
Example of Geometric mean pivot:-
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Input- 789654123 , 700 , 10^20 , 588412 , 900 , 5 , 500
Geometric mean :-( 789654123*700*10^20*588412*900*5*500)^(1/7)= 1846471
Pass 1- 500 700 5 588412 900 |<---> | 10^20 789654123
Geometric mean :-(500*700*5*588412*900)^(1/5)=984
Pass 2- 500, 700, 5, 900, |<---> | 588412, 10^20, 789654123
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
By this way we can divide array into approximate two equal parts.
My question is what will be worst case( worst unbalanced partitioning ) time complexity if i am using geometric mean as pivot to partitioning array into approximate two equal part?
Note:- we are not using -ve no in data set.
Geometric mean is equivalent to the arithmetic mean of the logarithm, so we just need to find something where arithmetic mean breaks down badly and take the exponent of it. One example would be factorials, if you have a list
1!, 2!, 3!, 4!, ..., n!
taking the arithmetic mean will split exactly before the last element. Proof: The sum of this array is larger than the last element:
s_n > n!
Consequently the arithmetic mean is larger than the element before it:
av_n = s_n/n > (n-1)!
As a result quick select requires n rounds and its performance will be O(n^2), in contrast to the average performance which would be O(n). To get the same behavior with the geometric mean you have to consider the list of exponents of this
a^(1!), a^(2!), ..., a^(n!)
for any a>1 or 0<a<1. The resulting performance of a quick-select based on the geometric mean would be O(n^2).
220, 221, 222, ..., 22n - 1 have geometric mean
(220 · 221 · 222 · ... · 22n - 1)(1 / n)
= (220 + 21 + ... 2n)(1 / n)
= (22n+1 - 1)(1/n)
= 2(2n+1 - 1) / n
= 2(2n+1 - 1) 2-log n
= 2(2n+1 - log n - log n)
Notice that this number is (approximately) 22n - log n. This means that your partition will only split approximately log n terms into the second group of the array, which is a very small number compared to the overall array size. Consequently, you'd expect that for data sets of this sort, you'd have closer to Θ(n2) performance than to Θ(n log n) performance. However, I can't get an exact asymptotic value for this because I don't know how exactly how many rounds there will be.
Hope this helps!

How do you calculate big O on a function with a hard limit?

As part of a programming assignment I saw recently, students were asked to find the big O value of their function for solving a puzzle. I was bored, and decided to write the program myself. However, my solution uses a pattern I saw in the problem to skip large portions of the calculations.
Big O shows how the time increases based on a scaling n, but as n scales, once it reaches the resetting of the pattern, the time it takes resets back to low values as well. My thought was that it was O(nlogn % k) when k+1 is when it resets. Another thought is that as it has a hard limit, the value is O(1), since that is big O of any constant. Is one of those right, and if not, how should the limit be represented?
As an example of the reset, the k value is 31336.
At n=31336, it takes 31336 steps but at n=31337, it takes 1.
The code is:
def Entry(a1, q):
F = [a1]
lastnum = a1
q1 = q % 31336
rows = (q / 31336)
for i in range(1, q1):
lastnum = (lastnum * 31334) % 31337
F.append(lastnum)
F = MergeSort(F)
print lastnum * rows + F.index(lastnum) + 1
MergeSort is a standard merge sort with O(nlogn) complexity.
It's O(1) and you can derive this from big O's definition. If f(x) is the complexity of your solution, then:
with
and with any M > 470040 (it's nlogn for n = 31336) and x > 0. And this implies from the definition that:
Well, an easy way that I use to think about big-O problems is to think of n as so big it may as well be infinity. If you don't get particular about byte-level operations on very big numbers (because q % 31336 would scale up as q goes to infinity and is not actually constant), then your intuition is right about it being O(1).
Imagining q as close to infinity, you can see that q % 31336 is obviously between 0 and 31335, as you noted. This fact limits the number of array elements, which limits the sort time to be some constant amount (n * log(n) ==> 31335 * log(31335) * C, for some constant C). So it is constant time for the whole algorithm.
But, in the real world, multiplication, division, and modulus all do scale based on input size. You can look up Karatsuba algorithm if you are interested in figuring that out. I'll leave it as an exercise.
If there are a few different instances of this problem, each with its own k value, then the complexity of the method is not O(1), but instead O(k·ln k).

Choosing minimum length k of array for merge sort where use of insertion sort to sort the subarrays is more optimal than standard merge sort

This is a question from Introduction to Algorithms By Cormen. But this isn't a homework problem instead self-study.
There is an array of length n. Consider a modification to merge sort in which n/k sublists each of length k are sorted using insertion sort and then merged using merging mechanism, where k is a value to be determined.
The relationship between n and k isn't known. The length of array is n. k sublists of n/k means n * (n/k) equals n elements of the array. Hence k is simply a limit at which the splitting of array for use with merge-sort is stopped and instead insertion-sort is used because of its smaller constant factor.
I was able to do the mathematical proof that the modified algorithm works in Θ(n*k + n*lg(n/k)) worst-case time. Now the book went on to say to
find the largest value of k as a function of n for which this modified algorithm has the same running time as standard merge sort, in terms of Θ notation. How should we choose k in practice?
Now this got me thinking for a lot of time but I couldn't come up with anything. I tried to solve
n*k + n*lg(n/k) = n*lg(n) for a relationship. I thought that finding an equality for the 2 running times would give me the limit and greater can be checked using simple hit-and-trial.
I solved it like this
n k + n lg(n/k) = n lg(n)
k + lg(n/k) = lg(n)
lg(2^k) + lg(n/k) = lg(n)
(2^k * n)/k = n
2^k = k
But it gave me 2 ^ k = k which doesn't show any relationship. What is the relationship? I think I might have taken the wrong equation for finding the relationship.
I can implement the algorithm and I suppose adding an if (length_Array < k) statement in the merge_sort function here(Github link of merge sort implementation) for calling insertion sort would be good enough. But how do I choose k in real life?
Well, this is a mathematical minimization problem, and to solve it, we need some basic calculus.
We need to find the value of k for which d[n*k + n*lg(n/k)] / dk == 0.
We should also check for the edge cases, which are k == n, and k == 1.
The candidate for the value of k that will give the minimal result for n*k + n*lg(n/k) is the minimum in the required range, and is thus the optimal value of k.
Attachment, solving the derivitives equation:
d[n*k + n*lg(n/k)] / dk = d[n*k + nlg(n) - nlg(k)] / dk
= n + 0 - n*1/k = n - n/k
=>
n - n/k = 0 => n = n/k => 1/k = 1 => k = 1
Now, we have the candidates: k=n, k=1. For k=n we get O(n^2), thus we conclude optimal k is k == 1.
Note that we found the derivitives on the function from the big Theta, and not on the exact complexity function that uses the needed constants.
Doing this on the exact complexity function, with all the constants might yield a bit different end result - but the way to solve it is pretty much the same, only take derivitives from a different function.
maybe k should be lg(n)
theta(nk + nlog(n/k)) have two terms, we have the assumption that k>=1, so the second term is less than nlog(n).
only when k=lg(n), the whole result is theta(nlog(n))

Resources