One problem to cover all the time complexities - algorithm

A college instructor here. I am trying to find a meaningful (practical) code example to illustrate different time complexities for beginners in a ELi5 manner. The code should start with constant complexity and then incrementally, by adding small piece of code, increases in complexity: .., logn, n, nlogn, n^2, 2^n, ..
I think I can explain it better with one example that has small incremental changes rather than switch the context from searching to sorting to brute force algorithms .

Any example will be artificial. But here is one that does reasonably well.
Let vec be a sorted array of numbers, i an integer, and x be another number. In order answer the following questions.
O(1) What is the value of vec[i]?
O(n) Is x in a range from vec by linear search?
O(log(n)) Is x in a range from vec by binary search?
O(n^2) Is x the sum of two elements in a range from of vec by a double loop?
O(n log(n)) Is x the sum of two elements of vec by linear search on the first with a binary search on the second. (Simplifying trick, do a linear search on the smaller and binary on the second. then reuse your code from 3.)
O(2^n) Is x the sum of any subset of elements of vec by recursion?
(pseudopolynomial) Memoize the previous solution. Discuss memory vs speed tradeoffs.

Related

K Closest with unsorted array

I am prepping for interview leet-code type problems and I came across the k closest problem, but given a sorted array. This problem requires finding the k closest elements by value to an input value from the array. The answer to this problem was fairly straight forward and I did not have any issues determining a linear-time algorithm to solve it.
However, working on this problem got me thinking. Is it possible to solve this problem given an unsorted array in linear time? My first thought was to use a heap and that would give an O(nlogk) time complexity solution, but I am trying to determine if its possible to come up with an O(n) solution? I was thinking about possibly using something like quickselect, but the issue is that this has an expected time of O(n), not a worst case time of O(n).
Is this even possible?
The median-of-medians algorithm makes Quickselect take O(n) time in the worst case.
It is used to select a pivot:
Divide the array into groups of 5 (O(n))
Find the median of each group (O(n))
Use Quickselect to find the median of the n/5 medians (O(n))
The resulting pivot is guaranteed to be greater and less than 30% of the elements, so it guarantees linear time Quickselect.
After selecting the pivot, of course, you have to continue on with the rest of Quickselect, which includes a recursive call like the one we made to select the pivot.
The worst case total time is T(n) = O(n) + T(0.7n) + T(n/5), which is still linear. Compared to the expected time of normal Quickselect, though, it's pretty slow, which is why we don't often use this in practice.
Your heap solution would be very welcome at an interview, I'm sure.
If you really want to get rid of the logk, which in practical applications should seldom be a problem, then yes, using Quickselect would be another option. Something like this:
Partition your array in values smaller and larger than x. <- O(n).
For the lower half, run Quickselect to find the kth largest number, then take the right-side partition which are your k largest numbers. <- O(n)
Repeat step 2 for the higher half, but for the k smallest numbers. <- O(n)
Merge your k smallest and k largest numbers and extract the k closest numbers. <- O(k)
This gives you a total time complexity of O(n), as you said.
However, a few points about your worry about expected time vs worst-case time. I understand that if an interview question explicitly insists on worst-case O(n), then this solution might not be accepted, but otherwise, this can well be considered O(n) in practice.
The key here being that for randomized quickselect and random or well-behaved input, the probability that the time complexity goes beyond O(n) decreases exponentially as the input grows. Meaning that already at largeish inputs, the probability is as small as guessing at a specific atom in the known universe. The assumption on well-behaved input concerns being somewhat random in nature and not adversarial. See this discussion on a similar (not identical) problem.

A linear algorithm for this specification?

This is my question I have got somewhere.
Given a list of numbers in random order write a linear time algorithm to find the 𝑘th smallest number in the list. Explain why your algorithm is linear.
I have searched almost half the web and what I got to know is a linear-time algorithm is whose time complexity must be O(n). (I may be wrong somewhere)
We can solve the above question by different algorithms eg.
Sort the array and select k-1 element [O(n log n)]
Using min-heap [O(n + klog n)]
etc.
Now the problem is I couldn't find any algorithm which has O(n) time complexity and satisfies that algorithm is linear.
What can be the solution for this problem?
This is std::nth_element
From cppreference:
Notes
The algorithm used is typically introselect although other selection algorithms with suitable average-case complexity are allowed.
Given a list of numbers
although it is not compatible with std::list, only std::vector, std::deque and std::array, as it requires RandomAccessIterator.
linear search remembering k smallest values is O(n*k) but if k is considered constant then its O(n) time.
However if k is not considered as constant then Using histogram leads to O(n+m.log(m)) time and O(m) space complexity where m is number of possible distinct values/range in your input data. The algo is like this:
create histogram counters for each possible value and set it to zero O(m)
process all data and count the values O(m)
sort the histogram O(m.log(m))
pick k-th element from histogram O(1)
in case we are talking about unsigned integers from 0 to m-1 then histogram is computed like this:
int data[n]={your data},cnt[m],i;
for (i=0;i<m;i++) cnt[i]=0;
for (i=0;i<n;i++) cnt[data[i]]++;
However if your input data values does not comply above condition you need to change the range by interpolation or hashing. However if m is huge (or contains huge gaps) is this a no go as such histogram is either using buckets (which is not usable for your problem) or need list of values which lead to no longer linear complexity.
So when put all this together is your problem solvable with linear complexity when:
n >= m.log(m)

Can my algorithm be done any better?

I have been presented with a challenge to make the most effective algorithm that I can for a task. Right now I came to the complexity of n * logn. And I was wondering if it is even possible to do it better. So basically the task is there are kids having a counting out game. You are given the number n which is the number of kids and m which how many times you skip someone before you execute. You need to return a list which gives the execution order. I tried to do it like this you use skip list.
Current = m
while table.size>0:
executed.add(table[current%table.size])
table.remove(current%table.size)
Current += m
My questions are is this correct? Is it n*logn and can you do it better?
Is this correct?
No.
When you remove an element from the table, the table.size decreases, and current % table.size expression generally ends up pointing at another irrelevant element.
For example, 44 % 11 is 0 but 44 % 10 is 4, an element in a totally different place.
Is it n*logn?
No.
If table is just a random-access array, it can take n operations to remove an element.
For example, if m = 1, the program, after fixing the point above, would always remove the first element of the array.
When an array implementation is naive enough, it takes table.size operations to relocate the array each time, leading to a total to about n^2 / 2 operations in total.
Now, it would be n log n if table was backed up, for example, by a balanced binary search tree with implicit indexes instead of keys, as well as split and merge primitives. That's a treap for example, here is what results from a quick search for an English source.
Such a data structure could be used as an array with O(log n) costs for access, merge and split.
But nothing so far suggests this is the case, and there is no such data structure in most languages' standard libraries.
Can you do it better?
Correction: partially, yes; fully, maybe.
If we solve the problem backwards, we have the following sub-problem.
Let there be a circle of k kids, and the pointer is currently at kid t.
We know that, just a moment ago, there was a circle of k + 1 kids, but we don't know where, at which kid x, the pointer was.
Then we counted to m, removed the kid, and the pointer ended up at t.
Whom did we just remove, and what is x?
Turns out the "what is x" part can be solved in O(1) (drawing can be helpful here), so the finding the last kid standing is doable in O(n).
As pointed out in the comments, the whole thing is called Josephus Problem, and its variants are studied extensively, e.g., in Concrete Mathematics by Knuth et al.
However, in O(1) per step, this only finds the number of the last standing kid.
It does not automatically give the whole order of counting the kids out.
There certainly are ways to make it O(log(n)) per step, O(n log(n)) in total.
But as for O(1), I don't know at the moment.
Complexity of your algorithm depends on the complexity of the operations
executed.add(..) and table.remove(..).
If both of them have complexity of O(1), your algorithm has complexity of O(n) because the loop terminates after n steps.
While executed.add(..) can easily be implemented in O(1), table.remove(..) needs a bit more thinking.
You can make it in O(n):
Store your persons in a LinkedList and connect the last element with the first. Removing an element costs O(1).
Goging to the next person to choose would cost O(m) but that is a constant = O(1).
This way the algorithm has the complexity of O(n*m) = O(n) (for constant m).

Finding the m Largest Numbers

This is a problem from the Cormen text, but I'd like to see if there are any other solutions.
Given an array with n distinct numbers, you need to find the m largest ones in the array, and have
them in sorted order. Assume n and m are large, but grow differently. In particular, you need
to consider below the situations where m = t*n, where t is a small number, say 0.1, and then the
possibility m = √n.
The solution given in the book offers 3 options:
Sort the array and return the top m-long segment
Convert the array to a max-heap and extract the m elements
Select the m-th largest number, partition the array about it, and sort the segment of larger entries.
These all make sense, and they all have their pros and cons, but I'm wondering, is there another way to do it? It doesn't have to be better or faster, I'm just curious to see if this is a common problem with more solutions, or if we are limited to those 3 choices.
The time complexities of the three approaches you have mentioned are as follows.
O(n log n)
O(n + m log n)
O(n + m log m)
So option (3) is definitely better than the others in terms of asymptotic complexity, since m <= n. When m is small, the difference between (2) and (3) is so small it would have little practical impact.
As for other ways to solve the problem, there are infinitely many ways you could, so the question is somewhat poor in this regard. Another approach I can think of as being practically simple and performant is the following.
Extract the first m numbers from your list of n into an array, and sort it.
Repeatedly grab the next number from your list and insert it into the correct location in the array, shifting all the lesser numbers over by one and pushing one out.
I would only do this if m was very small though. Option (2) from your original list is also extremely easy to implement if you have a max-heap implementation and will work great.
A different approach.
Take the first m numbers, and turn them into a min heap. Run through the array, if its value exceeds the min of the top m then you extract the min value and insert the new one. When you reach the end of the array you can then extract the elements into an array and reverse it.
The worst case performance of this version is O(n log(m)) placing it between the first and second methods for efficiency.
The average case is more interesting. On average only O(m log(n/m)) of the elements are going to pass the first comparison test, each time incurring O(log(m)) work so you get O(n + m log(n/m) log(m)) work, which puts it between the second and third methods. But if n is many orders of magnitude greater than m then the O(n) piece dominates, and the O(n) median select in the third approach has worse constants than the one comparison per element in this approach, so in this case this is actually the fastest!

Does quicksort with randomized median-of-three do appreciably better than randomized quicksort?

I was just answering a question about different approaches for picking the partition in a quicksort implementation and came up with a question that I honestly don't know how to answer. It's a bit math-heavy, and this may be the wrong site on which to ask this, so if this needs to move please let me know and I'll gladly migrate it elsewhere.
It's well-known that a quicksort implementation that picks its pivots uniformly at random will end up running in expected O(n lg n) time (there's a nice proof of this on Wikipedia). However, due to the cost of generating random numbers, many quicksort implementations don't pick pivots randomly, but instead rely on a "median-of-three" approach in which three elements are chosen deterministically and of which the median is chosen as the pivot. This is known to degenerate to O(n2) in the worst-case (see this great paper on how to generate those worst-case inputs, for example).
Now, suppose that we combine these two approaches by picking three random elements from the sequence and using their median as the choice of pivot. I know that this also guarantees O(n lg n) average-case runtime using a slightly different proof than the one for the regular randomized quicksort. However, I have no idea what the constant factor in front of the n lg n term is in this particular quicksort implementation. For regular randomized quicksort Wikipedia lists the actual runtime of randomized quicksort as requiring at most 1.39 n lg n comparisons (using lg as the binary logarithm).
My question is this: does anyone know of a way to derive the constant factor for the number of comparisons made using a "median-of-three" randomized quicksort? If we go even more generally, is there an expression for the constant factor on quicksort using a randomized median-of-k approach? I'm curious because I think it would be fascinating to see if there is some "sweet spot" of this approach that makes fewer comparisons than other randomized quicksort implementations. I mean, wouldn't it be cool to be able to say that randomized quicksort with a randomized median-of-six pivot choice makes the fewest comparisons? Or be able to conclusively say that you should just pick a pivot element at random?
Here's a heuristic derivation of the constant. I think it can be made rigorous, with a lot more effort.
Let P be a continuous random variable with values in [0, 1]. Intuitively, P is the fraction of values less than the pivot. We're looking to find the constant c such that
c n lg n = E[n + c P n lg (P n) + c (1 - P) n lg ((1 - P) n)].
A little bit of algebra later, we have
c = 1/E[-P lg P - (1 - P) lg (1 - P))].
In other words, c is the reciprocal of the expected entropy of the Bernoulli distribution with mean P. Intuitively, for each element, we need to compare it to pivots in a way that yields about lg n bits of information.
When P is uniform, the pdf of P is 1. The constant is
In[1]:= -1/NIntegrate[x Log[2, x] + (1 - x) Log[2, 1 - x], {x, 0, 1}]
Out[1]= 1.38629
When the pivot is a median of 3, the pdf of P is 6 x (1 - x). The constant is
In[2]:= -1/NIntegrate[6 x (1 - x) (x Log[2, x] + (1 - x) Log[2, 1 - x]), {x, 0, 1}]
Out[2]= 1.18825
The constant for the usual randomized quicksort is easy to compute because the probability that two elements k locations apart are compared is exactly 2/(k+1): the probability that one of the those two elements is chosen as a pivot before any of the k-1 elements between them. Unfortunately nothing so clever applies to your algorithm.
I'm hesitant to attempt your bolded question because I can answer your "underlying" question: asymptotically speaking, there is no "sweet spot". The total added cost of computing medians of k elements, even O(n1 - ε) elements, is linear, and the constant for the n log n term decreases with the array being split more evenly. The catch is of course constants on the linear term that are spectacularly impractical, highlighting one of the drawbacks of asymptotic analysis.
Based on my comments below, I guess k = O(nα) for 0 < α < 1 is the "sweet spot".
If the initial state of the set is randomly ordered, you will get the exact same constant factor for randomly picking three items to calculate the median as when picking three items deterministically.
The motive for picking item by random would be that the deterministic method would give a result that is worse than the average. If the deterministic method gives a good median, you can't improve on it by picking items by random.
So, which ever method gives the best result depends on the input data, it can't be determined for every possible set.
The only sure way to lower the constant factor is to increase the number of items that you use to calcuate the median, but at some point calculating the median will be more expensive than what you gain by getting a better median value.
Yes, it does. Bentley and McIlroy, authors of the C standard library's qsort function, wrote in their paper, Engineering a Sort Function the following numbers:
1.386 n lg n average comparisons using first, middle or a randomized pivot
1.188 n lg n average comparisons using a median of 3 pivot
1.094 n lg n average comparisons using a median of 3 medians pivot
According to the above paper:
Our final code therefore chooses the middle element of smaller arrays,
the median of the first, middle and last elements of a mid-sized
array, and the pseudo-median of nine evenly spaced elements of a large
array.
Just a thought: If you use the median-of-three approach, and you find it to be better, why not use a median-of-five or median-of-eleven approach? And while you are on it, maybe one can think of a median-of-n optimization... hmmm... Ok, that is obviously a bad idea (since you would have to sort your sequence for that...).
Basically, to choose your pivot element as the median-of-m elements, you sort those m elements, right? Therefore, I'd simply guess, that one of the constants you are looking for is "2": By first sorting 3 elements to choose your pivot you execute how many additional comparisons? Lets say its 2. You do this inside the quicksort over and over again. A basic conclusion would be that the median-of-3 is therefore 2-times slower then the simple random quicksort.
But what is working for you here? That you get a better device-and-conquer-distribution, and you are better protected against the degenerated case (a bit).
So, back to my infamous question at the beginning: Why not choose the pivot element from a median-of-m, m being 5, 7, n/3, or so. There must be a sweet spot where the sorting of the m elements is worse then the gain from the better divide-and-conquer behavior and quicksort. I guess, this sweet-spot is there very early -- you have to fight first against the constant factor of 2 comparisons if you choose median-of-3. It is worth an experiment, I admit, but I would not be too expectant of the result :-) But if I am wrong, and the gain is huge: don't stop at 3!

Resources