Take k smallest numbers from a list time complixity in Haskell [duplicate] - algorithm

I came across following sentence on Real World Haskell:
Lazy evaluation has some spooky effects. Let's say we want to find the
k least-valued elements of an unsorted list. In a traditional
language, the obvious approach would be to sort the list and take the
first k elements, but this is expensive. For efficiency, we would
instead write a special function that takes these values in one pass,
and it would have to perform some moderately complex book-keeping. In
Haskell, the sort-then-take approach actually performs well: laziness
ensures that the list will only be sorted enough to find the k minimal
elements.
And they give an code implemntation for that:
minima k xs = take k (sort xs)
But is that so ? I think even in Haskell it should do a full sort of the list to take out the k elements. ( Imagine having the smallest number at the end of the list ). Am I missing something here ?

(Just putting my comment to answer here) Traversing the whole list is not equivalent to sorting it. Assume doing quicksort where you partition the list around the pivot and then recursively sort left and right half of the list. If you are asking just for elements on the left half then there is no need to sort the right half. This argument can be applied recursively. Thus if you are taking k elements from n length list after sorting, the complexity is O(n + klog k) and not O(n log n). As pointed by #MoFu, Heinrich has a good blog post here.

Related

Quicksort to already sorted array

In this question: https://www.quora.com/What-is-randomized-quicksort
Alejo Hausner told in: Cost of quicksort, in the worst case, that
Ironically, if you apply quicksort to an array that is already sorted, you will get probably get this costly behavior
I cannot get it. Can someone explain it to me.
https://www.quora.com/What-will-be-the-complexity-of-quick-sort-if-array-is-already-sorted may be answer to this, but that did not get me a complete response.
The Quicksort algorithm is this:
select a pivot
move elements smaller than the pivot to the beginning, and elements larger than pivot to the end
now the array looks like [<=p, <=p, <=p, p, >p, >p, >p]
recursively sort the first and second "halves" of the array
Quicksort will be efficient, with a running time close to n log n, if the pivot always end up close to the middle of the array. This works perfectly if the pivot is the median value. But selecting the actual median would be costly in itself. If the pivot happens, out of bad luck, to be the smallest or largest element in the array, you'll get an array like this: [p, >p, >p, >p, >p, >p, >p]. If this happens too often, your "quicksort" effectively behaves like selection sort. In that case, since the size of the subarray to be recursively sorted only reduces by 1 at every iteration, there will be n levels of iteration, each one costing n operations, so the overall complexity will be `n^2.
Now, since we're not willing to use costly operations to find a good pivot, we might as well pick an element at random. And since we also don't really care about any kind of true randomness, we can just pick an arbitrary element from the array, for instance the first one.
If the array was shuffled uniformly at random, then picking the first element is great. You can reasonably hope it will regularly give you an "average" element. But if the array was already sorted... Then by definition the first element is the smallest. So we're in the bad case where the complexity is n^2.
A simple way to avoid "bad lists" is to pick a true random element instead of an arbitrary element. Or if you have reasons to believe that quicksort will often be called on lists that are almost sorted, you could pick the element in position n/2 instead of the one in position 1.
There are also several research papers about different ways to select the pivot, with precise calculations on the impact on complexity. For instance, you could pick three random elements, rank them from smallest to largest and keep the middle one. But the conclusion usually is: if you try to write a better pivot-selection, then it will also be more costly, and the overall complexity of the algorithm won't be improved that much.
Depending on the implementations there are several 'common' ways to choose the pivot.
In general for 'unsorted' source there is no good or bad way to choose it.
So some implementations just take the first element as pivot.
In the case of a already sorted source this results in the worst pivot possible because the lest interval will always be empty.
-> recursion steps = O(n) instead the desired O(log n).
This leads to O(n²) complexity, which is very bad for sorting.
Choosing the pivot by random avoids this behavior. It is extremely unlikely that the random chosen pivot will have the same bad characteristics in every recursion as described above.
Also on purpose bad source is not possible to generate because you cannot predict the choices of the random generator (if it's a good one)

Can my algorithm be done any better?

I have been presented with a challenge to make the most effective algorithm that I can for a task. Right now I came to the complexity of n * logn. And I was wondering if it is even possible to do it better. So basically the task is there are kids having a counting out game. You are given the number n which is the number of kids and m which how many times you skip someone before you execute. You need to return a list which gives the execution order. I tried to do it like this you use skip list.
Current = m
while table.size>0:
executed.add(table[current%table.size])
table.remove(current%table.size)
Current += m
My questions are is this correct? Is it n*logn and can you do it better?
Is this correct?
No.
When you remove an element from the table, the table.size decreases, and current % table.size expression generally ends up pointing at another irrelevant element.
For example, 44 % 11 is 0 but 44 % 10 is 4, an element in a totally different place.
Is it n*logn?
No.
If table is just a random-access array, it can take n operations to remove an element.
For example, if m = 1, the program, after fixing the point above, would always remove the first element of the array.
When an array implementation is naive enough, it takes table.size operations to relocate the array each time, leading to a total to about n^2 / 2 operations in total.
Now, it would be n log n if table was backed up, for example, by a balanced binary search tree with implicit indexes instead of keys, as well as split and merge primitives. That's a treap for example, here is what results from a quick search for an English source.
Such a data structure could be used as an array with O(log n) costs for access, merge and split.
But nothing so far suggests this is the case, and there is no such data structure in most languages' standard libraries.
Can you do it better?
Correction: partially, yes; fully, maybe.
If we solve the problem backwards, we have the following sub-problem.
Let there be a circle of k kids, and the pointer is currently at kid t.
We know that, just a moment ago, there was a circle of k + 1 kids, but we don't know where, at which kid x, the pointer was.
Then we counted to m, removed the kid, and the pointer ended up at t.
Whom did we just remove, and what is x?
Turns out the "what is x" part can be solved in O(1) (drawing can be helpful here), so the finding the last kid standing is doable in O(n).
As pointed out in the comments, the whole thing is called Josephus Problem, and its variants are studied extensively, e.g., in Concrete Mathematics by Knuth et al.
However, in O(1) per step, this only finds the number of the last standing kid.
It does not automatically give the whole order of counting the kids out.
There certainly are ways to make it O(log(n)) per step, O(n log(n)) in total.
But as for O(1), I don't know at the moment.
Complexity of your algorithm depends on the complexity of the operations
executed.add(..) and table.remove(..).
If both of them have complexity of O(1), your algorithm has complexity of O(n) because the loop terminates after n steps.
While executed.add(..) can easily be implemented in O(1), table.remove(..) needs a bit more thinking.
You can make it in O(n):
Store your persons in a LinkedList and connect the last element with the first. Removing an element costs O(1).
Goging to the next person to choose would cost O(m) but that is a constant = O(1).
This way the algorithm has the complexity of O(n*m) = O(n) (for constant m).

Finding the m Largest Numbers

This is a problem from the Cormen text, but I'd like to see if there are any other solutions.
Given an array with n distinct numbers, you need to find the m largest ones in the array, and have
them in sorted order. Assume n and m are large, but grow differently. In particular, you need
to consider below the situations where m = t*n, where t is a small number, say 0.1, and then the
possibility m = √n.
The solution given in the book offers 3 options:
Sort the array and return the top m-long segment
Convert the array to a max-heap and extract the m elements
Select the m-th largest number, partition the array about it, and sort the segment of larger entries.
These all make sense, and they all have their pros and cons, but I'm wondering, is there another way to do it? It doesn't have to be better or faster, I'm just curious to see if this is a common problem with more solutions, or if we are limited to those 3 choices.
The time complexities of the three approaches you have mentioned are as follows.
O(n log n)
O(n + m log n)
O(n + m log m)
So option (3) is definitely better than the others in terms of asymptotic complexity, since m <= n. When m is small, the difference between (2) and (3) is so small it would have little practical impact.
As for other ways to solve the problem, there are infinitely many ways you could, so the question is somewhat poor in this regard. Another approach I can think of as being practically simple and performant is the following.
Extract the first m numbers from your list of n into an array, and sort it.
Repeatedly grab the next number from your list and insert it into the correct location in the array, shifting all the lesser numbers over by one and pushing one out.
I would only do this if m was very small though. Option (2) from your original list is also extremely easy to implement if you have a max-heap implementation and will work great.
A different approach.
Take the first m numbers, and turn them into a min heap. Run through the array, if its value exceeds the min of the top m then you extract the min value and insert the new one. When you reach the end of the array you can then extract the elements into an array and reverse it.
The worst case performance of this version is O(n log(m)) placing it between the first and second methods for efficiency.
The average case is more interesting. On average only O(m log(n/m)) of the elements are going to pass the first comparison test, each time incurring O(log(m)) work so you get O(n + m log(n/m) log(m)) work, which puts it between the second and third methods. But if n is many orders of magnitude greater than m then the O(n) piece dominates, and the O(n) median select in the third approach has worse constants than the one comparison per element in this approach, so in this case this is actually the fastest!

Partial selection sort vs Mergesort to find "k largest in array"

I was wondering if my line of thinking is correct.
I'm preparing for interviews (as a college student) and one of the questions I came across was to find the K largest numbers in an array.
My first thought was to just use a partial selection sort (e.g. scan the array from the first element and keep two variables for the lowest element seen and its index and swap with that index at the end of the array and continue doing so until we've swapped K elements and return a copy of the first K elements in that array).
However, this takes O(K*n) time. If I simply sorted the array using an efficient sorting method like Mergesort, it would only take O(n*log(n)) time to sort the entire array and return the K largest numbers.
Is it good enough to discuss these two methods during an interview (comparing log(n) and K of the input and going with the smaller of the two to compute the K largest) or would it be safe to assume that I'm expected to give a O(n) solution for this problem?
There exists an O(n) algorithm for finding the k'th smallest element, and once you have that element, you can simply scan through the list and collect the appropriate elements. It's based on Quicksort, but the reasoning behind why it works are rather hairy... There's also a simpler variation that probably will run in O(n). My answer to another question contains a brief discussion of this.
Here's a general discussion of this particular interview question found from googling:
http://www.geeksforgeeks.org/k-largestor-smallest-elements-in-an-array/
As for your question about interviews in general, it probably greatly depends on the interviewer. They usually like to see how you think about things. So, as long as you can come up with some sort of initial solution, your interviewer would likely ask questions from there depending on what they were looking for exactly.
IMHO, I think the interviewer wouldn't be satisfied with either of the methods if he says the dataset is huge (say a billion elements). In this case, if K to be returned is huge (nearing a billion) your partial selection would almost result in an O(n^2). I think it entirely depends on the intricacies of the question proposed.
EDIT: Aasmund Eldhuset's answer shows you how to achieve the O(n) time complexity.
If you want to find K (so for K = 5 you'll get five results - five highest numbers ) then the best what you can get is O(n+klogn) - you can build prority queue in O(n) and then invoke pq.Dequeue() k times. If you are looking for K biggest number then you can get it with O(n) quicksort modification - it's called k-th order statistics. Pseudocode looks like that: (it's randomized algorithm, avg time is approximately O(n) however worst case is O(n^2))
QuickSortSelection(numbers, currentLength, k) {
if (currentLength == 1)
return numbers[0];
int pivot = random number from numbers array;
int newPivotIndex = partitionAroundPivot(numbers) // check quicksort algorithm for more details - less elements go left to the pivot, bigger elements go right
if ( k == newPivotIndex )
return pivot;
else if ( k < newPivotIndex )
return QuickSortSelection(numbers[0..newPivotIndex-1], newPivotIndex, k)
else
return QuickSortSelection(numbers[newPivotIndex+1..end], currentLength-newPivotIndex+1, k-newPivotIndex);
}
As i said this algorithm is O(n^2) worst case because pivot is chosen at random (however probability of running time of ~n^2 is something like 1/2^n). You can convert it deterministic algorithm with same running time worst case using for instance median of three median as a pivot - but it is slower in practice (due to constant).

What is the running time of this Haskell merge code

I came up with this code which takes two SORTED in ascending order lists and then merges those two lists into one single list, preserving the ascending order-sortedness. I was trying to analyze it and see what the time complexity of it is. My belief is that in the worst case, we will have to traverse through the whole lists and since there are 2 lists we're gonna have to have a nested recurence, which means O(n^2) time for the WORST case. However, since we're comparing the sizes of each of the two elements before we recurse, I am thinking that this is probably O(log n) time. Correct me if I'm wrong please. Thanks
Here's my recursion:
mergeLists::[Integer]->[Integer]->[Integer]
mergeLists [] [] =[]
mergeLists [] (y:ys) =(y:ys)
mergeLists (x:xs) [] =(x:xs)
mergeLists (x:xs) (y:ys)
|(x<y) =x:mergeLists xs (y:ys)
|otherwise =y:mergeLists (x:xs) ys
The time efficiency will be O(n), since the two lists are presumed to be in ascending order already. You don't need to recurse or perform any part of a comparison more than the number of elements in each list passed to the function.
Ach, Kareem already answered the question. I'll make it better by providing an understanding of Big-O notation for time and space efficiency. Big-O Notation is used to represent a basic concept of algorithmic performance over a set of data of arbitrary size. It's obvious that with an increasing data set for an algorithm to work on, the actual time to perform the operation increases. If the increase in time is determined linearly (IE each data member increases the algorithms actual time to finish by an equal amount as all other elements), then the algorithm's time efficiency is said to be "on the Order of n" or O(n). This is the case with your algorithm.
To learn more about Big-O notation, a great read is on another question here on StackOverflow: What does O(log n) mean exactly?
as the listed is already sorted .. then the merging time will be O(n) while n is the total size of the two lists.
it will work like the merge sort : http://www.vogella.com/tutorials/JavaAlgorithmsMergesort/article.html
Your mergeLists will take O(n+m) where n and m are the lengths of the input list. This can be easily proved by mathematical induction on n+m that the number of recursive calls will be at most n+m.
For the first 3 cases, there is no recursion, so the number of recursive calls is trivially less than n+m.
For the last 2 cases: In each case, the nested call receives one of the original lists and the other shortened by 1. So we can use the induction principle which tells us that the nested call will make at most n+m-1 recursive calls, and so the call in question will make at most n+m calls.

Resources