An algorithm about sort - sorting

Suppose that you are given a sequence of n elements to sort. The input sequence
consists of n=k subsequences, each containing k elements.The elements in a given
subsequence are all smaller than the elements in the succeeding subsequence and
larger than the elements in the preceding subsequence.
So is there a O(nlogk) method to put a disordered array to an array described above? thx!

A different formulation of the question
The question can be thought of like this. You have n balls of different sizes. You want to organize these into n/k buckets such that each bucket contains exactly k balls. Furthermore these buckets are placed in a line in which the left most bucket contains the k smallest balls. The 2nd bucket from the left contains the next k balls that would have been the smallest if we were to remove the leftmost bucket. The rightmost bucket contains the k largest balls.
But within each bucket you have no order. If you want the largest ball you know which bucket you must begin searching in, but you still need to search around in it.
I will be using the term bucket instead of subsequence since subsequence makes me think about ordering which is not important, what is important is belonging so bucket is easier for me.
A problem with the proposed complexity of the imagined solution
You are stating that k is the length (or size) of each bucket. It therefore naturally can be between 1 and n.
You then ask for if a O(n log k) solution exists that can organize the elements in this manner. There is a problem with your proposed complexity that is easy to see when we consider the two extremes k=1 and k=n.
k=n. Meaning we only have one large bucket. This is trivial since no action is needed. But your proposed complexity was O(n log k) = O(n log n) when k = n.
Let us consider k=1 too because it has a similar, but inverse, issue.
k=1. Each bucket contains 1 ball, and we need n buckets. This is the same as asking us to fully sort the whole sequence which will at best be O(n log n). But your proposed complexity was O(n log k) = O(n log 1) = O(n * 0) = 0. Remember log 1 = 0. It seems that your proposed complexity does not fit the problem at all.
We can pause here and say. No, you cannot do what you wish on O(n log k) because it does not make sense that the problem would become harder when you decrease the number of buckets. More importantly it cannot become easier as you increase the number of buckets.
If I were tasked to do this sorting manually I would say is trivial to sort into one bucket. Two is easy. Three would be harder than two. If you have n buckets then that is as hard as it can get!
Answer for an altered complexity
It is however interesting to consider what would happen if we were to fix your proposed complexity so that we instead ask the following. Is there a way to sort into these buckets in O(n log b) where b is the number of buckets (b = n / k)?
The extreme cases here seem to make sense.
b=1. One bucket. No sorting needed. O(n log b) = O(n log 1) = O(0). (technically this should still maybe be O(1))
b=n. n buckets. Full sort needed. O(n log b) = O(n log n).
So a solution seems possible. But this is outside the scope of the question now. I however suspect that Selection Algorithms and quickselect are the way to go.

Related

Closest pair algorithm from (n log^2 n) to (n log n) time

I am trying to understand how to go from n log^2 n time to n log n time for the closet pair algorithm. I get the below part (from http://www.cs.mcgill.ca/~cs251/ClosestPair/ClosestPairDQ.html)
Divide the set into two equal sized parts by the line l, and recursively compute the minimal distance in each part.
Let d be the minimal of the two minimal distances.
Eliminate points that lie farther than d apart from l
Sort the remaining points according to their y-coordinates
Scan the remaining points in the y order and compute the distances of each point to its five neighbors.
If any of these distances is less than d then update d.
Step 4 is a sort that takes O(n log n) time, which dominates all other steps and this is the one that needs to be reduced to O(n) in order for the overall algorithm to achieve O(n log n) time. And this is the part I am having a hard time understanding. The author proposes
Step 1: Divide the set into..., and recursively compute the distance in each part, returning the points in each set in sorted order by y-coordinate.
Step 4: Merge the two sorted lists into one sorted list in O(n) time.
You still have to sort the points by the y-coordinate in the recursive step, which takes O(n log n) time. How can we avoid this? The merging is O(n), but we still have to sort somewhere.
The reason that O(n log n) is a problem is that we do it over and over and over again: if we partition the set into two subsets, and partition each of those into two subsets, then that involves seven sorts (one over the whole set, two over halves of it, four over quarters of it).
So the proposal fixes this by reusing the results of previous sorts: so we do full merge-sorts of the smallest partitions (each of which is O(1), totaling O(n)), but for larger partitions we just need to do a single O(n) merge pass to combine the results of those. So we only pay the O(n log n) price a total of once, which is fine.
Their proposal is that you have two (already sorted lists) A and B. Combining those into one sorted list can be done using merge sort (just thing of step (4) as the merge step in merge sort).
The result of merge sort is one sorted list with all members of both A and B. Once merged, there is no need to sort anything again.
I provide an alternate solution which is maybe easier to understand. At first sort all points w.r.t their y coordinate. This is one time using O (n log n) . There are n points, and in the sorted array each of them has some index which is at most n. Save index of each point (add integer for index to point data structure). Then run the original algorithm. This time, at the moment which we want to sort points, do not sort them with normal comparision sort. But sort them according to their index. We can sort them according to their index with radix sort in O (n). So the total process is O (n log n), as we used the comparision sort only once and the rest is T (n)=2T (n/2)+O (n). But the constant is not as good as modification suggested in the question.
The modified procedure suggested in the question works like the merge in merge sort: when we have two sorted lists, we do not need to sort them again by using normal sort, we can sort them by merging them in O (n).

Finding the m Largest Numbers

This is a problem from the Cormen text, but I'd like to see if there are any other solutions.
Given an array with n distinct numbers, you need to find the m largest ones in the array, and have
them in sorted order. Assume n and m are large, but grow differently. In particular, you need
to consider below the situations where m = t*n, where t is a small number, say 0.1, and then the
possibility m = √n.
The solution given in the book offers 3 options:
Sort the array and return the top m-long segment
Convert the array to a max-heap and extract the m elements
Select the m-th largest number, partition the array about it, and sort the segment of larger entries.
These all make sense, and they all have their pros and cons, but I'm wondering, is there another way to do it? It doesn't have to be better or faster, I'm just curious to see if this is a common problem with more solutions, or if we are limited to those 3 choices.
The time complexities of the three approaches you have mentioned are as follows.
O(n log n)
O(n + m log n)
O(n + m log m)
So option (3) is definitely better than the others in terms of asymptotic complexity, since m <= n. When m is small, the difference between (2) and (3) is so small it would have little practical impact.
As for other ways to solve the problem, there are infinitely many ways you could, so the question is somewhat poor in this regard. Another approach I can think of as being practically simple and performant is the following.
Extract the first m numbers from your list of n into an array, and sort it.
Repeatedly grab the next number from your list and insert it into the correct location in the array, shifting all the lesser numbers over by one and pushing one out.
I would only do this if m was very small though. Option (2) from your original list is also extremely easy to implement if you have a max-heap implementation and will work great.
A different approach.
Take the first m numbers, and turn them into a min heap. Run through the array, if its value exceeds the min of the top m then you extract the min value and insert the new one. When you reach the end of the array you can then extract the elements into an array and reverse it.
The worst case performance of this version is O(n log(m)) placing it between the first and second methods for efficiency.
The average case is more interesting. On average only O(m log(n/m)) of the elements are going to pass the first comparison test, each time incurring O(log(m)) work so you get O(n + m log(n/m) log(m)) work, which puts it between the second and third methods. But if n is many orders of magnitude greater than m then the O(n) piece dominates, and the O(n) median select in the third approach has worse constants than the one comparison per element in this approach, so in this case this is actually the fastest!

What is the worst case complexity for bucket sort?

I just read the Wikipedia page about Bucket sort. In this article they say that the worst case complexity is O(n²). But I thought the worst case complexity was O(n + k) where k are the number of buckets. This is how I calculate this complexity:
Add the element to the bucket. Using a linked list this is O(1)
Going through the list and put the elements in the correct bucket = O(n)
Merging the buckets = O(k)
O(1) * O(n) + O(k) = O(n + k)
Am I missing something?
In order to merge the buckets, they first need to be sorted. Consider the pseudocode given in the Wikipedia article:
function bucketSort(array, n) is
buckets ← new array of n empty lists
for i = 0 to (length(array)-1) do
insert array[i] into buckets[msbits(array[i], k)]
for i = 0 to n - 1 do
nextSort(buckets[i])
return the concatenation of buckets[0], ..., buckets[n-1]
The nextSort(buckets[i]) sorts each of the individual buckets. Generally, a different sort is used to sort the buckets (i.e. insertion sort), as once you get down and size, different, non-recursive sorts often give you better performance.
Now, consider the case where all n elements end up in the same bucket. If we use insertion sort to sort individual buckets, this could lead to the worst case performance of O(n^2). I think the answer must be dependent on the sort you choose to sort the individual buckets.
What if the algorithm decides that every element belongs in the same bucket? In that case, the linked list in that bucket needs to be traversed every time an element is added. That takes 1 step, then 2, then 3, 4, 5... n . Thus the time is the sum of all of the numbers from 1 to n which is (n^2 + n)/2, which is O(n^2).
Of course, this is "worst case" (all the elements in one bucket) - the algorithm to calculate which bucket to place an element is generally designed to avoid this behavior.
If you can guarantee that each bucket represents a unique value (equivalent items), then the worst case time complexity would be O(m+n) as you pointed out.
Bucket sort assumes that the input is drawn from a uniform distribution. This implies that a few items fall in each bucket. In turn, this leads to a nice average running time of O(n). Indeed, if the n elements are inserted in each bucket so that O(1) elements fall in each different bucket (insertion requires O(1) per item), then sorting a bucket using insertion sort requires, on average, O(1) as well (this is proved in almost all textbooks on algorithms). Since you must sort n buckets, the average complexity is O(n).
Now, assume that the input is not drawn from a uniform distribution. As already pointed out by #mfrankli, this may lead in the worst case to a situation in which all of the items fall for example all in the first bucket. In this case, insertion sort will require in the worst case O(n^2).
Note that you may use the following trick to maintain the same average O(n) complexity, while providing an O(n log n) complexity in the worst case. Instead of using insertion sort, simply use an algorithm with O(n log n) complexity in the worst case: either merge sort or heap sort (but not quick sort, which achieves O(n log n) only on average).
This is an add-on answer to #perreal. I tried to post it as a comment but it's too long. #perreal is correctly pointing out when bucket sort makes the most sense. The different answers are making different assumptions about what data is being sorted. E.G. if the keys to be sorted are strings, then the range of possible keys will be too large (larger than the bucket array), and we will have to only use the first character of the string for the bucket positions or some other strategy. The individual buckets will have to be sorted because they hold items with different keys, leading to O(n^2).
But if we are sorting data where the keys are integers in a known range, then the buckets are always already sorted because the keys in the bucket are equal, which leads to the linear time sort. Not only are the buckets sorted, but the sort is stable because we can pull items out of the bucket array in the order they were added.
The thing that I wanted to add is that if you are facing O(n^2) because of the nature of the keys to be sorted, bucket sort might not be the right approach. When you have a range of possible keys that is proportional to the size of the input, then you can take advantage of the linear time bucket sort by having each bucket hold only 1 value of a key.

How could the complexity of bucket sort is O(n+k)?

Before saying "this has been asked before", or "find an algorithm book", please read on and tell me what part of my reasoning went wrong?
Say you have n intergers, and you divded them into k bins, this will take O(n) time. However, one need to sort each of the k bins, if using quick sort for each bin this is an O((n/k)log(n/k)) operation, so this step would take O(nlog(n/k)+k). Finally one need to assemble this array, this takes O(n+k), (see this post), so the total operation would be O(n+nlog(n/k)+k). Now, how did this nlog(n/k) disappeared, I could not figure at all. My guess is there is some mathematics going on which eliminates this n*log(n/k). Anyone could help?
Your assumption:
k - the number of buckets - is arbitrary
is wrong.
There are two variants of bucket sort, so it is quite confusing.
A
The number of buckets is equal to the number of items in the input
See analysis here
B
The number of buckets is equal to R - the number of possible values for the input integers
See analysis here and here
Your flaw is assuming that quicksort is used to sort the buckets. Typically this is not the case, and that's how you avoid the (n / k) log(n / k) terms.
Your analysis looks good. The term Bucketsort is used for many different algorithms, so depending on which one you looked at its average runtime might be O(n + k) or not.
If I had to guess, you might have looked at a typical variant where one chooses k very large so that n/k will be a constant. In another popular variant even k >> n, so one divides into k/n buckets instead.
If you provide the algorithm in detail and the source which claims this to be in an average of O(n + k) I can revisit my answer.

How to calculate order (big O) for more complex algorithms (eg quicksort)

I know there are quite a bunch of questions about big O notation, I have already checked:
Plain english explanation of Big O
Big O, how do you calculate/approximate it?
Big O Notation Homework--Code Fragment Algorithm Analysis?
to name a few.
I know by "intuition" how to calculate it for n, n^2, n! and so, however I am completely lost on how to calculate it for algorithms that are log n , n log n, n log log n and so.
What I mean is, I know that Quick Sort is n log n (on average).. but, why? Same thing for merge/comb, etc.
Could anybody explain me in a not too math-y way how do you calculate this?
The main reason is that Im about to have a big interview and I'm pretty sure they'll ask for this kind of stuff. I have researched for a few days now, and everybody seem to have either an explanation of why bubble sort is n^2 or the unreadable explanation (for me) on Wikipedia
The logarithm is the inverse operation of exponentiation. An example of exponentiation is when you double the number of items at each step. Thus, a logarithmic algorithm often halves the number of items at each step. For example, binary search falls into this category.
Many algorithms require a logarithmic number of big steps, but each big step requires O(n) units of work. Mergesort falls into this category.
Usually you can identify these kinds of problems by visualizing them as a balanced binary tree. For example, here's merge sort:
6 2 0 4 1 3 7 5
2 6 0 4 1 3 5 7
0 2 4 6 1 3 5 7
0 1 2 3 4 5 6 7
At the top is the input, as leaves of the tree. The algorithm creates a new node by sorting the two nodes above it. We know the height of a balanced binary tree is O(log n) so there are O(log n) big steps. However, creating each new row takes O(n) work. O(log n) big steps of O(n) work each means that mergesort is O(n log n) overall.
Generally, O(log n) algorithms look like the function below. They get to discard half of the data at each step.
def function(data, n):
if n <= constant:
return do_simple_case(data, n)
if some_condition():
function(data[:n/2], n / 2) # Recurse on first half of data
else:
function(data[n/2:], n - n / 2) # Recurse on second half of data
While O(n log n) algorithms look like the function below. They also split the data in half, but they need to consider both halves.
def function(data, n):
if n <= constant:
return do_simple_case(data, n)
part1 = function(data[n/2:], n / 2) # Recurse on first half of data
part2 = function(data[:n/2], n - n / 2) # Recurse on second half of data
return combine(part1, part2)
Where do_simple_case() takes O(1) time and combine() takes no more than O(n) time.
The algorithms don't need to split the data exactly in half. They could split it into one-third and two-thirds, and that would be fine. For average-case performance, splitting it in half on average is sufficient (like QuickSort). As long as the recursion is done on pieces of (n/something) and (n - n/something), it's okay. If it's breaking it into (k) and (n-k) then the height of the tree will be O(n) and not O(log n).
You can usually claim log n for algorithms where it halves the space/time each time it runs. A good example of this is any binary algorithm (e.g., binary search). You pick either left or right, which then axes the space you're searching in half. The pattern of repeatedly doing half is log n.
For some algorithms, getting a tight bound for the running time through intuition is close to impossible (I don't think I'll ever be able to intuit a O(n log log n) running time, for instance, and I doubt anyone will ever expect you to). If you can get your hands on the CLRS Introduction to Algorithms text, you'll find a pretty thorough treatment of asymptotic notation which is appropriately rigorous without being completely opaque.
If the algorithm is recursive, one simple way to derive a bound is to write out a recurrence and then set out to solve it, either iteratively or using the Master Theorem or some other way. For instance, if you're not looking to be super rigorous about it, the easiest way to get QuickSort's running time is through the Master Theorem -- QuickSort entails partitioning the array into two relatively equal subarrays (it should be fairly intuitive to see that this is O(n)), and then calling QuickSort recursively on those two subarrays. Then if we let T(n) denote the running time, we have T(n) = 2T(n/2) + O(n), which by the Master Method is O(n log n).
Check out the "phone book" example given here: What is a plain English explanation of "Big O" notation?
Remember that Big-O is all about scale: how much more operation will this algorithm require as the data set grows?
O(log n) generally means you can cut the dataset in half with each iteration (e.g. binary search)
O(n log n) means you're performing an O(log n) operation for each item in your dataset
I'm pretty sure 'O(n log log n)' doesn't make any sense. Or if it does, it simplifies down to O(n log n).
I'll attempt to do an intuitive analysis of why Mergesort is n log n and if you can give me an example of an n log log n algorithm, I can work through it as well.
Mergesort is a sorting example that works through splitting a list of elements repeatedly until only elements exists and then merging these lists together. The primary operation in each of these merges is comparison and each merge requires at most n comparisons where n is the length of the two lists combined. From this you can derive the recurrence and easily solve it, but we'll avoid that method.
Instead consider how Mergesort is going to behave, we're going to take a list and split it, then take those halves and split it again, until we have n partitions of length 1. I hope that it's easy to see that this recursion will only go log (n) deep until we have split the list up into our n partitions.
Now that we have that each of these n partitions will need to be merged, then once those are merged the next level will need to be merged, until we have a list of length n again. Refer to wikipedia's graphic for a simple example of this process http://en.wikipedia.org/wiki/File:Merge_sort_algorithm_diagram.svg.
Now consider the amount of time that this process will take, we're going to have log (n) levels and at each level we will have to merge all of the lists. As it turns out each level will take n time to merge, because we'll be merging a total of n elements each time. Then you can fairly easily see that it will take n log (n) time to sort an array with mergesort if you take the comparison operation to be the most important operation.
If anything is unclear or I skipped somewhere please let me know and I can try to be more verbose.
Edit Second Explanation:
Let me think if I can explain this better.
The problem is broken into a bunch of smaller lists and then the smaller lists are sorted and merged until you return to the original list which is now sorted.
When you break up the problems you have several different levels of size first you'll have two lists of size: n/2, n/2 then at the next level you'll have four lists of size: n/4, n/4, n/4, n/4 at the next level you'll have n/8, n/8 ,n/8 ,n/8, n/8, n/8 ,n/8 ,n/8 this continues until n/2^k is equal to 1 (each subdivision is the length divided by a power of 2, not all lengths will be divisible by four so it won't be quite this pretty). This is repeated division by two and can continue at most log_2(n) times, because 2^(log_2(n) )=n, so any more division by 2 would yield a list of size zero.
Now the important thing to note is that at every level we have n elements so for each level the merge will take n time, because merge is a linear operation. If there are log(n) levels of the recursion then we will perform this linear operation log(n) times, therefore our running time will be n log(n).
Sorry if that isn't helpful either.
When applying a divide-and-conquer algorithm where you partition the problem into sub-problems until it is so simple that it is trivial, if the partitioning goes well, the size of each sub-problem is n/2 or thereabout. This is often the origin of the log(n) that crops up in big-O complexity: O(log(n)) is the number of recursive calls needed when partitioning goes well.

Resources