Cormen quick sort modify partition function - algorithm

I am learning quick sort from Introduction to Algorithms.I got stuck on question 7.1-2. of Chapter 7 Quicksort-
"What value of q does PARTITION return when all elements in the array A[p…r] have the same value? Modify PARTITION so that q=⌊(p+r)/2⌋ when > all elements in the array A[p…r] have the same value."
The first part is easy and the answer is definitely r.But I can't even figure out what the second part is asking.I mean what is the reason for setting the pivot to (p+r)/2.Further I can't understand the solutions I found on searching on Google.
Please help me in understanding what is the advantage of this modification in case all elements are equal and if possible please provide the algorithm to do so.

By setting the pivot to the middle of p and r, we divide the array of size n into two sub-problems of equal size n/2. If you draw the recursion tree for the following recurrence, you would see its height as O(lgn)
T(n) = 2T(n/2)+O(n)
Imagine if the position of the pivot returned from the partition is always the last element in the array. Then the recurrence for the run time would be
T(n) = T(n-1)+O(n)
Do you see now why is it inefficient if the recursion tree is like a linked list? Try drawing the tree and adding the costs at each node in both the cases.
Also modifying the partition method to return (p+r)/2 is easy.
Hint: One easy way is split <= in the if condition of the partition.

Related

Minimal non-contiguous sequence of exactly k elements

The problem I'm having can be reduced to:
Given an array of N positive numbers, find the non-contiguous sequence of exactly K elements with the minimal sum.
Ok-ish: report the sum only. Bonus: the picked elements can be identified (at least one set of indices, if many can realize the same sum).
(in layman terms: pick any K non-neighbouring elements from N values so that their sum is minimal)
Of course, 2*K <= N+1 (otherwise no solution is possible), the problem is insensitive to positive/negative (just shift the array values with the MIN=min(A...) then add back K*MIN to the answer).
What I got so far (the naive approach):
select K+2 indexes of the values closest to the minimum. I'm not sure about this, for K=2 this seems to be the required to cover all the particular cases, but I don't know if it is required/sufficient for K>2**
brute force the minimal sum from the values of indices resulted at prev step respecting the non-contiguity criterion - if I'm right and K+2 is enough, I can live brute-forcing a (K+1)*(K+2) solution space but, as I said. I'm not sure K+2 is enough for K>2 (if in fact 2*K points are necessary, then brute-forcing goes out of window - the binomial coefficient C(2*K, K) grows prohibitively fast)
Any clever idea of how this can be done with minimal time/space complexity?
** for K=2, a non-trivial example where 4 values closest to the absolute minimum are necessary to select the objective sum [4,1,0,1,4,3,4] - one cannot use the 0 value for building the minimal sum, as it breaks the non-contiguity criterion.
PS - if you feel like showing code snippets, C/C++ and/or Java will be appreciated, but any language with decent syntax or pseudo-code will do (I reckon "decent syntax" excludes Perl, doesn't it?)
Let's assume input numbers are stored in array a[N]
Generic approach is DP: f(n, k) = min(f(n-1, k), f(n-2, k-1)+a[n])
It takes O(N*K) time and has 2 options:
for lazy backtracking recursive solution O(N*K) space
for O(K) space for forward cycle
In special case of big K there is another possibility:
use recursive back-tracking
instead of helper array of N*K size use map(n, map(k, pair(answer, list(answer indexes))))
save answer and list of indexes for this answer
instantly return MAX_INT if k>N/2
This way you'll have lower time than O(NK) for K~=N/2, something like O(Nlog(N)). It will increase up to O(N*log(N)Klog(K)) for small K, so decision between general approach or special case algorithm is important.
There should be a dynamic programming approach to this.
Work along the array from left to right. At each point i, for each value of j from 1..k, find the value of the right answer for picking j non-contiguous elements from 1..i. You can work out the answers at i by looking at the answers at i-1, i-2, and the value of array[i]. The answer you want is the answer at n for an array of length n. After you have done this you should be able to work out what the elements are by back-tracking along the array to work out whether the best decision at each point involves selecting the array element at that point, and therefore whether it used array[i-1][k] or array[i-2][k-1].

Two dimensional array search in sub linear time

Today I was asked the following question in an interview:
Given an n by n array of integers that contains no duplicates and values that increase from left to right as well as top to bottom, provide an algorithm that checks whether a given value is in the array.
The answer I provided was similar to the answer in this thread:
Algorithm: efficient way to search an integer in a two dimensional integer array?
This solution is O(2n), which I believed to be the optimal solution.
However, the interviewer then informed me that it is possible to solve this problem in sub linear time. I have racked my brain with how to go about doing this, but I am coming up with nothing.
Is a sub linear solution possible, or is this the optimal solution?
The thing to ask yourself is, what information does each comparison give you? It let's you eliminate the rectangle either "above to the left" or "below to the right".
Suppose you do a comparison at 'x' and it tells you what your are looking for is greater:
XXX...
XXX...
XXx...
......
......
'x' - checked space
'X' - check showed this is not a possible location for your data
'.' - still unknown
You have to use this information in a smart way to check the entire rectangle.
Suppose you do a binary search this way on the middle column...
You'll get a result like
XXX...
XXX...
XXX...
XXXXXX
...XXX
...XXX
Two rectangular spaces are left over of half width and possibly full height. What can you do with this information?
I recommend recurring on the 2 resulting subrectangles of '.'. BUT, now instead of choosing the middle column, you choose the middle row to do your binary search on.
So the resulting run time of an N by M rectangle looks like
T(N, M) = log(N) + T(M/2, N)*2
Note the change in indexes because your recursion stack switches between checking columns and rows. The final run time (I didn't bother solving the recursion) should be something like T(M, N) = log(M) + log(N) (it's probably not exactly this but it will be similar).

Finding closest number in a range

I thought a problem which is as follows:
We have an array A of integers of size n, and we have test cases t and in every test cases we are given a number m and a range [s,e] i.e. we are given s and e and we have to find the closest number of m in the range of that array(A[s]-A[e]).
You may assume array indexed are from 1 to n.
For example:
A = {5, 12, 9, 18, 19}
m = 13
s = 4 and e = 5
So the answer should be 18.
Constraints:
n<=10^5
t<=n
All I can thought is an O(n) solution for every test case, and I think a better solution exists.
This is a rough sketch:
Create a segment tree from the data. At each node, besides the usual data like left and right indices, you also store the numbers found in the sub-tree rooted at that node, stored in sorted order. You can achieve this when you construct the segment tree in bottom-up order. In the node just above the leaf, you store the two leaf values in sorted order. In an intermediate node, you keep the numbers in the left child, and right child, which you can merge together using standard merging. There are O(n) nodes in the tree, and keeping this data should take overall O(nlog(n)).
Once you have this tree, for every query, walk down the path till you reach the appropriate node(s) in the given range ([s, e]). As the tutorial shows, one or more different nodes would combine to form the given range. As the tree depth is O(log(n)), that is the time per query to reach these nodes. Each query should be O(log(n)). For all the nodes which lie completely inside the range, find the closest number using binary search in the sorted array stored in those nodes. Again, O(log(n)). Find the closest among all these, and that is the answer. Thus, you can answer each query in O(log(n)) time.
The tutorial I link to contains other data structures, such as sparse table, which are easier to implement, and should give O(sqrt(n)) per query. But I haven't thought much about this.
sort the array and do binary search . complexity : o(nlogn + logn *t )
I'm fairly sure no faster solution exists. A slight variation of your problem is:
There is no array A, but each test case contains an unsorted array of numbers to search. (The array slice of A from s to e).
In that case, there is clearly no better way than a linear search for each test case.
Now, in what way is your original problem more specific than the variation above? The only added information is that all the slices come from the same array. I don't think that this additional constraint can be used for an algorithmic speedup.
EDIT: I stand corrected. The segment tree data structure should work.

Finding the repeated element

In an array with integers between 1 and 1,000,000 or say some very larger value ,if a single value is occurring twice twice. How do you determine which one?
I think we can use a bitmap to mark the elements , and then traverse allover again to find out the repeated element . But , i think it is a process with high complexity.Is there any better way ?
This sounds like homework or an interview question ... so rather than giving away the answer, here's a hint.
What calculations can you do on a range of integers whose answer you can determine ahead of time?
Once you realize the answer to this, you should be able to figure it out .... if you still can't figure it out ... (and it's not homework) I'll post the solution :)
EDIT: Ok. So here's the elegant solution ... if the list contains ALL of the integers within the range.
We know that all of the values between 1 and N must exist in the list. Using Guass' formula we can quickly compute the expected value of a range of integers:
Sum(1..N) = 1/2 * (1 + N) * Count(1..N).
Since we know the expected sum, all we have to do is loop through all the values and sum their values. The different between this sum and the expected sum is the duplicate value.
EDIT: As other's have commented, the question doesn't state that the range contains all of the integers ... in this case, you have to decide whether you want to optimize for memory or time.
If you want to perform the operation using O(1) storage, you can perform an in-place sort of the list. As you're sorting you have to check adjacent elements. Once you see a duplicate, you know you can stop. Optimal sorting is an O(n log n) operation on average - which establishes an upper bound for find the duplicate in this manner.
If you want to optimize for speed, you can use an additional O(n) storage. Using a HashSet (or similar structure), insert values from your list until you determine you are inserting a duplicate into the HashSet. Inserting n items into a HashSet is an O(n) operation on average, which establishes that as an upper bound for this method.
you may try to use bits as hashmap:
1 at position k means that number k occured before
0 at position k means that number k did not occured before
pseudocode:
0. assume that your array is A
1. initialize bitarray(there is nice class in c# for this) of 1000000 length filled with zeros
2. for each num in A:
if bitarray[num]
return num
else
bitarray[num] = 1
end
The time complexity of the bitmap solution is O(n) and it doesn't seem like you could do better than that. However it will take up a lot of memory for a generic list of numbers. Sorting the numbers is an obvious way to detect duplicates and doesn't require extra space if you don't mind the current order changing.
Assuming the array is of length n < N (i.e. not ALL integers are present -- in this case LBushkin's trick is the answer to this homework problem), there is no way to solve this problem using less than O(n) memory using an algorithm that just takes a single pass through the array. This is by reduction to the set disjointness problem.
Suppose I made the problem easier, and I promised you that the duplicate elements were in the array such that the first one was in the first n/2 elements, and the second one was in the last n/2 elements. Now we can think of playing a game in which two people each hold a string of n/2 elements, and want to know how many messages they have to send to be sure that none of their elements are the same. Since the first player could simulate the run of any algorithm that takes a pass through the array, and send the contents of its memory to the second player, a lower bound on the number of messages they need to send implies a lower bound on the memory requirements of any algorithm.
But its easy to see in this simple game that they need to send n/2 messages to be sure that they don't hold any of the same elements, which yields the lower bound.
Edit: This generalizes to show that for algorithms that make k passes through the array and use memory m, that m*k = Omega(n). And it is easy to see that you can in fact trade off memory for time in this way.
Of course, if you are willing to use algorithms that don't simply take passes through the array, you can do better as suggested already: sort the array, then take 1 pass through. This takes time O(nlogn) and space O(1). But note curiously that this proves that any sorting algorithm that just makes passes through the array must take time Omega(n^2)! Sorting algorithms that break the n^2 bound must make random accesses.

Quicksort: Choosing the pivot

When implementing Quicksort, one of the things you have to do is to choose a pivot. But when I look at pseudocode like the one below, it is not clear how I should choose the pivot. First element of list? Something else?
function quicksort(array)
var list less, greater
if length(array) ≤ 1
return array
select and remove a pivot value pivot from array
for each x in array
if x ≤ pivot then append x to less
else append x to greater
return concatenate(quicksort(less), pivot, quicksort(greater))
Can someone help me grasp the concept of choosing a pivot and whether or not different scenarios call for different strategies.
Choosing a random pivot minimizes the chance that you will encounter worst-case O(n2) performance (always choosing first or last would cause worst-case performance for nearly-sorted or nearly-reverse-sorted data). Choosing the middle element would also be acceptable in the majority of cases.
Also, if you are implementing this yourself, there are versions of the algorithm that work in-place (i.e. without creating two new lists and then concatenating them).
It depends on your requirements. Choosing a pivot at random makes it harder to create a data set that generates O(N^2) performance. 'Median-of-three' (first, last, middle) is also a way of avoiding problems. Beware of relative performance of comparisons, though; if your comparisons are costly, then Mo3 does more comparisons than choosing (a single pivot value) at random. Database records can be costly to compare.
Update: Pulling comments into answer.
mdkess asserted:
'Median of 3' is NOT first last middle. Choose three random indexes, and take the middle value of this. The whole point is to make sure that your choice of pivots is not deterministic - if it is, worst case data can be quite easily generated.
To which I responded:
Analysis Of Hoare's Find Algorithm With Median-Of-Three Partition (1997)
by P Kirschenhofer, H Prodinger, C Martínez supports your contention (that 'median-of-three' is three random items).
There's an article described at portal.acm.org that is about 'The Worst Case Permutation for Median-of-Three Quicksort' by Hannu Erkiö, published in The Computer Journal, Vol 27, No 3, 1984. [Update 2012-02-26: Got the text for the article. Section 2 'The Algorithm' begins: 'By using the median of the first, middle and last elements of A[L:R], efficient partitions into parts of fairly equal sizes can be achieved in most practical situations.' Thus, it is discussing the first-middle-last Mo3 approach.]
Another short article that is interesting is by M. D. McIlroy, "A Killer Adversary for Quicksort", published in Software-Practice and Experience, Vol. 29(0), 1–4 (0 1999). It explains how to make almost any Quicksort behave quadratically.
AT&T Bell Labs Tech Journal, Oct 1984 "Theory and Practice in the Construction of a Working Sort Routine" states "Hoare suggested partitioning around the median of several randomly selected lines. Sedgewick [...] recommended choosing the median of the first [...] last [...] and middle". This indicates that both techniques for 'median-of-three' are known in the literature. (Update 2014-11-23: The article appears to be available at IEEE Xplore or from Wiley — if you have membership or are prepared to pay a fee.)
'Engineering a Sort Function' by J L Bentley and M D McIlroy, published in Software Practice and Experience, Vol 23(11), November 1993, goes into an extensive discussion of the issues, and they chose an adaptive partitioning algorithm based in part on the size of the data set. There is a lot of discussion of trade-offs for various approaches.
A Google search for 'median-of-three' works pretty well for further tracking.
Thanks for the information; I had only encountered the deterministic 'median-of-three' before.
Heh, I just taught this class.
There are several options.
Simple: Pick the first or last element of the range. (bad on partially sorted input)
Better: Pick the item in the middle of the range. (better on partially sorted input)
However, picking any arbitrary element runs the risk of poorly partitioning the array of size n into two arrays of size 1 and n-1. If you do that often enough, your quicksort runs the risk of becoming O(n^2).
One improvement I've seen is pick median(first, last, mid);
In the worst case, it can still go to O(n^2), but probabilistically, this is a rare case.
For most data, picking the first or last is sufficient. But, if you find that you're running into worst case scenarios often (partially sorted input), the first option would be to pick the central value( Which is a statistically good pivot for partially sorted data).
If you're still running into problems, then go the median route.
Never ever choose a fixed pivot - this can be attacked to exploit your algorithm's worst case O(n2) runtime, which is just asking for trouble. Quicksort's worst case runtime occurs when partitioning results in one array of 1 element, and one array of n-1 elements. Suppose you choose the first element as your partition. If someone feeds an array to your algorithm that is in decreasing order, your first pivot will be the biggest, so everything else in the array will move to the left of it. Then when you recurse, the first element will be the biggest again, so once more you put everything to the left of it, and so on.
A better technique is the median-of-3 method, where you pick three elements at random, and choose the middle. You know that the element that you choose won't be the the first or the last, but also, by the central limit theorem, the distribution of the middle element will be normal, which means that you will tend towards the middle (and hence, nlog(n) time).
If you absolutely want to guarantee O(nlog(n)) runtime for the algorithm, the columns-of-5 method for finding the median of an array runs in O(n) time, which means that the recurrence equation for quicksort in the worst case will be:
T(n) = O(n) (find the median) + O(n) (partition) + 2T(n/2) (recurse left and right)
By the Master Theorem, this is O(nlog(n)). However, the constant factor will be huge, and if worst case performance is your primary concern, use a merge sort instead, which is only a little bit slower than quicksort on average, and guarantees O(nlog(n)) time (and will be much faster than this lame median quicksort).
Explanation of the Median of Medians Algorithm
Don't try and get too clever and combine pivoting strategies. If you combined median of 3 with random pivot by picking the median of the first, last and a random index in the middle, then you'll still be vulnerable to many of the distributions which send median of 3 quadratic (so its actually worse than plain random pivot)
E.g a pipe organ distribution (1,2,3...N/2..3,2,1) first and last will both be 1 and the random index will be some number greater than 1, taking the median gives 1 (either first or last) and you get an extermely unbalanced partitioning.
It is easier to break the quicksort into three sections doing this
Exchange or swap data element function
The partition function
Processing the partitions
It is only slightly more inefficent than one long function but is alot easier to understand.
Code follows:
/* This selects what the data type in the array to be sorted is */
#define DATATYPE long
/* This is the swap function .. your job is to swap data in x & y .. how depends on
data type .. the example works for normal numerical data types .. like long I chose
above */
void swap (DATATYPE *x, DATATYPE *y){
DATATYPE Temp;
Temp = *x; // Hold current x value
*x = *y; // Transfer y to x
*y = Temp; // Set y to the held old x value
};
/* This is the partition code */
int partition (DATATYPE list[], int l, int h){
int i;
int p; // pivot element index
int firsthigh; // divider position for pivot element
// Random pivot example shown for median p = (l+h)/2 would be used
p = l + (short)(rand() % (int)(h - l + 1)); // Random partition point
swap(&list[p], &list[h]); // Swap the values
firsthigh = l; // Hold first high value
for (i = l; i < h; i++)
if(list[i] < list[h]) { // Value at i is less than h
swap(&list[i], &list[firsthigh]); // So swap the value
firsthigh++; // Incement first high
}
swap(&list[h], &list[firsthigh]); // Swap h and first high values
return(firsthigh); // Return first high
};
/* Finally the body sort */
void quicksort(DATATYPE list[], int l, int h){
int p; // index of partition
if ((h - l) > 0) {
p = partition(list, l, h); // Partition list
quicksort(list, l, p - 1); // Sort lower partion
quicksort(list, p + 1, h); // Sort upper partition
};
};
It is entirely dependent on how your data is sorted to begin with. If you think it will be pseudo-random then your best bet is to either pick a random selection or choose the middle.
If you are sorting a random-accessible collection (like an array), it's general best to pick the physical middle item. With this, if the array is all ready sorted (or nearly sorted), the two partitions will be close to even, and you'll get the best speed.
If you are sorting something with only linear access (like a linked-list), then it's best to choose the first item, because it's the fastest item to access. Here, however,if the list is already sorted, you're screwed -- one partition will always be null, and the other have everything, producing the worst time.
However, for a linked-list, picking anything besides the first, will just make matters worse. It pick the middle item in a listed-list, you'd have to step through it on each partition step -- adding a O(N/2) operation which is done logN times making total time O(1.5 N *log N) and that's if we know how long the list is before we start -- usually we don't so we'd have to step all the way through to count them, then step half-way through to find the middle, then step through a third time to do the actual partition: O(2.5N * log N)
Ideally the pivot should be the middle value in the entire array.
This will reduce the chances of getting worst case performance.
In a truly optimized implementation, the method for choosing pivot should depend on the array size - for a large array, it pays off to spend more time choosing a good pivot. Without doing a full analysis, I would guess "middle of O(log(n)) elements" is a good start, and this has the added bonus of not requiring any extra memory: Using tail-call on the larger partition and in-place partitioning, we use the same O(log(n)) extra memory at almost every stage of the algorithm.
Quick sort's complexity varies greatly with the selection of pivot value. for example if you always choose first element as an pivot, algorithm's complexity becomes as worst as O(n^2). here is an smart method to choose pivot element-
1. choose the first, mid, last element of the array.
2. compare these three numbers and find the number which is greater than one and smaller than other i.e. median.
3. make this element as pivot element.
choosing the pivot by this method splits the array in nearly two half and hence the complexity
reduces to O(nlog(n)).
On the average, Median of 3 is good for small n. Median of 5 is a bit better for larger n. The ninther, which is the "median of three medians of three" is even better for very large n.
The higher you go with sampling the better you get as n increases, but the improvement dramatically slows down as you increase the samples. And you incur the overhead of sampling and sorting samples.
I recommend using the middle index, as it can be calculated easily.
You can calculate it by rounding (array.length / 2).
If you choose the first or the last element in the array, then there are high chance that the pivot is the smallest or the largest element of the array and that is bad.
Why?
Because in that case the number of element smaller / larger than the pivot element in 0. and this will repeat as follow :
Consider the size of the array n.Then,
(n) + (n - 1) + (n - 2) + ......+ 1 = O(n^2)
Hence, the time complexity increases to O(n^2) from O(nlogn). So, I highly recommend to use median / random element of the array as the pivot.

Resources