Big Shot IT company interview puzzle - performance

This past week I attended a couple of interviews at a few big IT companies. one question that left me bit puzzled. below is an exact description of the problem.(from one of the interview questions website)
Given the data set,
A,B,A,C,A,B,A,D,A,B,A,C,A,B,A,E,A,B,A,C,A,B,A,D,A,B,A,C,A,B,A,F
which can be reduced to
(A; 16); (B; 8); (C; 4); (D; 2); (E; 1); (F; 1):
using the (value, frequency) format.
for a total of m of these tuples, stored in no specific order. Devise an O(m) algorithm that returns the kth order statistic of the data set. m is the number of tuples as opposed to n which is total number of elements in the data set.

You can use Quick-Select to solve this problem.
Naively:
Pick an element (called the pivot) from the array
Put things less than or equal to the pivot on the left of the array, those greater on the right.
If the pivot is in position k, then you're done. If it's greater than k, then repeat the algorithm on the left side of the array. If it's less than k, then repeat the algorithm on the right side of the array.
There's a couple of details:
You need to either pick the pivot randomly (if you're happy with expected O(m) as the cost), or use a deterministic median algorithm.
You need to be careful to not take O(m^2) time if there's lots of values equal to the pivot. One simple way to do this is to do a second pass to split the array into 3 parts rather than 2: those less than the pivot, those equal to the pivot, and those greater than the pivot.

Related

Algorithms for dividing an array into n parts

In a recent campus Facebook interview i have asked to divide an array into 3 equal parts such that the sum in each array is roughly equal to sum/3.My Approach1. Sort The Array2. Fill the array[k] (k=0) uptil (array[k]<=sum/3)3. After that increment k and repeat the above step for array[k]Is there any better algorithm for this or it is NP Hard Problem
This is a variant of the partition problem (see http://en.wikipedia.org/wiki/Partition_problem for details). In fact a solution to this can solve that one (take an array, pad with 0s, and then solve this problem) so this problem is NP hard.
There is a dynamic programming approach that is pseudo-polynomial. For each i from 0 to the size of the array, you keep track of all possible combinations of current sizes for the sub arrays, and their current sums. As long as there are a limited number possible sums of subsets of the array, this runs acceptably fast.
The solution that I would have suggested is to just go for "good enough" closeness. First let's consider the simpler problem with all values positive. Then sort by value descending. Take that array in threes. Build up the three subsets by always adding the largest of the triple to the one with the smallest sum, the smallest to the one with the largest, and the middle to the middle. You will end up dividing the array evenly, and the difference will be no more than the value of the third smallest element.
For the general case you can divide into positive and negative, use the above approach on each, and then brute force all combinations of a group of positives, a group of negatives, and the few leftover values in the middle that did not divide evenly.
Here are details on a dynamic programming solution if you are interested. The running time and memory usage is O(n*(sum)^2) where n is the size of your array and sum is the sum of absolute values of your array values. For each array index j from 1 to n, store all the possible values you can get for your 3 subset sums when you split the array from index 1 to j into 3 subsets. Also for each possibility, store one possible way to split the array to get the 3 sums. Then to extend this information for 1 to (j+1) given the information from 1 to j, simply take each possible combination of 3 sums for splitting 1 to j and form the 3 combinations of 3 sums you get when you choose to add the (j+1)th array element to any one of the 3 subsets. Finally, when you are done and reach j = n, go through the set of all combinations of 3 subset sums you can get when you split array positions 1 to n into 3 sets, and choose the one whose maximum deviation from sum/3 is minimized. At first this may seem like O(n*(sum)^3) complexity, but for each j and each combination of the first 2 subset sums, the 3rd subset sum is uniquely determined. (because you are not allowed to omit any elements of the array). Thus the complexity really is O(n*(sum)^2).

stack of piles visible from top view

This is an interview question.
Here are the notes arranged in the following manner as depicted in the image.
Given the starting and ending point of each note.
for eg. [2-5], [3-9], [7-100] on a scale of length limits 0-10^9
in this example all three notes will be visible.
we need to find out, when viewed from top, how many notes are visible??
I tried to solve in O(n*n) , where n is the number of notes by checking every note visibilty with every other note. but in this approach how will we determine if the two notes are in different stacks.
ultimately did not reached the solution.
O(n) solutions will be preferred as O(n) solution was demanded by interviewer
If the order of the notes in the input is "the former is on top" than its easy:
keep values of the min_x and the max_x, initializing it to the first note's x values. Iterate over the notes, each note that has x values either greater than max_x or lesser than min_x changes those respective value to its own x values and is considered visible, otherwise it is not. finish iterating and return the list of visible notes. collect the cash.
If O(n log n) is sufficient: first, remap all numbers in the input to between 0..(2*n+1) (that is, if a number x_i is the j-th smallest number among all numbers in the input, replace all x_i with j). You can then use Painter's algorithm on a segment tree.
Details:
Consider an array of size (2 * n + 1). Initialize all these cells with -1.
Painter's algorithm: Iterate the bank notes from the last one given (in the bottom) to the topmost one. For each note covering from a_i to b_i, replace the values of all cells in the array whose index is between a_i and b_i with i. At the end of this algorithm, we can simply look at which indexes are in the array and these form all the visible notes. However, naively this works in O(N^2).
Segment tree: So, instead of using an array, we use a segment tree. The operations above can then be done in O(log N).

Algorithm to generate a 'nearly sorted' or 'k sorted' list?

I want to generate some test data to test a function that merges 'k sorted' lists (lists where each element is at most k positions away from it's correct sorted position) into a single fully sorted list. I have an approach that works but I'm not sure how well randomized it is and I feel there should be a simpler / more elegant way to do this. My current approach:
Generate n random elements paired with an integer index.
Sort random elements.
Set paired index for each element to its sorted position.
Work backwards through the elements, swapping each element with an element a random distance between 1 and k positions behind it in the list. Only swap with the target element if its paired index is its current index (this avoids swapping an element that is already out of place and moving it further than k positions away from where it should be).
Copy the perturbed elements out into another list.
Like I say, this works but I'm interested in alternative / better approaches.
I think you could just fill an array with random integers and then run quicksort on it with a custom stopping condition.
If in a particular quicksort recursion your start and end indexes are less than k apart, then just return instead of continuing to recur.
Because of how quicksort works, every number in the start..end interval belongs somewhere in that region; worst case is that array[start] might really belong at array[end] (or vice versa) in truly sorted order. So, assuring that start and end are no more than k apart should be sufficient.
You can generate array of random numbers and then h-sort it like in shellsort, but without fiew last sorting steps when h is less then k.
Step 1: Randomly permute disjoint segments of length k. (Eg. 1 to K, k+1 to 2k ...)
Step 2: Permute conditionally again by swapping (that they don't break k-sorted assumption (1+t yo k+t, k+1+t to 1+2k+t ...) where t is a number between 1 and k (most preferably k/2)
Probably repeat step 2 multiple times with different t.
If I understand the problem, you want an algorithm to randomly pick a single k-sorted list of length n, uniformly selected from the universe U of all k-sorted lists of length n. (You will then run this algorithm m times to produce m lists as input test data.)
The first step is to count them. What is the size of U? |U|
The next step is to enumerate them. Create any one-to-one mapping F between the integers (1,2,...,|U|) and k-sorted lists of length n.
Then randomly select an integer x between 1 and |U| inclusive, and then apply F(x) to get the list.

Revisit: 2D Array Sorted Along X and Y Axis

So, this is a common interview question. There's already a topic up, which I have read, but it's dead, and no answer was ever accepted. On top of that, my interests lie in a slightly more constrained form of the question, with a couple practical applications.
Given a two dimensional array such that:
Elements are unique.
Elements are sorted along the x-axis and the y-axis.
Neither sort predominates, so neither sort is a secondary sorting parameter.
As a result, the diagonal is also sorted.
All of the sorts can be thought of as moving in the same direction. That is to say that they are all ascending, or that they are all descending.
Technically, I think as long as you have a >/=/< comparator, any total ordering should work.
Elements are numeric types, with a single-cycle comparator.
Thus, memory operations are the dominating factor in a big-O analysis.
How do you find an element? Only worst case analysis matters.
Solutions I am aware of:
A variety of approaches that are:
O(nlog(n)), where you approach each row separately.
O(nlog(n)) with strong best and average performance.
One that is O(n+m):
Start in a non-extreme corner, which we will assume is the bottom right.
Let the target be J. Cur Pos is M.
If M is greater than J, move left.
If M is less than J, move up.
If you can do neither, you are done, and J is not present.
If M is equal to J, you are done.
Originally found elsewhere, most recently stolen from here.
And I believe I've seen one with a worst-case O(n+m) but a optimal case of nearly O(log(n)).
What I am curious about:
Right now, I have proved to my satisfaction that naive partitioning attack always devolves to nlog(n). Partitioning attacks in general appear to have a optimal worst-case of O(n+m), and most do not terminate early in cases of absence. I was also wondering, as a result, if an interpolation probe might not be better than a binary probe, and thus it occurred to me that one might think of this as a set intersection problem with a weak interaction between sets. My mind cast immediately towards Baeza-Yates intersection, but I haven't had time to draft an adaptation of that approach. However, given my suspicions that optimality of a O(N+M) worst case is provable, I thought I'd just go ahead and ask here, to see if anyone could bash together a counter-argument, or pull together a recurrence relation for interpolation search.
Here's a proof that it has to be at least Omega(min(n,m)). Let n >= m. Then consider the matrix which has all 0s at (i,j) where i+j < m, all 2s where i+j >= m, except for a single (i,j) with i+j = m which has a 1. This is a valid input matrix, and there are m possible placements for the 1. No query into the array (other than the actual location of the 1) can distinguish among those m possible placements. So you'll have to check all m locations in the worst case, and at least m/2 expected locations for any randomized algorithm.
One of your assumptions was that matrix elements have to be unique, and I didn't do that. It is easy to fix, however, because you just pick a big number X=n*m, replace all 0s with unique numbers less than X, all 2s with unique numbers greater than X, and 1 with X.
And because it is also Omega(lg n) (counting argument), it is Omega(m + lg n) where n>=m.
An optimal O(m+n) solution is to start at the top-left corner, that has minimal value. Move diagonally downwards to the right until you hit an element whose value >= value of the given element. If the element's value is equal to that of the given element, return found as true.
Otherwise, from here we can proceed in two ways.
Strategy 1:
Move up in the column and search for the given element until we reach the end. If found, return found as true
Move left in the row and search for the given element until we reach the end. If found, return found as true
return found as false
Strategy 2:
Let i denote the row index and j denote the column index of the diagonal element we have stopped at. (Here, we have i = j, BTW). Let k = 1.
Repeat the below steps until i-k >= 0
Search if a[i-k][j] is equal to the given element. if yes, return found as true.
Search if a[i][j-k] is equal to the given element. if yes, return found as true.
Increment k
1 2 4 5 6
2 3 5 7 8
4 6 8 9 10
5 8 9 10 11

Quicksort: Choosing the pivot

When implementing Quicksort, one of the things you have to do is to choose a pivot. But when I look at pseudocode like the one below, it is not clear how I should choose the pivot. First element of list? Something else?
function quicksort(array)
var list less, greater
if length(array) ≤ 1
return array
select and remove a pivot value pivot from array
for each x in array
if x ≤ pivot then append x to less
else append x to greater
return concatenate(quicksort(less), pivot, quicksort(greater))
Can someone help me grasp the concept of choosing a pivot and whether or not different scenarios call for different strategies.
Choosing a random pivot minimizes the chance that you will encounter worst-case O(n2) performance (always choosing first or last would cause worst-case performance for nearly-sorted or nearly-reverse-sorted data). Choosing the middle element would also be acceptable in the majority of cases.
Also, if you are implementing this yourself, there are versions of the algorithm that work in-place (i.e. without creating two new lists and then concatenating them).
It depends on your requirements. Choosing a pivot at random makes it harder to create a data set that generates O(N^2) performance. 'Median-of-three' (first, last, middle) is also a way of avoiding problems. Beware of relative performance of comparisons, though; if your comparisons are costly, then Mo3 does more comparisons than choosing (a single pivot value) at random. Database records can be costly to compare.
Update: Pulling comments into answer.
mdkess asserted:
'Median of 3' is NOT first last middle. Choose three random indexes, and take the middle value of this. The whole point is to make sure that your choice of pivots is not deterministic - if it is, worst case data can be quite easily generated.
To which I responded:
Analysis Of Hoare's Find Algorithm With Median-Of-Three Partition (1997)
by P Kirschenhofer, H Prodinger, C Martínez supports your contention (that 'median-of-three' is three random items).
There's an article described at portal.acm.org that is about 'The Worst Case Permutation for Median-of-Three Quicksort' by Hannu Erkiö, published in The Computer Journal, Vol 27, No 3, 1984. [Update 2012-02-26: Got the text for the article. Section 2 'The Algorithm' begins: 'By using the median of the first, middle and last elements of A[L:R], efficient partitions into parts of fairly equal sizes can be achieved in most practical situations.' Thus, it is discussing the first-middle-last Mo3 approach.]
Another short article that is interesting is by M. D. McIlroy, "A Killer Adversary for Quicksort", published in Software-Practice and Experience, Vol. 29(0), 1–4 (0 1999). It explains how to make almost any Quicksort behave quadratically.
AT&T Bell Labs Tech Journal, Oct 1984 "Theory and Practice in the Construction of a Working Sort Routine" states "Hoare suggested partitioning around the median of several randomly selected lines. Sedgewick [...] recommended choosing the median of the first [...] last [...] and middle". This indicates that both techniques for 'median-of-three' are known in the literature. (Update 2014-11-23: The article appears to be available at IEEE Xplore or from Wiley — if you have membership or are prepared to pay a fee.)
'Engineering a Sort Function' by J L Bentley and M D McIlroy, published in Software Practice and Experience, Vol 23(11), November 1993, goes into an extensive discussion of the issues, and they chose an adaptive partitioning algorithm based in part on the size of the data set. There is a lot of discussion of trade-offs for various approaches.
A Google search for 'median-of-three' works pretty well for further tracking.
Thanks for the information; I had only encountered the deterministic 'median-of-three' before.
Heh, I just taught this class.
There are several options.
Simple: Pick the first or last element of the range. (bad on partially sorted input)
Better: Pick the item in the middle of the range. (better on partially sorted input)
However, picking any arbitrary element runs the risk of poorly partitioning the array of size n into two arrays of size 1 and n-1. If you do that often enough, your quicksort runs the risk of becoming O(n^2).
One improvement I've seen is pick median(first, last, mid);
In the worst case, it can still go to O(n^2), but probabilistically, this is a rare case.
For most data, picking the first or last is sufficient. But, if you find that you're running into worst case scenarios often (partially sorted input), the first option would be to pick the central value( Which is a statistically good pivot for partially sorted data).
If you're still running into problems, then go the median route.
Never ever choose a fixed pivot - this can be attacked to exploit your algorithm's worst case O(n2) runtime, which is just asking for trouble. Quicksort's worst case runtime occurs when partitioning results in one array of 1 element, and one array of n-1 elements. Suppose you choose the first element as your partition. If someone feeds an array to your algorithm that is in decreasing order, your first pivot will be the biggest, so everything else in the array will move to the left of it. Then when you recurse, the first element will be the biggest again, so once more you put everything to the left of it, and so on.
A better technique is the median-of-3 method, where you pick three elements at random, and choose the middle. You know that the element that you choose won't be the the first or the last, but also, by the central limit theorem, the distribution of the middle element will be normal, which means that you will tend towards the middle (and hence, nlog(n) time).
If you absolutely want to guarantee O(nlog(n)) runtime for the algorithm, the columns-of-5 method for finding the median of an array runs in O(n) time, which means that the recurrence equation for quicksort in the worst case will be:
T(n) = O(n) (find the median) + O(n) (partition) + 2T(n/2) (recurse left and right)
By the Master Theorem, this is O(nlog(n)). However, the constant factor will be huge, and if worst case performance is your primary concern, use a merge sort instead, which is only a little bit slower than quicksort on average, and guarantees O(nlog(n)) time (and will be much faster than this lame median quicksort).
Explanation of the Median of Medians Algorithm
Don't try and get too clever and combine pivoting strategies. If you combined median of 3 with random pivot by picking the median of the first, last and a random index in the middle, then you'll still be vulnerable to many of the distributions which send median of 3 quadratic (so its actually worse than plain random pivot)
E.g a pipe organ distribution (1,2,3...N/2..3,2,1) first and last will both be 1 and the random index will be some number greater than 1, taking the median gives 1 (either first or last) and you get an extermely unbalanced partitioning.
It is easier to break the quicksort into three sections doing this
Exchange or swap data element function
The partition function
Processing the partitions
It is only slightly more inefficent than one long function but is alot easier to understand.
Code follows:
/* This selects what the data type in the array to be sorted is */
#define DATATYPE long
/* This is the swap function .. your job is to swap data in x & y .. how depends on
data type .. the example works for normal numerical data types .. like long I chose
above */
void swap (DATATYPE *x, DATATYPE *y){
DATATYPE Temp;
Temp = *x; // Hold current x value
*x = *y; // Transfer y to x
*y = Temp; // Set y to the held old x value
};
/* This is the partition code */
int partition (DATATYPE list[], int l, int h){
int i;
int p; // pivot element index
int firsthigh; // divider position for pivot element
// Random pivot example shown for median p = (l+h)/2 would be used
p = l + (short)(rand() % (int)(h - l + 1)); // Random partition point
swap(&list[p], &list[h]); // Swap the values
firsthigh = l; // Hold first high value
for (i = l; i < h; i++)
if(list[i] < list[h]) { // Value at i is less than h
swap(&list[i], &list[firsthigh]); // So swap the value
firsthigh++; // Incement first high
}
swap(&list[h], &list[firsthigh]); // Swap h and first high values
return(firsthigh); // Return first high
};
/* Finally the body sort */
void quicksort(DATATYPE list[], int l, int h){
int p; // index of partition
if ((h - l) > 0) {
p = partition(list, l, h); // Partition list
quicksort(list, l, p - 1); // Sort lower partion
quicksort(list, p + 1, h); // Sort upper partition
};
};
It is entirely dependent on how your data is sorted to begin with. If you think it will be pseudo-random then your best bet is to either pick a random selection or choose the middle.
If you are sorting a random-accessible collection (like an array), it's general best to pick the physical middle item. With this, if the array is all ready sorted (or nearly sorted), the two partitions will be close to even, and you'll get the best speed.
If you are sorting something with only linear access (like a linked-list), then it's best to choose the first item, because it's the fastest item to access. Here, however,if the list is already sorted, you're screwed -- one partition will always be null, and the other have everything, producing the worst time.
However, for a linked-list, picking anything besides the first, will just make matters worse. It pick the middle item in a listed-list, you'd have to step through it on each partition step -- adding a O(N/2) operation which is done logN times making total time O(1.5 N *log N) and that's if we know how long the list is before we start -- usually we don't so we'd have to step all the way through to count them, then step half-way through to find the middle, then step through a third time to do the actual partition: O(2.5N * log N)
Ideally the pivot should be the middle value in the entire array.
This will reduce the chances of getting worst case performance.
In a truly optimized implementation, the method for choosing pivot should depend on the array size - for a large array, it pays off to spend more time choosing a good pivot. Without doing a full analysis, I would guess "middle of O(log(n)) elements" is a good start, and this has the added bonus of not requiring any extra memory: Using tail-call on the larger partition and in-place partitioning, we use the same O(log(n)) extra memory at almost every stage of the algorithm.
Quick sort's complexity varies greatly with the selection of pivot value. for example if you always choose first element as an pivot, algorithm's complexity becomes as worst as O(n^2). here is an smart method to choose pivot element-
1. choose the first, mid, last element of the array.
2. compare these three numbers and find the number which is greater than one and smaller than other i.e. median.
3. make this element as pivot element.
choosing the pivot by this method splits the array in nearly two half and hence the complexity
reduces to O(nlog(n)).
On the average, Median of 3 is good for small n. Median of 5 is a bit better for larger n. The ninther, which is the "median of three medians of three" is even better for very large n.
The higher you go with sampling the better you get as n increases, but the improvement dramatically slows down as you increase the samples. And you incur the overhead of sampling and sorting samples.
I recommend using the middle index, as it can be calculated easily.
You can calculate it by rounding (array.length / 2).
If you choose the first or the last element in the array, then there are high chance that the pivot is the smallest or the largest element of the array and that is bad.
Why?
Because in that case the number of element smaller / larger than the pivot element in 0. and this will repeat as follow :
Consider the size of the array n.Then,
(n) + (n - 1) + (n - 2) + ......+ 1 = O(n^2)
Hence, the time complexity increases to O(n^2) from O(nlogn). So, I highly recommend to use median / random element of the array as the pivot.

Resources