How do I find the k-nearest values in n-dimensional space? - computational-geometry

I read about kd-trees but they are inefficient when the dimensionality of the space is high. I have a database of value and I want to find the values that are within a certain hamming distance of the query. For instance, the database is a list of 32-bit numbers and I want to find all numbers that differ from the query value by less than 3 bits.
I heard somewhere about MultiVariate Partition trees but couldn't find a good reference. I know that min-Hash gives a good approximation, better as the but I'd like an exact answer.

The hamming distance is closely related to levenshtein distance, and is similar to algorithms used for spelling correction.
A method that works is branch-and-bound search in a trie. It takes time that is exponential in the distance, for near distance, up to being linear in the dictionary size.
If the dictionary is of binary words stored in a binary trie, with strict hamming distance, here is a simple pseudo-code:
walk(trie, word, i, hit, budget){
if (budget < 0 || i > word.length) return;
if (trie==NULL){
if (i==word.length) print hit;
return;
}
hit[i] = 0;
walk(trie.subtrie[0], word, i+1, hit, (word[i]==0 ? budget : budget-1));
hit[i] = 1;
walk(trie.subtrie[1], word, i+1, hit, (word[i]==1 ? budget : budget-1));
}
main(){
for (int budget = 0; ; budget++){
walk(trie, word, 0, hit, budget);
/* quit if enough hits have been printed */
}
}
The idea is you walk the entire trie, keeping track of the distance between the current trie node and the original word. You prune the search by having a budget of how much distance you will tolerate. This works because the distance can never decrease as you go deeper into the trie.
Then you do this repeatedly with budgets starting at zero and increasing in steps until you print out the hits you want. Since each walk covers so many fewer nodes than the subsequent walk, it doesn't hurt that you're doing multiple walks. If k is fixed, you can simply start out with that as your budget.

Related

Best case of fractional knapsack

the worst case running time of fractional knapsack is O(n), then what should be its best case? is it O(1), because if a weight limit is 16 and you get first item having value, is it right??
True if you assume that input is given in sorted order of value !!!
But as per the definition, the algorithm is expected to take non-sorted input too. see this.
If you are considering a normal input that may or may not be sorted. Then there are two approaches to solve the problem:
Sort the input. which can not be less than O(n) even in best case that too if you use bubble/insertion sort. Which looks completely foolish because both of these sorting algorithms has O(n^2) avarage/worst case performance.
Use the weighted medians approach . That will cost you O(n) as finding the weighted median will take O(n). The code for this approach is given below.
Weighted median approach for fractional knapsack:
We will work on value per unit of item in the following code. The code will first find the middle value (i.e. mid of values per unit of items if given in sorted order) and place it in its correct position. We will use quick sort partition method for this. Once we get the middle (call it mid) element, following two cases need to be taken into consideration:
When sum of weight of all items present in the right side of mid is more than the value of W, we need to search our answer in right side of mid.
else sum all the values present in right side of mid (call it v_left) and search for W-v_left in the left side of mid (include mid as well).
Following is the implementation in python (Use only floating point numbers everywhere):
Please note that i am not providing you the production level code and there are cases which will fail as well. Think about what can cause worst case/failure for finding kth max in array (when all valules are same may be).
def partition(weights,values,start,end):
x = values[end]/weights[end]
i = start
for j in range(start,end):
if values[j]/weights[j] < x:
values[i],values[j] = values[j],values[i]
weights[i], weights[j] = weights[j],weights[i]
i+=1
values[i],values[end] = values[end],values[i]
weights[i], weights[end] = weights[end],weights[i]
return i
def _find_kth(weights,values,start,end,k):
ind = partition(weights,values,start,end)
if ind - start == k-1:
return ind
if ind - start > k-1:
return _find_kth(weights,values,start,ind-1,k)
return _find_kth(weights,values,ind+1,end,k-ind-1)
def find_kth(weights,values,k):
return _find_kth(weights,values,0,len(weights)-1,k)
def fractional_knapsack(weights,values,w):
if w == 0 or len(weights)==0:
return 0
if len(weights) == 1 and weights[0] > w:
return w*(values[0]/weights[0])
mid = find_kth(weights,values,len(weights)/2)
w1 = reduce(lambda x,y: x+y,weights[mid+1:])
v1 = reduce(lambda x,y: x+y, values[mid+1:])
if(w1>w):
return fractional_knapsack(weights[mid+1:],values[mid+1:],w)
return v1 + fractional_knapsack(weights[:mid+1],values[:mid+1],w-w1)
(Editing and rewriting the answer after discussion with #Shasha99, since I feel answers before 2016-12-06 are a bit deceiving)
Summary
O(1) best case is possible if the items are already sorted. Otherwise best case is O(n).
Discussion
If the items are not sorted, you need to find the best item (for the case where one item already fills the knapsack), and that alone will take O(n), since you have to check all of them. Therefore, best case O(n).
On the opposite end, you could have a knapsack where all the items fit. Searching for best would not be needed, but you need to put all of them in, so it's still O(n).
More analysis
Funny enough, O(n) worst case does not imply items being sorted.
Apparently idea from http://algo2.iti.kit.edu/sanders/courses/algdat03/sol12.pdf paired with fast median selection algorithm (weighted medians or maybe median of medians?). Thanks to #Shasha99 for finding this algorithm.
Note that plain quickselect is O(n) expected, O(n*n) worst, but if you use median-of-medians that becomes O(n) worst case. The downside is quite a complicated algorithm.
I'd be interested in a working implementation of any algorithm. More sources to (hopefully simple) algorithms also wouldn't hurt.

Write a program to find 100 largest numbers out of an array of 1 billion numbers

I recently attended an interview where I was asked "write a program to find 100 largest numbers out of an array of 1 billion numbers."
I was only able to give a brute force solution which was to sort the array in O(nlogn) time complexity and take the last 100 numbers.
Arrays.sort(array);
The interviewer was looking for a better time complexity, I tried a couple of other solutions but failed to answer him. Is there a better time complexity solution?
You can keep a priority queue of the 100 biggest numbers, iterate through the 1 billion numbers. Whenever you encounter a number greater than the smallest number in the queue (the head of the queue), remove the head of the queue and add the new number to the queue.
A priority queue implemented with a heap has insert + delete complexity of O(log K). (Where K = 100, the number of elements to find. N = 1 billion, the number of total elements in the array).
In the worst case you get billion*log2(100) which is better than billion*log2(billion) for an O(N log N) comparison-based sort1.
In general, if you need the largest K numbers from a set of N numbers, the complexity is O(N log K) rather than O(N log N), this can be very significant when K is very small comparing to N.
The expected time of this priority queue algorithm is pretty interesting, since in each iteration an insertion may or may not occur.
The probability of the i'th number to be inserted to the queue is the probability of a random variable being larger than at least i-K random variables from the same distribution (the first k numbers are automatically added to the queue). We can use order statistics (see link) to calculate this probability.
For example, lets assume the numbers were randomly selected uniformly from {0, 1}, the expected value of (i-K)th number (out of i numbers) is (i-k)/i, and chance of a random variable being larger than this value is 1-[(i-k)/i] = k/i.
Thus, the expected number of insertions is:
And the expected running time can be expressed as:
(k time to generate the queue with the first k elements, then n-k comparisons, and the expected number of insertions as described above, each takes an average log(k)/2 time)
Note that when N is very large comparing to K, this expression is a lot closer to n rather than N log K. This is somewhat intuitive, as in the case of the question, even after 10,000 iterations (which is very small comparing to a billion), the chance of a number to be inserted to the queue is very small.
But we don't know that the array values are uniformly distributed. They might trend towards increasing, in which case most or all numbers will be be new candidates for the set of 100 largest numbers seen. The worst case for this algorithm is O(N log K).
Or if they trend towards decreasing, most of the largest 100 numbers will be very early, and our best-case run time is essentially O(N + K log K), which is just O(N) for K much smaller than N.
Footnote 1: O(N) integer sorting / histogramming
Counting Sort or Radix Sort are both O(N), but often have larger constant factors that make them worse than comparison sorts in practice. In some special cases they're actually quite fast, primarily for narrow integer types.
For example, Counting Sort does well if the numbers are small. 16-bit numbers would only need an array of 2^16 counters. And instead of actually expanding back into a sorted array, you could just scan the histogram you build as part of Counting Sort.
After histogramming an array, you can quickly answer queries for any order statistic, e.g. the 99 largest numbers, the 200 to 100th largest numbers.) 32-bit numbers would scatter the counts over a much larger array or hash table of counters, potentially needing 16 GiB of memory (4 bytes for each of 2^32 counters). And on real CPUs, probably getting lots of TLB and cache misses, unlike an array of 2^16 elements where L2 cache would typically hit.
Similarly, Radix Sort could look at only the top buckets after a first pass. But the constant factors may still be larger than log K, depending on K.
Note that the size of each counter is large enough to not overflow even if all N integers are duplicates. 1 billion is somewhat below 2^30, so a 30-bit unsigned counter would be sufficient. And a 32-bit signed or unsigned integer is just fine.
If you had many more, you might need 64-bit counters, taking twice the memory footprint to initialize to zero and to randomly access. Or a sentinel value for the few counters that overflow a 16 or 32-bit integer, to indicate that the rest of the count is elsewhere (in a small dictionary such as a hash table mapping to 64-bit counters).
If this is asked in an interview, the interviewer probably wants to see your problem solving process, not just your knowledge of algorithms.
The description is quite general so maybe you can ask him the range or meaning of these numbers to make the problem clear. Doing this may impress an interviewer. If, for example, these numbers stands for people's age then it's a much easier problem. With a reasonable assumption that nobody alive is older than 200, you can use an integer array of size 200 (maybe 201) to count the number of people with the same age in just one iteration. Here the index means the age. After this it's a piece of cake to find 100 largest numbers. By the way this algorithm is called counting sort.
Anyway, making the question more specific and clearer is good for you in an interview.
You can iterate over the numbers which takes O(n)
Whenever you find a value greater than the current minimum, add the new value to a circular queue with size 100.
The min of that circular queue is your new comparison value. Keep on adding to that queue. If full, extract the minimum from the queue.
I realized that this is tagged with 'algorithm', but will toss out some other options, since it probably should also be tagged 'interview'.
What is the source of the 1 billion numbers? If it is a database then 'select value from table order by value desc limit 100' would do the job quite nicely - there might be dialect differences.
Is this a one-off, or something that will be repeated? If repeated, how frequently? If it is a one-off and the data are in a file, then 'cat srcfile | sort (options as needed) | head -100' will have you quickly doing productive work that you are getting paid to do while the computer handles this trivial chore.
If it is repeated, you would advise picking any decent approach to get the initial answer and store / cache the results so that you could continuously be able to report the top 100.
Finally, there is this consideration. Are you looking for an entry level job and interviewing with a geeky manager or future co-worker? If so, then you can toss out all manner of approaches describing the relative technical pros and cons. If you are looking for a more managerial job, then approach it like a manager would, concerned with the development and maintenance costs of the solution, and say "thank you very much" and leave if that is the interviewer wants to focus on CS trivia. He and you would be unlikely to have much advancement potential there.
Better luck on the next interview.
My immediate reaction for this would be to use a heap, but there is way to use QuickSelect without keeping all of the input values on hand at any one time.
Create an array of size 200 and fill it up with the first 200 input values. Run QuickSelect and discard the low 100, leaving you with 100 free places. Read in the next 100 input values and run QuickSelect again. Continue until you have run though the entire input in batches of 100.
At the end you have the top 100 values. For N values you have run QuickSelect roughly N/100 times. Each Quickselect cost about 200 times some constant, so the total cost is 2N times some constant. This looks linear in the size of the input to me, regardless of the parameter size that I am hardwiring to be 100 in this explanation.
You can use Quick select algorithm to find the number at the(by order) index [billion-101]
and then iterate over the numbers and to find the numbers that biger from that number.
array={...the billion numbers...}
result[100];
pivot=QuickSelect(array,billion-101);//O(N)
for(i=0;i<billion;i++)//O(N)
if(array[i]>=pivot)
result.add(array[i]);
This algorithm Time is: 2 X O(N) = O(N) (Average case performance)
The second option like Thomas Jungblut suggest is:
Use Heap building the MAX heap will take O(N),then the top 100 max numbers will be in the top of the Heap, all you need is to get them out from the heap(100 X O(Log(N)).
This algorithm Time is:O(N) + 100 X O(Log(N)) = O(N)
Although the other quickselect solution has been downvoted, the fact remains that quickselect will find the solution faster than using a queue of size 100. Quickselect has an expected running time of 2n + o(n), in terms of comparisons. A very simply implementation would be
array = input array of length n
r = Quickselect(array,n-100)
result = array of length 100
for(i = 1 to n)
if(array[i]>r)
add array[i] to result
This will take 3n + o(n) comparisons on average. Moreover, it can be made more efficient using the fact that quickselect will leave the largest 100 items in the array in the 100 right-most locations. So in fact, the running time can be improved to 2n+o(n).
There is the issue that this is expected running time, and not worst case, but by using a decent pivot selection strategy (e.g. pick 21 elements at random, and choose the median of those 21 as pivot), then the number of comparisons can be guaranteed with high probability to be at most (2+c)n for an arbitrarily small constant c.
In fact, by using an optimized sampling strategy (e.g. sample sqrt(n) elements at random, and choose the 99th percentile), the running time can be gotten down to (1+c)n + o(n) for arbitrarily small c (assuming that K, the number of elements to be selected is o(n)).
On the other hand, using a queue of size 100 will require O(log(100)n) comparisons, and log base 2 of 100 is approximately equal to 6.6.
If we think of this problem in the more abstract sense of choosing the largest K elements from an array of size N, where K=o(N) but both K and N go to infinity, then the running time of the quickselect version will be O(N) and the queue version will be O(N log K), so in this sense quickselect is also asymptotically superior.
In comments, it was mentioned that the queue solution will run in expected time N + K log N on a random input. Of course, the random input assumption is never valid unless the question states it explicitly. The queue solution could be made to traverse the array in a random order, but this will incur the additional cost of N calls to a random number generator as well as either permuting the entire input array or else allocating a new array of length N containing the random indices.
If the problem doesn't allow you to move around the elements in the original array, and the cost of allocating memory is high so duplicating the array is not an option, that is a different matter. But strictly in terms of running time, this is the best solution.
take the first 100 numbers of the billion and sort them. now just iterate through the billion, if the source number is higher than the smallest of 100, insert in sort order. What you end up with is something much closer to O(n) over the size of the set.
Two options:
(1) Heap (priorityQueue)
Maintain a min-heap with size of 100. Traverse the array. Once the element is smaller than first element in heap, replace it.
InSERT ELEMENT INTO HEAP: O(log100)
compare the first element: O(1)
There are n elements in the array, so the total would be O(nlog100), which is O(n)
(2) Map-reduce model.
This is very similar to word count example in hadoop.
Map job: count every element's frequency or times appeared.
Reduce: Get top K element.
Usually, I would give the recruiter two answers. Give them whatever they like. Of course, map reduce coding would be labor-some because you have to know every exact parameters. No harm to practice it.
Good Luck.
An very easy solution would be to iterate through the array 100 times. Which is O(n).
Each time you pull out the largest number (and change its value to the minimum value, so that you don't see it in the next iteration, or keep track of indexes of previous answers (by keeping track of indexes the original array can have multiple of the same number)). After 100 iterations, you have the 100 largest numbers.
The simple solution would be using a priority queue, adding the first 100 numbers to the queue and keeping track of the smallest number in the queue, then iterating through the other billion numbers, and each time we find one that is larger than the largest number in the priority queue, we remove the smallest number, add the new number, and again keep track of the smallest number in the queue.
If the numbers were in random order, this would work beautiful because as we iterate through a billion random numbers, it would be very rare that the next number is among the 100 largest so far. But the numbers might not be random. If the array was already sorted in ascending order then we would always insert an element to the priority queue.
So we pick say 100,000 random numbers from the array first. To avoid random access which might be slow, we add say 400 random groups of 250 consecutive numbers. With that random selection, we can be quite sure that very few of the remaining numbers are in the top hundred, so the execution time will be very close to that of a simple loop comparing a billion numbers to some maximum value.
This question would be answered with N log(100) complexity (instead of N log N) with just one line of C++ code.
std::vector<int> myvector = ...; // Define your 1 billion numbers.
// Assumed integer just for concreteness
std::partial_sort (myvector.begin(), myvector.begin()+100, myvector.end());
The final answer would be a vector where the first 100 elements are guaranteed to be the 100 biggest numbers of you array while the remaining elements are unordered
C++ STL (standard library) is quite handy for this kind of problems.
Note: I am not saying that this is the optimal solution, but it would have saved your interview.
I see a lot of O(N) discussions, so I propose something different just for the thought exercise.
Is there any known information about the nature of these numbers? If it's random in nature, then go no further and look at the other answers. You won't get any better results than they do.
However! See if whatever list-populating mechanism populated that list in a particular order. Are they in a well-defined pattern where you can know with certainty that the largest magnitude of numbers will be found in a certain region of the list or on a certain interval? There may be a pattern to it. If that is so, for example if they are guaranteed to be in some sort of normal distribution with the characteristic hump in the middle, always have repeating upward trends among defined subsets, have a prolonged spike at some time T in the middle of the data set like perhaps an incidence of insider trading or equipment failure, or maybe just have a "spike" every Nth number as in analysis of forces after a catastrophe, you can reduce the number of records you have to check significantly.
There's some food for thought anyway. Maybe this will help you give future interviewers a thoughtful answer. I know I would be impressed if someone asked me such a question in response to a problem like this - it would tell me that they are thinking of optimization. Just recognize that there may not always be a possibility to optimize.
Inspired by #ron teller's answer, here is a barebones C program to do what you want.
#include <stdlib.h>
#include <stdio.h>
#define TOTAL_NUMBERS 1000000000
#define N_TOP_NUMBERS 100
int
compare_function(const void *first, const void *second)
{
int a = *((int *) first);
int b = *((int *) second);
if (a > b){
return 1;
}
if (a < b){
return -1;
}
return 0;
}
int
main(int argc, char ** argv)
{
if(argc != 2){
printf("please supply a path to a binary file containing 1000000000"
"integers of this machine's wordlength and endianness\n");
exit(1);
}
FILE * f = fopen(argv[1], "r");
if(!f){
exit(1);
}
int top100[N_TOP_NUMBERS] = {0};
int sorts = 0;
for (int i = 0; i < TOTAL_NUMBERS; i++){
int number;
int ok;
ok = fread(&number, sizeof(int), 1, f);
if(!ok){
printf("not enough numbers!\n");
break;
}
if(number > top100[0]){
sorts++;
top100[0] = number;
qsort(top100, N_TOP_NUMBERS, sizeof(int), compare_function);
}
}
printf("%d sorts made\n"
"the top 100 integers in %s are:\n",
sorts, argv[1] );
for (int i = 0; i < N_TOP_NUMBERS; i++){
printf("%d\n", top100[i]);
}
fclose(f);
exit(0);
}
On my machine (core i3 with a fast SSD) it takes 25 seconds, and 1724 sorts.
I generated a binary file with dd if=/dev/urandom/ count=1000000000 bs=1 for this run.
Obviously, there are performance issues with reading only 4 bytes at a time - from disk, but this is for example's sake. On the plus side, very little memory is needed.
You can do it in O(n) time. Just iterate through the list and keep track of the 100 biggest numbers you've seen at any given point and the minimum value in that group. When you find a new number bigger the smallest of your ten, then replace it and update your new min value of the 100 (may take a constant time of 100 to determine this each time you do it, but this does not affect the overall analysis).
The simplest solution is to scan the billion numbers large array and hold the 100 largest values found so far in a small array buffer without any sorting and remember the smallest value of this buffer. First I thought this method was proposed by fordprefect but in a comment he said that he assumed the 100 number data structure being implemented as a heap. Whenever a new number is found that is larger then the minimum in the buffer is overwritten by the new value found and the buffer is searched for the current minimum again. If the numbers in billion number array are randomly distributed most of the time the value from the large array is compared to the minimum of the small array and discarded. Only for a very very small fraction of number the value must be inserted into the small array. So the difference of manipulating the data structure holding the small numbers can be neglected. For a small number of elements it is hard to determine if the usage of a priority queue is actually faster than using my naive approach.
I want to estimate the number of inserts in the small 100 element array buffer when the 10^9 element array is scanned. The program scans the first 1000 elements of this large array and has to insert at most 1000 elements in the buffer. The buffer contains 100 element of the 1000 elements scanned, that is 0.1 of the element scanned. So we assume that the probability that a value from the large array is larger than the current minimum of the buffer is about 0.1 Such an element has to be inserted in the buffer . Now the program scans the next 10^4 elements from the large array. Because the minimum of the buffer will increase every time a new element is inserted. We estimated that the ratio of elements larger than our current minimum is about 0.1 and so there are 0.1*10^4=1000 elements to insert. Actually the expected number of elements that are inserted into the buffer will be smaller. After the scan of this 10^4 elements fraction of the numbers in the buffer will be about 0.01 of the elements scanned so far. So when scanning the next 10^5 numbers we assume that not more than 0.01*10^5=1000 will be inserted in the buffer. Continuing this argumentation we have inserted about 7000 values after scanning 1000+10^4+10^5+...+10^9 ~ 10^9 elements of the large array.
So when scanning an array with 10^9 elements of random size we expect not more than 10^4 (=7000 rounded up) insertions in the buffer. After each insertion into the buffer the new minimum must be found. If the buffer is a simple array we need 100 comparison to find the new minimum. If the buffer is another data structure (like a heap) we need at least 1 comparison to find the minimum. To compare the elements of the large array we need 10^9 comparisons. So all in all we need about 10^9+100*10^4=1.001 * 10^9 comparisons when using an array as buffer and at least 1.000 * 10^9 comparisons when using another type of data structure (like a heap). So using a heap brings only a gain of 0.1% if performance is determined by the number of comparison.
But what is the difference in execution time between inserting an element in a 100 element heap and replacing an element in an 100 element array and finding its new minimum?
At the theoretical level: How many comparisons are needed for inserting in a heap. I know it is O(log(n)) but how large is the constant factor? I
At the machine level: What is the impact of caching and branch prediction on the execution time of a heap insert and a linear search in an array.
At the implementation level: What additional costs are hidden in a heap data structure supplied by a library or a compiler?
I think these are some of the questions that have to be answered before one can try to estimate the real difference between the performance of a 100 element heap or a 100 element array. So it would make sense to make an experiment and measure the real performance.
Although in this question we should search for top 100 numbers, I will
generalize things and write x. Still, I will treat x as constant value.
Algorithm Biggest x elements from n:
I will call return value LIST. It is a set of x elements (in my opinion that should be linked list)
First x elements are taken from pool "as they come" and sorted in LIST (this is done in constant time since x is treated as constant - O( x log(x) ) time)
For every element that comes next we check if it is bigger than smallest element in LIST and if is we pop out the smallest and insert current element to LIST. Since that is ordered list every element should find its place in logarithmic time (binary search) and since it is ordered list insertion is not a problem. Every step is also done in constant time ( O(log(x) ) time ).
So, what is the worst case scenario?
x log(x) + (n-x)(log(x)+1) = nlog(x) + n - x
So that is O(n) time for worst case. The +1 is the checking if number is greater than smallest one in LIST. Expected time for average case will depend on mathematical distribution of those n elements.
Possible improvements
This algorithm can be slightly improved for worst case scenario but IMHO (I can not prove this claim) that will degrade average behavior. Asymptotic behavior will be the same.
Improvement in this algorithm will be that we will not check if element is greater than smallest. For each element we will try to insert it and if it is smaller than smallest we will disregard it. Although that sounds preposterous if we regard only the worst case scenario we will have
x log(x) + (n-x)log(x) = nlog(x)
operations.
For this use case I don't see any further improvements. Yet you must ask yourself - what if I have to do this more than log(n) times and for different x-es? Obviously we would sort that array in O(n log(n)) and take our x element whenever we need them.
Finding the top 100 out of a billion numbers is best done using min-heap of 100 elements.
First prime the min-heap with the first 100 numbers encountered. min-heap will store the smallest of the first 100 numbers at the root (top).
Now as you go along the rest of the numbers only compare them with the root (smallest of the 100).
If the new number encountered is larger than root of min-heap replace the root with that number otherwise ignore it.
As part of the insertion of the new number in min-heap the smallest number in the heap will come to the top (root).
Once we have gone through all the numbers we will have the largest 100 numbers in the min-heap.
I have written up a simple solution in Python in case anyone is interested. It uses the bisect module and a temporary return list which it keeps sorted. This is similar to a priority queue implementation.
import bisect
def kLargest(A, k):
'''returns list of k largest integers in A'''
ret = []
for i, a in enumerate(A):
# For first k elements, simply construct sorted temp list
# It is treated similarly to a priority queue
if i < k:
bisect.insort(ret, a) # properly inserts a into sorted list ret
# Iterate over rest of array
# Replace and update return array when more optimal element is found
else:
if a > ret[0]:
del ret[0] # pop min element off queue
bisect.insort(ret, a) # properly inserts a into sorted list ret
return ret
Usage with 100,000,000 elements and worst-case input which is a sorted list:
>>> from so import kLargest
>>> kLargest(range(100000000), 100)
[99999900, 99999901, 99999902, 99999903, 99999904, 99999905, 99999906, 99999907,
99999908, 99999909, 99999910, 99999911, 99999912, 99999913, 99999914, 99999915,
99999916, 99999917, 99999918, 99999919, 99999920, 99999921, 99999922, 99999923,
99999924, 99999925, 99999926, 99999927, 99999928, 99999929, 99999930, 99999931,
99999932, 99999933, 99999934, 99999935, 99999936, 99999937, 99999938, 99999939,
99999940, 99999941, 99999942, 99999943, 99999944, 99999945, 99999946, 99999947,
99999948, 99999949, 99999950, 99999951, 99999952, 99999953, 99999954, 99999955,
99999956, 99999957, 99999958, 99999959, 99999960, 99999961, 99999962, 99999963,
99999964, 99999965, 99999966, 99999967, 99999968, 99999969, 99999970, 99999971,
99999972, 99999973, 99999974, 99999975, 99999976, 99999977, 99999978, 99999979,
99999980, 99999981, 99999982, 99999983, 99999984, 99999985, 99999986, 99999987,
99999988, 99999989, 99999990, 99999991, 99999992, 99999993, 99999994, 99999995,
99999996, 99999997, 99999998, 99999999]
It took about 40 seconds to calculate this for 100,000,000 elements so I'm scared to do it for 1 billion. To be fair though, I was feeding it the worst-case input (ironically an array that is already sorted).
Time ~ O(100 * N)
Space ~ O(100 + N)
Create an empty list of 100 empty slot
For every number in input-list:
If the number is smaller than the first one, skip
Otherwise replace it with this number
Then, push the number through adjacent swap; until it's smaller than the next one
Return the list
Note: if the log(input-list.size) + c < 100, then the optimal way is to sort the input-list, then split first 100 items.
Another O(n) algorithm -
The algorithm finds the largest 100 by elimination
consider all the million numbers in their binary representation. Start from the most significant bit. Finding if the MSB is 1 can be a done by a boolean operation multiplication with an appropriate number. If there are more than 100 1's in these million eliminate the other numbers with zeros. Now of the remaining numbers proceed with the next most significant bit. keep a count of the number of remaining numbers after elimination and proceed as long as this number is greater than 100.
The major boolean operation can be an parallely done on GPUs
I would find out who had the time to put a billion numbers into an array and fire him. Must work for government. At least if you had a linked list you could insert a number into the middle without moving half a billion to make room. Even better a Btree allows for a binary search. Each comparison eliminates half of your total. A hash algorithm would allow you to populate the data structure like a checkerboard but not so good for sparse data. As it is your best bet is to have a solution array of 100 integers and keep track of the lowest number in your solution array so you can replace it when you come across a higher number in the original array. You would have to look at every element in the original array assuming it is not sorted to begin with.
I know this might get buried, but here is my idea for a variation on a radix MSD.
pseudo-code:
//billion is the array of 1 billion numbers
int[] billion = getMyBillionNumbers();
//this assumes these are 32-bit integers and we are using hex digits
int[][] mynums = int[8][16];
for number in billion
putInTop100Array(number)
function putInTop100Array(number){
//basically if we got past all the digits successfully
if(number == null)
return true;
msdIdx = getMsdIdx(number);
msd = getMsd(number);
//check if the idx above where we are is already full
if(mynums[msdIdx][msd+1] > 99) {
return false;
} else if(putInTop100Array(removeMSD(number)){
mynums[msdIdx][msd]++;
//we've found 100 digits here, no need to keep looking below where we are
if(mynums[msdIdx][msd] > 99){
for(int i = 0; i < mds; i++){
//making it 101 just so we can tell the difference
//between numbers where we actually found 101, and
//where we just set it
mynums[msdIdx][i] = 101;
}
}
return true;
}
return false;
}
The function getMsdIdx(int num) would return the index of the most significant digit (non-zero). The function getMsd(int num) would return the most significant digit. The funciton removeMSD(int num) would remove the most significant digit from a number and return the number (or return null if there was nothing left after removing the most significant digit).
Once this is done, all that is left is traversing mynums to grab the top 100 digits. This would be something like:
int[] nums = int[100];
int idx = 0;
for(int i = 7; i >= 0; i--){
int timesAdded = 0;
for(int j = 16; j >=0 && timesAdded < 100; j--){
for(int k = mynums[i][j]; k > 0; k--){
nums[idx] += j;
timesAdded++;
idx++;
}
}
}
I should note that although the above looks like it has high time complexity, it will really only be around O(7*100).
A quick explanation of what this is trying to do:
Essentially this system is trying to use every digit in a 2d-array based upon the index of the digit in the number, and the digit's value. It uses these as indexes to keep track of how many numbers of that value have been inserted in the array. When 100 has been reached, it closes off all "lower branches".
The time of this algorithm is something like O(billion*log(16)*7)+O(100). I could be wrong about that. Also it is very likely this needs debugging as it is kinda complex and I just wrote it off the top of my head.
EDIT: Downvotes without explanation are not helpful. If you think this answer is incorrect, please leave a comment why. Pretty sure that StackOverflow even tells you to do so when you downvote.
Managing a separate list is extra work and you have to move things around the whole list every time you find another replacement. Just qsort it and take the top 100.
Use nth-element to get the 100'th element O(n)
Iterate the second time but only once and output every element that is greater than this specific element.
Please note esp. the second step might be easy to compute in parallel! And it will also be efficiently when you need a million biggest elements.
It's a question from Google or some else industry giants.Maybe the following code is the right answer expected by your interviewer.
The time cost and space cost depend on the maximum number in the input array.For 32-Bit int array input, The maximum space cost is 4 * 125M Bytes, Time cost is 5 * Billion.
public class TopNumber {
public static void main(String[] args) {
final int input[] = {2389,8922,3382,6982,5231,8934
,4322,7922,6892,5224,4829,3829
,6892,6872,4682,6723,8923,3492};
//One int(4 bytes) hold 32 = 2^5 value,
//About 4 * 125M Bytes
//int sort[] = new int[1 << (32 - 5)];
//Allocate small array for local test
int sort[] = new int[1000];
//Set all bit to 0
for(int index = 0; index < sort.length; index++){
sort[index] = 0;
}
for(int number : input){
sort[number >>> 5] |= (1 << (number % 32));
}
int topNum = 0;
outer:
for(int index = sort.length - 1; index >= 0; index--){
if(0 != sort[index]){
for(int bit = 31; bit >= 0; bit--){
if(0 != (sort[index] & (1 << bit))){
System.out.println((index << 5) + bit);
topNum++;
if(topNum >= 3){
break outer;
}
}
}
}
}
}
}
i did my own code,not sure if its what the "interviewer" it's looking
private static final int MAX=100;
PriorityQueue<Integer> queue = new PriorityQueue<>(MAX);
queue.add(array[0]);
for (int i=1;i<array.length;i++)
{
if(queue.peek()<array[i])
{
if(queue.size() >=MAX)
{
queue.poll();
}
queue.add(array[i]);
}
}
Possible improvements.
If the file contains 1 billions number, reading it could be really long...
To improve this working you can :
Split the file into n parts, Create n threads, make n threads look each for the 100 biggest numbers in their part of the file (using the priority queue), and finally get the 100 biggest numbers of all threads output.
Use a cluster to do a such task, with a solution like hadoop. Here you can split the file even more and have the output quicker for a 1 billion (or a 10^12) numbers file.
First take 1000 elements and add them in a max heap. Now take out the first max 100 elements and store it somewhere. Now pick next 900 elements from the file and add them in the heap along with the last 100 highest element.
Keep repeating this process of picking up 100 elements from the heap and adding 900 elements from the file.
The final pick of 100 elements will give us the maximum 100 elements from a billion of numbers.
THe complexity is O(N)
First create an array of 100 ints initialiaze the first element of this array as the first element of the N values,
keep track of the index of the current element with a another variable, call it CurrentBig
Iterate though the N values
if N[i] > M[CurrentBig] {
M[CurrentBig]=N[i]; ( overwrite the current value with the newly found larger number)
CurrentBig++; ( go to the next position in the M array)
CurrentBig %= 100; ( modulo arithmetic saves you from using lists/hashes etc.)
M[CurrentBig]=N[i]; ( pick up the current value again to use it for the next Iteration of the N array)
}
when done , print the M array from CurrentBig 100 times modulo 100 :-)
For the student: make sure that the last line of the code does not trump valid data right before the code exits

Algorithm: Distance transform - any faster algorithm?

I'm trying to solve distance transform problem (using Manhattan's distance). Basically, giving matrix with 0's and 1's, program must assign distances of every position to nearest 1. For example, for this one
0000
0100
0000
0000
distance transform matrix is
2123
1012
2123
3234
Possible solutions from my head are:
Slowest ones (slowest because I have tried to implement them - they were lagging on very big matrices):
Brute-force - for every 1 that program reads, change distances accordingly from beginning till end.
Breadth-first search from 0's - for every 0, program looks for nearest 1 inside out.
Same as 2 but starting from 1's mark every distance inside out.
Much faster (read from other people's code)
Breadth-first search from 1's
1. Assign all values in the distance matrix to -1 or very big value.
2. While reading matrix, put all positions of 1's into queue.
3. While queue is not empty
a. Dequeue position - let it be x
b. For each position around x (that has distance 1 from it)
if position is valid (does not exceed matrix dimensions) then
if distance is not initialized or is greater than (distance of x) + 1 then
I. distance = (distance of x) + 1
II. enqueue position into queue
I wanted to ask if there is faster solution to that problem. I tried to search algorithms for distance transform but most of them are dealing with Euclidean distances.
Thanks in advance.
The breadth first search would perform Θ(n*m) operations where n and m are the width and height of your matrix.
You need to output Θ(n*m) numbers, so you can't get any faster than that from a theoretical point of view.
I'm assuming you are not interested in going towards discussions involving cache and such optimizations.
Note that this solution works in more interesting cases. For example, imagine the same question, but there could be different "sources":
00000
01000
00000
00000
00010
Using BFS, you will get the following distance-to-closest-source in the same time complexity:
21234
10123
21223
32212
32101
However, with a single source, there is another solution that might have a slightly better performance in practice (even though the complexity is still the same).
Before, let's observe the following property.
Property: If source is at (a, b), then a point (x, y) has the following manhattan distance:
d(x, y) = abs(x - a) + abs(y - b)
This should be quite easy to prove. So another algorithm would be:
for r in rows
for c in cols
d(r, c) = abc(r - a) + abs(c - b)
which is very short and easy.
Unless you write and test it, there is no easy way of comparing the two algorithms. Assuming an efficient bounded queue implementation (with an array), you have the following major operations per cell:
BFS: queue insertion/deletion, visit of each node 5 times (four times by neighbors, and one time out of the queue)
Direct formula: two subtraction and two ifs
It would really depend on the compiler and its optimizations as well as the specific CPU and memory architecture to say which would perform better.
That said, I'd advise for going with whichever seems simpler to you. Note however that with multiple sources, in the second solution you would need multiple passes on the array (or multiple distance calculations in one pass) and that would definitely have a worse performance than BFS for a large enough number of sources.
You don't need a queue or anything like that at all. Notice that if (i,j) is at distance d from (k,l), one way to realise that distance is to go left or right |i-k| times and then up or down |j-l| times.
So, initialise your matrix with big numbers and stick a zero everywhere you have a 1 in your input. Now do something like this:
for (i = 0; i < sx-1; i++) {
for (j = 0; j < sy-1; j++) {
dist[i+1][j] = min(dist[i+1][j], dist[i][j]+1);
dist[i][j+1] = min(dist[i][j+1], dist[i][j]+1);
}
dist[i][sy-1] = min(dist[i][sy-1], dist[i][sy-2]+1);
}
for (j = 0; j < sy-1; j++) {
dist[sx-1][j] = min(dist[sx-1][j], dist[sx-2][j]+1);
}
At this point, you've found all of the shortest paths that involve only going down or right. If you do a similar thing for going up and left, dist[i][j] will give you the distance from (i, j) to the nearest 1 in your input matrix.

What can be the efficient approach to solve the 8 puzzle problem?

The 8-puzzle is a square board with 9 positions, filled by 8 numbered tiles and one gap. At any point, a tile adjacent to the gap can be moved into the gap, creating a new gap position. In other words the gap can be swapped with an adjacent (horizontally and vertically) tile. The objective in the game is to begin with an arbitrary configuration of tiles, and move them so as to get the numbered tiles arranged in ascending order either running around the perimeter of the board or ordered from left to right, with 1 in the top left-hand position.
I was wondering what approach will be efficient to solve this problem?
I will just attempt to rewrite the previous answer with more details on why it is optimal.
The A* algorithm taken directly from wikipedia is
function A*(start,goal)
closedset := the empty set // The set of nodes already evaluated.
openset := set containing the initial node // The set of tentative nodes to be evaluated.
came_from := the empty map // The map of navigated nodes.
g_score[start] := 0 // Distance from start along optimal path.
h_score[start] := heuristic_estimate_of_distance(start, goal)
f_score[start] := h_score[start] // Estimated total distance from start to goal through y.
while openset is not empty
x := the node in openset having the lowest f_score[] value
if x = goal
return reconstruct_path(came_from, came_from[goal])
remove x from openset
add x to closedset
foreach y in neighbor_nodes(x)
if y in closedset
continue
tentative_g_score := g_score[x] + dist_between(x,y)
if y not in openset
add y to openset
tentative_is_better := true
elseif tentative_g_score < g_score[y]
tentative_is_better := true
else
tentative_is_better := false
if tentative_is_better = true
came_from[y] := x
g_score[y] := tentative_g_score
h_score[y] := heuristic_estimate_of_distance(y, goal)
f_score[y] := g_score[y] + h_score[y]
return failure
function reconstruct_path(came_from, current_node)
if came_from[current_node] is set
p = reconstruct_path(came_from, came_from[current_node])
return (p + current_node)
else
return current_node
So let me fill in all the details here.
heuristic_estimate_of_distance is the function Σ d(xi) where d(.) is the Manhattan distance of each square xi from its goal state.
So the setup
1 2 3
4 7 6
8 5
would have a heuristic_estimate_of_distance of 1+2+1=4 since each of 8,5 are one away from their goal position with d(.)=1 and 7 is 2 away from its goal state with d(7)=2.
The set of nodes that the A* searches over is defined to be the starting position followed by all possible legal positions. That is lets say the starting position x is as above:
x =
1 2 3
4 7 6
8 5
then the function neighbor_nodes(x) produces the 2 possible legal moves:
1 2 3
4 7
8 5 6
or
1 2 3
4 7 6
8 5
The function dist_between(x,y) is defined as the number of square moves that took place to transition from state x to y. This is mostly going to be equal to 1 in A* always for the purposes of your algorithm.
closedset and openset are both specific to the A* algorithm and can be implemented using standard data structures (priority queues I believe.) came_from is a data structure used
to reconstruct the solution found using the function reconstruct_path who's details can be found on wikipedia. If you do not wish to remember the solution you do not need to implement this.
Last, I will address the issue of optimality. Consider the excerpt from the A* wikipedia article:
"If the heuristic function h is admissible, meaning that it never overestimates the actual minimal cost of reaching the goal, then A* is itself admissible (or optimal) if we do not use a closed set. If a closed set is used, then h must also be monotonic (or consistent) for A* to be optimal. This means that for any pair of adjacent nodes x and y, where d(x,y) denotes the length of the edge between them, we must have:
h(x) <= d(x,y) +h(y)"
So it suffices to show that our heuristic is admissible and monotonic. For the former (admissibility), note that given any configuration our heuristic (sum of all distances) estimates that each square is not constrained by only legal moves and can move freely towards its goal position, which is clearly an optimistic estimate, hence our heuristic is admissible (or it never over-estimates since reaching a goal position will always take at least as many moves as the heuristic estimates.)
The monotonicity requirement stated in words is:
"The heuristic cost (estimated distance to goal state) of any node must be less than or equal to the cost of transitioning to any adjacent node plus the heuristic cost of that node."
It is mainly to prevent the possibility of negative cycles, where transitioning to an unrelated node may decrease the distance to the goal node more than the cost of actually making the transition, suggesting a poor heuristic.
To show monotonicity its pretty simple in our case. Any adjacent nodes x,y have d(x,y)=1 by our definition of d. Thus we need to show
h(x) <= h(y) + 1
which is equivalent to
h(x) - h(y) <= 1
which is equivalent to
Σ d(xi) - Σ d(yi) <= 1
which is equivalent to
Σ d(xi) - d(yi) <= 1
We know by our definition of neighbor_nodes(x) that two neighbour nodes x,y can have at most the position of one square differing, meaning that in our sums the term
d(xi) - d(yi) = 0
for all but 1 value of i. Lets say without loss of generality this is true of i=k. Furthermore, we know that for i=k, the node has moved at most one place, so its distance to
a goal state must be at most one more than in the previous state thus:
Σ d(xi) - d(yi) = d(xk) - d(yk) <= 1
showing monotonicity. This shows what needed to be showed, thus proving this algorithm will be optimal (in a big-O notation or asymptotic kind of way.)
Note, that I have shown optimality in terms of big-O notation but there is still lots of room to play in terms of tweaking the heuristic. You can add additional twists to it so that it is a closer estimate of the actual distance to the goal state, however you have to make sure that the heuristic is always an underestimate otherwise you loose optimality!
EDIT MANY MOONS LATER
Reading this over again (much) later, I realized the way I wrote it sort of confounds the meaning of optimality of this algorithm.
There are two distinct meanings of optimality I was trying to get at here:
1) The algorithm produces an optimal solution, that is the best possible solution given the objective criteria.
2) The algorithm expands the least number of state nodes of all possible algorithms using the same heuristic.
The simplest way to understand why you need admissibility and monotonicity of the heuristic to obtain 1) is to view A* as an application of Dijkstra's shortest path algorithm on a graph where the edge weights are given by the node distance traveled thus far plus the heuristic distance. Without these two properties, we would have negative edges in the graph, thereby negative cycles would be possible and Dijkstra's shortest path algorithm would no longer return the correct answer! (Construct a simple example of this to convince yourself.)
2) is actually quite confusing to understand. To fully understand the meaning of this, there are a lot of quantifiers on this statement, such as when talking about other algorithms, one refers to similar algorithms as A* that expand nodes and search without a-priori information (other than the heuristic.) Obviously, one can construct a trivial counter-example otherwise, such as an oracle or genie that tells you the answer at every step of the way. To understand this statement in depth I highly suggest reading the last paragraph in the History section on Wikipedia as well as looking into all the citations and footnotes in that carefully stated sentence.
I hope this clears up any remaining confusion among would-be readers.
You can use the heuristic that is based on the positions of the numbers, that is the higher the overall sum of all the distances of each letter from its goal state is, the higher the heuristic value. Then you can implement A* search which can be proved to be the optimal search in terms of time and space complexity (provided the heuristic is monotonic and admissible.) http://en.wikipedia.org/wiki/A*_search_algorithm
Since the OP cannot post a picture, this is what he's talking about:
As far as solving this puzzle, goes, take a look at the iterative deepening depth-first search algorithm, as made relevant to the 8-puzzle problem by this page.
Donut's got it! IDDFS will do the trick, considering the relatively limited search space of this puzzle. It would be efficient hence respond to the OP's question. It would find the optimal solution, but not necessarily in optimal complexity.
Implementing IDDFS would be the more complicated part of this problem, I just want to suggest an simple approach to managing the board, the games rules etc. This in particular addresses a way to obtain initial states for the puzzle which are solvable. An hinted in the notes of the question, not all random assignemts of 9 tites (considering the empty slot a special tile), will yield a solvable puzzle. It is a matter of mathematical parity... So, here's a suggestions to model the game:
Make the list of all 3x3 permutation matrices which represent valid "moves" of the game.
Such list is a subset of 3x3s w/ all zeros and two ones. Each matrix gets an ID which will be quite convenient to keep track of the moves, in the IDDFS search tree. An alternative to matrices, is to have two-tuples of the tile position numbers to swap, this may lead to faster implementation.
Such matrices can be used to create the initial puzzle state, starting with the "win" state, and running a arbitrary number of permutations selected at random. In addition to ensuring that the initial state is solvable this approach also provides a indicative number of moves with which a given puzzle can be solved.
Now let's just implement the IDDFS algo and [joke]return the assignement for an A+[/joke]...
This is an example of the classical shortest path algorithm. You can read more about shortest path here and here.
In short, think of all possible states of the puzzle as of vertices in some graph. With each move you change states - so, each valid move represents an edge of the graph. Since moves don't have any cost, you may think of the cost of each move being 1. The following c++-like pseudo-code will work for this problem:
{
int[][] field = new int[3][3];
// fill the input here
map<string, int> path;
queue<string> q;
put(field, 0); // we can get to the starting position in 0 turns
while (!q.empty()) {
string v = q.poll();
int[][] take = decode(v);
int time = path.get(v);
if (isFinalPosition(take)) {
return time;
}
for each valid move from take to int[][] newPosition {
put(newPosition, time + 1);
}
}
// no path
return -1;
}
void isFinalPosition(int[][] q) {
return encode(q) == "123456780"; // 0 represents empty space
}
void put(int[][] position, int time) {
string s = encode(newPosition);
if (!path.contains(s)) {
path.put(s, time);
}
}
string encode(int[][] field) {
string s = "";
for (int i = 0; i < 3; i++) for (int j = 0; j < 3; j++) s += field[i][j];
return s;
}
int[][] decode(string s) {
int[][] ans = new int[3][3];
for (int i = 0; i < 3; i++) for (int j = 0; j < 3; j++) field[i][j] = s[i * 3 + j];
return ans;
}
See this link for my parallel iterative deepening search for a solution to the 15-puzzle, which is the 4x4 big-brother of the 8-puzzle.

Find median value from a growing set

I came across an interesting algorithm question in an interview. I gave my answer but not sure whether there is any better idea. So I welcome everyone to write something about his/her ideas.
You have an empty set. Now elements are put into the set one by one. We assume all the elements are integers and they are distinct (according to the definition of set, we don't consider two elements with the same value).
Every time a new element is added to the set, the set's median value is asked. The median value is defined the same as in math: the middle element in a sorted list. Here, specially, when the size of set is even, assuming size of set = 2*x, the median element is the x-th element of the set.
An example:
Start with an empty set,
when 12 is added, the median is 12,
when 7 is added, the median is 7,
when 8 is added, the median is 8,
when 11 is added, the median is 8,
when 5 is added, the median is 8,
when 16 is added, the median is 8,
...
Notice that, first, elements are added to set one by one and second, we don't know the elements going to be added.
My answer.
Since it is a question about finding median, sorting is needed. The easiest solution is to use a normal array and keep the array sorted. When a new element comes, use binary search to find the position for the element (log_n) and add the element to the array. Since it is a normal array so shifting the rest of the array is needed, whose time complexity is n. When the element is inserted, we can immediately get the median, using instance time.
The WORST time complexity is: log_n + n + 1.
Another solution is to use link list. The reason for using link list is to remove the need of shifting the array. But finding the location of the new element requires a linear search. Adding the element takes instant time and then we need to find the median by going through half of the array, which always takes n/2 time.
The WORST time complexity is: n + 1 + n/2.
The third solution is to use a binary search tree. Using a tree, we avoid shifting array. But using the binary search tree to find the median is not very attractive. So I change the binary search tree in a way that it is always the case that the left subtree and the right subtree are balanced. This means that at any time, either the left subtree and the right subtree have the same number of nodes or the right subtree has one node more than in the left subtree. In other words, it is ensured that at any time, the root element is the median. Of course this requires changes in the way the tree is built. The technical detail is similar to rotating a red-black tree.
If the tree is maintained properly, it is ensured that the WORST time complexity is O(n).
So the three algorithms are all linear to the size of the set. If no sub-linear algorithm exists, the three algorithms can be thought as the optimal solutions. Since they don't differ from each other much, the best is the easiest to implement, which is the second one, using link list.
So what I really wonder is, will there be a sub-linear algorithm for this problem and if so what will it be like. Any ideas guys?
Steve.
Your complexity analysis is confusing. Let's say that n items total are added; we want to output the stream of n medians (where the ith in the stream is the median of the first i items) efficiently.
I believe this can be done in O(n*lg n) time using two priority queues (e.g. binary or fibonacci heap); one queue for the items below the current median (so the largest element is at the top), and the other for items above it (in this heap, the smallest is at the bottom). Note that in fibonacci (and other) heaps, insertion is O(1) amortized; it's only popping an element that's O(lg n).
This would be called an "online median selection" algorithm, although Wikipedia only talks about online min/max selection. Here's an approximate algorithm, and a lower bound on deterministic and approximate online median selection (a lower bound means no faster algorithm is possible!)
If there are a small number of possible values compared to n, you can probably break the comparison-based lower bound just like you can for sorting.
I received the same interview question and came up with the two-heap solution in wrang-wrang's post. As he says, the time per operation is O(log n) worst-case. The expected time is also O(log n) because you have to "pop an element" 1/4 of the time assuming random inputs.
I subsequently thought about it further and figured out how to get constant expected time; indeed, the expected number of comparisons per element becomes 2+o(1). You can see my writeup at http://denenberg.com/omf.pdf .
BTW, the solutions discussed here all require space O(n), since you must save all the elements. A completely different approach, requiring only O(log n) space, gives you an approximation to the median (not the exact median). Sorry I can't post a link (I'm limited to one link per post) but my paper has pointers.
Although wrang-wrang already answered, I wish to describe a modification of your binary search tree method that is sub-linear.
We use a binary search tree that is balanced (AVL/Red-Black/etc), but not super-balanced like you described. So adding an item is O(log n)
One modification to the tree: for every node we also store the number of nodes in its subtree. This doesn't change the complexity. (For a leaf this count would be 1, for a node with two leaf children this would be 3, etc)
We can now access the Kth smallest element in O(log n) using these counts:
def get_kth_item(subtree, k):
left_size = 0 if subtree.left is None else subtree.left.size
if k < left_size:
return get_kth_item(subtree.left, k)
elif k == left_size:
return subtree.value
else: # k > left_size
return get_kth_item(subtree.right, k-1-left_size)
A median is a special case of Kth smallest element (given that you know the size of the set).
So all in all this is another O(log n) solution.
We can difine a min and max heap to store numbers. Additionally, we define a class DynamicArray for the number set, with two functions: Insert and Getmedian. Time to insert a new number is O(lgn), while time to get median is O(1).
This solution is implemented in C++ as the following:
template<typename T> class DynamicArray
{
public:
void Insert(T num)
{
if(((minHeap.size() + maxHeap.size()) & 1) == 0)
{
if(maxHeap.size() > 0 && num < maxHeap[0])
{
maxHeap.push_back(num);
push_heap(maxHeap.begin(), maxHeap.end(), less<T>());
num = maxHeap[0];
pop_heap(maxHeap.begin(), maxHeap.end(), less<T>());
maxHeap.pop_back();
}
minHeap.push_back(num);
push_heap(minHeap.begin(), minHeap.end(), greater<T>());
}
else
{
if(minHeap.size() > 0 && minHeap[0] < num)
{
minHeap.push_back(num);
push_heap(minHeap.begin(), minHeap.end(), greater<T>());
num = minHeap[0];
pop_heap(minHeap.begin(), minHeap.end(), greater<T>());
minHeap.pop_back();
}
maxHeap.push_back(num);
push_heap(maxHeap.begin(), maxHeap.end(), less<T>());
}
}
int GetMedian()
{
int size = minHeap.size() + maxHeap.size();
if(size == 0)
throw exception("No numbers are available");
T median = 0;
if(size & 1 == 1)
median = minHeap[0];
else
median = (minHeap[0] + maxHeap[0]) / 2;
return median;
}
private:
vector<T> minHeap;
vector<T> maxHeap;
};
For more detailed analysis, please refer to my blog: http://codercareer.blogspot.com/2012/01/no-30-median-in-stream.html.
1) As with the previous suggestions, keep two heaps and cache their respective sizes. The left heap keeps values below the median, the right heap keeps values above the median. If you simply negate the values in the right heap the smallest value will be at the root so there is no need to create a special data structure.
2) When you add a new number, you determine the new median from the size of your two heaps, the current median, and the two roots of the L&R heaps, which just takes constant time.
3) Call a private threaded method to perform the actual work to perform the insert and update, but return immediately with the new median value. You only need to block until the heap roots are updated. Then, the thread doing the insert just needs to maintain a lock on the traversing grandparent node as it traverses the tree; this will ensue that you can insert and rebalance without blocking other inserting threads working on other sub-branches.
Getting the median becomes a constant time procedure, of course now you may have to wait on synchronization from further adds.
Rob
A balanced tree (e.g. R/B tree) with augmented size field should find the median in lg(n) time in the worst case. I think it is in Chapter 14 of the classic Algorithm text book.
To keep the explanation brief, you can efficiently augment a BST to select a key of a specified rank in O(h) by having each node store the number of nodes in its left subtree. If you can guarantee that the tree is balanced, you can reduce this to O(log(n)). Consider using an AVL which is height-balanced (or red-black tree which is roughly balanced), then you can select any key in O(log(n)). When you insert or delete a node into the AVL you can increment or decrement a variable that keeps track of the total number of nodes in the tree to determine the rank of the median which you can then select in O(log(n)).
In order to find the median in linear time you can try this (it just came to my mind). You need to store some values every time you add number to your set, and you won't need sorting. Here it goes.
typedef struct
{
int number;
int lesser;
int greater;
} record;
int median(record numbers[], int count, int n)
{
int i;
int m = VERY_BIG_NUMBER;
int a, b;
numbers[count + 1].number = n:
for (i = 0; i < count + 1; i++)
{
if (n < numbers[i].number)
{
numbers[i].lesser++;
numbers[count + 1].greater++;
}
else
{
numbers[i].greater++;
numbers[count + 1].lesser++;
}
if (numbers[i].greater - numbers[i].lesser == 0)
m = numbers[i].number;
}
if (m == VERY_BIG_NUMBER)
for (i = 0; i < count + 1; i++)
{
if (numbers[i].greater - numbers[i].lesser == -1)
a = numbers[i].number;
if (numbers[i].greater - numbers[i].lesser == 1)
b = numbers[i].number;
m = (a + b) / 2;
}
return m;
}
What this does is, each time you add a number to the set, you must now how many "lesser than your number" numbers have, and how many "greater than your number" numbers have. So, if you have a number with the same "lesser than" and "greater than" it means your number is in the very middle of the set, without having to sort it. In the case that you have an even amount of numbers you may have two choices for a median, so you just return the mean of those two. BTW, this is C code, I hope this helps.

Resources