Find the largest k numbers in k arrays stored across k machines - algorithm

This is an interview question. I have K machines each of which is connected to 1 central machine. Each of the K machines have an array of 4 byte numbers in file. You can use any data structure to load those numbers into memory on those machines and they fit. Numbers are not unique across K machines. Find the K largest numbers in the union of the numbers across all K machines. What is the fastest I can do this?

(This is an interesting problem because it involves parallelism. As I haven't encountered parallel algorithm optimization before, it's quite amusing: you can get away with some ridiculously high-complexity steps, because you can make up for it later. Anyway, onto the answer...)
> "What is the fastest I can do this?"
The best you can do is O(K). Below I illustrate both a simple O(K log(K)) algorithm, and the more complex O(K) algorithm.
First step:
Each computer needs enough time to read every element. This means that unless the elements are already in memory, one of the two bounds on the time is O(largest array size). If for example your largest array size varies as O(K log(K)) or O(K^2) or something, no amount of algorithmic trickery will let you go faster than that. Thus the actual best running time is O(max(K, largestArraySize)) technically.
Let us say the arrays have a max length of N, which is <=K. With the above caveat, we're allowed to bound N<K since each computer has to look at each of its elements at least once (O(N) preprocessing per computer), each computer can pick the largest K elements (this is known as finding kth-order-statistics, see these linear-time algorithms). Furthermore, we can do so for free (since it's also O(N)).
Bounds and reasonable expectations:
Let's begin by thinking of some worst-case scenarios, and estimates for the minimum amount of work necessary.
One minimum-work-necessary estimate is O(K*N/K) = O(N), because we need to look at every element at the very least. But, if we're smart, we can distribute the work evenly across all K computers (hence the division by K).
Another minimum-work-necessary estimate is O(N): if one array is larger than all elements on all other computers, we return the set.
We must output all K elements; this is at least O(K) to print them out. We can avoid this if we are content merely knowing where the elements are, in which case the O(K) bound does not necessarily apply.
Can this bound of O(N) be achieved? Let's see...
Simple approach - O(NlogN + K) = O(KlogK):
For now let's come up with a simple approach, which achieves O(NlogN + K).
Consider the data arranged like so, where each column is a computer, and each row is a number in the array:
computer: A B C D E F G
10 (o) (o)
9 o (o) (o)
8 o (o)
7 x x (x)
6 x x (x)
5 x ..........
4 x x ..
3 x x x . .
2 x x . .
1 x x .
0 x x .
You can also imagine this as a sweep-line algorithm from computation geometry, or an efficient variant of the 'merge' step from mergesort. The elements with parentheses represent the elements with which we'll initialize our potential "candidate solution" (in some central server). The algorithm will converge on the correct o responses by dumping the (x) answers for the two unselected os.
Algorithm:
All computers start as 'active'.
Each computer sorts its elements. (parallel O(N logN))
Repeat until all computers are inactive:
Each active computer finds the next-highest element (O(1) since sorted) and gives it to the central server.
The server smartly combines the new elements with the old K elements, and removes an equal number of the lowest elements from the combined set. To perform this step efficiently, we have a global priority queue of fixed size K. We insert the new potentially-better elements, and bad elements fall out of the set. Whenever an element falls out of the set, we tell the computer which sent that element to never send another one. (Justification: This always raises the smallest element of the candidate set.)
(sidenote: Adding a callback hook to falling out of a priority queue is an O(1) operation.)
We can see graphically that this will perform at most 2K*(findNextHighest_time + queueInsert_time) operations, and as we do so, elements will naturally fall out of the priority queue. findNextHighest_time is O(1) since we sorted the arrays, so to minimize 2K*queueInsert_time, we choose a priority queue with an O(1) insertion time (e.g. a Fibonacci-heap based priority queue). This gives us an O(log(queue_size)) extraction time (we cannot have O(1) insertion and extraction); however, we never need to use the extract operation! Once we are done, we merely dump the priority queue as an unordered set, which takes O(queue_size)=O(K) time.
We'd thus have O(N log(N) + K) total running time (parallel sorting, followed by O(K)*O(1) priority queue insertions). In the worst case of N=K, this is O(K log(K)).
The better approach - O(N+K) = O(K):
However I have come up with a better approach, which achieves O(K). It is based on the median-of-median selection algorithm, but parallelized. It goes like this:
We can eliminate a set of numbers if we know for sure that there are at least K (not strictly) larger numbers somewhere among all the computers.
Algorithm:
Each computer finds the sqrt(N)th highest element of its set, and splits the set into elements < and > it. This takes O(N) time in parallel.
The computers collaborate to combine those statistics into a new set, and find the K/sqrt(N)th highest element of that set (let's call it the 'superstatistic'), and note which computers have statistics < and > the superstatistic. This takes O(K) time.
Now consider all elements less than their computer's statistics, on computers whose statistic is less than the superstatistic. Those elements can be eliminated. This is because the elements greater than their computer's statistic, on computers whose statistic is larger than the superstatistic, are a set of K elements which are larger. (See the visual here).
Now, the computers with the uneliminated elements evenly redistribute their data to the computers who lost data.
Recurse: you still have K computers, but the value of N has decreased. Once N is less than a predetermined constant, use the previous algorithm I mentioned in "simple approach - O(NlogN + K)"; except in this case, it is now O(K). =)
It turns out that the reductions are O(N) total (amazingly not order K), except perhaps the final step which might by O(K). Thus this algorithm is O(N+K) = O(K) total.
Analysis and simulation of O(K) running time below. The statistics allow us to divide the world into four unordered sets, represented here as a rectangle divided into four subboxes:
------N-----
N^.5
________________
| | s | <- computer
| | #=K s REDIST. | <- computer
| | s | <- computer
| K/N^.5|-----S----------| <- computer
| | s | <- computer
K | s | <- computer
| | s ELIMIN. | <- computer
| | s | <- computer
| | s | <- computer
| |_____s__________| <- computer
LEGEND:
s=statistic, S=superstatistic
#=K -- set of K largest elements
(I'd draw the relation between the unordered sets of rows and s-column here, but it would clutter things up; see the addendum right now quickly.)
For this analysis, we will consider N as it decreases.
At a given step, we are able to eliminate the elements labelled ELIMIN; this has removed area from the rectangle representation above, reducing the problem size from K*N to , which hilariously simplifies to
Now, the computers with the uneliminated elements redistribute their data (REDIST rectangle above) to the computers with eliminated elements (ELIMIN). This is done in parallel, where the bandwidth bottleneck corresponds to the length of the short size of REDIST (because they are outnumbered by the ELIMIN computers which are waiting for their data). Therefore the data will take as long to transfer as the long length of the REDIST rectangle (another way of thinking about it: K/√N * (N-√N) is the area, divided by K/√N data-per-time, resulting in O(N-√N) time).
Thus at each step of size N, we are able to reduce the problem size to K(2√N-1), at the cost of performing N + 3K + (N-√N) work. We now recurse. The recurrence relation which will tell us our performance is:
T(N) = 2N+3K-√N + T(2√N-1)
The decimation of the subproblem size is much faster than the normal geometric series (being √N rather than something like N/2 which you'd normally get from common divide-and-conquers). Unfortunately neither the Master Theorem nor the powerful Akra-Bazzi theorem work, but we can at least convince ourselves it is linear via a simulation:
>>> def T(n,k=None):
... return 1 if n<10 else sqrt(n)*(2*sqrt(n)-1)+3*k+T(2*sqrt(n)-1, k=k)
>>> f = (lambda x: x)
>>> (lambda n: T((10**5)*n,k=(10**5)*n)/f((10**5)*n) - T(n,k=n)/f(n))(10**30)
-3.552713678800501e-15
The function T(N) is, at large scales, a multiple of the linear function x, hence linear (doubling the input doubles the output). This method, therefore, almost certainly achieves the bound of O(N) we conjecture. Though see the addendum for an interesting possibility.
...
Addendum
One pitfall is accidentally sorting. If we do anything which accidentally sorts our elements, we will incur a log(N) penalty at the least. Thus it is better to think of the arrays as sets, to avoid the pitfall of thinking that they are sorted.
Also we might initially think that with the constant amount of work at each step of 3K, so we would have to do work 3Klog(log(N)) work. But the -1 has a powerful role to play in the decimation of the problem size. It is very slightly possible that the running time is actually something above linear, but definitely much smaller than even Nlog(log(log(log(N)))). For example it might be something like O(N*InverseAckermann(N)), but I hit the recursion limit when testing.
The O(K) is probably only due to the fact that we have to print them out; if we are content merely knowing where the data is, we might even be able to pull off an O(N) (e.g. if the arrays are of length O(log(K)) we might be able to achieve O(log(K)))... but that's another story.
The relation between the unordered sets is as follows. Would have cluttered things up in explanation.
.
_
/ \
(.....) > s > (.....)
s
(.....) > s > (.....)
s
(.....) > s > (.....)
\_/
v
S
v
/ \
(.....) > s > (.....)
s
(.....) > s > (.....)
s
(.....) > s > (.....)
\_/

Find the k largest numbers on each machine. O(n*log(k))
Combine the results (on a centralized server, if k is not huge, otherwise you can merge them in a tree-hierarchy accross the server cluster).
Update: to make it clear, the combine step is not a sort. You just pick the top k numbers from the results. There are many ways to do this efficiently. You can use a heap for example, pushing the head of each list. Then you can remove the head from the heap and push the head from the list the element belonged to. Doing this k times gives you the result. All this is O(k*log(k)).

Maintain a min heap of size 'k' in the centralized server.
Initially insert first k elements into the min heap.
For the remaining elements
Check(peek) for the min element in the heap (O(1))
If the min element is lesser than the current element, then remove the min element from heap and insert the current element.
Finally min heap will have 'k' largest elements
This would require n(log k) time.

I would suggest something like this:
take the k largest numbers on each machine in sorted order O(Nk) where N is the number of element on each machine
sort each of these arrays of k elements by largest element (you will get k arrays of k elements sorted by largest element : a square matrix kxk)
take the "upper triangle" of the matrix made of these k arrays of k elements, (the k largest element will be in this upper triangle)
the central machine can now find the k largest element of these k(k+1)/2 elements

Let the machines find the out k largest elements copy it into a
datastructure (stack), sort it and pass it on to the Central
machine.
At the central machine receive the stacks from all the machine. Find
the greatest of the elements at the top of the stacks.
Pop out the greatest element form its stack and copy it to the 'TopK list'.
Leave the other stacks intact.
Repeat step 3, k times to get Top K numbers.

1) sort the items on every machine
2) use a k - binary heap on the central machine
a) populate the heap with first (max) element from each machine
b) extract the first element, and put back in the heap the first element from the machine that you extracted the element. (of course heapify your heap, after the element is added).
Sort will be O(Nlog(N)) where N is the max array on the machines.
O(k) - to build the heap
O(klog(k)) to extract and populate the heap k times.
Complexity is max(O(klog(k)),O(Nlog(N)))

I would think the MapReduce paradigm would be well suited to a task like this.
Every machine runs it's own independent map task to find the maximum value in its array (depends on the language used) and this will probably be O(N) complexity for N numbers on each machine.
The reduce task compares the result from the individual machines' outputs to give you the largest k numbers.

Related

constant memory reservoir sampling, O(k) possible?

I have an input stream, of size n, and I want to produce an output stream of size k that contains distinct random elements of the input stream, without requiring any additional memory for elements selected by the sample.
The algorithm I was going to use is basically as follows:
for each element in input stream
if random()<k/n
decrement k
output element
if k = 0
halt
end if
end if
decrement n
end for
The function random() generates a number from [0..1) on a random distribution, and I trust the algorithm's principle of operation is straightforward.
Although this algorithm can terminate early when it selects the last element, in general the algorithm is still approximately O(n). At first it seemed to work as intended (outputting roughly uniformly distributed but still random elements from the input stream), but I think there may be a non-uniform tendency to pick later elements when k is much less than n. I'm not sure about this, however... so I'd appreciate knowing for sure one way or the other. I'm also wondering if a faster algorithm exists. Obviously, since k elements must be generated, the algorithm cannot be any faster than O(k). For an O(k) solution, one could assume the existence of a function skip(x), which can skip over x elements in the input stream in O(1) time (but cannot skip backwards). I would still like to keep the requirement of not requiring any additional memory, however.
If it is a real stream, you need O(n) time to scan it.
Your existing algorithm is good. (I got that wrong before.) You can prove by induction that the probability that you have not picked the first element in i tries is 1 - i/n = (n-i)/n. First that is true for i=0 by inspection. Now if you have not picked it in ith tries, the odds that the next one picks it is 1/(n-i). And then the odds of picking it on the i+1'th try is ((n-i)/n) * (1/(n-i)) = 1/n. Which means that the odds of not picking it in the first i+1 times is 1 - i/n - 1/n = 1 - (i+i)/n. That completes induction. And so the odds of picking the first element in the first k tries is the odds of not having not picked it, or 1 - (n - k/n) = k/n.
But what if you have O(1) access to any element? Well note that choosing k to take is the same as choosing n-k to leave. So without loss of generality we can assume that k <= n/2. What that means is that we can use a randomized algorithm like this:
chosen = set()
count_chosen = 0
while count_chosen < k:
choice = random_element(stream)
if choice not in chosen:
chosen.add(choice)
count_chosen = count_chosen + 1
The set will be O(k) space, and since the probability of each random choice being new to you is at least 0.5, the expected running time is no worse than 2k choices.

Simple Max Profit Scheduling Algo

Assume that you have an array of durations L[5,8,2] with deadlines D[13,8,7]. If you have an end time of each activity E[i]. You receive (or lose) an amount D[i] - E[i] for each activity, which sums to a total amount gained or lost, which for this example is 4. E depends on what order you do each activity. For example if you do each L[i] in ascending order your resulting E would be [7,15,2].
I've found the max value occurs after you sort the L array, which runs O(nlog n). What's fascinating is that after you sort the L array, there's no need to sort the D array b/c you'll end up with the same max value for any arrangement of the deadlines (I've tried on larger sets). Is there a better way to solve this problem to get the running time to be less than O(nlogn)? I've spent a couple hours trying all sorts of linear tweaks on lengths and deadlines, to no avail, or even use conditional statements. It seems to me this can be done in O(n) time, but I can't for the life of me find it.
You sort an unbounded array of integers. There are faster ways to sort integers than the ones based on just comparing their magnitude: O(n log log n) for a deterministic case and O(n sqrt(log log n)) for a randomized algorithm. See https://cstheory.stackexchange.com/a/19089 for more discussion.
If the integers are bounded (as in, you can guarantee they won't be larger than some value), counting sort will solve the problem in O(n).
Sorting the durations is the correct answer. As #liori points out, there are different ways to sort integers, but regardless, you still need to sort the durations.
Let's look at an abstraction of the problem. Start with L[a,b,c] and D[x,y,z]. Assume that the tasks are executed in the order given, then the end times are E[a,a+b,a+b+c], and so
profit = (x - a) + (y - (a+b)) + (z - (a+b+c))
which is the same as
profit = x + y + z - 3a - 2b - c
From this, we can see that the order of the deadlines doesn't matter, but the order in which the tasks are executed is important. The duration of the first task is subtracted from the profit many times. But the duration of the last task is only subtracted from the profit once. So clearly, the tasks need to be done in order from shortest to longest.

Algorithm to generate k element subsets in order of their sum

If I have an unsorted large set of n integers (say 2^20 of them) and would like to generate subsets with k elements each (where k is small, say 5) in increasing order of their sums, what is the most efficient way to do so?
Why I need to generate these subsets in this fashion is that I would like to find the k-element subset with the smallest sum satisfying a certain condition, and I thus would apply the condition on each of the k-element subsets generated.
Also, what would be the complexity of the algorithm?
There is a similar question here: Algorithm to get every possible subset of a list, in order of their product, without building and sorting the entire list (i.e Generators) about generating subsets in order of their product, but it wouldn't fit my needs due to the extremely large size of the set n
I intend to implement the algorithm in Mathematica, but could do it in C++ or Python too.
If your desired property of the small subsets (call it P) is fairly common, a probabilistic approach may work well:
Sort the n integers (for millions of integers i.e. 10s to 100s of MB of ram, this should not be a problem), and sum the k-1 smallest. Call this total offset.
Generate a random k-subset (say, by sampling k random numbers, mod n) and check it for P-ness.
On a match, note the sum-total of the subset. Subtract offset from this to find an upper bound on the largest element of any k-subset of equivalent sum-total.
Restrict your set of n integers to those less than or equal to this bound.
Repeat (goto 2) until no matches are found within some fixed number of iterations.
Note the initial sort is O(n log n). The binary search implicit in step 4 is O(log n).
Obviously, if P is so rare that random pot-shots are unlikely to get a match, this does you no good.
Even if only 1 in 1000 of the k-sized sets meets your condition, That's still far too many combinations to test. I believe runtime scales with nCk (n choose k), where n is the size of your unsorted list. The answer by Andrew Mao has a link to this value. 10^28/1000 is still 10^25. Even at 1000 tests per second, that's still 10^22 seconds. =10^14 years.
If you are allowed to, I think you need to eliminate duplicate numbers from your large set. Each duplicate you remove will drastically reduce the number of evaluations you need to perform. Sort the list, then kill the dupes.
Also, are you looking for the single best answer here? Who will verify the answer, and how long would that take? I suggest implementing a Genetic Algorithm and running a bunch of instances overnight (for as long as you have the time). This will yield a very good answer, in much less time than the duration of the universe.
Do you mean 20 integers, or 2^20? If it's really 2^20, then you may need to go through a significant amount of (2^20 choose 5) subsets before you find one that satisfies your condition. On a modern 100k MIPS CPU, assuming just 1 instruction can compute a set and evaluate that condition, going through that entire set would still take 3 quadrillion years. So if you even need to go through a fraction of that, it's not going to finish in your lifetime.
Even if the number of integers is smaller, this seems to be a rather brute force way to solve this problem. I conjecture that you may be able to express your condition as a constraint in a mixed integer program, in which case solving the following could be a much faster way to obtain the solution than brute force enumeration. Assuming your integers are w_i, i from 1 to N:
min sum(i) w_i*x_i
x_i binary
sum over x_i = k
subject to (some constraints on w_i*x_i)
If it turns out that the linear programming relaxation of your MIP is tight, then you would be in luck and have a very efficient way to solve the problem, even for 2^20 integers (Example: max-flow/min-cut problem.) Also, you can use the approach of column generation to find a solution since you may have a very large number of values that cannot be solved for at the same time.
If you post a bit more about the constraint you are interested in, I or someone else may be able to propose a more concrete solution for you that doesn't involve brute force enumeration.
Here's an approximate way to do what you're saying.
First, sort the list. Then, consider some length-5 index vector v, corresponding to the positions in the sorted list, where the maximum index is some number m, and some other index vector v', with some max index m' > m. The smallest sum for all such vectors v' is always greater than the smallest sum for all vectors v.
So, here's how you can loop through the elements with approximately increasing sum:
sort arr
for i = 1 to N
for v = 5-element subsets of (1, ..., i)
set = arr{v}
if condition(set) is satisfied
break_loop = true
compute sum(set), keep set if it is the best so far
break if break_loop
Basically, this means that you no longer need to check for 5-element combinations of (1, ..., n+1) if you find a satisfying assignment in (1, ..., n), since any satisfying assignment with max index n+1 will have a greater sum, and you can stop after that set. However, there is no easy way to loop through the 5-combinations of (1, ..., n) while guaranteeing that the sum is always increasing, but at least you can stop checking after you find a satisfying set at some n.
This looks to be a perfect candidate for map-reduce (http://en.wikipedia.org/wiki/MapReduce). If you know of any way of partitioning them smartly so that passing candidates are equally present in each node then you can probably get a great throughput.
Complete sort may not really be needed as the map stage can take care of it. Each node can then verify the condition against the k-tuples and output results into a file that can be aggregated / reduced later.
If you know of the probability of occurrence and don't need all of the results try looking at probabilistic algorithms to converge to an answer.

Is this searching algorithm optimal?

I have two lists, L and M, each containing thousands of 64-bit unsigned integers. I need to find out whether the sum of any two members of L is itself a member of M.
Is it possible to improve upon the performance of the following algorithm?
Sort(M)
for i = 0 to Length(L)
for j = i + 1 to Length(L)
BinarySearch(M, L[i] + L[j])
(I'm assuming your goal is to find all pairs in L that sum to something in M)
Forget hashtables!
Sort both lists.
Then do the outer loop of your algorithm: walk over every element i in L, then every larger element j in L. As you go, form the sum and check to see if it's in M.
But don't look using a binary search: simply do a linear scan from the last place you looked. Let's say you're working on some value i, and you have some value j, followed by some value j'. When searching for (i+j), you would have got to the point in M where that value is found, or the first largest value. You're now looking for (i+j'); since j' > j, you know that (i+j') > (i+j), and so it cannot be any earlier in M than the last place you got. If L and M are both smoothly distributed, there is an excellent chance that the point in M where you would find (i+j') is only a little way off.
If the arrays are not smoothly distributed, then better than a linear scan might be some sort of jumping scan - look forward N elements at a time, halving N if the jump goes too far.
I believe this algorithm is O(n^2), which is as fast as any proposed hash algorithm (which have an O(1) primitive operation, but still have to do O(n**2) of them. It also means that you don't have to worry about the O(n log n) to sort. It has much better data locality than the hash algorithms - it basically consists of paired streamed reads over the arrays, repeated n times.
EDIT: I have written implementations of Paul Baker's original algorithm, Nick Larsen's hashtable algorithm, and my algorithm, and a simple benchmarking framework. The implementations are simple (linear probing in the hashtable, no skipping in my linear search), and i had to make guesses at various sizing parameters. See http://urchin.earth.li/~twic/Code/SumTest/ for the code. I welcome corrections or suggestions, about any of the implementations, the framework, and the parameters.
For L and M containing 3438 items each, with values ranging from 1 to 34380, and with Larsen's hashtable having a load factor of 0.75, the median times for a run are:
Baker (binary search): 423 716 646 ns
Larsen (hashtable): 733 479 121 ns
Anderson (linear search): 62 077 597 ns
The difference is much bigger than i had expected (and, i admit, not in the direction i had expected). I suspect i have made one or more major mistakes in the implementation. If anyone spots one, i really would like to hear about it!
One thing is that i have allocated Larsen's hashtable inside the timed method. It is thus paying the cost of allocation and (some) garbage collection. I think this is fair, because it's a temporary structure only needed by the algorithm. If you think it's something that could be reused, it would be simple enough to move it into an instance field and allocate it only once (and Arrays.fill it with zero inside the timed method), and see how that affects performance.
The complexity of the example code in the question is O(m log m + l2 log m) where l=|L| and m=|M| as it runs binary search (O(log m)) for every pair of elements in L (O(l2)), and M is sorted first.
Replacing the binary search with a hash table reduces the complexity to O(l2) assuming that hash table insert and lookup are O(1) operations.
This is asymptotically optimal as long as you assume that you need to process every pair of numbers on the list L, as there are O(l2) such pairs. If there are a couple of thousands of numbers on L, and they are random 64-bit integers, then definitely you need to process all the pairs.
Instead of sorting M at a cost of n * log(n), you could create a hash set at the cost of n.
You could also store all sums in another hash set while iterating and add a check to make sure you don't perform the same search twice.
You can avoid binary search by using hashtable except sorted M array.
Alternatively, add all of the members of L to a hashset lSet, then iterate over M, performing these steps for each m in M:
add m to hashset mSet - if m is already in mSet, skip this iteration; if m is in hashset dSet, also skip this iteration.
subtract each member l of L less than m from m to give d, and test whether d is also in lSet;
if so, add (l, d) to some collection rSet; add d to hashset dSet.
This will require fewer iterations, at the cost of more memory. You will want to pre-allocate the memory for the structures, if this is to give you a speed increase.

Finding the hundred largest numbers in a file of a billion

I went to an interview today and was asked this question:
Suppose you have one billion integers which are unsorted in a disk file. How would you determine the largest hundred numbers?
I'm not even sure where I would start on this question. What is the most efficient process to follow to give the correct result? Do I need to go through the disk file a hundred times grabbing the highest number not yet in my list, or is there a better way?
Obviously the interviewers want you to point out two key facts:
You cannot read the whole list of integers into memory, since it is too large. So you will have to read it one by one.
You need an efficient data structure to hold the 100 largest elements. This data structure must support the following operations:
Get-Size: Get the number of values in the container.
Find-Min: Get the smallest value.
Delete-Min: Remove the smallest value to replace it with a new, larger value.
Insert: Insert another element into the container.
By evaluating the requirements for the data structure, a computer science professor would expect you to recommend using a Heap (Min-Heap), since it is designed to support exactly the operations we need here.
For example, for Fibonacci heaps, the operations Get-Size, Find-Min and Insert all are O(1) and Delete-Min is O(log n) (with n <= 100 in this case).
In practice, you could use a priority queue from your favorite language's standard library (e.g. priority_queue from#include <queue> in C++) which is usually implemented using a heap.
Here's my initial algorithm:
create array of size 100 [0..99].
read first 100 numbers and put into array.
sort array in ascending order.
while more numbers in file:
get next number N.
if N > array[0]:
if N > array[99]:
shift array[1..99] to array[0..98].
set array[99] to N.
else
find, using binary search, first index i where N <= array[i].
shift array[1..i-1] to array[0..i-2].
set array[i-1] to N.
endif
endif
endwhile
This has the (very slight) advantage is that there's no O(n^2) shuffling for the first 100 elements, just an O(n log n) sort and that you very quickly identify and throw away those that are too small. It also uses a binary search (7 comparisons max) to find the correct insertion point rather than 50 (on average) for a simplistic linear search (not that I'm suggesting anyone else proffered such a solution, just that it may impress the interviewer).
You may even get bonus points for suggesting the use of optimised shift operations like memcpy in C provided you can be sure the overlap isn't a problem.
One other possibility you may want to consider is to maintain three lists (of up to 100 integers each):
read first hundred numbers into array 1 and sort them descending.
while more numbers:
read up to next hundred numbers into array 2 and sort them descending.
merge-sort lists 1 and 2 into list 3 (only first (largest) 100 numbers).
if more numbers:
read up to next hundred numbers into array 2 and sort them descending.
merge-sort lists 3 and 2 into list 1 (only first (largest) 100 numbers).
else
copy list 3 to list 1.
endif
endwhile
I'm not sure, but that may end up being more efficient than the continual shuffling.
The merge-sort is a simple selection along the lines of (for merge-sorting lists 1 and 2 into 3):
list3.clear()
while list3.size() < 100:
while list1.peek() >= list2.peek():
list3.add(list1.pop())
endwhile
while list2.peek() >= list1.peek():
list3.add(list2.pop())
endwhile
endwhile
Simply put, pulling the top 100 values out of the combined list by virtue of the fact they're already sorted in descending order. I haven't checked in detail whether that would be more efficient, I'm just offering it as a possibility.
I suspect the interviewers would be impressed with the potential for "out of the box" thinking and the fact that you'd stated that it should be evaluated for performance.
As with most interviews, technical skill is one of the the things they're looking at.
Create an array of 100 numbers all being -2^31.
Check if the the first number you read from disk is greater than the first in the list. If it is copy the array down 1 index and update it to the new number. If not check the next in the 100 and so on.
When you've finished reading all 1 billion digits you should have the highest 100 in the array.
Job done.
I'd traverse the list in order. As I go, I add elements to a set (or multiset depending on duplicates). When the set reached 100, I'd only insert if the value was greater than the min in the set (O(log m)). Then delete the min.
Calling the number of values in the list n and the number of values to find m:
this is O(n * log m)
Speed of the processing algorithm is absolutely irrelevant (unless it's completely dumb).
The bottleneck here is I/O (it's specified that they are on disk). So make sure that you work with large buffers.
Keep a fixed array of 100 integers. Initialise them to a Int.MinValue. When you are reading, from 1 billion integers, compare them with the numbers in the first cell of the array (index 0). If larger, then move up to next. Again if larger, then move up until you hit the end or a smaller value. Then store the value in the index and shift all values in the previous cells one cell down... do this and you will find 100 max integers.
I believe the quickest way to do this is by using a very large bit map to record which numbers are present. In order to represent a 32 bit integer this would need to be 2^32 / 8 bytes which is about == 536MB. Scan through the integers simply setting the corresponding bit in the bit map. Then look for the highest 100 entries.
NOTE: This finds the highest 100 numbers not the highest 100 instances of a number if you see the difference.
This kind of approach is discussed in the very good book Programming Pearls which your interviewer may have read!
You are going to have to check every number, there is no way around that.
Just as a slight improvement on solutions offered,
Given a list of 100 numbers:
9595
8505
...
234
1
You would check to see if the new found value is > min value of our array, if it is, insert it. However doing a search from bottom to top can be quite expensive, and you may consider taking a divide and conquer approach, by for example evaluating the 50th item in the array and doing a comparison, then you know if the value needs to be inserted in the first 50 items, or the bottom 50. You can repeat this process for a much faster search as we have eliminated 50% of our search space.
Also consider the data type of the integers. If they are 32 bit integers and you are on a 64 bit system, you may be able to do some clever memory handling and bitwise operations to deal with two numbers on disk at once if they are continual in memory.
I think someone should have mentioned a priority queue by now. You just need to keep the current top 100 numbers, know what the lowest is and be able to replace that with a higher number. That's what a priority queue does for you - some implementations may sort the list, but it's not required.
Assuming that 1 bill + 100ion numbers fit into memory
the best sorting algorithm is heap sort. form a heap and get the first 100 numbers. complexity o(nlogn + 100(for fetching first 100 numbers))
improving the solution
divide the implementaion to two heap(so that insertion are less complex) and while fetching the first 100 elements do imperial merge algorithm.
Here's some python code which implements the algorithm suggested by ferdinand beyer above. essentially it's a heap, the only difference is that deletion has been merged with insertion operation
import random
import math
class myds:
""" implement a heap to find k greatest numbers out of all that are provided"""
k = 0
getnext = None
heap = []
def __init__(self, k, getnext ):
""" k is the number of integers to return, getnext is a function that is called to get the next number, it returns a string to signal end of stream """
assert k>0
self.k = k
self.getnext = getnext
def housekeeping_bubbleup(self, index):
if index == 0:
return()
parent_index = int(math.floor((index-1)/2))
if self.heap[parent_index] > self.heap[index]:
self.heap[index], self.heap[parent_index] = self.heap[parent_index], self.heap[index]
self.housekeeping_bubbleup(parent_index)
return()
def insertonly_level2(self, n):
self.heap.append(n)
#pdb.set_trace()
self.housekeeping_bubbleup(len(self.heap)-1)
def insertonly_level1(self, n):
""" runs first k times only, can be as slow as i want """
if len(self.heap) == 0:
self.heap.append(n)
return()
elif n > self.heap[0]:
self.insertonly_level2(n)
else:
return()
def housekeeping_bubbledown(self, index, length):
child_index_l = 2*index+1
child_index_r = 2*index+2
child_index = None
if child_index_l >= length and child_index_r >= length: # No child
return()
elif child_index_r >= length: #only left child
if self.heap[child_index_l] < self.heap[index]: # If the child is smaller
child_index = child_index_l
else:
return()
else: #both child
if self.heap[ child_index_r] < self.heap[ child_index_l]:
child_index = child_index_r
else:
child_index = child_index_l
self.heap[index], self.heap[ child_index] = self.heap[child_index], self.heap[index]
self.housekeeping_bubbledown(child_index, length)
return()
def insertdelete_level1(self, n):
self.heap[0] = n
self.housekeeping_bubbledown(0, len(self.heap))
return()
def insert_to_myds(self, n ):
if len(self.heap) < self.k:
self.insertonly_level1(n)
elif n > self.heap[0]:
#pdb.set_trace()
self.insertdelete_level1(n)
else:
return()
def run(self ):
for n in self.getnext:
self.insert_to_myds(n)
print(self.heap)
# import pdb; pdb.set_trace()
return(self.heap)
def createinput(n):
input_arr = range(n)
random.shuffle(input_arr)
f = file('input', 'w')
for value in input_arr:
f.write(str(value))
f.write('\n')
input_arr = []
with open('input') as f:
input_arr = [int(x) for x in f]
myds_object = myds(4, iter(input_arr))
output = myds_object.run()
print output
If you find 100th order statistic using quick sort, it will work in average O(billion). But I doubt that with such numbers and due to random access needed for this approach it will be faster, than O(billion log(100)).
Here is another solution (about an eon later, I have no shame sorry!) based on the second one provided by #paxdiablo. The basic idea is that you should read another k numbers only if they're greater than the minimum you already have and that sorting is not really necessary:
// your variables
n = 100
k = a number > n and << 1 billion
create array1[n], array2[k]
read first n numbers into array2
find minimum and maximum of array2
while more numbers:
if number > maximum:
store in array1
if array1 is full: // I don't need contents of array2 anymore
array2 = array1
array1 = []
else if number > minimum:
store in array2
if array2 is full:
x = n - array1.count()
find the x largest numbers of array2 and discard the rest
find minimum and maximum of array2
else:
discard the number
endwhile
// Finally
x = n - array1.count()
find the x largest numbers of array2 and discard the rest
return merge array1 and array2
The critical step is the function for finding the largest x numbers in array2. But you can use the fact, that you know the minimum and maximum to speed up the function for finding the largest x numbers in array2.
Actually, there are lots of possible optimisations since you don't really need to sort it, you just need the x largest numbers.
Furthermore, if k is big enough and you have enough memory, you could even turn it into a recursive algorithm for finding the n largest numbers.
Finally, if the numbers are already sorted (in any order), the algorithm is O(n).
Obviously, this is just theoretically because in practice you would use standard sorting algorithms and the bottleneck would probably be the IO.
There are lots of clever approaches (like the priority queue solutions), but one of the simplest things you can do can also be fast and efficient.
If you want the top k of n, consider:
allocate an array of k ints
while more input
perform insertion sort of next value into the array
This may sound absurdly simplistic. You might expect this to be O(n^2), but it's actually only O(k*n), and if k is much smaller than n (as is postulated in the problem statement), it approaches O(n).
You might argue that the constant factor is too high because doing an average of k/2 comparisons and moves per input is a lot. But most values will be trivially rejected on the first comparison against the kth largest value seen so far. If you have a billion inputs, only a small fraction are likely to be larger than the 100th so far.
(You could construe a worst-case input where each value is larger than its predecessor, thus requiring k comparisons and moves for every input. But that is essentially a sorted input, and the problem statement said the input is unsorted.)
Even the binary-search improvement (to find the insertion point) only cuts the comparisons to ceil(log_2(k)), and unless you special case an extra comparison against the kth-so-far, you're much less likely to get the trivial rejection of the vast majority of inputs. And it does nothing to reduce the number of moves you need. Given caching schemes and branch prediction, doing 7 non-consecutive comparisons and then 50 consecutive moves doesn't seem likely to be significantly faster than doing 50 consecutive comparisons and moves. It's why many system sorts abandon Quicksort in favor of insertion sort for small sizes.
Also consider that this requires almost no extra memory and that the algorithm is extremely cache friendly (which may or may not be true for a heap or priority queue), and it's trivial to write without errors.
The process of reading the file is probably the major bottleneck, so the real performance gains are likely to be by doing a simple solution for the selection, you can focus your efforts on finding a good buffering strategy for minimizing the i/o.
If k can be arbitrarily large, approaching n, then it makes sense to consider a priority queue or other, smarter, data structure. Another option would be to split the input into multiple chunks, sort each of them in parallel, and then merge.

Resources