constant memory reservoir sampling, O(k) possible? - algorithm

I have an input stream, of size n, and I want to produce an output stream of size k that contains distinct random elements of the input stream, without requiring any additional memory for elements selected by the sample.
The algorithm I was going to use is basically as follows:
for each element in input stream
if random()<k/n
decrement k
output element
if k = 0
halt
end if
end if
decrement n
end for
The function random() generates a number from [0..1) on a random distribution, and I trust the algorithm's principle of operation is straightforward.
Although this algorithm can terminate early when it selects the last element, in general the algorithm is still approximately O(n). At first it seemed to work as intended (outputting roughly uniformly distributed but still random elements from the input stream), but I think there may be a non-uniform tendency to pick later elements when k is much less than n. I'm not sure about this, however... so I'd appreciate knowing for sure one way or the other. I'm also wondering if a faster algorithm exists. Obviously, since k elements must be generated, the algorithm cannot be any faster than O(k). For an O(k) solution, one could assume the existence of a function skip(x), which can skip over x elements in the input stream in O(1) time (but cannot skip backwards). I would still like to keep the requirement of not requiring any additional memory, however.

If it is a real stream, you need O(n) time to scan it.
Your existing algorithm is good. (I got that wrong before.) You can prove by induction that the probability that you have not picked the first element in i tries is 1 - i/n = (n-i)/n. First that is true for i=0 by inspection. Now if you have not picked it in ith tries, the odds that the next one picks it is 1/(n-i). And then the odds of picking it on the i+1'th try is ((n-i)/n) * (1/(n-i)) = 1/n. Which means that the odds of not picking it in the first i+1 times is 1 - i/n - 1/n = 1 - (i+i)/n. That completes induction. And so the odds of picking the first element in the first k tries is the odds of not having not picked it, or 1 - (n - k/n) = k/n.
But what if you have O(1) access to any element? Well note that choosing k to take is the same as choosing n-k to leave. So without loss of generality we can assume that k <= n/2. What that means is that we can use a randomized algorithm like this:
chosen = set()
count_chosen = 0
while count_chosen < k:
choice = random_element(stream)
if choice not in chosen:
chosen.add(choice)
count_chosen = count_chosen + 1
The set will be O(k) space, and since the probability of each random choice being new to you is at least 0.5, the expected running time is no worse than 2k choices.

Related

searching through a vast collection of potential solutions

I have a quite difficult problem (perhaps even a NP-hard problem ^^) with looking for a solution in a massive collection of results. Perhaps there is an algorithm for it.
Below exercise is artificial but is a perfect example to illustrate my issue.
There is a big array with integers. Lets say it has 100.000 elements.
int numbers[] = {-123,32,4,-234564,23,5,....}
I want to check in a relatively quick way if a sum on any 2 numbers from this array is equal to 0. In other words, if the array has "-123" I want to find is there also a "123" number.
The easiest solution would be brute force - check everything with everything. That gives 100.000 x 100.000 a big number ;-) Obviously brute force method can by optimised. Order numbers and check negatives against positive only. My question is - is there something better then optimised brute force to find a solution?
First, sort the array by magnitude of the value.
Then, if the data contains a pair which satisfies the conditions you're after, it contains such a pair adjacent in the array. So just sweep through looking for adjacent pairs whose sum is 0.
Overall time complexity is O(n log n) for the sort, could be O(n) if you use "cheating" sorts not based solely on comparisons. Clearly it can't be done in less than linear time, because in the worst case you can't do it without looking at all the elements. I think n log n is probably optimal in the decision tree model of computing, but only because it "feels a bit like" the element uniqueness problem.
Alternative approach:
Add the elements one at a time to a hash-based or tree-based container. Before adding each element, check whether its negative is present. If so, stop.
This is likely to be faster in the case where there are lots of suitable pairs, because you save the cost of sorting the whole data. That said, you could write a modified sort that exits early by checking for adjacent pairs as soon as any subset of the data is in its final order, but that's effort.
Brute force would be an O(n^2) solution. You can certainly do better.
Off the top of my head, first sort it. Heap sort will have a complexity of O(nlogn).
Now, for the first element, say a, you know you need to find an element b, such that a+b = 0. This can be found using binary search (since your array is now sorted). Binary search has a complexity of O(logn).
This gives you an overall solution of O(nlogn) complexity.
The example you provided can be brute-force solved in O(n^2) time.
You can start ordering the numbers (O(n·logn)) from smaller to bigger. If you place one pointer at the beginning (the "most negative number") and other at the end (the "most positive"), you can check if there is such pair of numbers in an additional O(n) steps by following the next procedure:
If the numbers at both pointers have the same module, you have the solution
If not, move the pointer of the number with bigger module towards "zero" (this is, increase if it is the pointer on the negative side, decrease if it is the positive-side one)
Repeat until finding a solution, or the pointers cross.
Total complexity is O(n·logn)+O(n) = O(n·logn).
Sort your array using Quicksort. After this happened, use two indexes, let's call them positive and negative.
positive <- 0
negative <- size - 1
while ((array[positive] > 0) and (array(negative < 0) and (positive >= 0) and (negative < size)) do
delta <- array[positive] + array[negative]
if (delta = 0) then
return true
else if (delta < 0) then
negative <- negative + 1
else
positive <- positive - 1
end if
end while
return (array[positive] * array[negative] = 0)
You didn't say what should the algorithm do if 0 is part of the array, I've supposed that in this case true should be returned.

Find the largest k numbers in k arrays stored across k machines

This is an interview question. I have K machines each of which is connected to 1 central machine. Each of the K machines have an array of 4 byte numbers in file. You can use any data structure to load those numbers into memory on those machines and they fit. Numbers are not unique across K machines. Find the K largest numbers in the union of the numbers across all K machines. What is the fastest I can do this?
(This is an interesting problem because it involves parallelism. As I haven't encountered parallel algorithm optimization before, it's quite amusing: you can get away with some ridiculously high-complexity steps, because you can make up for it later. Anyway, onto the answer...)
> "What is the fastest I can do this?"
The best you can do is O(K). Below I illustrate both a simple O(K log(K)) algorithm, and the more complex O(K) algorithm.
First step:
Each computer needs enough time to read every element. This means that unless the elements are already in memory, one of the two bounds on the time is O(largest array size). If for example your largest array size varies as O(K log(K)) or O(K^2) or something, no amount of algorithmic trickery will let you go faster than that. Thus the actual best running time is O(max(K, largestArraySize)) technically.
Let us say the arrays have a max length of N, which is <=K. With the above caveat, we're allowed to bound N<K since each computer has to look at each of its elements at least once (O(N) preprocessing per computer), each computer can pick the largest K elements (this is known as finding kth-order-statistics, see these linear-time algorithms). Furthermore, we can do so for free (since it's also O(N)).
Bounds and reasonable expectations:
Let's begin by thinking of some worst-case scenarios, and estimates for the minimum amount of work necessary.
One minimum-work-necessary estimate is O(K*N/K) = O(N), because we need to look at every element at the very least. But, if we're smart, we can distribute the work evenly across all K computers (hence the division by K).
Another minimum-work-necessary estimate is O(N): if one array is larger than all elements on all other computers, we return the set.
We must output all K elements; this is at least O(K) to print them out. We can avoid this if we are content merely knowing where the elements are, in which case the O(K) bound does not necessarily apply.
Can this bound of O(N) be achieved? Let's see...
Simple approach - O(NlogN + K) = O(KlogK):
For now let's come up with a simple approach, which achieves O(NlogN + K).
Consider the data arranged like so, where each column is a computer, and each row is a number in the array:
computer: A B C D E F G
10 (o) (o)
9 o (o) (o)
8 o (o)
7 x x (x)
6 x x (x)
5 x ..........
4 x x ..
3 x x x . .
2 x x . .
1 x x .
0 x x .
You can also imagine this as a sweep-line algorithm from computation geometry, or an efficient variant of the 'merge' step from mergesort. The elements with parentheses represent the elements with which we'll initialize our potential "candidate solution" (in some central server). The algorithm will converge on the correct o responses by dumping the (x) answers for the two unselected os.
Algorithm:
All computers start as 'active'.
Each computer sorts its elements. (parallel O(N logN))
Repeat until all computers are inactive:
Each active computer finds the next-highest element (O(1) since sorted) and gives it to the central server.
The server smartly combines the new elements with the old K elements, and removes an equal number of the lowest elements from the combined set. To perform this step efficiently, we have a global priority queue of fixed size K. We insert the new potentially-better elements, and bad elements fall out of the set. Whenever an element falls out of the set, we tell the computer which sent that element to never send another one. (Justification: This always raises the smallest element of the candidate set.)
(sidenote: Adding a callback hook to falling out of a priority queue is an O(1) operation.)
We can see graphically that this will perform at most 2K*(findNextHighest_time + queueInsert_time) operations, and as we do so, elements will naturally fall out of the priority queue. findNextHighest_time is O(1) since we sorted the arrays, so to minimize 2K*queueInsert_time, we choose a priority queue with an O(1) insertion time (e.g. a Fibonacci-heap based priority queue). This gives us an O(log(queue_size)) extraction time (we cannot have O(1) insertion and extraction); however, we never need to use the extract operation! Once we are done, we merely dump the priority queue as an unordered set, which takes O(queue_size)=O(K) time.
We'd thus have O(N log(N) + K) total running time (parallel sorting, followed by O(K)*O(1) priority queue insertions). In the worst case of N=K, this is O(K log(K)).
The better approach - O(N+K) = O(K):
However I have come up with a better approach, which achieves O(K). It is based on the median-of-median selection algorithm, but parallelized. It goes like this:
We can eliminate a set of numbers if we know for sure that there are at least K (not strictly) larger numbers somewhere among all the computers.
Algorithm:
Each computer finds the sqrt(N)th highest element of its set, and splits the set into elements < and > it. This takes O(N) time in parallel.
The computers collaborate to combine those statistics into a new set, and find the K/sqrt(N)th highest element of that set (let's call it the 'superstatistic'), and note which computers have statistics < and > the superstatistic. This takes O(K) time.
Now consider all elements less than their computer's statistics, on computers whose statistic is less than the superstatistic. Those elements can be eliminated. This is because the elements greater than their computer's statistic, on computers whose statistic is larger than the superstatistic, are a set of K elements which are larger. (See the visual here).
Now, the computers with the uneliminated elements evenly redistribute their data to the computers who lost data.
Recurse: you still have K computers, but the value of N has decreased. Once N is less than a predetermined constant, use the previous algorithm I mentioned in "simple approach - O(NlogN + K)"; except in this case, it is now O(K). =)
It turns out that the reductions are O(N) total (amazingly not order K), except perhaps the final step which might by O(K). Thus this algorithm is O(N+K) = O(K) total.
Analysis and simulation of O(K) running time below. The statistics allow us to divide the world into four unordered sets, represented here as a rectangle divided into four subboxes:
------N-----
N^.5
________________
| | s | <- computer
| | #=K s REDIST. | <- computer
| | s | <- computer
| K/N^.5|-----S----------| <- computer
| | s | <- computer
K | s | <- computer
| | s ELIMIN. | <- computer
| | s | <- computer
| | s | <- computer
| |_____s__________| <- computer
LEGEND:
s=statistic, S=superstatistic
#=K -- set of K largest elements
(I'd draw the relation between the unordered sets of rows and s-column here, but it would clutter things up; see the addendum right now quickly.)
For this analysis, we will consider N as it decreases.
At a given step, we are able to eliminate the elements labelled ELIMIN; this has removed area from the rectangle representation above, reducing the problem size from K*N to , which hilariously simplifies to
Now, the computers with the uneliminated elements redistribute their data (REDIST rectangle above) to the computers with eliminated elements (ELIMIN). This is done in parallel, where the bandwidth bottleneck corresponds to the length of the short size of REDIST (because they are outnumbered by the ELIMIN computers which are waiting for their data). Therefore the data will take as long to transfer as the long length of the REDIST rectangle (another way of thinking about it: K/√N * (N-√N) is the area, divided by K/√N data-per-time, resulting in O(N-√N) time).
Thus at each step of size N, we are able to reduce the problem size to K(2√N-1), at the cost of performing N + 3K + (N-√N) work. We now recurse. The recurrence relation which will tell us our performance is:
T(N) = 2N+3K-√N + T(2√N-1)
The decimation of the subproblem size is much faster than the normal geometric series (being √N rather than something like N/2 which you'd normally get from common divide-and-conquers). Unfortunately neither the Master Theorem nor the powerful Akra-Bazzi theorem work, but we can at least convince ourselves it is linear via a simulation:
>>> def T(n,k=None):
... return 1 if n<10 else sqrt(n)*(2*sqrt(n)-1)+3*k+T(2*sqrt(n)-1, k=k)
>>> f = (lambda x: x)
>>> (lambda n: T((10**5)*n,k=(10**5)*n)/f((10**5)*n) - T(n,k=n)/f(n))(10**30)
-3.552713678800501e-15
The function T(N) is, at large scales, a multiple of the linear function x, hence linear (doubling the input doubles the output). This method, therefore, almost certainly achieves the bound of O(N) we conjecture. Though see the addendum for an interesting possibility.
...
Addendum
One pitfall is accidentally sorting. If we do anything which accidentally sorts our elements, we will incur a log(N) penalty at the least. Thus it is better to think of the arrays as sets, to avoid the pitfall of thinking that they are sorted.
Also we might initially think that with the constant amount of work at each step of 3K, so we would have to do work 3Klog(log(N)) work. But the -1 has a powerful role to play in the decimation of the problem size. It is very slightly possible that the running time is actually something above linear, but definitely much smaller than even Nlog(log(log(log(N)))). For example it might be something like O(N*InverseAckermann(N)), but I hit the recursion limit when testing.
The O(K) is probably only due to the fact that we have to print them out; if we are content merely knowing where the data is, we might even be able to pull off an O(N) (e.g. if the arrays are of length O(log(K)) we might be able to achieve O(log(K)))... but that's another story.
The relation between the unordered sets is as follows. Would have cluttered things up in explanation.
.
_
/ \
(.....) > s > (.....)
s
(.....) > s > (.....)
s
(.....) > s > (.....)
\_/
v
S
v
/ \
(.....) > s > (.....)
s
(.....) > s > (.....)
s
(.....) > s > (.....)
\_/
Find the k largest numbers on each machine. O(n*log(k))
Combine the results (on a centralized server, if k is not huge, otherwise you can merge them in a tree-hierarchy accross the server cluster).
Update: to make it clear, the combine step is not a sort. You just pick the top k numbers from the results. There are many ways to do this efficiently. You can use a heap for example, pushing the head of each list. Then you can remove the head from the heap and push the head from the list the element belonged to. Doing this k times gives you the result. All this is O(k*log(k)).
Maintain a min heap of size 'k' in the centralized server.
Initially insert first k elements into the min heap.
For the remaining elements
Check(peek) for the min element in the heap (O(1))
If the min element is lesser than the current element, then remove the min element from heap and insert the current element.
Finally min heap will have 'k' largest elements
This would require n(log k) time.
I would suggest something like this:
take the k largest numbers on each machine in sorted order O(Nk) where N is the number of element on each machine
sort each of these arrays of k elements by largest element (you will get k arrays of k elements sorted by largest element : a square matrix kxk)
take the "upper triangle" of the matrix made of these k arrays of k elements, (the k largest element will be in this upper triangle)
the central machine can now find the k largest element of these k(k+1)/2 elements
Let the machines find the out k largest elements copy it into a
datastructure (stack), sort it and pass it on to the Central
machine.
At the central machine receive the stacks from all the machine. Find
the greatest of the elements at the top of the stacks.
Pop out the greatest element form its stack and copy it to the 'TopK list'.
Leave the other stacks intact.
Repeat step 3, k times to get Top K numbers.
1) sort the items on every machine
2) use a k - binary heap on the central machine
a) populate the heap with first (max) element from each machine
b) extract the first element, and put back in the heap the first element from the machine that you extracted the element. (of course heapify your heap, after the element is added).
Sort will be O(Nlog(N)) where N is the max array on the machines.
O(k) - to build the heap
O(klog(k)) to extract and populate the heap k times.
Complexity is max(O(klog(k)),O(Nlog(N)))
I would think the MapReduce paradigm would be well suited to a task like this.
Every machine runs it's own independent map task to find the maximum value in its array (depends on the language used) and this will probably be O(N) complexity for N numbers on each machine.
The reduce task compares the result from the individual machines' outputs to give you the largest k numbers.

Is there a Sorting Algorithm that sorts in O(∞) permutations?

After reading this question and through the various Phone Book sorting scenarios put forth in the answer, I found the concept of the BOGO sort to be quite interesting. Certainly there is no use for this type of sorting algorithm but it did raise an interesting question in my mind-- could their be a sorting algorithm that is infinitely impossible to complete?
In other words, is there a process where one could attempt to compare and re-order a fixed set of data and can yet never achieve an actual sorted list?
This is much more of a theoretical/philosophical question than a practical one and if I was more of a mathematician I'd probably be able to prove/disprove such a possibility. Has anyone asked this question before and if so, what can be said about it?
[edit:] no deterministic process with a finite amount of state takes "O(infinity)" since the slowest it can be is to progress through all possible states. this includes sorting.
[earlier, more specific answer:]
no. for a list of size n you only have state space of size n! in which to store progress (assuming that the entire state of the sort is stored in the ordering of the elements and it really is "doing something," deterministically).
so the worst possible behaviour would cycle through all available states before terminating and take time proportional to n! (at the risk of confusing matters, there must be a single path through the state - since that is "all the state" you cannot have a process move from state X to Y, and then later from state X to Z, since that requires additional state, or is non-deterministic)
Idea 1:
function sort( int[] arr ) {
int[] sorted = quicksort( arr ); // compare and reorder data
while(true); // where'd this come from???
return sorted; // return answer
}
Idea 2
How do you define O(infinity)? The formal definition of Big-O merely states that f(x)=O(g(x)) implies that M*g(x) is an upper bound of f(x) given sufficiently large x and some constant M.
Typically when you talking about "infinity", you are talking about some sort of unbounded limit. So in this case, the only reasonable definition is saying that O(infinity) is O(function that's larger than every function). Obviously a function that's larger than every function is an upper bound. Thus technically everything is "O(infinity)"
Idea 3
Assuming you mean theta notation (tight bound)...
If you impose the additional restriction that the algorithm is smart (returns when it finds a sorted permutation) and every permutation of the list must be visited in a finite amount of time, then the answer no. There are only N! permutations of a list. The upper bound for such a sorting algorithm is then a finite over finite numbers, which is finite.
Your question doesn't really have much to do with sorting. An algorithm which is guaranteed never to complete would be pretty dull. Indeed, even an algorithm which would might or might not ever complete would be pretty dull. Much more interesting would be an algorithm which would be guaranteed to complete, eventually, but whose worst-case computation time with respect to the size of the input would not be expressible as O(F(N)) for any function F that could itself be computed in bounded time. My hunch would be that such an algorithm could be devised, but I'm not sure how.
How about this one:
Start at the first item.
Flip a coin.
If it's heads, switch it with the next item.
If it's tails, don't switch them.
If list is sorted, stop.
If not, move onto the next pair ...
It's a sorting algorithm -- the kind a monkey might do. Is there any guarantee that you'll arrive at a sorted list? I don't think so!
Yes -
SortNumbers(collectionOfNumbers)
{
If IsSorted(collectionOfNumbers){
reverse(collectionOfNumbers(1:end/2))
}
return SortNumbers(collectionOfNumbers)
}
Input: A[1..n] : n unique integers in arbitrary order
Output: A'[1..n] : reordering of the elements of A
such that A'[i] R(A') A'[j] if i < j.
Comparator: a R(A') b iff A'[i] = a, A'[j] = b and i > j
More generally, make the comparator something that's either (a) impossible to reconcile with the output specification, so that no solution can exist, or (b) uncomputable (e.g., sort these (input, turing machine) pairs in order of the number of steps needed for the machine to halt on the input).
Even more generally, if you have a procedure that fails to halt on a valid input, the procedure is not an algorithm which solves the problem on that input/output domain... which means you don't have an algorithm at all, or that what you have is only an algorithm if you appropriately restrict the domain.
Let's suppose that you have a random coin flipper, infinite arithmetic, and infinite rationals. Then the answer is yes. You can write a sorting algorithm which has 100% chance of successfully sorting your data (so it really is a sorting function), but which on average will take infinite time to do so.
Here is an emulation of this in Python.
# We'll pretend that these are true random numbers.
import random
import fractions
def flip ():
return 0.5 < random.random()
# This tests whether a number is less than an infinite precision number in the range
# [0, 1]. It has a 100% probability of returning an answer.
def number_less_than_rand (x):
high = fractions.Fraction(1, 1)
low = fractions.Fraction(0, 1)
while low < x and x < high:
if flip():
low = (low + high) / 2
else:
high = (low + high) / 2
return high < x
def slow_sort (some_array):
n = fractions.Fraction(100, 1)
# This loop has a 100% chance of finishing, but its average time to complete
# is also infinite. If you haven't studied infinite series and products, you'll
# just have to take this on faith. Otherwise proving that is a fun exercise.
while not number_less_than_rand(1/n):
n += 1
print n
some_array.sort()

Where can I use a technique from Majority Vote algorithm

As seen in the answers to Linear time majority algorithm?, it is possible to compute the majority of an array of elements in linear time and log(n) space.
It was shown that everyone who sees this algorithm believes that it is a cool technique. But does the idea generalize to new algorithms?
It seems the hidden power of this algorithm is in keeping a counter that plays a complex role -- such as "(count of majority element so far) - (count of second majority so far)". Are there other algorithms based on the same idea?
Umh, let's first start to understand why the algorithm works, in order to "isolate" the ideas there.
The point of the algorithm is that if you have a majority element, then you can match each occurrence of it with an "another" element, and then you have some more "spare".
So, we just have a counter which counts the number of "spare" occurrences of our guest answer.
If it reaches 0, then it isn't a majority element for the subsequence starting from when we have "elected" the "current" element as the guest major element to the "current" position.
Also, since our "guest" element matches every other element occurrence in the considered subsequence, there are no major elements in the considered subsequence.
Now, since:
our algorithm gives a correct answer only if there is a major element, and
if there is a major element, then it'll still be if we ignore the "current" subsequence when the counter goes to zero
it is obvious to see by contradiction that, if a major element exists, then we have a suffix of the whole sequence when the counter never gets to zero.
Now: what's the idea that can be exploited in new, O(1) size O(n) time algorithms?
To me, you can apply this technique whenever you have to compute a property P on a sequence of elements which:
can be exteded from seq[n, m] to seq[n, m+1] in O(1) time if Q(seq[n, m+1]) doesn't hold
P(seq[n, m]) can be computed in O(1) time and space from P(seq[n, j]) and P(seq[j, m]) if Q(seq[n, j]) holds
In our case, P is the "spare" occurrences of our "elected" major element and Q is "P is zero".
If you see things in that way, longest common subsequence exploits the same idea (dunno about its "coolness factor" ;))
Jaydev Misra and David Gries have a paper called Finding Repeated Elements (ACM page) which generalizes it to an element repeating more than n/k times (k=2 is the majority problem).
Of course, this is probably very similar to the original problem, and you are probably looking for 'different' algorithms.
Here is an example which is possibly different.
Give an algorithm which will detect if a string of parentheses ( '(' and ')') is well formed.
I believe the standard solution is to maintain a counter.
Side note:
As to answers which claim cannot be constant space etc, ask them for the model of computation. In the WORD RAM model for instance, you assume the integers/array indices etc are O(1).
A lot of folks incorrectly mix and match models. For instance, they will happily have the input array of n integers be O(n), have an array index be O(1) space, but a counter they consider Omega(log n) etc, which is nonsense. If they want to consider the size in bits, then the input itself is Omega(n log n) etc.
For people who want to understand what does this algorithm do and why does it works: look at my detailed answer.
Here I will describe a natural extension of this algorithm (or a generalization). So in a standard majority voting algorithm you have to find an element which appears at least n/2 times in the stream, where n is the size of the stream. You can do this in O(n) time (with a tiny constant and in O(log(n)) space, worse case and highly unlikely.
The generalized algorithm allows you to find k most frequent items, where each time appeared at least n/(k+1) times in the original stream. Note that if k=1, you end up with your original problem.
Solution to this problem is really similar to the original one, except instead of one counter and one possible element, you maintain k counters and k possible elements. Now the logic goes in a similar way. You iterate through the array and if the element is in the possible elements, you increase it's counter, if one of the counters is zero - substitute the element of this counter with new element. Otherwise just decrease the values.
As with original majority voting algorithm, you need to have a guarantee that you have these k majority elements, otherwise you have to do another pass over the array to verify that your previously found possible elements are correct. Here is my python attempt (have not done a thorough testing).
from collections import defaultdict
def majority_element_general(arr, k=1):
counter, i = defaultdict(int), 0
while len(counter) < k and i < len(arr):
counter[arr[i]] += 1
i += 1
for i in arr[i:]:
if i in counter:
counter[i] += 1
elif len(counter) < k:
counter[i] = 1
else:
fields_to_remove = []
for el in counter:
if counter[el] > 1:
counter[el] -= 1
else:
fields_to_remove.append(el)
for el in fields_to_remove:
del counter[el]
potential_elements = counter.keys()
# might want to check that they are really frequent.
return potential_elements

Finding the hundred largest numbers in a file of a billion

I went to an interview today and was asked this question:
Suppose you have one billion integers which are unsorted in a disk file. How would you determine the largest hundred numbers?
I'm not even sure where I would start on this question. What is the most efficient process to follow to give the correct result? Do I need to go through the disk file a hundred times grabbing the highest number not yet in my list, or is there a better way?
Obviously the interviewers want you to point out two key facts:
You cannot read the whole list of integers into memory, since it is too large. So you will have to read it one by one.
You need an efficient data structure to hold the 100 largest elements. This data structure must support the following operations:
Get-Size: Get the number of values in the container.
Find-Min: Get the smallest value.
Delete-Min: Remove the smallest value to replace it with a new, larger value.
Insert: Insert another element into the container.
By evaluating the requirements for the data structure, a computer science professor would expect you to recommend using a Heap (Min-Heap), since it is designed to support exactly the operations we need here.
For example, for Fibonacci heaps, the operations Get-Size, Find-Min and Insert all are O(1) and Delete-Min is O(log n) (with n <= 100 in this case).
In practice, you could use a priority queue from your favorite language's standard library (e.g. priority_queue from#include <queue> in C++) which is usually implemented using a heap.
Here's my initial algorithm:
create array of size 100 [0..99].
read first 100 numbers and put into array.
sort array in ascending order.
while more numbers in file:
get next number N.
if N > array[0]:
if N > array[99]:
shift array[1..99] to array[0..98].
set array[99] to N.
else
find, using binary search, first index i where N <= array[i].
shift array[1..i-1] to array[0..i-2].
set array[i-1] to N.
endif
endif
endwhile
This has the (very slight) advantage is that there's no O(n^2) shuffling for the first 100 elements, just an O(n log n) sort and that you very quickly identify and throw away those that are too small. It also uses a binary search (7 comparisons max) to find the correct insertion point rather than 50 (on average) for a simplistic linear search (not that I'm suggesting anyone else proffered such a solution, just that it may impress the interviewer).
You may even get bonus points for suggesting the use of optimised shift operations like memcpy in C provided you can be sure the overlap isn't a problem.
One other possibility you may want to consider is to maintain three lists (of up to 100 integers each):
read first hundred numbers into array 1 and sort them descending.
while more numbers:
read up to next hundred numbers into array 2 and sort them descending.
merge-sort lists 1 and 2 into list 3 (only first (largest) 100 numbers).
if more numbers:
read up to next hundred numbers into array 2 and sort them descending.
merge-sort lists 3 and 2 into list 1 (only first (largest) 100 numbers).
else
copy list 3 to list 1.
endif
endwhile
I'm not sure, but that may end up being more efficient than the continual shuffling.
The merge-sort is a simple selection along the lines of (for merge-sorting lists 1 and 2 into 3):
list3.clear()
while list3.size() < 100:
while list1.peek() >= list2.peek():
list3.add(list1.pop())
endwhile
while list2.peek() >= list1.peek():
list3.add(list2.pop())
endwhile
endwhile
Simply put, pulling the top 100 values out of the combined list by virtue of the fact they're already sorted in descending order. I haven't checked in detail whether that would be more efficient, I'm just offering it as a possibility.
I suspect the interviewers would be impressed with the potential for "out of the box" thinking and the fact that you'd stated that it should be evaluated for performance.
As with most interviews, technical skill is one of the the things they're looking at.
Create an array of 100 numbers all being -2^31.
Check if the the first number you read from disk is greater than the first in the list. If it is copy the array down 1 index and update it to the new number. If not check the next in the 100 and so on.
When you've finished reading all 1 billion digits you should have the highest 100 in the array.
Job done.
I'd traverse the list in order. As I go, I add elements to a set (or multiset depending on duplicates). When the set reached 100, I'd only insert if the value was greater than the min in the set (O(log m)). Then delete the min.
Calling the number of values in the list n and the number of values to find m:
this is O(n * log m)
Speed of the processing algorithm is absolutely irrelevant (unless it's completely dumb).
The bottleneck here is I/O (it's specified that they are on disk). So make sure that you work with large buffers.
Keep a fixed array of 100 integers. Initialise them to a Int.MinValue. When you are reading, from 1 billion integers, compare them with the numbers in the first cell of the array (index 0). If larger, then move up to next. Again if larger, then move up until you hit the end or a smaller value. Then store the value in the index and shift all values in the previous cells one cell down... do this and you will find 100 max integers.
I believe the quickest way to do this is by using a very large bit map to record which numbers are present. In order to represent a 32 bit integer this would need to be 2^32 / 8 bytes which is about == 536MB. Scan through the integers simply setting the corresponding bit in the bit map. Then look for the highest 100 entries.
NOTE: This finds the highest 100 numbers not the highest 100 instances of a number if you see the difference.
This kind of approach is discussed in the very good book Programming Pearls which your interviewer may have read!
You are going to have to check every number, there is no way around that.
Just as a slight improvement on solutions offered,
Given a list of 100 numbers:
9595
8505
...
234
1
You would check to see if the new found value is > min value of our array, if it is, insert it. However doing a search from bottom to top can be quite expensive, and you may consider taking a divide and conquer approach, by for example evaluating the 50th item in the array and doing a comparison, then you know if the value needs to be inserted in the first 50 items, or the bottom 50. You can repeat this process for a much faster search as we have eliminated 50% of our search space.
Also consider the data type of the integers. If they are 32 bit integers and you are on a 64 bit system, you may be able to do some clever memory handling and bitwise operations to deal with two numbers on disk at once if they are continual in memory.
I think someone should have mentioned a priority queue by now. You just need to keep the current top 100 numbers, know what the lowest is and be able to replace that with a higher number. That's what a priority queue does for you - some implementations may sort the list, but it's not required.
Assuming that 1 bill + 100ion numbers fit into memory
the best sorting algorithm is heap sort. form a heap and get the first 100 numbers. complexity o(nlogn + 100(for fetching first 100 numbers))
improving the solution
divide the implementaion to two heap(so that insertion are less complex) and while fetching the first 100 elements do imperial merge algorithm.
Here's some python code which implements the algorithm suggested by ferdinand beyer above. essentially it's a heap, the only difference is that deletion has been merged with insertion operation
import random
import math
class myds:
""" implement a heap to find k greatest numbers out of all that are provided"""
k = 0
getnext = None
heap = []
def __init__(self, k, getnext ):
""" k is the number of integers to return, getnext is a function that is called to get the next number, it returns a string to signal end of stream """
assert k>0
self.k = k
self.getnext = getnext
def housekeeping_bubbleup(self, index):
if index == 0:
return()
parent_index = int(math.floor((index-1)/2))
if self.heap[parent_index] > self.heap[index]:
self.heap[index], self.heap[parent_index] = self.heap[parent_index], self.heap[index]
self.housekeeping_bubbleup(parent_index)
return()
def insertonly_level2(self, n):
self.heap.append(n)
#pdb.set_trace()
self.housekeeping_bubbleup(len(self.heap)-1)
def insertonly_level1(self, n):
""" runs first k times only, can be as slow as i want """
if len(self.heap) == 0:
self.heap.append(n)
return()
elif n > self.heap[0]:
self.insertonly_level2(n)
else:
return()
def housekeeping_bubbledown(self, index, length):
child_index_l = 2*index+1
child_index_r = 2*index+2
child_index = None
if child_index_l >= length and child_index_r >= length: # No child
return()
elif child_index_r >= length: #only left child
if self.heap[child_index_l] < self.heap[index]: # If the child is smaller
child_index = child_index_l
else:
return()
else: #both child
if self.heap[ child_index_r] < self.heap[ child_index_l]:
child_index = child_index_r
else:
child_index = child_index_l
self.heap[index], self.heap[ child_index] = self.heap[child_index], self.heap[index]
self.housekeeping_bubbledown(child_index, length)
return()
def insertdelete_level1(self, n):
self.heap[0] = n
self.housekeeping_bubbledown(0, len(self.heap))
return()
def insert_to_myds(self, n ):
if len(self.heap) < self.k:
self.insertonly_level1(n)
elif n > self.heap[0]:
#pdb.set_trace()
self.insertdelete_level1(n)
else:
return()
def run(self ):
for n in self.getnext:
self.insert_to_myds(n)
print(self.heap)
# import pdb; pdb.set_trace()
return(self.heap)
def createinput(n):
input_arr = range(n)
random.shuffle(input_arr)
f = file('input', 'w')
for value in input_arr:
f.write(str(value))
f.write('\n')
input_arr = []
with open('input') as f:
input_arr = [int(x) for x in f]
myds_object = myds(4, iter(input_arr))
output = myds_object.run()
print output
If you find 100th order statistic using quick sort, it will work in average O(billion). But I doubt that with such numbers and due to random access needed for this approach it will be faster, than O(billion log(100)).
Here is another solution (about an eon later, I have no shame sorry!) based on the second one provided by #paxdiablo. The basic idea is that you should read another k numbers only if they're greater than the minimum you already have and that sorting is not really necessary:
// your variables
n = 100
k = a number > n and << 1 billion
create array1[n], array2[k]
read first n numbers into array2
find minimum and maximum of array2
while more numbers:
if number > maximum:
store in array1
if array1 is full: // I don't need contents of array2 anymore
array2 = array1
array1 = []
else if number > minimum:
store in array2
if array2 is full:
x = n - array1.count()
find the x largest numbers of array2 and discard the rest
find minimum and maximum of array2
else:
discard the number
endwhile
// Finally
x = n - array1.count()
find the x largest numbers of array2 and discard the rest
return merge array1 and array2
The critical step is the function for finding the largest x numbers in array2. But you can use the fact, that you know the minimum and maximum to speed up the function for finding the largest x numbers in array2.
Actually, there are lots of possible optimisations since you don't really need to sort it, you just need the x largest numbers.
Furthermore, if k is big enough and you have enough memory, you could even turn it into a recursive algorithm for finding the n largest numbers.
Finally, if the numbers are already sorted (in any order), the algorithm is O(n).
Obviously, this is just theoretically because in practice you would use standard sorting algorithms and the bottleneck would probably be the IO.
There are lots of clever approaches (like the priority queue solutions), but one of the simplest things you can do can also be fast and efficient.
If you want the top k of n, consider:
allocate an array of k ints
while more input
perform insertion sort of next value into the array
This may sound absurdly simplistic. You might expect this to be O(n^2), but it's actually only O(k*n), and if k is much smaller than n (as is postulated in the problem statement), it approaches O(n).
You might argue that the constant factor is too high because doing an average of k/2 comparisons and moves per input is a lot. But most values will be trivially rejected on the first comparison against the kth largest value seen so far. If you have a billion inputs, only a small fraction are likely to be larger than the 100th so far.
(You could construe a worst-case input where each value is larger than its predecessor, thus requiring k comparisons and moves for every input. But that is essentially a sorted input, and the problem statement said the input is unsorted.)
Even the binary-search improvement (to find the insertion point) only cuts the comparisons to ceil(log_2(k)), and unless you special case an extra comparison against the kth-so-far, you're much less likely to get the trivial rejection of the vast majority of inputs. And it does nothing to reduce the number of moves you need. Given caching schemes and branch prediction, doing 7 non-consecutive comparisons and then 50 consecutive moves doesn't seem likely to be significantly faster than doing 50 consecutive comparisons and moves. It's why many system sorts abandon Quicksort in favor of insertion sort for small sizes.
Also consider that this requires almost no extra memory and that the algorithm is extremely cache friendly (which may or may not be true for a heap or priority queue), and it's trivial to write without errors.
The process of reading the file is probably the major bottleneck, so the real performance gains are likely to be by doing a simple solution for the selection, you can focus your efforts on finding a good buffering strategy for minimizing the i/o.
If k can be arbitrarily large, approaching n, then it makes sense to consider a priority queue or other, smarter, data structure. Another option would be to split the input into multiple chunks, sort each of them in parallel, and then merge.

Resources