Finding the hundred largest numbers in a file of a billion - algorithm

I went to an interview today and was asked this question:
Suppose you have one billion integers which are unsorted in a disk file. How would you determine the largest hundred numbers?
I'm not even sure where I would start on this question. What is the most efficient process to follow to give the correct result? Do I need to go through the disk file a hundred times grabbing the highest number not yet in my list, or is there a better way?

Obviously the interviewers want you to point out two key facts:
You cannot read the whole list of integers into memory, since it is too large. So you will have to read it one by one.
You need an efficient data structure to hold the 100 largest elements. This data structure must support the following operations:
Get-Size: Get the number of values in the container.
Find-Min: Get the smallest value.
Delete-Min: Remove the smallest value to replace it with a new, larger value.
Insert: Insert another element into the container.
By evaluating the requirements for the data structure, a computer science professor would expect you to recommend using a Heap (Min-Heap), since it is designed to support exactly the operations we need here.
For example, for Fibonacci heaps, the operations Get-Size, Find-Min and Insert all are O(1) and Delete-Min is O(log n) (with n <= 100 in this case).
In practice, you could use a priority queue from your favorite language's standard library (e.g. priority_queue from#include <queue> in C++) which is usually implemented using a heap.

Here's my initial algorithm:
create array of size 100 [0..99].
read first 100 numbers and put into array.
sort array in ascending order.
while more numbers in file:
get next number N.
if N > array[0]:
if N > array[99]:
shift array[1..99] to array[0..98].
set array[99] to N.
else
find, using binary search, first index i where N <= array[i].
shift array[1..i-1] to array[0..i-2].
set array[i-1] to N.
endif
endif
endwhile
This has the (very slight) advantage is that there's no O(n^2) shuffling for the first 100 elements, just an O(n log n) sort and that you very quickly identify and throw away those that are too small. It also uses a binary search (7 comparisons max) to find the correct insertion point rather than 50 (on average) for a simplistic linear search (not that I'm suggesting anyone else proffered such a solution, just that it may impress the interviewer).
You may even get bonus points for suggesting the use of optimised shift operations like memcpy in C provided you can be sure the overlap isn't a problem.
One other possibility you may want to consider is to maintain three lists (of up to 100 integers each):
read first hundred numbers into array 1 and sort them descending.
while more numbers:
read up to next hundred numbers into array 2 and sort them descending.
merge-sort lists 1 and 2 into list 3 (only first (largest) 100 numbers).
if more numbers:
read up to next hundred numbers into array 2 and sort them descending.
merge-sort lists 3 and 2 into list 1 (only first (largest) 100 numbers).
else
copy list 3 to list 1.
endif
endwhile
I'm not sure, but that may end up being more efficient than the continual shuffling.
The merge-sort is a simple selection along the lines of (for merge-sorting lists 1 and 2 into 3):
list3.clear()
while list3.size() < 100:
while list1.peek() >= list2.peek():
list3.add(list1.pop())
endwhile
while list2.peek() >= list1.peek():
list3.add(list2.pop())
endwhile
endwhile
Simply put, pulling the top 100 values out of the combined list by virtue of the fact they're already sorted in descending order. I haven't checked in detail whether that would be more efficient, I'm just offering it as a possibility.
I suspect the interviewers would be impressed with the potential for "out of the box" thinking and the fact that you'd stated that it should be evaluated for performance.
As with most interviews, technical skill is one of the the things they're looking at.

Create an array of 100 numbers all being -2^31.
Check if the the first number you read from disk is greater than the first in the list. If it is copy the array down 1 index and update it to the new number. If not check the next in the 100 and so on.
When you've finished reading all 1 billion digits you should have the highest 100 in the array.
Job done.

I'd traverse the list in order. As I go, I add elements to a set (or multiset depending on duplicates). When the set reached 100, I'd only insert if the value was greater than the min in the set (O(log m)). Then delete the min.
Calling the number of values in the list n and the number of values to find m:
this is O(n * log m)

Speed of the processing algorithm is absolutely irrelevant (unless it's completely dumb).
The bottleneck here is I/O (it's specified that they are on disk). So make sure that you work with large buffers.

Keep a fixed array of 100 integers. Initialise them to a Int.MinValue. When you are reading, from 1 billion integers, compare them with the numbers in the first cell of the array (index 0). If larger, then move up to next. Again if larger, then move up until you hit the end or a smaller value. Then store the value in the index and shift all values in the previous cells one cell down... do this and you will find 100 max integers.

I believe the quickest way to do this is by using a very large bit map to record which numbers are present. In order to represent a 32 bit integer this would need to be 2^32 / 8 bytes which is about == 536MB. Scan through the integers simply setting the corresponding bit in the bit map. Then look for the highest 100 entries.
NOTE: This finds the highest 100 numbers not the highest 100 instances of a number if you see the difference.
This kind of approach is discussed in the very good book Programming Pearls which your interviewer may have read!

You are going to have to check every number, there is no way around that.
Just as a slight improvement on solutions offered,
Given a list of 100 numbers:
9595
8505
...
234
1
You would check to see if the new found value is > min value of our array, if it is, insert it. However doing a search from bottom to top can be quite expensive, and you may consider taking a divide and conquer approach, by for example evaluating the 50th item in the array and doing a comparison, then you know if the value needs to be inserted in the first 50 items, or the bottom 50. You can repeat this process for a much faster search as we have eliminated 50% of our search space.
Also consider the data type of the integers. If they are 32 bit integers and you are on a 64 bit system, you may be able to do some clever memory handling and bitwise operations to deal with two numbers on disk at once if they are continual in memory.

I think someone should have mentioned a priority queue by now. You just need to keep the current top 100 numbers, know what the lowest is and be able to replace that with a higher number. That's what a priority queue does for you - some implementations may sort the list, but it's not required.

Assuming that 1 bill + 100ion numbers fit into memory
the best sorting algorithm is heap sort. form a heap and get the first 100 numbers. complexity o(nlogn + 100(for fetching first 100 numbers))
improving the solution
divide the implementaion to two heap(so that insertion are less complex) and while fetching the first 100 elements do imperial merge algorithm.

Here's some python code which implements the algorithm suggested by ferdinand beyer above. essentially it's a heap, the only difference is that deletion has been merged with insertion operation
import random
import math
class myds:
""" implement a heap to find k greatest numbers out of all that are provided"""
k = 0
getnext = None
heap = []
def __init__(self, k, getnext ):
""" k is the number of integers to return, getnext is a function that is called to get the next number, it returns a string to signal end of stream """
assert k>0
self.k = k
self.getnext = getnext
def housekeeping_bubbleup(self, index):
if index == 0:
return()
parent_index = int(math.floor((index-1)/2))
if self.heap[parent_index] > self.heap[index]:
self.heap[index], self.heap[parent_index] = self.heap[parent_index], self.heap[index]
self.housekeeping_bubbleup(parent_index)
return()
def insertonly_level2(self, n):
self.heap.append(n)
#pdb.set_trace()
self.housekeeping_bubbleup(len(self.heap)-1)
def insertonly_level1(self, n):
""" runs first k times only, can be as slow as i want """
if len(self.heap) == 0:
self.heap.append(n)
return()
elif n > self.heap[0]:
self.insertonly_level2(n)
else:
return()
def housekeeping_bubbledown(self, index, length):
child_index_l = 2*index+1
child_index_r = 2*index+2
child_index = None
if child_index_l >= length and child_index_r >= length: # No child
return()
elif child_index_r >= length: #only left child
if self.heap[child_index_l] < self.heap[index]: # If the child is smaller
child_index = child_index_l
else:
return()
else: #both child
if self.heap[ child_index_r] < self.heap[ child_index_l]:
child_index = child_index_r
else:
child_index = child_index_l
self.heap[index], self.heap[ child_index] = self.heap[child_index], self.heap[index]
self.housekeeping_bubbledown(child_index, length)
return()
def insertdelete_level1(self, n):
self.heap[0] = n
self.housekeeping_bubbledown(0, len(self.heap))
return()
def insert_to_myds(self, n ):
if len(self.heap) < self.k:
self.insertonly_level1(n)
elif n > self.heap[0]:
#pdb.set_trace()
self.insertdelete_level1(n)
else:
return()
def run(self ):
for n in self.getnext:
self.insert_to_myds(n)
print(self.heap)
# import pdb; pdb.set_trace()
return(self.heap)
def createinput(n):
input_arr = range(n)
random.shuffle(input_arr)
f = file('input', 'w')
for value in input_arr:
f.write(str(value))
f.write('\n')
input_arr = []
with open('input') as f:
input_arr = [int(x) for x in f]
myds_object = myds(4, iter(input_arr))
output = myds_object.run()
print output

If you find 100th order statistic using quick sort, it will work in average O(billion). But I doubt that with such numbers and due to random access needed for this approach it will be faster, than O(billion log(100)).

Here is another solution (about an eon later, I have no shame sorry!) based on the second one provided by #paxdiablo. The basic idea is that you should read another k numbers only if they're greater than the minimum you already have and that sorting is not really necessary:
// your variables
n = 100
k = a number > n and << 1 billion
create array1[n], array2[k]
read first n numbers into array2
find minimum and maximum of array2
while more numbers:
if number > maximum:
store in array1
if array1 is full: // I don't need contents of array2 anymore
array2 = array1
array1 = []
else if number > minimum:
store in array2
if array2 is full:
x = n - array1.count()
find the x largest numbers of array2 and discard the rest
find minimum and maximum of array2
else:
discard the number
endwhile
// Finally
x = n - array1.count()
find the x largest numbers of array2 and discard the rest
return merge array1 and array2
The critical step is the function for finding the largest x numbers in array2. But you can use the fact, that you know the minimum and maximum to speed up the function for finding the largest x numbers in array2.
Actually, there are lots of possible optimisations since you don't really need to sort it, you just need the x largest numbers.
Furthermore, if k is big enough and you have enough memory, you could even turn it into a recursive algorithm for finding the n largest numbers.
Finally, if the numbers are already sorted (in any order), the algorithm is O(n).
Obviously, this is just theoretically because in practice you would use standard sorting algorithms and the bottleneck would probably be the IO.

There are lots of clever approaches (like the priority queue solutions), but one of the simplest things you can do can also be fast and efficient.
If you want the top k of n, consider:
allocate an array of k ints
while more input
perform insertion sort of next value into the array
This may sound absurdly simplistic. You might expect this to be O(n^2), but it's actually only O(k*n), and if k is much smaller than n (as is postulated in the problem statement), it approaches O(n).
You might argue that the constant factor is too high because doing an average of k/2 comparisons and moves per input is a lot. But most values will be trivially rejected on the first comparison against the kth largest value seen so far. If you have a billion inputs, only a small fraction are likely to be larger than the 100th so far.
(You could construe a worst-case input where each value is larger than its predecessor, thus requiring k comparisons and moves for every input. But that is essentially a sorted input, and the problem statement said the input is unsorted.)
Even the binary-search improvement (to find the insertion point) only cuts the comparisons to ceil(log_2(k)), and unless you special case an extra comparison against the kth-so-far, you're much less likely to get the trivial rejection of the vast majority of inputs. And it does nothing to reduce the number of moves you need. Given caching schemes and branch prediction, doing 7 non-consecutive comparisons and then 50 consecutive moves doesn't seem likely to be significantly faster than doing 50 consecutive comparisons and moves. It's why many system sorts abandon Quicksort in favor of insertion sort for small sizes.
Also consider that this requires almost no extra memory and that the algorithm is extremely cache friendly (which may or may not be true for a heap or priority queue), and it's trivial to write without errors.
The process of reading the file is probably the major bottleneck, so the real performance gains are likely to be by doing a simple solution for the selection, you can focus your efforts on finding a good buffering strategy for minimizing the i/o.
If k can be arbitrarily large, approaching n, then it makes sense to consider a priority queue or other, smarter, data structure. Another option would be to split the input into multiple chunks, sort each of them in parallel, and then merge.

Related

constant memory reservoir sampling, O(k) possible?

I have an input stream, of size n, and I want to produce an output stream of size k that contains distinct random elements of the input stream, without requiring any additional memory for elements selected by the sample.
The algorithm I was going to use is basically as follows:
for each element in input stream
if random()<k/n
decrement k
output element
if k = 0
halt
end if
end if
decrement n
end for
The function random() generates a number from [0..1) on a random distribution, and I trust the algorithm's principle of operation is straightforward.
Although this algorithm can terminate early when it selects the last element, in general the algorithm is still approximately O(n). At first it seemed to work as intended (outputting roughly uniformly distributed but still random elements from the input stream), but I think there may be a non-uniform tendency to pick later elements when k is much less than n. I'm not sure about this, however... so I'd appreciate knowing for sure one way or the other. I'm also wondering if a faster algorithm exists. Obviously, since k elements must be generated, the algorithm cannot be any faster than O(k). For an O(k) solution, one could assume the existence of a function skip(x), which can skip over x elements in the input stream in O(1) time (but cannot skip backwards). I would still like to keep the requirement of not requiring any additional memory, however.
If it is a real stream, you need O(n) time to scan it.
Your existing algorithm is good. (I got that wrong before.) You can prove by induction that the probability that you have not picked the first element in i tries is 1 - i/n = (n-i)/n. First that is true for i=0 by inspection. Now if you have not picked it in ith tries, the odds that the next one picks it is 1/(n-i). And then the odds of picking it on the i+1'th try is ((n-i)/n) * (1/(n-i)) = 1/n. Which means that the odds of not picking it in the first i+1 times is 1 - i/n - 1/n = 1 - (i+i)/n. That completes induction. And so the odds of picking the first element in the first k tries is the odds of not having not picked it, or 1 - (n - k/n) = k/n.
But what if you have O(1) access to any element? Well note that choosing k to take is the same as choosing n-k to leave. So without loss of generality we can assume that k <= n/2. What that means is that we can use a randomized algorithm like this:
chosen = set()
count_chosen = 0
while count_chosen < k:
choice = random_element(stream)
if choice not in chosen:
chosen.add(choice)
count_chosen = count_chosen + 1
The set will be O(k) space, and since the probability of each random choice being new to you is at least 0.5, the expected running time is no worse than 2k choices.

Giving a set of tuples (value,cost),Is there an algorithm to find the combination of tuple that have the least cost for storing given number

I have a set of (value,cost) tuples which is (2000000,200) , (500000,75) , (100000,20)
Suppose X is any positive number.
Is there an algorithm to find the combination of tuple that have the least cost for the sum of value that can store X.
The sum of tuple values can be equal or greater than the given X
ex.
giving x = 800000 the answer should be (500000,75) , (100000,20) , (100000,20) , (100000,20)
giving x = 900000 the answer should be (500000,75) , (500000,75)
giving x = 1500000 the answer should be (2000000,200)
I can hardcode this but the set and the tuple are subject to change so if this can be substitute with well-known algorithm it would be great.
This can be solved with dinamic programming, as you have no limit on number of tuples and can afford higher sums that provided number.
First, you can optimize tuples. If one big tuple can be replaced by number of smaller ones with equal or lower cost and equal or higher value, you can remove bigger tuple at all.
Also, it's fruitful for future use to order tuples in optimized set by value/cost in descending order. Tuple is better if value/cost is bigger.
Time complexity O(N*T), where N is number divided by common factor (F) of optimized tuple values, and T is number of tuples in optimized tuple set.
Memory complexity O(N).
Set up array a of size N that will contain:
in a[i].cost best cost for solution for i*F, 0 for special case "no solution yet"
in a[i].tuple the tuple that led to best solution
Recursion scheme:
function gets n as a single parameter - it's provided number/F for start, leftover of needed value/F sums for recusion calls
if array a for n is filled, return a[n].cost
otherwise set current_cost to MAXINT
for each tuple from best to worst try to add it to solution:
if value/F >= n, we've got some solution, compare tuple cost to current_cost and if it's better, update a[n].cost and a[n].tuple
if value/F < n, call recursively for n-value/F and compare cost with current solution, update current solution and a[n].cost, a[n].tuple if needed
after all, return a[n].cost or throw exception is no solution exists
Tuple list can be retrieved from a but traverse through .tuple on each step.
It's possible to reduce overall array size down to max(tuple.value/F), but you'll have to save more or less complete solution instead of one best .tuple for each element, and you'll have to make "sliding window" carefully.
It's possible to turn recursion into cycle from 0 to n, as with many other dynamic programming algorithms.

searching through a vast collection of potential solutions

I have a quite difficult problem (perhaps even a NP-hard problem ^^) with looking for a solution in a massive collection of results. Perhaps there is an algorithm for it.
Below exercise is artificial but is a perfect example to illustrate my issue.
There is a big array with integers. Lets say it has 100.000 elements.
int numbers[] = {-123,32,4,-234564,23,5,....}
I want to check in a relatively quick way if a sum on any 2 numbers from this array is equal to 0. In other words, if the array has "-123" I want to find is there also a "123" number.
The easiest solution would be brute force - check everything with everything. That gives 100.000 x 100.000 a big number ;-) Obviously brute force method can by optimised. Order numbers and check negatives against positive only. My question is - is there something better then optimised brute force to find a solution?
First, sort the array by magnitude of the value.
Then, if the data contains a pair which satisfies the conditions you're after, it contains such a pair adjacent in the array. So just sweep through looking for adjacent pairs whose sum is 0.
Overall time complexity is O(n log n) for the sort, could be O(n) if you use "cheating" sorts not based solely on comparisons. Clearly it can't be done in less than linear time, because in the worst case you can't do it without looking at all the elements. I think n log n is probably optimal in the decision tree model of computing, but only because it "feels a bit like" the element uniqueness problem.
Alternative approach:
Add the elements one at a time to a hash-based or tree-based container. Before adding each element, check whether its negative is present. If so, stop.
This is likely to be faster in the case where there are lots of suitable pairs, because you save the cost of sorting the whole data. That said, you could write a modified sort that exits early by checking for adjacent pairs as soon as any subset of the data is in its final order, but that's effort.
Brute force would be an O(n^2) solution. You can certainly do better.
Off the top of my head, first sort it. Heap sort will have a complexity of O(nlogn).
Now, for the first element, say a, you know you need to find an element b, such that a+b = 0. This can be found using binary search (since your array is now sorted). Binary search has a complexity of O(logn).
This gives you an overall solution of O(nlogn) complexity.
The example you provided can be brute-force solved in O(n^2) time.
You can start ordering the numbers (O(n·logn)) from smaller to bigger. If you place one pointer at the beginning (the "most negative number") and other at the end (the "most positive"), you can check if there is such pair of numbers in an additional O(n) steps by following the next procedure:
If the numbers at both pointers have the same module, you have the solution
If not, move the pointer of the number with bigger module towards "zero" (this is, increase if it is the pointer on the negative side, decrease if it is the positive-side one)
Repeat until finding a solution, or the pointers cross.
Total complexity is O(n·logn)+O(n) = O(n·logn).
Sort your array using Quicksort. After this happened, use two indexes, let's call them positive and negative.
positive <- 0
negative <- size - 1
while ((array[positive] > 0) and (array(negative < 0) and (positive >= 0) and (negative < size)) do
delta <- array[positive] + array[negative]
if (delta = 0) then
return true
else if (delta < 0) then
negative <- negative + 1
else
positive <- positive - 1
end if
end while
return (array[positive] * array[negative] = 0)
You didn't say what should the algorithm do if 0 is part of the array, I've supposed that in this case true should be returned.

Generate all subset sums within a range faster than O((k+N) * 2^(N/2))?

Is there a way to generate all of the subset sums s1, s2, ..., sk that fall in a range [A,B] faster than O((k+N)*2N/2), where k is the number of sums there are in [A,B]? Note that k is only known after we have enumerated all subset sums within [A,B].
I'm currently using a modified Horowitz-Sahni algorithm. For example, I first call it to for the smallest sum greater than or equal to A, giving me s1. Then I call it again for the next smallest sum greater than s1, giving me s2. Repeat this until we find a sum sk+1 greater than B. There is a lot of computation repeated between each iteration, even without rebuilding the initial two 2N/2 lists, so is there a way to do better?
In my problem, N is about 15, and the magnitude of the numbers is on the order of millions, so I haven't considered the dynamic programming route.
Check the subset sum on Wikipedia. As far as I know, it's the fastest known algorithm, which operates in O(2^(N/2)) time.
Edit:
If you're looking for multiple possible sums, instead of just 0, you can save the end arrays and just iterate through them again (which is roughly an O(2^(n/2) operation) and save re-computing them. The value of all the possible subsets is doesn't change with the target.
Edit again:
I'm not wholly sure what you want. Are we running K searches for one independent value each, or looking for any subset that has a value in a specific range that is K wide? Or are you trying to approximate the second by using the first?
Edit in response:
Yes, you do get a lot of duplicate work even without rebuilding the list. But if you don't rebuild the list, that's not O(k * N * 2^(N/2)). Building the list is O(N * 2^(N/2)).
If you know A and B right now, you could begin iteration, and then simply not stop when you find the right answer (the bottom bound), but keep going until it goes out of range. That should be roughly the same as solving subset sum for just one solution, involving only +k more ops, and when you're done, you can ditch the list.
More edit:
You have a range of sums, from A to B. First, you solve subset sum problem for A. Then, you just keep iterating and storing the results, until you find the solution for B, at which point you stop. Now you have every sum between A and B in a single run, and it will only cost you one subset sum problem solve plus K operations for K values in the range A to B, which is linear and nice and fast.
s = *i + *j; if s > B then ++i; else if s < A then ++j; else { print s; ... what_goes_here? ... }
No, no, no. I get the source of your confusion now (I misread something), but it's still not as complex as what you had originally. If you want to find ALL combinations within the range, instead of one, you will just have to iterate over all combinations of both lists, which isn't too bad.
Excuse my use of auto. C++0x compiler.
std::vector<int> sums;
std::vector<int> firstlist;
std::vector<int> secondlist;
// Fill in first/secondlist.
std::sort(firstlist.begin(), firstlist.end());
std::sort(secondlist.begin(), secondlist.end());
auto firstit = firstlist.begin();
auto secondit = secondlist.begin();
// Since we want all in a range, rather than just the first, we need to check all combinations. Horowitz/Sahni is only designed to find one.
for(; firstit != firstlist.end(); firstit++) {
for(; secondit = secondlist.end(); secondit++) {
int sum = *firstit + *secondit;
if (sum > A && sum < B)
sums.push_back(sum);
}
}
It's still not great. But it could be optimized if you know in advance that N is very large, for example, mapping or hashmapping sums to iterators, so that any given firstit can find any suitable partners in secondit, reducing the running time.
It is possible to do this in O(N*2^(N/2)), using ideas similar to Horowitz Sahni, but we try and do some optimizations to reduce the constants in the BigOh.
We do the following
Step 1: Split into sets of N/2, and generate all possible 2^(N/2) sets for each split. Call them S1 and S2. This we can do in O(2^(N/2)) (note: the N factor is missing here, due to an optimization we can do).
Step 2: Next sort the larger of S1 and S2 (say S1) in O(N*2^(N/2)) time (we optimize here by not sorting both).
Step 3: Find Subset sums in range [A,B] in S1 using binary search (as it is sorted).
Step 4: Next, for each sum in S2, find using binary search the sets in S1 whose union with this gives sum in range [A,B]. This is O(N*2^(N/2)). At the same time, find if that corresponding set in S2 is in the range [A,B]. The optimization here is to combine loops. Note: This gives you a representation of the sets (in terms of two indexes in S2), not the sets themselves. If you want all the sets, this becomes O(K + N*2^(N/2)), where K is the number of sets.
Further optimizations might be possible, for instance when sum from S2, is negative, we don't consider sums < A etc.
Since Steps 2,3,4 should be pretty clear, I will elaborate further on how to get Step 1 done in O(2^(N/2)) time.
For this, we use the concept of Gray Codes. Gray codes are a sequence of binary bit patterns in which each pattern differs from the previous pattern in exactly one bit.
Example: 00 -> 01 -> 11 -> 10 is a gray code with 2 bits.
There are gray codes which go through all possible N/2 bit numbers and these can be generated iteratively (see the wiki page I linked to), in O(1) time for each step (total O(2^(N/2)) steps), given the previous bit pattern, i.e. given current bit pattern, we can generate the next bit pattern in O(1) time.
This enables us to form all the subset sums, by using the previous sum and changing that by just adding or subtracting one number (corresponding to the differing bit position) to get the next sum.
If you modify the Horowitz-Sahni algorithm in the right way, then it's hardly slower than original Horowitz-Sahni. Recall that Horowitz-Sahni works two lists of subset sums: Sums of subsets in the left half of the original list, and sums of subsets in the right half. Call these two lists of sums L and R. To obtain subsets that sum to some fixed value A, you can sort R, and then look up a number in R that matches each number in L using a binary search. However, the algorithm is asymmetric only to save a constant factor in space and time. It's a good idea for this problem to sort both L and R.
In my code below I also reverse L. Then you can keep two pointers into R, updated for each entry in L: A pointer to the last entry in R that's too low, and a pointer to the first entry in R that's too high. When you advance to the next entry in L, each pointer might either move forward or stay put, but they won't have to move backwards. Thus, the second stage of the Horowitz-Sahni algorithm only takes linear time in the data generated in the first stage, plus linear time in the length of the output. Up to a constant factor, you can't do better than that (once you have committed to this meet-in-the-middle algorithm).
Here is a Python code with example input:
# Input
terms = [29371, 108810, 124019, 267363, 298330, 368607,
438140, 453243, 515250, 575143, 695146, 840979, 868052, 999760]
(A,B) = (500000,600000)
# Subset iterator stolen from Sage
def subsets(X):
yield []; pairs = []
for x in X:
pairs.append((2**len(pairs),x))
for w in xrange(2**(len(pairs)-1), 2**(len(pairs))):
yield [x for m, x in pairs if m & w]
# Modified Horowitz-Sahni with toolow and toohigh indices
L = sorted([(sum(S),S) for S in subsets(terms[:len(terms)/2])])
R = sorted([(sum(S),S) for S in subsets(terms[len(terms)/2:])])
(toolow,toohigh) = (-1,0)
for (Lsum,S) in reversed(L):
while R[toolow+1][0] < A-Lsum and toolow < len(R)-1: toolow += 1
while R[toohigh][0] <= B-Lsum and toohigh < len(R): toohigh += 1
for n in xrange(toolow+1,toohigh):
print '+'.join(map(str,S+R[n][1])),'=',sum(S+R[n][1])
"Moron" (I think he should change his user name) raises the reasonable issue of optimizing the algorithm a little further by skipping one of the sorts. Actually, because each list L and R is a list of sizes of subsets, you can do a combined generate and sort of each one in linear time! (That is, linear in the lengths of the lists.) L is the union of two lists of sums, those that include the first term, term[0], and those that don't. So actually you should just make one of these halves in sorted form, add a constant, and then do a merge of the two sorted lists. If you apply this idea recursively, you save a logarithmic factor in the time to make a sorted L, i.e., a factor of N in the original variable of the problem. This gives a good reason to sort both lists as you generate them. If you only sort one list, you have some binary searches that could reintroduce that factor of N; at best you have to optimize them somehow.
At first glance, a factor of O(N) could still be there for a different reason: If you want not just the subset sum, but the subset that makes the sum, then it looks like O(N) time and space to store each subset in L and in R. However, there is a data-sharing trick that also gets rid of that factor of O(N). The first step of the trick is to store each subset of the left or right half as a linked list of bits (1 if a term is included, 0 if it is not included). Then, when the list L is doubled in size as in the previous paragraph, the two linked lists for a subset and its partner can be shared, except at the head:
0
|
v
1 -> 1 -> 0 -> ...
Actually, this linked list trick is an artifact of the cost model and never truly helpful. Because, in order to have pointers in a RAM architecture with O(1) cost, you have to define data words with O(log(memory)) bits. But if you have data words of this size, you might as well store each word as a single bit vector rather than with this pointer structure. I.e., if you need less than a gigaword of memory, then you can store each subset in a 32-bit word. If you need more than a gigaword, then you have a 64-bit architecture or an emulation of it (or maybe 48 bits), and you can still store each subset in one word. If you patch the RAM cost model to take account of word size, then this factor of N was never really there anyway.
So, interestingly, the time complexity for the original Horowitz-Sahni algorithm isn't O(N*2^(N/2)), it's O(2^(N/2)). Likewise the time complexity for this problem is O(K+2^(N/2)), where K is the length of the output.

Finding a single number in a list [duplicate]

This question already has answers here:
How to find the only number in an array that doesn't occur twice [duplicate]
(5 answers)
Closed 7 years ago.
What would be the best algorithm for finding a number that occurs only once in a list which has all other numbers occurring exactly twice.
So, in the list of integers (lets take it as an array) each integer repeats exactly twice, except one. To find that one, what is the best algorithm.
The fastest (O(n)) and most memory efficient (O(1)) way is with the XOR operation.
In C:
int arr[] = {3, 2, 5, 2, 1, 5, 3};
int num = 0, i;
for (i=0; i < 7; i++)
num ^= arr[i];
printf("%i\n", num);
This prints "1", which is the only one that occurs once.
This works because the first time you hit a number it marks the num variable with itself, and the second time it unmarks num with itself (more or less). The only one that remains unmarked is your non-duplicate.
By the way, you can expand on this idea to very quickly find two unique numbers among a list of duplicates.
Let's call the unique numbers a and b. First take the XOR of everything, as Kyle suggested. What we get is a^b. We know a^b != 0, since a != b. Choose any 1 bit of a^b, and use that as a mask -- in more detail: choose x as a power of 2 so that x & (a^b) is nonzero.
Now split the list into two sublists -- one sublist contains all numbers y with y&x == 0, and the rest go in the other sublist. By the way we chose x, we know that a and b are in different buckets. We also know that each pair of duplicates is still in the same bucket. So we can now apply ye olde "XOR-em-all" trick to each bucket independently, and discover what a and b are completely.
Bam.
O(N) time, O(N) memory
HT= Hash Table
HT.clear()
go over the list in order
for each item you see
if(HT.Contains(item)) -> HT.Remove(item)
else
ht.add(item)
at the end, the item in the HT is the item you are looking for.
Note (credit #Jared Updike): This system will find all Odd instances of items.
comment: I don't see how can people vote up solutions that give you NLogN performance. in which universe is that "better" ?
I am even more shocked you marked the accepted answer s NLogN solution...
I do agree however that if memory is required to be constant, then NLogN would be (so far) the best solution.
Kyle's solution would obviously not catch situations were the data set does not follow the rules. If all numbers were in pairs the algorithm would give a result of zero, the exact same value as if zero would be the only value with single occurance.
If there were multiple single occurance values or triples, the result would be errouness as well.
Testing the data set might well end up with a more costly algorithm, either in memory or time.
Csmba's solution does show some errouness data (no or more then one single occurence value), but not other (quadrouples). Regarding his solution, depending on the implementation of HT, either memory and/or time is more then O(n).
If we cannot be sure about the correctness of the input set, sorting and counting or using a hashtable counting occurances with the integer itself being the hash key would both be feasible.
I would say that using a sorting algorithm and then going through the sorted list to find the number is a good way to do it.
And now the problem is finding "the best" sorting algorithm. There are a lot of sorting algorithms, each of them with its strong and weak points, so this is quite a complicated question. The Wikipedia entry seems like a nice source of info on that.
Implementation in Ruby:
a = [1,2,3,4,123,1,2,.........]
t = a.length-1
for i in 0..t
s = a.index(a[i])+1
b = a[s..t]
w = b.include?a[i]
if w == false
puts a[i]
end
end
You need to specify what you mean by "best" - to some, speed is all that matters and would qualify an answer as "best" - for others, they might forgive a few hundred milliseconds if the solution was more readable.
"Best" is subjective unless you are more specific.
That said:
Iterate through the numbers, for each number search the list for that number and when you reach the number that returns only a 1 for the number of search results, you are done.
Seems like the best you could do is to iterate through the list, for every item add it to a list of "seen" items or else remove it from the "seen" if it's already there, and at the end your list of "seen" items will include the singular element. This is O(n) in regards to time and n in regards to space (in the worst case, it will be much better if the list is sorted).
The fact that they're integers doesn't really factor in, since there's nothing special you can do with adding them up... is there?
Question
I don't understand why the selected answer is "best" by any standard. O(N*lgN) > O(N), and it changes the list (or else creates a copy of it, which is still more expensive in space and time). Am I missing something?
Depends on how large/small/diverse the numbers are though. A radix sort might be applicable which would reduce the sorting time of the O(N log N) solution by a large degree.
The sorting method and the XOR method have the same time complexity. The XOR method is only O(n) if you assume that bitwise XOR of two strings is a constant time operation. This is equivalent to saying that the size of the integers in the array is bounded by a constant. In that case you can use Radix sort to sort the array in O(n).
If the numbers are not bounded, then bitwise XOR takes time O(k) where k is the length of the bit string, and the XOR method takes O(nk). Now again Radix sort will sort the array in time O(nk).
You could simply put the elements in the set into a hash until you find a collision. In ruby, this is a one-liner.
def find_dupe(array)
h={}
array.detect { |e| h[e]||(h[e]=true; false) }
end
So, find_dupe([1,2,3,4,5,1]) would return 1.
This is actually a common "trick" interview question though. It is normally about a list of consecutive integers with one duplicate. In this case the interviewer is often looking for you to use the Gaussian sum of n-integers trick e.g. n*(n+1)/2 subtracted from the actual sum. The textbook answer is something like this.
def find_dupe_for_consecutive_integers(array)
n=array.size-1 # subtract one from array.size because of the dupe
array.sum - n*(n+1)/2
end

Resources