I have two files - let's call them file0 and file1.
What I would like to get is a fast algorithm for the following problem (it is clear to me how to write a rather slow algorithm that solves it):
Detect the largest suffix of file0 that is a prefix of file1, that means a memory block B (or more precisely: the number of bytes of such a memory block) of maximum length so that
file0 consists of some memory block A, followed by B
file1 constist of memory block B, followed by some memory block C
Note that the blocks A, B and C can also have a length of zero bytes.
Edit (to answer drysdam's remark): the obvious rather slow algorithm that I think of (Pseudocode): let the length of the files be bounded by m, n with wlog m<=n.
for each length from m to 0
compare the m last bytes of file0 with the first m bytes of file1
if they are equal
return length
This is obviously an O(m*min(m, n)) algorithm. If the files are about the same size this is O(n^2).
The files that I have to handle currently are of size of 10 up to a few hundread megabytes. But in extreme cases they can also be of size of a few gigabytes - just big enough not to fit into the 32 bit adress space of x86 anymore.
Consider treating your bytes as numbers 0..255 held as integers mod p, where p is a prime, optionally much larger than 255. Here are two ways of computing b0*x^2 + b1*x + b2:
(b0*x + b1)*x + b2
b0*x^2 + (b1*x + b2).
Therefore I can compute this quantity efficiently either by working from left to right - multiply by x and adding b2, or by working right to left - adding b0*x^2.
Pick a random x and compute this working from right to left in AB and from left to right in BC. If the values computed match, you note down the location. Later do a slow check of all the matches starting with the longest to see if the B really is identical in both cases.
What is the chance of a match at random? If you have a false match then (a0 - c0)*x^2 + (a1 - c1)*x + (a2 - c2) = 0. A polynomial of degree d has at most d roots, so if x is random the chance of a false match is at most d / p, and you can make this small by working mod p for suitably large p. (If I remember rightly there is a scheme for message authentication which has this idea at its heart).
Depending on how much memory you have available, you may want to consider building a suffix tree for the first file. Once you have this, you can query the prefix of the second file that maximally overlaps with a suffix of the second file by just walking the suffix tree down from the root along the edges matching the letters of the prefix of the second file. Since suffix trees can be built in linear time, the runtime for this algorithm is O(|A| + |B|) using your terminology, since it takes O(|A| + |B|) time to build the suffix tree and O(|B|) time to walk the suffix tree to find the block B.
If it is not an academical assignment then It might make sense to implement the simplest solution and see how it behaves on your data.
For example, a theoretically more efficient Knuth-Morris-Pratt algorithm -based solution can perform worse than IndexOf-based solution (see Overlap Detection).
For large files your program might spent all the time waiting for I/O.
Related
I have got the following problem given: There are given n Files with length z1,...zn and a usage u1,....un where the sum of all u1,...un equals 1. 0 < u_i < 1
They want to have a sorting where the Time taken to take a file from store is minimal. Example, if z1 = 12, l2 = 3 and u1 = 0,9 and u2 = 0,1 where file 1 is taken first, the approximate time to access is 12* 0,8 + 15* 0,1
My task: Prove that this (greedy) algorithm is optimal.
My Question: Is my answer to that question correct or what should I improve?
My answer:
Suppose the Algorithm is not optimal, there has to exist an order that if more efficennt. For that there have to be two factors being mentioned. The usage and the length. The more a file is used, the shorter its time to access has to be. For that the length of files in the first places before has to be as short as possible. If the formula z1/u1 is being sorted descending. The files with a high usage will be placed lastly. Hence the accesstime is made by all lengths before * usage it means that more often files will be accessed slowly. It is a contradiction to the efficency. Suppose the formula z_i / u_i is unefficient. The division by u_i has the consequence that if more usage is given, the term will be smaller. Which means more often used terms will be accessed faster, hence u_i is < 1 and > 0. If differing from that division, terms with higher usage wont be preferred then, which would be a contradiciton to the efficiency. Also because of z_i at the top of fraction lower lengths will be preferred first. Differing from that, it would mean that terms with longer length should be preferred also. Taking longer terms first is a contradiction to efficiency. Hence all other alternatives of taking another sorting could be contradicted it can be proved that the term is optimal and correct.
I have a binary file and i know the number of occurrences of every symbol in it. I need to predict the length of compressed file IF i was to compress it using Huffman algorithm. I am only interested in the hypothetical output length and not in the codes for individual symbols, so constructing Huffman tree seems redundant.
As an illustration i need to get something like "A binary string of 38 bits which contains 4 a's, 5 b's and 10 c's can be compressed down to 28 bits.", except both the file and the alphabet size are much larger.
The basic question is: can it be done without constructing the tree?
Looking at the greedy algorithm: http://www.siggraph.org/education/materials/HyperGraph/video/mpeg/mpegfaq/huffman_tutorial.html
it seems the tree can be constructed in n*log(n) time where n is the number of distinct symbols in the file. This is not bad asymptotically, but requires memory allocation for tree nodes and does a lot of work which in my case goes to waste.
the lower bound on the average number of bits per symbol in compressed file is nothing but the entropy H = -sum(p(x)*log(p(x))) for all symbols x in input. P(x) = freq(x)/(filesize). Using this compressed length(lower bound) = filesize*H. This is the lower bound on compressed size of file. But unfortunately the optimal entropy is not achievable in most case because bits are integral not fractional so in practical case the huffman tree is needed to be constructed to get correct compression size. But optimal compression size can be used to get the upper bound on amount of compression possible and to decide whether to use huffman or not.
You can upper bound the average bit counts per symbol in Huffman coding by
H(p1, p2, ..., pn) + 1 where H is entropy, each pi is probability of symbol i occuring in input. If you multiply this value by input size N it will give you approximated length of encoded output.
You could easily modify the algorithm to build the binary tree in an array. The root node is at index 0, its left node at index 1 and right node at index 2. In general, a node's children will be at (index*2) + 1 and (index * 2) + 2. Granted, this still requires memory allocation, but if you know how many symbols you have you can compute how many nodes will be in the tree. So it's a single array allocation.
I don't see where the work that's done really goes to waste. You have to keep track of the combining logic somehow, and doing it in a tree as shown is pretty simple. I know that you're just looking for the final answer--the length of each symbol--but you can't get that answer without doing the work.
Assume, that I am tracking the usage of slots in a Fenwick tree. As an example, lets consider tracking 32 slots, leading to a Fenwick tree layout as shown in the image below, where the numbers in the grid indicate the index in the underlying array with counts manipulated by the Fenwick tree where the value in each cell is the sum of "used" items in that segment (i.e. array cell 23 stores the amount of used slots in the range [16-23]). The items at the lowest level (i.e. cells 0, 2, 4, ...) can only have the value of "1" (used slot) or "0" (free slot).
What I am looking for is an efficient algorithm to find the first range of a given number of contiguous free slots.
To illustrate, suppose I have the Fenwick tree shown in the image below in which a total of 9 slots are used (note that the light gray numbers are just added for clarity, not actually stored in the tree's array cells).
Now I would like to find e.g. the first contiguous range of 10 free slots, which should find this range:
I can't seem to find an efficient way of doing this, and it is giving me a bit of a headache. Note, that as the required amount of storage space is critical for my purposes, I do not wish to extend the design to be a segment tree.
Any thoughts and suggestions on an O(log N) type of solution would be very welcome.
EDIT
Time for an update after bounty period has expired. Thanks for all comments, questions, suggestions and answers. They have made me think things over again, taught me a lot and pointed out to me (once again; one day I may learn this lesson) that I should focus more on the issue I want to solve when asking questions.
Since #Erik P was the only one that provided a reasonable answer to the question that included the requested code/pseudo code, he will receive the bounty.
He also pointed out correctly that O(log N) search using this structure is not going to be possible. Kudos to #DanBjorge for providing a proof that made me think about worst case performance.
The comment and answer of #EvgenyKluev made me realize I should have formulated my question differently. In fact I was already doing in large part what he suggested (see https://gist.github.com/anonymous/7594508 - which shows where I got stuck before posting this question), and asked this question hoping there would be an efficient way to search contiguous ranges, thereby preventing changing this design to a segment tree (which would require an additional 1024 bytes). It appears however that such a change might be the smart thing to do.
For anyone interested, a binary encoded Fenwick tree matching the example used in this question (32 slot fenwick tree encoded in 64 bits) can be found here: https://gist.github.com/anonymous/7594245.
I think the easiest way to implement all the desired functionality with O(log N) time complexity and at the same time minimize memory requirements is using a bit vector to store all 0/1 (free/used) values. Bit vector can substitute 6 lowest levels of both Fenwick tree and segment tree (if implemented as 64-bit integers). So height of these trees may be reduced by 6 and space requirements for each of these trees would be 64 (or 32) times less than usual.
Segment tree may be implemented as implicit binary tree sitting in an array (just like a well-known max-heap implementation). Root node at index 1, each left descendant of node at index i is placed at 2*i, each right descendant - at 2*i+1. This means twice as much space is needed comparing to Fenwick tree, but since tree height is cut by 6 levels, that's not a big problem.
Each segment tree node should store a single value - length of the longest contiguous sequence of "free" slots starting at a point covered by this node (or zero if no such starting point is there). This makes search for the first range of a given number of contiguous zeros very simple: start from the root, then choose left descendant if it contains value greater or equal than required, otherwise choose right descendant. After coming to some leaf node, check corresponding word of bit vector (for a run of zeros in the middle of the word).
Update operations are more complicated. When changing a value to "used", check appropriate word of bit vector, if it is empty, ascend segment tree to find nonzero value for some left descendant, then descend the tree to get to rightmost leaf with this value, then determine how newly added slot splits "free" interval into two halves, then update all parent nodes for both added slot and starting node of the interval being split, also set a bit in the bit vector. Changing a value to "free" may be implemented similarly.
If obtaining the number of nonzero items in some range is also needed, implement Fenwick tree over the same bit vector (but separate from the segment tree). There is nothing special in Fenwick tree implementation except that adding together 6 lowest nodes is substituted by "population count" operation for some word of the bit vector. For an example of using Fenwick tree together with bit vector see first solution for Magic Board on CodeChef.
All necessary operations for bit vector may be implemented pretty efficiently using various bitwise tricks. For some of them (leading/trailing zero count and population count) you could use either compiler intrinsics or assembler instructions (depending on target architecture).
If bit vector is implemented with 64-bit words and tree nodes - with 32-bit words, both trees occupy 150% space in addition to bit vector. This may be significantly reduced if each leaf node corresponds not to a single bit vector word, but to a small range (4 or 8 words). For 8 words additional space needed for trees would be only 20% of bit vector size. This makes implementation slightly more complicated. If properly optimized, performance should be approximately the same as in variant for one word per leaf node. For very large data set performance is likely to be better (because bit vector computations are more cache-friendly than walking the trees).
As mcdowella suggests in their answer, let K2 = K/2, rounding up, and let M be the smallest power of 2 that is >= K2. A promising approach would be to search for contiguous blocks of K2 zeroes fully contained in one size-M block, and once we've found those, check neighbouring size-M blocks to see if they contain sufficient adjacent zeroes. For the initial scan, if the number of 0s in a block is < K2, clearly we can skip it, and if the number of 0s is >= K2 and the size of the block is >= 2*M, we can look at both sub-blocks.
This suggests the following code. Below, A[0 .. N-1] is the Fenwick tree array; N is assumed to be a power of 2. I'm assuming that you're counting empty slots, rather than nonempty ones; if you prefer to count empty slots, it's easy enough to transform from the one to the other.
initialize q as a stack data structure of triples of integers
push (N-1, N, A[n-1]) onto q
# An entry (i, j, z) represents the block [i-j+1 .. i] of length j, which
# contains z zeroes; we start with one block representing the whole array.
# We maintain the invariant that i always has at least as many trailing ones
# in its binary representation as j has trailing zeroes. (**)
initialize r as an empty list of pairs of integers
while q is not empty:
pop an entry (i,j,z) off q
if z < K2:
next
if FW(i) >= K:
first_half := i - j/2
# change this if you want to count nonempty slots:
first_half_zeroes := A[first_half]
# Because of invariant (**) above, first_half always has exactly
# the right number of trailing 1 bits in its binary representation
# that A[first_half] counts elements of the interval
# [i-j+1 .. first_half].
push (i, j/2, z - first_half_zeroes) onto q
push (first_half, j/2, first_half_zeroes) onto q
else:
process_block(i, j, z)
This lets us process all size-M blocks with at least K/2 zeroes in order. You could even randomize the order in which you push the first and second half onto q in order to get the blocks in a random order, which might be nice to combat the situation where the first half of your array fills up much more quickly than the latter half.
Now we need to discuss how to process a single block. If z = j, then the block is entirely filled with 0s and we can look both left and right to add zeroes. Otherwise, we need to find out if it starts with >= K/2 contiguous zeroes, and if so with how many exactly, and then check if the previous block ends with a suitable number of zeroes. Similarly, we check if the block ends with >= K/2 contiguous zeroes, and if so with how many exactly, and then check if the next block starts with a suitable number of zeroes. So we will need a procedure to find the number of zeroes a block starts or ends with, possibly with a shortcut if it's at least a or at most b. To be precise: let ends_with_zeroes(i, j, min, max) be a procedure that returns the number of zeroes that the block from [i-j+1 .. j] ends with, with a shortcut to return max if the result will be more than max and min if the result will be less than min. Similarly for starts_with_zeroes(i, j, min, max).
def process_block(i, j, z):
if j == z:
if i > j:
a := ends_with_zeroes(i-j, j, 0, K-z)
else:
a := 0
if i < N-1:
b := starts_with_zeroes(i+j, j, K-z-a-1, K-z-a)
else:
b := 0
if b >= K-z-a:
print "Found: starting at ", i - j - a + 1
return
# If the block doesn't start or end with K2 zeroes but overlaps with a
# correct solution anyway, we don't need to find it here -- we'll find it
# starting from the adjacent block.
a := starts_with_zeroes(i, j, K2-1, j)
if i > j and a >= K2:
b := ends_with_zeroes(i-j, j, K-a-1, K-a)
if b >= K-a:
print "Found: starting at ", i - j - a + 1
# Since z < 2*K2, and j != z, we know this block doesn't end with K2
# zeroes, so we can safely return.
return
a := ends_with_zeroes(i, j, K2-1, j)
if i < N-1 and a >= K2:
b := starts_with_zeroes(i+j, K-a-1, K-a)
if b >= K-a:
print "Found: starting at ", i - a + 1
Note that in the second case where we find a solution, it may be possible to move the starting point left a bit further. You could check for that separately if you need the very first position that it could start.
Now all that's left is to implement starts_with_zeroes and ends_with_zeroes. In order to check that the block starts with at least min zeroes, we can test that it starts with 2^h zeroes (where 2^h <= min) by checking the appropriate Fenwick entry; then similarly check if it starts with 2^H zeroes where 2^H >= max to short cut the other way (except if max = j, it is trickier to find the right count from the Fenwick tree); then find the precise number.
def starts_with_zeroes(i, j, min, max):
start := i-j
h2 := 1
while h2 * 2 <= min:
h2 := h2 * 2
if A[start + h2] < h2:
return min
# Now h2 = 2^h in the text.
# If you insist, you can do the above operation faster with bit twiddling
# to get the 2log of min (in which case, for more info google it).
while h2 < max and A[start + 2*h2] == 2*h2:
h2 := 2*h2
if h2 == j:
# Walk up the Fenwick tree to determine the exact number of zeroes
# in interval [start+1 .. i]. (Not implemented, but easy.) Let this
# number be z.
if z < j:
h2 := h2 / 2
if h2 >= max:
return max
# Now we know that [start+1 .. start+h2] is all zeroes, but somewhere in
# [start+h2+1 .. start+2*h2] there is a one.
# Maintain invariant: the interval [start+1 .. start+h2] is all zeroes,
# and there is a one in [start+h2+1 .. start+h2+step].
step := h2;
while step > 1:
step := step / 2
if A[start + h2 + step] == step:
h2 := h2 + step
return h2
As you see, starts_with_zeroes is pretty bottom-up. For ends_with_zeroes, I think you'd want to do a more top-down approach, since examining the second half of something in a Fenwick tree is a little trickier. You should be able to do a similar type of binary search-style iteration.
This algorithm is definitely not O(log(N)), and I have a hunch that this is unavoidable. The Fenwick tree simply doesn't give information that is that good for your question. However, I think this algorithm will perform fairly well in practice if suitable intervals are fairly common.
One quick check, when searching for a range of K contiguous slots, is to find the largest power of two less than or equal to K/2. Any K continuous zero slots must contain at least one Fenwick-aligned range of slots of size <= K/2 that is entirely filled with zeros. You could search the Fenwick tree from the top for such chunks of aligned zeros and then look for the first one that can be extended to produce a range of K contiguous zeros.
In your example the lowest level contains 0s or 1s and the upper level contains sums of descendants. Finding stretches of 0s would be easier if the lowest level contained 0s where you are currently writing 1s and a count of the number of contiguous zeros to the left where you are currently writing zeros, and the upper levels contained the maximum value of any descendant. Updating would mean more work, especially if you had long strings of zeros being created and destroyed, but you could find the leftmost string of zeros of length at least K with a single search to the left branching left where the max value was at least K. Actually here a lot of the update work is done creating and destroying runs of 1,2,3,4... on the lowest level. Perhaps if you left the lowest level as originally defined and did a case by case analysis of the effects of modifications you could have the upper levels displaying the longest stretch of zeros starting at any descendant of a given node - for quick search - and get reasonable update cost.
#Erik covered a reasonable sounding algorithm. However, note that this problem has a lower complexity bound of Ω(N/K) in the worst-case.
Proof:
Consider a reduced version of the problem where:
N and K are both powers of 2
N > 2K >= 4
Suppose your input array is made up of (N/2K) chunks of size 2K. One chunk is of the form K 0s followed by K 1s, every other chunk is the string "10" repeated K times. There are (N/2K) such arrays, each with exactly one solution to the problem (the beginning of the one "special" chunk).
Let n = log2(N), k = log2(K). Let us also define the root node of the tree as being at level 0 and the leaf nodes as being at level n of the tree.
Note that, due to our array being made up of aligned chunks of size 2K, level n-k of the tree is simply going to be made up of the number of 1s in each chunk. However, each of our chunks has the same number of 1s in it. This means that every node at level n-k will be identical, which in turn means that every node at level <= n-k will also be identical.
What this means is that the tree contains no information that can disambiguate the "special" chunk until you start analyzing level n-k+1 and lower. But since all but 2 of the (N/K) nodes at that level are identical, this means that in the worst case you'll have to examine O(N/K) nodes in order to disambiguate the solution from the rest of the nodes.
I've reduced a compression problem I am working on to the following:
You are given as input two n-length vectors of floating point values:
float64 L1, L2, ..., Ln;
float64 U1, U2, ..., Un;
Such that for all i
0.0 <= Li <= Ui <= 1.0
(By the way, n is large: ~10^9)
The algorithm takes L and U as input and uses them to generate a program.
When executed the generated program outputs an n-length vector X:
float64 X1, X2, ..., Xn;
Such that for all i:
L1 <= Xi <= Ui
The generated program can output any such X that fits these bounds.
For example a generated program could simply store L as an array and output it. (Notice this would take 64n bits of space to store L and then a little extra for the program to output it)
The goal is that the generated program (including data) as small as possible, given L and U.
For example suppose that it happens that every element of L was less than 0.3 and every element of U was greater than 0.4 than the generated program could just be:
for i in 1 to n
output 0.35
Which would be tiny.
Can anyone suggest a strategy, algorithm or architecture to tackle this?
This simple heuristic is very fast and should provide very good compression if the bounds allow for a very good compression:
Prepare an arbitrary (virtual) binary search tree over all candidate values. float64s share the sorting order with signed int64s, so you can arbitrarily prefer (have nearer to the root) the values with more trailing zeroes.
For each pair of bounds
start at the root.
While the current node is larger than both bounds OR smaller than both bounds,
descend down the tree.
append the current node into the vector.
For the tree mentioned above, this means
For each pair of bounds
find the (unique) number within the specified range that has as few significant bits as possible. That is, find the first bit where both bounds differ; set it to 1 and all following bits to 0; if the bit that's set to 1 is the sign bit, set it to 0 instead.
Then you can feed this to a deflateing library to compress (and build a self-extracting archive).
A better compression might be possible to achieve if you analyse the data and build a different binary search tree. Since the data set is very large and arrives as a stream of data, it might not be feasible, but this is one such heuristic:
while the output is not fully defined
find any value that fits within the most undecided-for bounds:
sort all bounds together:
bounds with lower value sort before bounds with higher value.
lower bounds sort before upper bounds with the same value.
indistinguishable bounds are grouped together.
calculate the running total of open intervals.
pick the largest total. Either the upper or the lower bound will do. You could even try to make a "smart choice" by splitting the interval with the least amount of significant bits.
set this value as the output for all positions where it can be used.
Instead of recalculating the sort order, you could cache the sort order and only remove from that, or even cache the running total as well (or switch from recalculating the running total to caching the running total at runtime). This does not change the result, only improve the running time.
Is there a way to generate all of the subset sums s1, s2, ..., sk that fall in a range [A,B] faster than O((k+N)*2N/2), where k is the number of sums there are in [A,B]? Note that k is only known after we have enumerated all subset sums within [A,B].
I'm currently using a modified Horowitz-Sahni algorithm. For example, I first call it to for the smallest sum greater than or equal to A, giving me s1. Then I call it again for the next smallest sum greater than s1, giving me s2. Repeat this until we find a sum sk+1 greater than B. There is a lot of computation repeated between each iteration, even without rebuilding the initial two 2N/2 lists, so is there a way to do better?
In my problem, N is about 15, and the magnitude of the numbers is on the order of millions, so I haven't considered the dynamic programming route.
Check the subset sum on Wikipedia. As far as I know, it's the fastest known algorithm, which operates in O(2^(N/2)) time.
Edit:
If you're looking for multiple possible sums, instead of just 0, you can save the end arrays and just iterate through them again (which is roughly an O(2^(n/2) operation) and save re-computing them. The value of all the possible subsets is doesn't change with the target.
Edit again:
I'm not wholly sure what you want. Are we running K searches for one independent value each, or looking for any subset that has a value in a specific range that is K wide? Or are you trying to approximate the second by using the first?
Edit in response:
Yes, you do get a lot of duplicate work even without rebuilding the list. But if you don't rebuild the list, that's not O(k * N * 2^(N/2)). Building the list is O(N * 2^(N/2)).
If you know A and B right now, you could begin iteration, and then simply not stop when you find the right answer (the bottom bound), but keep going until it goes out of range. That should be roughly the same as solving subset sum for just one solution, involving only +k more ops, and when you're done, you can ditch the list.
More edit:
You have a range of sums, from A to B. First, you solve subset sum problem for A. Then, you just keep iterating and storing the results, until you find the solution for B, at which point you stop. Now you have every sum between A and B in a single run, and it will only cost you one subset sum problem solve plus K operations for K values in the range A to B, which is linear and nice and fast.
s = *i + *j; if s > B then ++i; else if s < A then ++j; else { print s; ... what_goes_here? ... }
No, no, no. I get the source of your confusion now (I misread something), but it's still not as complex as what you had originally. If you want to find ALL combinations within the range, instead of one, you will just have to iterate over all combinations of both lists, which isn't too bad.
Excuse my use of auto. C++0x compiler.
std::vector<int> sums;
std::vector<int> firstlist;
std::vector<int> secondlist;
// Fill in first/secondlist.
std::sort(firstlist.begin(), firstlist.end());
std::sort(secondlist.begin(), secondlist.end());
auto firstit = firstlist.begin();
auto secondit = secondlist.begin();
// Since we want all in a range, rather than just the first, we need to check all combinations. Horowitz/Sahni is only designed to find one.
for(; firstit != firstlist.end(); firstit++) {
for(; secondit = secondlist.end(); secondit++) {
int sum = *firstit + *secondit;
if (sum > A && sum < B)
sums.push_back(sum);
}
}
It's still not great. But it could be optimized if you know in advance that N is very large, for example, mapping or hashmapping sums to iterators, so that any given firstit can find any suitable partners in secondit, reducing the running time.
It is possible to do this in O(N*2^(N/2)), using ideas similar to Horowitz Sahni, but we try and do some optimizations to reduce the constants in the BigOh.
We do the following
Step 1: Split into sets of N/2, and generate all possible 2^(N/2) sets for each split. Call them S1 and S2. This we can do in O(2^(N/2)) (note: the N factor is missing here, due to an optimization we can do).
Step 2: Next sort the larger of S1 and S2 (say S1) in O(N*2^(N/2)) time (we optimize here by not sorting both).
Step 3: Find Subset sums in range [A,B] in S1 using binary search (as it is sorted).
Step 4: Next, for each sum in S2, find using binary search the sets in S1 whose union with this gives sum in range [A,B]. This is O(N*2^(N/2)). At the same time, find if that corresponding set in S2 is in the range [A,B]. The optimization here is to combine loops. Note: This gives you a representation of the sets (in terms of two indexes in S2), not the sets themselves. If you want all the sets, this becomes O(K + N*2^(N/2)), where K is the number of sets.
Further optimizations might be possible, for instance when sum from S2, is negative, we don't consider sums < A etc.
Since Steps 2,3,4 should be pretty clear, I will elaborate further on how to get Step 1 done in O(2^(N/2)) time.
For this, we use the concept of Gray Codes. Gray codes are a sequence of binary bit patterns in which each pattern differs from the previous pattern in exactly one bit.
Example: 00 -> 01 -> 11 -> 10 is a gray code with 2 bits.
There are gray codes which go through all possible N/2 bit numbers and these can be generated iteratively (see the wiki page I linked to), in O(1) time for each step (total O(2^(N/2)) steps), given the previous bit pattern, i.e. given current bit pattern, we can generate the next bit pattern in O(1) time.
This enables us to form all the subset sums, by using the previous sum and changing that by just adding or subtracting one number (corresponding to the differing bit position) to get the next sum.
If you modify the Horowitz-Sahni algorithm in the right way, then it's hardly slower than original Horowitz-Sahni. Recall that Horowitz-Sahni works two lists of subset sums: Sums of subsets in the left half of the original list, and sums of subsets in the right half. Call these two lists of sums L and R. To obtain subsets that sum to some fixed value A, you can sort R, and then look up a number in R that matches each number in L using a binary search. However, the algorithm is asymmetric only to save a constant factor in space and time. It's a good idea for this problem to sort both L and R.
In my code below I also reverse L. Then you can keep two pointers into R, updated for each entry in L: A pointer to the last entry in R that's too low, and a pointer to the first entry in R that's too high. When you advance to the next entry in L, each pointer might either move forward or stay put, but they won't have to move backwards. Thus, the second stage of the Horowitz-Sahni algorithm only takes linear time in the data generated in the first stage, plus linear time in the length of the output. Up to a constant factor, you can't do better than that (once you have committed to this meet-in-the-middle algorithm).
Here is a Python code with example input:
# Input
terms = [29371, 108810, 124019, 267363, 298330, 368607,
438140, 453243, 515250, 575143, 695146, 840979, 868052, 999760]
(A,B) = (500000,600000)
# Subset iterator stolen from Sage
def subsets(X):
yield []; pairs = []
for x in X:
pairs.append((2**len(pairs),x))
for w in xrange(2**(len(pairs)-1), 2**(len(pairs))):
yield [x for m, x in pairs if m & w]
# Modified Horowitz-Sahni with toolow and toohigh indices
L = sorted([(sum(S),S) for S in subsets(terms[:len(terms)/2])])
R = sorted([(sum(S),S) for S in subsets(terms[len(terms)/2:])])
(toolow,toohigh) = (-1,0)
for (Lsum,S) in reversed(L):
while R[toolow+1][0] < A-Lsum and toolow < len(R)-1: toolow += 1
while R[toohigh][0] <= B-Lsum and toohigh < len(R): toohigh += 1
for n in xrange(toolow+1,toohigh):
print '+'.join(map(str,S+R[n][1])),'=',sum(S+R[n][1])
"Moron" (I think he should change his user name) raises the reasonable issue of optimizing the algorithm a little further by skipping one of the sorts. Actually, because each list L and R is a list of sizes of subsets, you can do a combined generate and sort of each one in linear time! (That is, linear in the lengths of the lists.) L is the union of two lists of sums, those that include the first term, term[0], and those that don't. So actually you should just make one of these halves in sorted form, add a constant, and then do a merge of the two sorted lists. If you apply this idea recursively, you save a logarithmic factor in the time to make a sorted L, i.e., a factor of N in the original variable of the problem. This gives a good reason to sort both lists as you generate them. If you only sort one list, you have some binary searches that could reintroduce that factor of N; at best you have to optimize them somehow.
At first glance, a factor of O(N) could still be there for a different reason: If you want not just the subset sum, but the subset that makes the sum, then it looks like O(N) time and space to store each subset in L and in R. However, there is a data-sharing trick that also gets rid of that factor of O(N). The first step of the trick is to store each subset of the left or right half as a linked list of bits (1 if a term is included, 0 if it is not included). Then, when the list L is doubled in size as in the previous paragraph, the two linked lists for a subset and its partner can be shared, except at the head:
0
|
v
1 -> 1 -> 0 -> ...
Actually, this linked list trick is an artifact of the cost model and never truly helpful. Because, in order to have pointers in a RAM architecture with O(1) cost, you have to define data words with O(log(memory)) bits. But if you have data words of this size, you might as well store each word as a single bit vector rather than with this pointer structure. I.e., if you need less than a gigaword of memory, then you can store each subset in a 32-bit word. If you need more than a gigaword, then you have a 64-bit architecture or an emulation of it (or maybe 48 bits), and you can still store each subset in one word. If you patch the RAM cost model to take account of word size, then this factor of N was never really there anyway.
So, interestingly, the time complexity for the original Horowitz-Sahni algorithm isn't O(N*2^(N/2)), it's O(2^(N/2)). Likewise the time complexity for this problem is O(K+2^(N/2)), where K is the length of the output.