How to efficiently find a contiguous range of used/free slots from a Fenwick tree - algorithm

Assume, that I am tracking the usage of slots in a Fenwick tree. As an example, lets consider tracking 32 slots, leading to a Fenwick tree layout as shown in the image below, where the numbers in the grid indicate the index in the underlying array with counts manipulated by the Fenwick tree where the value in each cell is the sum of "used" items in that segment (i.e. array cell 23 stores the amount of used slots in the range [16-23]). The items at the lowest level (i.e. cells 0, 2, 4, ...) can only have the value of "1" (used slot) or "0" (free slot).
What I am looking for is an efficient algorithm to find the first range of a given number of contiguous free slots.
To illustrate, suppose I have the Fenwick tree shown in the image below in which a total of 9 slots are used (note that the light gray numbers are just added for clarity, not actually stored in the tree's array cells).
Now I would like to find e.g. the first contiguous range of 10 free slots, which should find this range:
I can't seem to find an efficient way of doing this, and it is giving me a bit of a headache. Note, that as the required amount of storage space is critical for my purposes, I do not wish to extend the design to be a segment tree.
Any thoughts and suggestions on an O(log N) type of solution would be very welcome.
EDIT
Time for an update after bounty period has expired. Thanks for all comments, questions, suggestions and answers. They have made me think things over again, taught me a lot and pointed out to me (once again; one day I may learn this lesson) that I should focus more on the issue I want to solve when asking questions.
Since #Erik P was the only one that provided a reasonable answer to the question that included the requested code/pseudo code, he will receive the bounty.
He also pointed out correctly that O(log N) search using this structure is not going to be possible. Kudos to #DanBjorge for providing a proof that made me think about worst case performance.
The comment and answer of #EvgenyKluev made me realize I should have formulated my question differently. In fact I was already doing in large part what he suggested (see https://gist.github.com/anonymous/7594508 - which shows where I got stuck before posting this question), and asked this question hoping there would be an efficient way to search contiguous ranges, thereby preventing changing this design to a segment tree (which would require an additional 1024 bytes). It appears however that such a change might be the smart thing to do.
For anyone interested, a binary encoded Fenwick tree matching the example used in this question (32 slot fenwick tree encoded in 64 bits) can be found here: https://gist.github.com/anonymous/7594245.

I think the easiest way to implement all the desired functionality with O(log N) time complexity and at the same time minimize memory requirements is using a bit vector to store all 0/1 (free/used) values. Bit vector can substitute 6 lowest levels of both Fenwick tree and segment tree (if implemented as 64-bit integers). So height of these trees may be reduced by 6 and space requirements for each of these trees would be 64 (or 32) times less than usual.
Segment tree may be implemented as implicit binary tree sitting in an array (just like a well-known max-heap implementation). Root node at index 1, each left descendant of node at index i is placed at 2*i, each right descendant - at 2*i+1. This means twice as much space is needed comparing to Fenwick tree, but since tree height is cut by 6 levels, that's not a big problem.
Each segment tree node should store a single value - length of the longest contiguous sequence of "free" slots starting at a point covered by this node (or zero if no such starting point is there). This makes search for the first range of a given number of contiguous zeros very simple: start from the root, then choose left descendant if it contains value greater or equal than required, otherwise choose right descendant. After coming to some leaf node, check corresponding word of bit vector (for a run of zeros in the middle of the word).
Update operations are more complicated. When changing a value to "used", check appropriate word of bit vector, if it is empty, ascend segment tree to find nonzero value for some left descendant, then descend the tree to get to rightmost leaf with this value, then determine how newly added slot splits "free" interval into two halves, then update all parent nodes for both added slot and starting node of the interval being split, also set a bit in the bit vector. Changing a value to "free" may be implemented similarly.
If obtaining the number of nonzero items in some range is also needed, implement Fenwick tree over the same bit vector (but separate from the segment tree). There is nothing special in Fenwick tree implementation except that adding together 6 lowest nodes is substituted by "population count" operation for some word of the bit vector. For an example of using Fenwick tree together with bit vector see first solution for Magic Board on CodeChef.
All necessary operations for bit vector may be implemented pretty efficiently using various bitwise tricks. For some of them (leading/trailing zero count and population count) you could use either compiler intrinsics or assembler instructions (depending on target architecture).
If bit vector is implemented with 64-bit words and tree nodes - with 32-bit words, both trees occupy 150% space in addition to bit vector. This may be significantly reduced if each leaf node corresponds not to a single bit vector word, but to a small range (4 or 8 words). For 8 words additional space needed for trees would be only 20% of bit vector size. This makes implementation slightly more complicated. If properly optimized, performance should be approximately the same as in variant for one word per leaf node. For very large data set performance is likely to be better (because bit vector computations are more cache-friendly than walking the trees).

As mcdowella suggests in their answer, let K2 = K/2, rounding up, and let M be the smallest power of 2 that is >= K2. A promising approach would be to search for contiguous blocks of K2 zeroes fully contained in one size-M block, and once we've found those, check neighbouring size-M blocks to see if they contain sufficient adjacent zeroes. For the initial scan, if the number of 0s in a block is < K2, clearly we can skip it, and if the number of 0s is >= K2 and the size of the block is >= 2*M, we can look at both sub-blocks.
This suggests the following code. Below, A[0 .. N-1] is the Fenwick tree array; N is assumed to be a power of 2. I'm assuming that you're counting empty slots, rather than nonempty ones; if you prefer to count empty slots, it's easy enough to transform from the one to the other.
initialize q as a stack data structure of triples of integers
push (N-1, N, A[n-1]) onto q
# An entry (i, j, z) represents the block [i-j+1 .. i] of length j, which
# contains z zeroes; we start with one block representing the whole array.
# We maintain the invariant that i always has at least as many trailing ones
# in its binary representation as j has trailing zeroes. (**)
initialize r as an empty list of pairs of integers
while q is not empty:
pop an entry (i,j,z) off q
if z < K2:
next
if FW(i) >= K:
first_half := i - j/2
# change this if you want to count nonempty slots:
first_half_zeroes := A[first_half]
# Because of invariant (**) above, first_half always has exactly
# the right number of trailing 1 bits in its binary representation
# that A[first_half] counts elements of the interval
# [i-j+1 .. first_half].
push (i, j/2, z - first_half_zeroes) onto q
push (first_half, j/2, first_half_zeroes) onto q
else:
process_block(i, j, z)
This lets us process all size-M blocks with at least K/2 zeroes in order. You could even randomize the order in which you push the first and second half onto q in order to get the blocks in a random order, which might be nice to combat the situation where the first half of your array fills up much more quickly than the latter half.
Now we need to discuss how to process a single block. If z = j, then the block is entirely filled with 0s and we can look both left and right to add zeroes. Otherwise, we need to find out if it starts with >= K/2 contiguous zeroes, and if so with how many exactly, and then check if the previous block ends with a suitable number of zeroes. Similarly, we check if the block ends with >= K/2 contiguous zeroes, and if so with how many exactly, and then check if the next block starts with a suitable number of zeroes. So we will need a procedure to find the number of zeroes a block starts or ends with, possibly with a shortcut if it's at least a or at most b. To be precise: let ends_with_zeroes(i, j, min, max) be a procedure that returns the number of zeroes that the block from [i-j+1 .. j] ends with, with a shortcut to return max if the result will be more than max and min if the result will be less than min. Similarly for starts_with_zeroes(i, j, min, max).
def process_block(i, j, z):
if j == z:
if i > j:
a := ends_with_zeroes(i-j, j, 0, K-z)
else:
a := 0
if i < N-1:
b := starts_with_zeroes(i+j, j, K-z-a-1, K-z-a)
else:
b := 0
if b >= K-z-a:
print "Found: starting at ", i - j - a + 1
return
# If the block doesn't start or end with K2 zeroes but overlaps with a
# correct solution anyway, we don't need to find it here -- we'll find it
# starting from the adjacent block.
a := starts_with_zeroes(i, j, K2-1, j)
if i > j and a >= K2:
b := ends_with_zeroes(i-j, j, K-a-1, K-a)
if b >= K-a:
print "Found: starting at ", i - j - a + 1
# Since z < 2*K2, and j != z, we know this block doesn't end with K2
# zeroes, so we can safely return.
return
a := ends_with_zeroes(i, j, K2-1, j)
if i < N-1 and a >= K2:
b := starts_with_zeroes(i+j, K-a-1, K-a)
if b >= K-a:
print "Found: starting at ", i - a + 1
Note that in the second case where we find a solution, it may be possible to move the starting point left a bit further. You could check for that separately if you need the very first position that it could start.
Now all that's left is to implement starts_with_zeroes and ends_with_zeroes. In order to check that the block starts with at least min zeroes, we can test that it starts with 2^h zeroes (where 2^h <= min) by checking the appropriate Fenwick entry; then similarly check if it starts with 2^H zeroes where 2^H >= max to short cut the other way (except if max = j, it is trickier to find the right count from the Fenwick tree); then find the precise number.
def starts_with_zeroes(i, j, min, max):
start := i-j
h2 := 1
while h2 * 2 <= min:
h2 := h2 * 2
if A[start + h2] < h2:
return min
# Now h2 = 2^h in the text.
# If you insist, you can do the above operation faster with bit twiddling
# to get the 2log of min (in which case, for more info google it).
while h2 < max and A[start + 2*h2] == 2*h2:
h2 := 2*h2
if h2 == j:
# Walk up the Fenwick tree to determine the exact number of zeroes
# in interval [start+1 .. i]. (Not implemented, but easy.) Let this
# number be z.
if z < j:
h2 := h2 / 2
if h2 >= max:
return max
# Now we know that [start+1 .. start+h2] is all zeroes, but somewhere in
# [start+h2+1 .. start+2*h2] there is a one.
# Maintain invariant: the interval [start+1 .. start+h2] is all zeroes,
# and there is a one in [start+h2+1 .. start+h2+step].
step := h2;
while step > 1:
step := step / 2
if A[start + h2 + step] == step:
h2 := h2 + step
return h2
As you see, starts_with_zeroes is pretty bottom-up. For ends_with_zeroes, I think you'd want to do a more top-down approach, since examining the second half of something in a Fenwick tree is a little trickier. You should be able to do a similar type of binary search-style iteration.
This algorithm is definitely not O(log(N)), and I have a hunch that this is unavoidable. The Fenwick tree simply doesn't give information that is that good for your question. However, I think this algorithm will perform fairly well in practice if suitable intervals are fairly common.

One quick check, when searching for a range of K contiguous slots, is to find the largest power of two less than or equal to K/2. Any K continuous zero slots must contain at least one Fenwick-aligned range of slots of size <= K/2 that is entirely filled with zeros. You could search the Fenwick tree from the top for such chunks of aligned zeros and then look for the first one that can be extended to produce a range of K contiguous zeros.
In your example the lowest level contains 0s or 1s and the upper level contains sums of descendants. Finding stretches of 0s would be easier if the lowest level contained 0s where you are currently writing 1s and a count of the number of contiguous zeros to the left where you are currently writing zeros, and the upper levels contained the maximum value of any descendant. Updating would mean more work, especially if you had long strings of zeros being created and destroyed, but you could find the leftmost string of zeros of length at least K with a single search to the left branching left where the max value was at least K. Actually here a lot of the update work is done creating and destroying runs of 1,2,3,4... on the lowest level. Perhaps if you left the lowest level as originally defined and did a case by case analysis of the effects of modifications you could have the upper levels displaying the longest stretch of zeros starting at any descendant of a given node - for quick search - and get reasonable update cost.

#Erik covered a reasonable sounding algorithm. However, note that this problem has a lower complexity bound of Ω(N/K) in the worst-case.
Proof:
Consider a reduced version of the problem where:
N and K are both powers of 2
N > 2K >= 4
Suppose your input array is made up of (N/2K) chunks of size 2K. One chunk is of the form K 0s followed by K 1s, every other chunk is the string "10" repeated K times. There are (N/2K) such arrays, each with exactly one solution to the problem (the beginning of the one "special" chunk).
Let n = log2(N), k = log2(K). Let us also define the root node of the tree as being at level 0 and the leaf nodes as being at level n of the tree.
Note that, due to our array being made up of aligned chunks of size 2K, level n-k of the tree is simply going to be made up of the number of 1s in each chunk. However, each of our chunks has the same number of 1s in it. This means that every node at level n-k will be identical, which in turn means that every node at level <= n-k will also be identical.
What this means is that the tree contains no information that can disambiguate the "special" chunk until you start analyzing level n-k+1 and lower. But since all but 2 of the (N/K) nodes at that level are identical, this means that in the worst case you'll have to examine O(N/K) nodes in order to disambiguate the solution from the rest of the nodes.

Related

algorithm of finding max of min in any range for two arrays

Let's say we have two arrays of ints of equal length, a_1, ..., a_n and b_1, ..., b_n. For any given index pairs i and j with 1<=i<j<=n, we need to find the max of the min for any sequence of the form a_k, ..., a_{l-1}, b_l, ..., b_{j-i+k} with 0<=k<=n-j+i and l can be j-i+k+1, i.e. that sequence is purely from array a. When k=0, the sequence is purely from array b.
We want to do this for all pairs of i and j very efficiently.
Example, given
`a=[3,2,4,1]` and `b=[4,6,1,3]`
when `i=1, j=3`, the sequence can be
`[3,2,4]`, min is 2
`[3,2,1]`, min is 1
`[3,6,1]`, min is 1
`[2,4,1]`, min is 1
`[2,4,3]`, min is 2
`[2,1,3]`, min is 1
`[4,6,1]`, min is 1
`[6,1,3]`, min is 1
So the max is 2 for this input.
Is there a good way to run this efficiently?
It's seems possible to make the brute force approach run fairly quickly.
If you preprocess each sequence into a balanced tree where each node is augmented with the min of that subtree, then you can find the min of any subrange of that sequence in O(log n) time by splitting the tree at the appropriate points. See, for example, this paper for more information. Note that this preprocessing takes O(n) time.
Let's call the range (i,j) a window. The complexity of this problem doesn't depend on the specific (i,j), but rather the size of the window (that is, j-i+1). For a window size of m (=j-i+1), there are n-m+1 windows of that size. For each window, there are m+1 places where you can "cut" the window so that some prefix of the elements come from from sequence a and the suffix comes from sequence b. You pay O(log n) for each cut (to split the binary trees as I mentioned above). That's a total cost of O((n-m+1) * (m+1) * log(n)).
There is probably a faster way to do this, by reusing splits, or by noticing that nearby windows share a lot of elements. But regardless, I think the binary tree splitting trick I mentioned above might be helpful!

Finding Most Compressible Vector Within Given Bounds?

I've reduced a compression problem I am working on to the following:
You are given as input two n-length vectors of floating point values:
float64 L1, L2, ..., Ln;
float64 U1, U2, ..., Un;
Such that for all i
0.0 <= Li <= Ui <= 1.0
(By the way, n is large: ~10^9)
The algorithm takes L and U as input and uses them to generate a program.
When executed the generated program outputs an n-length vector X:
float64 X1, X2, ..., Xn;
Such that for all i:
L1 <= Xi <= Ui
The generated program can output any such X that fits these bounds.
For example a generated program could simply store L as an array and output it. (Notice this would take 64n bits of space to store L and then a little extra for the program to output it)
The goal is that the generated program (including data) as small as possible, given L and U.
For example suppose that it happens that every element of L was less than 0.3 and every element of U was greater than 0.4 than the generated program could just be:
for i in 1 to n
output 0.35
Which would be tiny.
Can anyone suggest a strategy, algorithm or architecture to tackle this?
This simple heuristic is very fast and should provide very good compression if the bounds allow for a very good compression:
Prepare an arbitrary (virtual) binary search tree over all candidate values. float64s share the sorting order with signed int64s, so you can arbitrarily prefer (have nearer to the root) the values with more trailing zeroes.
For each pair of bounds
start at the root.
While the current node is larger than both bounds OR smaller than both bounds,
descend down the tree.
append the current node into the vector.
For the tree mentioned above, this means
For each pair of bounds
find the (unique) number within the specified range that has as few significant bits as possible. That is, find the first bit where both bounds differ; set it to 1 and all following bits to 0; if the bit that's set to 1 is the sign bit, set it to 0 instead.
Then you can feed this to a deflateing library to compress (and build a self-extracting archive).
A better compression might be possible to achieve if you analyse the data and build a different binary search tree. Since the data set is very large and arrives as a stream of data, it might not be feasible, but this is one such heuristic:
while the output is not fully defined
find any value that fits within the most undecided-for bounds:
sort all bounds together:
bounds with lower value sort before bounds with higher value.
lower bounds sort before upper bounds with the same value.
indistinguishable bounds are grouped together.
calculate the running total of open intervals.
pick the largest total. Either the upper or the lower bound will do. You could even try to make a "smart choice" by splitting the interval with the least amount of significant bits.
set this value as the output for all positions where it can be used.
Instead of recalculating the sort order, you could cache the sort order and only remove from that, or even cache the running total as well (or switch from recalculating the running total to caching the running total at runtime). This does not change the result, only improve the running time.

Revisit: 2D Array Sorted Along X and Y Axis

So, this is a common interview question. There's already a topic up, which I have read, but it's dead, and no answer was ever accepted. On top of that, my interests lie in a slightly more constrained form of the question, with a couple practical applications.
Given a two dimensional array such that:
Elements are unique.
Elements are sorted along the x-axis and the y-axis.
Neither sort predominates, so neither sort is a secondary sorting parameter.
As a result, the diagonal is also sorted.
All of the sorts can be thought of as moving in the same direction. That is to say that they are all ascending, or that they are all descending.
Technically, I think as long as you have a >/=/< comparator, any total ordering should work.
Elements are numeric types, with a single-cycle comparator.
Thus, memory operations are the dominating factor in a big-O analysis.
How do you find an element? Only worst case analysis matters.
Solutions I am aware of:
A variety of approaches that are:
O(nlog(n)), where you approach each row separately.
O(nlog(n)) with strong best and average performance.
One that is O(n+m):
Start in a non-extreme corner, which we will assume is the bottom right.
Let the target be J. Cur Pos is M.
If M is greater than J, move left.
If M is less than J, move up.
If you can do neither, you are done, and J is not present.
If M is equal to J, you are done.
Originally found elsewhere, most recently stolen from here.
And I believe I've seen one with a worst-case O(n+m) but a optimal case of nearly O(log(n)).
What I am curious about:
Right now, I have proved to my satisfaction that naive partitioning attack always devolves to nlog(n). Partitioning attacks in general appear to have a optimal worst-case of O(n+m), and most do not terminate early in cases of absence. I was also wondering, as a result, if an interpolation probe might not be better than a binary probe, and thus it occurred to me that one might think of this as a set intersection problem with a weak interaction between sets. My mind cast immediately towards Baeza-Yates intersection, but I haven't had time to draft an adaptation of that approach. However, given my suspicions that optimality of a O(N+M) worst case is provable, I thought I'd just go ahead and ask here, to see if anyone could bash together a counter-argument, or pull together a recurrence relation for interpolation search.
Here's a proof that it has to be at least Omega(min(n,m)). Let n >= m. Then consider the matrix which has all 0s at (i,j) where i+j < m, all 2s where i+j >= m, except for a single (i,j) with i+j = m which has a 1. This is a valid input matrix, and there are m possible placements for the 1. No query into the array (other than the actual location of the 1) can distinguish among those m possible placements. So you'll have to check all m locations in the worst case, and at least m/2 expected locations for any randomized algorithm.
One of your assumptions was that matrix elements have to be unique, and I didn't do that. It is easy to fix, however, because you just pick a big number X=n*m, replace all 0s with unique numbers less than X, all 2s with unique numbers greater than X, and 1 with X.
And because it is also Omega(lg n) (counting argument), it is Omega(m + lg n) where n>=m.
An optimal O(m+n) solution is to start at the top-left corner, that has minimal value. Move diagonally downwards to the right until you hit an element whose value >= value of the given element. If the element's value is equal to that of the given element, return found as true.
Otherwise, from here we can proceed in two ways.
Strategy 1:
Move up in the column and search for the given element until we reach the end. If found, return found as true
Move left in the row and search for the given element until we reach the end. If found, return found as true
return found as false
Strategy 2:
Let i denote the row index and j denote the column index of the diagonal element we have stopped at. (Here, we have i = j, BTW). Let k = 1.
Repeat the below steps until i-k >= 0
Search if a[i-k][j] is equal to the given element. if yes, return found as true.
Search if a[i][j-k] is equal to the given element. if yes, return found as true.
Increment k
1 2 4 5 6
2 3 5 7 8
4 6 8 9 10
5 8 9 10 11

Generate all subset sums within a range faster than O((k+N) * 2^(N/2))?

Is there a way to generate all of the subset sums s1, s2, ..., sk that fall in a range [A,B] faster than O((k+N)*2N/2), where k is the number of sums there are in [A,B]? Note that k is only known after we have enumerated all subset sums within [A,B].
I'm currently using a modified Horowitz-Sahni algorithm. For example, I first call it to for the smallest sum greater than or equal to A, giving me s1. Then I call it again for the next smallest sum greater than s1, giving me s2. Repeat this until we find a sum sk+1 greater than B. There is a lot of computation repeated between each iteration, even without rebuilding the initial two 2N/2 lists, so is there a way to do better?
In my problem, N is about 15, and the magnitude of the numbers is on the order of millions, so I haven't considered the dynamic programming route.
Check the subset sum on Wikipedia. As far as I know, it's the fastest known algorithm, which operates in O(2^(N/2)) time.
Edit:
If you're looking for multiple possible sums, instead of just 0, you can save the end arrays and just iterate through them again (which is roughly an O(2^(n/2) operation) and save re-computing them. The value of all the possible subsets is doesn't change with the target.
Edit again:
I'm not wholly sure what you want. Are we running K searches for one independent value each, or looking for any subset that has a value in a specific range that is K wide? Or are you trying to approximate the second by using the first?
Edit in response:
Yes, you do get a lot of duplicate work even without rebuilding the list. But if you don't rebuild the list, that's not O(k * N * 2^(N/2)). Building the list is O(N * 2^(N/2)).
If you know A and B right now, you could begin iteration, and then simply not stop when you find the right answer (the bottom bound), but keep going until it goes out of range. That should be roughly the same as solving subset sum for just one solution, involving only +k more ops, and when you're done, you can ditch the list.
More edit:
You have a range of sums, from A to B. First, you solve subset sum problem for A. Then, you just keep iterating and storing the results, until you find the solution for B, at which point you stop. Now you have every sum between A and B in a single run, and it will only cost you one subset sum problem solve plus K operations for K values in the range A to B, which is linear and nice and fast.
s = *i + *j; if s > B then ++i; else if s < A then ++j; else { print s; ... what_goes_here? ... }
No, no, no. I get the source of your confusion now (I misread something), but it's still not as complex as what you had originally. If you want to find ALL combinations within the range, instead of one, you will just have to iterate over all combinations of both lists, which isn't too bad.
Excuse my use of auto. C++0x compiler.
std::vector<int> sums;
std::vector<int> firstlist;
std::vector<int> secondlist;
// Fill in first/secondlist.
std::sort(firstlist.begin(), firstlist.end());
std::sort(secondlist.begin(), secondlist.end());
auto firstit = firstlist.begin();
auto secondit = secondlist.begin();
// Since we want all in a range, rather than just the first, we need to check all combinations. Horowitz/Sahni is only designed to find one.
for(; firstit != firstlist.end(); firstit++) {
for(; secondit = secondlist.end(); secondit++) {
int sum = *firstit + *secondit;
if (sum > A && sum < B)
sums.push_back(sum);
}
}
It's still not great. But it could be optimized if you know in advance that N is very large, for example, mapping or hashmapping sums to iterators, so that any given firstit can find any suitable partners in secondit, reducing the running time.
It is possible to do this in O(N*2^(N/2)), using ideas similar to Horowitz Sahni, but we try and do some optimizations to reduce the constants in the BigOh.
We do the following
Step 1: Split into sets of N/2, and generate all possible 2^(N/2) sets for each split. Call them S1 and S2. This we can do in O(2^(N/2)) (note: the N factor is missing here, due to an optimization we can do).
Step 2: Next sort the larger of S1 and S2 (say S1) in O(N*2^(N/2)) time (we optimize here by not sorting both).
Step 3: Find Subset sums in range [A,B] in S1 using binary search (as it is sorted).
Step 4: Next, for each sum in S2, find using binary search the sets in S1 whose union with this gives sum in range [A,B]. This is O(N*2^(N/2)). At the same time, find if that corresponding set in S2 is in the range [A,B]. The optimization here is to combine loops. Note: This gives you a representation of the sets (in terms of two indexes in S2), not the sets themselves. If you want all the sets, this becomes O(K + N*2^(N/2)), where K is the number of sets.
Further optimizations might be possible, for instance when sum from S2, is negative, we don't consider sums < A etc.
Since Steps 2,3,4 should be pretty clear, I will elaborate further on how to get Step 1 done in O(2^(N/2)) time.
For this, we use the concept of Gray Codes. Gray codes are a sequence of binary bit patterns in which each pattern differs from the previous pattern in exactly one bit.
Example: 00 -> 01 -> 11 -> 10 is a gray code with 2 bits.
There are gray codes which go through all possible N/2 bit numbers and these can be generated iteratively (see the wiki page I linked to), in O(1) time for each step (total O(2^(N/2)) steps), given the previous bit pattern, i.e. given current bit pattern, we can generate the next bit pattern in O(1) time.
This enables us to form all the subset sums, by using the previous sum and changing that by just adding or subtracting one number (corresponding to the differing bit position) to get the next sum.
If you modify the Horowitz-Sahni algorithm in the right way, then it's hardly slower than original Horowitz-Sahni. Recall that Horowitz-Sahni works two lists of subset sums: Sums of subsets in the left half of the original list, and sums of subsets in the right half. Call these two lists of sums L and R. To obtain subsets that sum to some fixed value A, you can sort R, and then look up a number in R that matches each number in L using a binary search. However, the algorithm is asymmetric only to save a constant factor in space and time. It's a good idea for this problem to sort both L and R.
In my code below I also reverse L. Then you can keep two pointers into R, updated for each entry in L: A pointer to the last entry in R that's too low, and a pointer to the first entry in R that's too high. When you advance to the next entry in L, each pointer might either move forward or stay put, but they won't have to move backwards. Thus, the second stage of the Horowitz-Sahni algorithm only takes linear time in the data generated in the first stage, plus linear time in the length of the output. Up to a constant factor, you can't do better than that (once you have committed to this meet-in-the-middle algorithm).
Here is a Python code with example input:
# Input
terms = [29371, 108810, 124019, 267363, 298330, 368607,
438140, 453243, 515250, 575143, 695146, 840979, 868052, 999760]
(A,B) = (500000,600000)
# Subset iterator stolen from Sage
def subsets(X):
yield []; pairs = []
for x in X:
pairs.append((2**len(pairs),x))
for w in xrange(2**(len(pairs)-1), 2**(len(pairs))):
yield [x for m, x in pairs if m & w]
# Modified Horowitz-Sahni with toolow and toohigh indices
L = sorted([(sum(S),S) for S in subsets(terms[:len(terms)/2])])
R = sorted([(sum(S),S) for S in subsets(terms[len(terms)/2:])])
(toolow,toohigh) = (-1,0)
for (Lsum,S) in reversed(L):
while R[toolow+1][0] < A-Lsum and toolow < len(R)-1: toolow += 1
while R[toohigh][0] <= B-Lsum and toohigh < len(R): toohigh += 1
for n in xrange(toolow+1,toohigh):
print '+'.join(map(str,S+R[n][1])),'=',sum(S+R[n][1])
"Moron" (I think he should change his user name) raises the reasonable issue of optimizing the algorithm a little further by skipping one of the sorts. Actually, because each list L and R is a list of sizes of subsets, you can do a combined generate and sort of each one in linear time! (That is, linear in the lengths of the lists.) L is the union of two lists of sums, those that include the first term, term[0], and those that don't. So actually you should just make one of these halves in sorted form, add a constant, and then do a merge of the two sorted lists. If you apply this idea recursively, you save a logarithmic factor in the time to make a sorted L, i.e., a factor of N in the original variable of the problem. This gives a good reason to sort both lists as you generate them. If you only sort one list, you have some binary searches that could reintroduce that factor of N; at best you have to optimize them somehow.
At first glance, a factor of O(N) could still be there for a different reason: If you want not just the subset sum, but the subset that makes the sum, then it looks like O(N) time and space to store each subset in L and in R. However, there is a data-sharing trick that also gets rid of that factor of O(N). The first step of the trick is to store each subset of the left or right half as a linked list of bits (1 if a term is included, 0 if it is not included). Then, when the list L is doubled in size as in the previous paragraph, the two linked lists for a subset and its partner can be shared, except at the head:
0
|
v
1 -> 1 -> 0 -> ...
Actually, this linked list trick is an artifact of the cost model and never truly helpful. Because, in order to have pointers in a RAM architecture with O(1) cost, you have to define data words with O(log(memory)) bits. But if you have data words of this size, you might as well store each word as a single bit vector rather than with this pointer structure. I.e., if you need less than a gigaword of memory, then you can store each subset in a 32-bit word. If you need more than a gigaword, then you have a 64-bit architecture or an emulation of it (or maybe 48 bits), and you can still store each subset in one word. If you patch the RAM cost model to take account of word size, then this factor of N was never really there anyway.
So, interestingly, the time complexity for the original Horowitz-Sahni algorithm isn't O(N*2^(N/2)), it's O(2^(N/2)). Likewise the time complexity for this problem is O(K+2^(N/2)), where K is the length of the output.

string transposition algorithm

Suppose there is given two String:
String s1= "MARTHA"
String s2= "MARHTA"
here we exchange positions of T and H. I am interested to write code which counts how many changes are necessary to transform from one String to another String.
There are several edit distance algorithms, the given Wikipeida link has links to a few.
Assuming that the distance counts only swaps, here is an idea based on permutations, that runs in linear time.
The first step of the algorithm is ensuring that the two strings are really equivalent in their character contents. This can be done in linear time using a hash table (or a fixed array that covers all the alphabet). If they are not, then s2 can't be considered a permutation of s1, and the "swap count" is irrelevant.
The second step counts the minimum number of swaps required to transform s2 to s1. This can be done by inspecting the permutation p that corresponds to the transformation from s1 to s2. For example, if s1="abcde" and s2="badce", then p=(2,1,4,3,5), meaning that position 1 contains element #2, position 2 contains element #1, etc. This permutation can be broke up into permutation cycles in linear time. The cycles in the example are (2,1) (4,3) and (5). The minimum swap count is the total count of the swaps required per cycle. A cycle of length k requires k-1 swaps in order to "fix it". Therefore, The number of swaps is N-C, where N is the string length and C is the number of cycles. In our example, the result is 2 (swap 1,2 and then 3,4).
Now, there are two problems here, and I think I'm too tired to solve them right now :)
1) My solution assumes that no character is repeated, which is not always the case. Some adjustment is needed to calculate the swap count correctly.
2) My formula #MinSwaps=N-C needs a proof... I didn't find it in the web.
Your problem is not so easy, since before counting the swaps you need to ensure that every swap reduces the "distance" (in equality) between these two strings. Then actually you look for the count but you should look for the smallest count (or at least I suppose), otherwise there exists infinite ways to swap a string to obtain another one.
You should first check which charaters are already in place, then for every character that is not look if there is a couple that can be swapped so that the next distance between strings is reduced. Then iterate over until you finish the process.
If you don't want to effectively do it but just count the number of swaps use a bit array in which you have 1 for every well-placed character and 0 otherwise. You will finish when every bit is 1.

Resources