Given values v1, v2, ..., which get updated one by one, maintain the maximum over subsets (v_i1, v_i2, ...) - algorithm

To set some notation, we have an array of size N consisting of non-negative floats V = [v1, v2, ..., vN], as well as M subsets S1, S2, ..., SM of {1, 2, ..., N} (the subsets will overlap). We are interested in the quantities w_j = max(v_i for i in Sj). The problem is to devise a data structure which can maintain w_j as efficiently as possible, while the values in the array V get updated one by one. We should assume that M >> N.
One idea is to construct the "inverse" of the subsets S, namely subsets T1, T2, ..., TN of {1, 2, ..., M} such that i in Sj if and only if j in Ti. Then, if vi is updated, scan every j in Ti and calculate w_j from scratch. This takes O(TN) time, where T is the maximum size of any Ti subset.
I believe I see a way to maintain these in O(T log N) time, but the algorithm involves a rather convoluted structure of copies of binary search trees and lookup tables. Is there a simpler data structure to use or a simple known solution to this problem? does this problem have a name?
As well, since we have M >> N, it would be ideal to reduce the complexity from O(M), but is this even possible?
Edit: The goal is to construct some data structure which allows efficiently maintaining the maximums when the V array is updated. You cannot construct this data structure in less than O(M) time, but it may be possible to update it in less then that whenever a single entry of the V array changes.

According to my comment, We have M sets that maybe have overlap. On the other hand, each set contains at least one number. So we need to read at least one time M sets with size at least 1. as a result our lower bound for this problem is Ω(M).


What I'm trying to achieve is continuously add more values to a set and keep them as far apart from each other as possible. I'm sure there must be several algorithms out there to solve this problem, but I'm probably just not searching with the right terms. If someone could point me to a solution (doesn't need to be a particularly efficient one) that would be great.
Effectively, given an set of values S, within a range Min-Max, I need to calculate a new value V, within the same range, such that the sum of distances between V and all values in S gets maximized.
It's easy to show that possible candidates for V are either an already existing value of S or the minimum/maximum. Proof: Let S_1, S_2, ..., S_n be the sorted sequence of S, including min and max. If you choose S_i < V < S_{i+1}, then the sum sum of distances can be achieved with either V = S_i or V = S_{i+1}, depending on the number of points on the left and the right.
This observation yields an O(n^2) algorithm that just checks every potential candidate in S. It can be improved to O(n) by computing prefix sums upfront to compute the sum of distances in O(1) per element.
In general, since each element contributes two linear cost functions to the domain of possible values, this problem can be solved in O(log n) per query. You just need a data structure that can maintain a list of linear function segments and returns the point with maximum sum. A balanced binary search tree with some clever augmentation and lazy updates can solve this. Whether this is necessary or not of course depends on the number of elements and the number of queries you expect to perform.
I don't think there is a silver bullet solution to your problem, but this is how I would go about solving it generally. First, you need to define a function sumDistance() which takes in a new value V along with all the values in the current set, and outputs the sum of the distances between V and each value in the set.
Next, you can iterate over the domain d of sumDistance(), where Min <= d <= Max, and keep track of the sums for each value V in the domain. When you encounter a new largest sum, then record it. The V value which gave you the largest sum is the value you retain and add to your set.
This algorithm can be repeated for each new value you wish to add. Note that because this is essentially a one dimensional optimization problem, the running time should not be too bad so your first attempt might be good enough.
Assuming the distance d(a,b) = |a-b| then one of min and max will always yield a maximum.
Let's assume you have V that is not at an end point. You then have n1 values that are lower and n2 values that are higher. The total distance at the minimum will be at least (n1 - n2) * (max - V) bigger and the total distance at the maximum will be at least (n2 - n1) * (V - min) bigger.
Since at least one of n1 - n2 and n2 - n1 must be non-negative, a maximum can always be found at one of the end points.

Is it possible to find the list of attributes which would yield to the greatest sum without brute forcing?

I have about 2M records stored in a table.
Each record has a number and about 5K boolean attributes.
So the table looks something like this.
3, T, F, T, F, T, T, ...
29, F, F, T, F, T, T, ...
-87, T, F, T, F, T, T, ...
98, F, F, T, F, F, T, ...
And I defined SUM(A, B) as the sum of the numbers where Ath and Bth attributes are true.
For example, from the sample data above: SUM(1, 3) = 3 + ... + (-87) because the 1st and the 3rd attributes are T for 3 and -87
3, (T), F, (T), F, T, T, ...
29, (F), F, (T), F, T, T, ...
-87, (T), F, (T), F, T, T, ...
98, (F), F, (T), F, F, T, ...
And SUM() can take any number of parameters: SUM(1) and SUM(5, 7, ..., 3455) are all possible.
Are there some smart algorithms for finding a list of attributes L where SUM(L) would yields to the maximum result?
Obviously, brute forcing is not feasible for this large data set.
It would be awesome if there is a way to find not only the maximum but top N lists.
It seems like it is not possible to find THE answer without brute forcing. If I changed the question to find a "good estimation", would there be a good way to do it?
Or, what if I said the cardinality of L is fixed to something like 10, would there be a way to calculate the L?
I would be happy with any.
Unfortunately, this problem is NP-complete. Your options are limited to finding a good but non-maximal solution with an approximation algorithm, or using branch-and-bound and hoping that you don't hit exponential runtime.
Proof of NP-completeness
To prove that your problem is NP-complete, we reduce the set cover problem to your problem. Suppose we have a set U of N elements, and a set S of M subsets of U, where the union of all sets in S is U. The set cover problem asks for the smallest subset T of S such that every element of U is contained in an element of T. If we had a polynomial-time algorithm to solve your problem, we could solve the set cover problem as follows:
First, construct a table with M+N rows and M attributes. The first N rows are "element" rows, each corresponding to an element of U. These have value "negative enough"; -M-1 should be enough. For element row i, the jth attribute is true if the corresponding element is not in the jth set in S.
The last M rows are "set" rows, each corresponding to a set in S. These have value 1. For set row N+i, the ith attribute is false and all others are true.
The values of the element rows are small enough that any choice of attributes that excludes all element rows beats any choice of attributes that includes any element row. Since the union of all sets in S is U, picking all attributes excludes all element rows, so the best choice of attributes is the one that includes the most set rows without including any element rows. By the construction of the table, a choice of attributes will exclude all element rows if the union of the corresponding sets is U, and if it does, its score will be better the fewer attributes it includes. Thus, the best choice of attributes corresponds directly to a minimum cover of S.
If we had a good algorithm to pick a choice of attributes that produces the maximal sum, we could apply it to this table to generate the minimum cover of an arbitrary S. Thus, your problem is as hard as the NP-complete set cover problem, and you should not waste your time trying to come up with an efficient algorithm to generate the perfect choice of attributes.
You could try a genetic algorithm approach, starting out with a certain (large) number of random attribute combinations, letting the worst x% die and mutating a certain percentage of the remaining population by adding/removing attributes.
There is no guarantee that you will find the optimal answer, but a good chance to find a good one within reasonable time.
No polynomial algorithms to solve this problem come to my mind. I can only suggest you a greedy heuristic:
For each attribute, compute its expected_score, i.e. the addend it would bring to your SUM, if selected alone. In your example, the score of 1 is 3 - 87 = -84.
Sort the attributes by expected_score in non-increasing order.
By following that order, greedily add to L the attributes. Call actual_score the score that the attribute a will actually bring to your sum (it can be better or worse than expected_score, depending on the attributes you already have in L). If actual_score(a) is not strictly positive, discard a.
This will not give you the optimal L, but I think a "fairly good" one.
Note: see below why this approach will not give the best results.
My first approach would be to start off with the special case L={} (which should give the sum of all integers) and add that to a list of solutions. From there add possible attributes as restrictions. In the first iteration, try each attribute in turn and remember those that gave a better result. After that iteration, put the remembered ones into a list of solutions.
In the second iteration, try to add another attribute to each of the remembered ones. Remember all those that improved the result. Remove duplicates from the remembered attribute combinations and add these to the list of solutions. Note that {m,n} is the same as {n,m}, so skip redundant combinations in order not to blow up your sets.
Repeat the second iterations until there are no more possible attributes that could be added to improve the final sum. If you then order the list of solutions by their sum, you get the requested solution.
Note that there are ~20G ways to select three attributes out of 5k, so you can't build a data structure containing those but you must absolutely generate them on demand. Still, the sheer amount can produce lots of temporary solutions, so you have to store those efficiently and perhaps even on disk. You can exploit the fact that you only need the previous iteration's solutions for the next iterations, not the ones before.
Another restriction here is that you can end up with less than N best solutions, because all those below L={} are not considered. In that case, I would accept all possible solutions until you have N solutions, and only once you have the N solutions discard those that don't give an improvement over the worst one.
Python code:
solutions = [{}]
remembered = [{}]
while remembered:
tmp = remembered
remembered = []
for s in remembered:
for xs in extensions(s):
if score(xs) > score(s)
Why this doesn't work:
Consider a temporary solution consisting of the three records
-2, T, F
-2, F, T
+3, F, F
The overall sum of these is -1. When I now select the first attribute, I discard the second and third record, giving a sum of -2. When selecting the second attribute, I discard the first and third, giving the same sum of -2. When selecting both the first and second attribute, I discard all three records, giving a sum of zero, which is an improvement.

There is a sequence {a1, a2, a3, a4, ..... aN}. A run is the maximal strictly increasing or strictly decreasing continuous part of the sequence. Eg. If we have a sequence {1,2,3,4,7,6,5,2,3,4,1,2} We have 5 possible runs {1,2,3,4,7}, {7,6,5,2}, {2,3,4}, {4,1} and {1,2}.
Given four numbers N, M, K, L. Count the number of possible sequences of N numbers that has exactly M runs, each of the number in the sequence is less than or equal to K and difference between the adjacent numbers is less than equal to L
The question was asked during an interview.
I could only think of a brute force solution. What is an efficient solution for this problem?
Use dynamic programming. For each number in the substring maintain separate count of maximal increasing and maximally decreasing subsequences. When you incrementally add a new number to the end you can use these counts to update the counts for the new number. Complexity: O(n^2)
This can be rephrased as a recurrence problem. Look at your problem as finding #(N, M) (assume K and L are fixed, they are used in the recurrence conditions, so propagate accordingly). Now start with the more restricted count functions A(N, M; a) and D(N, M, a), where A counts those sets with last run ascending, D counts those with last run descending, and a is the value of the last element in the set.
Express #(N, M) in terms of A(N, M; a) and D(N, M; a) (it's the sum over all allowable a). You might note that there are relations between the two (like the reflection A(N, M; a) = D(N, M; K-a)) but that won't matter much for the calculation except to speed table filling.
Now A(N, M; a) can be expressed in terms of A(N-1, M; w), A(N-1, M-1; x), D(N-1, M; y) and D(N-1, M-1; z). The idea is that if you start with a set of size N-1 and know the direction of the last run and the value of the last element, you know whether adding element a will add to an existing run or add a run. So you can count the number of possible ways to get what you want from the possibilities of the previous case.
I'll let you write this recursion down. Note that this is where you account for L (only add up those that obey the L distance restriction) and K (look for end cases).
Terminate the recursion using the fact that A(1, 1; a) = 1, A(1, x>1; a) = 0 (and similarly for D).
Now, since this is a multiple recursion, be sure your implementation stores results in a table and begins by trying lookup (commonly called dynamic programming).
I suppose you mean by 'brute force solution' what I might mean by 'straightforward solution involving nested-loops over N,M,K,L' ? Sometimes the straightforward solution is good enough. One of the times when the straightforward solution is good enough is when you don't have a better solution. Another of the times is when the numbers are not very large.
With that off my chest I would write the loops in the reverse direction, or something like that. I mean:
Create 2 auxiliary data structures, one to contain the indices of the numbers <=K, one for the indices of the numbers whose difference with their neighbours is <=L.
Run through the list of numbers and populate the foregoing auxiliary data structures.
Find the intersection of the values in those 2 data structures; these will be the indices of interesting places to start searching for runs.
Look in each of the interesting places.
Until someone demonstrates otherwise this is the most efficient solution.

For a part of a divide and conquer algorithm, I have the following question where the data structure is not fixed, so set is not to be taken literally:
Given a set X sorted wrt. some ordering of elements and subsets A and B together consisting of all elements in X, can sorted versions A' and B' of A and B be constructed in time linear in the number of elements in X ?
At the moment I am doing a standard sort at each recursive step giving the recursion
T(n) = 2*T(n/2) + O(n*log n)
for the complexity rather than
T(n) = 2*T(n/2) + O(n)
like in the procedural version, where one can utilize a structure with constant-time lookup on A and B to form A' and B' in linear time.
The added log n factor carries over to the overall complexity, giving O(n* (log n)^2) instead of O(n* log n).
Perhaps I am understanding the term lookup incorrectly. The creation of A' and B' in linear time is easy to do if membership of A and B can be checked in constant time.
I didn't succeed in my attempt at making things clearer by abstracting
away the specifics, so here is the actual problem:
I am implementing the algorithm for the closest pair problem. Given a
finite collection P of points in the plane it finds a pair of points
in P with the minimal distance. It works roughly as follows:
If P
has at least 4 points, form Px and
Py, the points in P sorted by x- and y-coordinate. By
splitting Px form L and R, the left- and right-most
halves of points. Recursively compute the closest pair distance in L and
R, let d be the minimum of the two. Now the minimum distance in P is
either d or the distance from a point in L to a point in R. If the
minimal distance is between points from separate halves, it will appear
between a pair of points lying in the strip of width 2*d centered around
the line x = x0, where x0 is the x-coordinate of
a right-most point in L. It turns out that to find a potential minimal distance pair in
the strip, it is enough to compute for every point in the the strip its
distance to the seven following points if the strip points are in a
collection sorted by y-coordinate.
It is in the steps with forming the sorted collections to pass into the recursion and sorting the strip points by y-coordinate where I don't see how to, in
Haskell, utilize having sorted P at the beginning of the recursion.
The following function may interest you:
partition :: (a -> Bool) -> [a] -> ([a], [a])
partition f xs = (filter f xs, filter (not . f) xs)
If you can compute set-membership in constant time, that is, there is a predicate of type a -> Bool that runs in constant time, then partition will run in time linear in the length of its input list. Furthermore, partition is stable, so that if its input list is sorted, then so are both output lists.
I would also like to point out that the above definition is meant to be give the semantics of partition only; the real implementation in GHC only walks its input list once, even if the entire output is forced.
Of course, the real crux of the question is providing a constant-time predicate. The way you phrased the question leaves sets A and B quite unstructured -- you demand that we can handle any particular partitioning. In that case, I don't know of any particularly Haskell-y way of doing constant-time lookup in arbitrary sets. However, often these problems are a bit more structured: often, rather than set-membership, you are actually interested in whether some easily-computable property holds or not. In this case, the above is just what the doctor ordered.
I know very very little about Haskell but here's a shot anyway.
Given that (A+B) == X can;t you just iterate through X (in the sorted order) and add each element to A' or B' if it exists in A or B? Give linear time lookup of element x in the Sets A and B that would be linear.

Is there a way to generate all of the subset sums s1, s2, ..., sk that fall in a range [A,B] faster than O((k+N)*2N/2), where k is the number of sums there are in [A,B]? Note that k is only known after we have enumerated all subset sums within [A,B].
I'm currently using a modified Horowitz-Sahni algorithm. For example, I first call it to for the smallest sum greater than or equal to A, giving me s1. Then I call it again for the next smallest sum greater than s1, giving me s2. Repeat this until we find a sum sk+1 greater than B. There is a lot of computation repeated between each iteration, even without rebuilding the initial two 2N/2 lists, so is there a way to do better?
In my problem, N is about 15, and the magnitude of the numbers is on the order of millions, so I haven't considered the dynamic programming route.
Check the subset sum on Wikipedia. As far as I know, it's the fastest known algorithm, which operates in O(2^(N/2)) time.
If you're looking for multiple possible sums, instead of just 0, you can save the end arrays and just iterate through them again (which is roughly an O(2^(n/2) operation) and save re-computing them. The value of all the possible subsets is doesn't change with the target.
Edit again:
I'm not wholly sure what you want. Are we running K searches for one independent value each, or looking for any subset that has a value in a specific range that is K wide? Or are you trying to approximate the second by using the first?
Edit in response:
Yes, you do get a lot of duplicate work even without rebuilding the list. But if you don't rebuild the list, that's not O(k * N * 2^(N/2)). Building the list is O(N * 2^(N/2)).
If you know A and B right now, you could begin iteration, and then simply not stop when you find the right answer (the bottom bound), but keep going until it goes out of range. That should be roughly the same as solving subset sum for just one solution, involving only +k more ops, and when you're done, you can ditch the list.
More edit:
You have a range of sums, from A to B. First, you solve subset sum problem for A. Then, you just keep iterating and storing the results, until you find the solution for B, at which point you stop. Now you have every sum between A and B in a single run, and it will only cost you one subset sum problem solve plus K operations for K values in the range A to B, which is linear and nice and fast.
s = *i + *j; if s > B then ++i; else if s < A then ++j; else { print s; ... what_goes_here? ... }
No, no, no. I get the source of your confusion now (I misread something), but it's still not as complex as what you had originally. If you want to find ALL combinations within the range, instead of one, you will just have to iterate over all combinations of both lists, which isn't too bad.
Excuse my use of auto. C++0x compiler.
std::vector<int> sums;
std::vector<int> firstlist;
std::vector<int> secondlist;
// Fill in first/secondlist.
std::sort(firstlist.begin(), firstlist.end());
std::sort(secondlist.begin(), secondlist.end());
auto firstit = firstlist.begin();
auto secondit = secondlist.begin();
// Since we want all in a range, rather than just the first, we need to check all combinations. Horowitz/Sahni is only designed to find one.
for(; firstit != firstlist.end(); firstit++) {
for(; secondit = secondlist.end(); secondit++) {
int sum = *firstit + *secondit;
if (sum > A && sum < B)
It's still not great. But it could be optimized if you know in advance that N is very large, for example, mapping or hashmapping sums to iterators, so that any given firstit can find any suitable partners in secondit, reducing the running time.
It is possible to do this in O(N*2^(N/2)), using ideas similar to Horowitz Sahni, but we try and do some optimizations to reduce the constants in the BigOh.
We do the following
Step 1: Split into sets of N/2, and generate all possible 2^(N/2) sets for each split. Call them S1 and S2. This we can do in O(2^(N/2)) (note: the N factor is missing here, due to an optimization we can do).
Step 2: Next sort the larger of S1 and S2 (say S1) in O(N*2^(N/2)) time (we optimize here by not sorting both).
Step 3: Find Subset sums in range [A,B] in S1 using binary search (as it is sorted).
Step 4: Next, for each sum in S2, find using binary search the sets in S1 whose union with this gives sum in range [A,B]. This is O(N*2^(N/2)). At the same time, find if that corresponding set in S2 is in the range [A,B]. The optimization here is to combine loops. Note: This gives you a representation of the sets (in terms of two indexes in S2), not the sets themselves. If you want all the sets, this becomes O(K + N*2^(N/2)), where K is the number of sets.
Further optimizations might be possible, for instance when sum from S2, is negative, we don't consider sums < A etc.
Since Steps 2,3,4 should be pretty clear, I will elaborate further on how to get Step 1 done in O(2^(N/2)) time.
For this, we use the concept of Gray Codes. Gray codes are a sequence of binary bit patterns in which each pattern differs from the previous pattern in exactly one bit.
Example: 00 -> 01 -> 11 -> 10 is a gray code with 2 bits.
There are gray codes which go through all possible N/2 bit numbers and these can be generated iteratively (see the wiki page I linked to), in O(1) time for each step (total O(2^(N/2)) steps), given the previous bit pattern, i.e. given current bit pattern, we can generate the next bit pattern in O(1) time.
This enables us to form all the subset sums, by using the previous sum and changing that by just adding or subtracting one number (corresponding to the differing bit position) to get the next sum.
If you modify the Horowitz-Sahni algorithm in the right way, then it's hardly slower than original Horowitz-Sahni. Recall that Horowitz-Sahni works two lists of subset sums: Sums of subsets in the left half of the original list, and sums of subsets in the right half. Call these two lists of sums L and R. To obtain subsets that sum to some fixed value A, you can sort R, and then look up a number in R that matches each number in L using a binary search. However, the algorithm is asymmetric only to save a constant factor in space and time. It's a good idea for this problem to sort both L and R.
In my code below I also reverse L. Then you can keep two pointers into R, updated for each entry in L: A pointer to the last entry in R that's too low, and a pointer to the first entry in R that's too high. When you advance to the next entry in L, each pointer might either move forward or stay put, but they won't have to move backwards. Thus, the second stage of the Horowitz-Sahni algorithm only takes linear time in the data generated in the first stage, plus linear time in the length of the output. Up to a constant factor, you can't do better than that (once you have committed to this meet-in-the-middle algorithm).
Here is a Python code with example input:
# Input
terms = [29371, 108810, 124019, 267363, 298330, 368607,
438140, 453243, 515250, 575143, 695146, 840979, 868052, 999760]
(A,B) = (500000,600000)
# Subset iterator stolen from Sage
def subsets(X):
yield []; pairs = []
for x in X:
for w in xrange(2**(len(pairs)-1), 2**(len(pairs))):
yield [x for m, x in pairs if m & w]
# Modified Horowitz-Sahni with toolow and toohigh indices
L = sorted([(sum(S),S) for S in subsets(terms[:len(terms)/2])])
R = sorted([(sum(S),S) for S in subsets(terms[len(terms)/2:])])
(toolow,toohigh) = (-1,0)
for (Lsum,S) in reversed(L):
while R[toolow+1][0] < A-Lsum and toolow < len(R)-1: toolow += 1
while R[toohigh][0] <= B-Lsum and toohigh < len(R): toohigh += 1
for n in xrange(toolow+1,toohigh):
print '+'.join(map(str,S+R[n][1])),'=',sum(S+R[n][1])
"Moron" (I think he should change his user name) raises the reasonable issue of optimizing the algorithm a little further by skipping one of the sorts. Actually, because each list L and R is a list of sizes of subsets, you can do a combined generate and sort of each one in linear time! (That is, linear in the lengths of the lists.) L is the union of two lists of sums, those that include the first term, term[0], and those that don't. So actually you should just make one of these halves in sorted form, add a constant, and then do a merge of the two sorted lists. If you apply this idea recursively, you save a logarithmic factor in the time to make a sorted L, i.e., a factor of N in the original variable of the problem. This gives a good reason to sort both lists as you generate them. If you only sort one list, you have some binary searches that could reintroduce that factor of N; at best you have to optimize them somehow.
At first glance, a factor of O(N) could still be there for a different reason: If you want not just the subset sum, but the subset that makes the sum, then it looks like O(N) time and space to store each subset in L and in R. However, there is a data-sharing trick that also gets rid of that factor of O(N). The first step of the trick is to store each subset of the left or right half as a linked list of bits (1 if a term is included, 0 if it is not included). Then, when the list L is doubled in size as in the previous paragraph, the two linked lists for a subset and its partner can be shared, except at the head:
1 -> 1 -> 0 -> ...
Actually, this linked list trick is an artifact of the cost model and never truly helpful. Because, in order to have pointers in a RAM architecture with O(1) cost, you have to define data words with O(log(memory)) bits. But if you have data words of this size, you might as well store each word as a single bit vector rather than with this pointer structure. I.e., if you need less than a gigaword of memory, then you can store each subset in a 32-bit word. If you need more than a gigaword, then you have a 64-bit architecture or an emulation of it (or maybe 48 bits), and you can still store each subset in one word. If you patch the RAM cost model to take account of word size, then this factor of N was never really there anyway.
So, interestingly, the time complexity for the original Horowitz-Sahni algorithm isn't O(N*2^(N/2)), it's O(2^(N/2)). Likewise the time complexity for this problem is O(K+2^(N/2)), where K is the length of the output.
