Limiting grouped sets by total items - algorithm

I have a set of sets of items: {1}, {2}, {2,3}, {13,7}, {7,2,18}. And limits (item=>max number of items): count(1)<=10, count(2)<=2, count(3)<=5, count(7)<=1, count(13)<=10, count(18)<=10 (no more, than 2 of 2 total). I need to find the best subset of initial set, that fits the limit. E.g., {2}, {2,3}, {7,2,18} doesn't fit, because it has 3 of 2 in total, but limit has only 2 of 2.
Inner sets are immutable, e.g. {7,2,18} can't be split. Inner sets can be of any size (but they are about 1-5 items in practice)
The definition of "best" is kind of vague in my case. I'm ok with a subset, that have most sets. Or a subset, that has most items total.
Currently, I have this (for the case "a subset, that have most sets"):
calculate current totals per item ({1=>1, 2=>3, 3=>1, 7=>2, 13=>1, 18=>1}
find items, that are affected by limit ({2, 7})
find sets, that contain affected items ({2}, {2,3}, {13,7}, {7,2,18})
generate all subsets of this smaller set (2^4, 16 subsets, including empty)
calculate limits for each subset
stop with the first subset matched
My problem: I'm not sure if my solution is optimal, and it has exponential complexity.
Is there a better solution?
(In practice, it's a rare condition, that item hits limit)

The expensive step for you is to calculate all partitions of the set-of-sets to see which partitions are valid from a limits perspective. You are generating an O(2^n) tree (with n the number of initial subsets) and then looking at all leaves.
You can significantly speed this up by searching more promising candidates first, and stopping a soon as an (optimal) solution is reached. For example, going for the "subsets with most sets is best" goal, you can use the following pseudocode:
place initial state (= set with all subsets) S in queue Q
do
pop first state from Q
if valid, return it: it is an optimal answer
else,
for each subset I in the current state S
push state S-I (that is, that does not include subset I) to Q
while Q is not empty
if this line is reached, return "no answer possible"
Note that, in the worst case (= no possible answer), this is just as bad as what you had before. But if there are few subsets to remove, this will find them in O(2^r), where r is the minimal number of subsets that needs to be removed to reach an optimal answer.
You can further speed thinks up if you avoid revisiting states: an (invalid) state with r removed subsets will be reached 2^r times. Note however that this is a space-time tradeoff: the larger the visited-state cache, the more memory you will need.
Other optimizations are possible, such as heuristics for choosing which subsets to remove first; for example, subsets that contain items that are currently exceeding their limits should be prioritized for removal over those that contain no such items.

Related

Are there sorting algorithms that respect final position restrictions and run in O(n log n) time?

I'm looking for a sorting algorithm that honors a min and max range for each element1. The problem domain is a recommendations engine that combines a set of business rules (the restrictions) with a recommendation score (the value). If we have a recommendation we want to promote (e.g. a special product or deal) or an announcement we want to appear near the top of the list (e.g. "This is super important, remember to verify your email address to participate in an upcoming promotion!") or near the bottom of the list (e.g. "If you liked these recommendations, click here for more..."), they will be curated with certain position restriction in place. For example, this should always be the top position, these should be in the top 10, or middle 5 etc. This curation step is done ahead of time and remains fixed for a given time period and for business reasons must remain very flexible.
Please don't question the business purpose, UI or input validation. I'm just trying to implement the algorithm in the constraints I've been given. Please treat this as an academic question. I will endeavor to provide a rigorous problem statement, and feedback on all other aspects of the problem is very welcome.
So if we were sorting chars, our data would have a structure of
struct {
char value;
Integer minPosition;
Integer maxPosition;
}
Where minPosition and maxPosition may be null (unrestricted). If this were called on an algorithm where all positions restrictions were null, or all minPositions were 0 or less and all maxPositions were equal to or greater than the size of the list, then the output would just be chars in ascending order.
This algorithm would only reorder two elements if the minPosition and maxPosition of both elements would not be violated by their new positions. An insertion-based algorithm which promotes items to the top of the list and reorders the rest has obvious problems in that every later element would have to be revalidated after each iteration; in my head, that rules out such algorithms for having O(n3) complexity, but I won't rule out such algorithms without considering evidence to the contrary, if presented.
In the output list, certain elements will be out of order with regard to their value, if and only if the set of position constraints dictates it. These outputs are still valid.
A valid list is any list where all elements are in a position that does not conflict with their constraints.
An optimal list is a list which cannot be reordered to more closely match the natural order without violating one or more position constraint. An invalid list is never optimal. I don't have a strict definition I can spell out for 'more closely matching' between one ordering or another. However, I think it's fairly easy to let intuition guide you, or choose something similar to a distance metric.
Multiple optimal orderings may exist if multiple inputs have the same value. You could make an argument that the above paragraph is therefore incorrect, because either one can be reordered to the other without violating constraints and therefore neither can be optimal. However, any rigorous distance function would treat these lists as identical, with the same distance from the natural order and therefore reordering the identical elements is allowed (because it's a no-op).
I would call such outputs the correct, sorted order which respects the position constraints, but several commentators pointed out that we're not really returning a sorted list, so let's stick with 'optimal'.
For example, the following are a input lists (in the form of <char>(<minPosition>:<maxPosition>), where Z(1:1) indicates a Z that must be at the front of the list and M(-:-) indicates an M that may be in any position in the final list and the natural order (sorted by value only) is A...M...Z) and their optimal orders.
Input order
A(1:1) D(-:-) C(-:-) E(-:-) B(-:-)
Optimal order
A B C D E
This is a trivial example to show that the natural order prevails in a list with no constraints.
Input order
E(1:1) D(2:2) C(3:3) B(4:4) A(5:5)
Optimal order
E D C B A
This example is to show that a fully constrained list is output in the same order it is given. The input is already a valid and optimal list. The algorithm should still run in O(n log n) time for such inputs. (Our initial solution is able to short-circuit any fully constrained list to run in linear time; I added the example both to drive home the definitions of optimal and valid and because some swap-based algorithms I considered handled this as the worse case.)
Input order
E(1:1) C(-:-) B(1:5) A(4:4) D(2:3)
Optimal Order
E B D A C
E is constrained to 1:1, so it is first in the list even though it has the lowest value. A is similarly constrained to 4:4, so it is also out of natural order. B has essentially identical constraints to C and may appear anywhere in the final list, but B will be before C because of value. D may be in positions 2 or 3, so it appears after B because of natural ordering but before C because of its constraints.
Note that the final order is correct despite being wildly different from the natural order (which is still A,B,C,D,E). As explained in the previous paragraph, nothing in this list can be reordered without violating the constraints of one or more items.
Input order
B(-:-) C(2:2) A(-:-) A(-:-)
Optimal order
A(-:-) C(2:2) A(-:-) B(-:-)
C remains unmoved because it already in its only valid position. B is reordered to the end because its value is less than both A's. In reality, there will be additional fields that differentiate the two A's, but from the standpoint of the algorithm, they are identical and preserving OR reversing their input ordering is an optimal solution.
Input order
A(1:1) B(1:1) C(3:4) D(3:4) E(3:4)
Undefined output
This input is invalid for two reasons: 1) A and B are both constrained to position 1 and 2) C, D, and E are constrained to a range than can only hold 2 elements. In other words, the ranges 1:1 and 3:4 are over-constrained. However, the consistency and legality of the constraints are enforced by UI validation, so it's officially not the algorithms problem if they are incorrect, and the algorithm can return a best-effort ordering OR the original ordering in that case. Passing an input like this to the algorithm may be considered undefined behavior; anything can happen. So, for the rest of the question...
All input lists will have elements that are initially in valid positions.
The sorting algorithm itself can assume the constraints are valid and an optimal order exists.2
We've currently settled on a customized selection sort (with runtime complexity of O(n2)) and reasonably proved that it works for all inputs whose position restrictions are valid and consistent (e.g. not overbooked for a given position or range of positions).
Is there a sorting algorithm that is guaranteed to return the optimal final order and run in better than O(n2) time complexity?3
I feel that a library standard sorting algorithm could be modified to handle these constrains by providing a custom comparator that accepts the candidate destination position for each element. This would be equivalent to the current position of each element, so maybe modifying the value holding class to include the current position of the element and do the extra accounting in the comparison (.equals()) and swap methods would be sufficient.
However, the more I think about it, an algorithm that runs in O(n log n) time could not work correctly with these restrictions. Intuitively, such algorithms are based on running n comparisons log n times. The log n is achieved by leveraging a divide and conquer mechanism, which only compares certain candidates for certain positions.
In other words, input lists with valid position constraints (i.e. counterexamples) exist for any O(n log n) sorting algorithm where a candidate element would be compared with an element (or range in the case of Quicksort and variants) with/to which it could not be swapped, and therefore would never move to the correct final position. If that's too vague, I can come up with a counter example for mergesort and quicksort.
In contrast, an O(n2) sorting algorithm makes exhaustive comparisons and can always move an element to its correct final position.
To ask an actual question: Is my intuition correct when I reason that an O(n log n) sort is not guaranteed to find a valid order? If so, can you provide more concrete proof? If not, why not? Is there other existing research on this class of problem?
1: I've not been able to find a set of search terms that points me in the direction of any concrete classification of such sorting algorithm or constraints; that's why I'm asking some basic questions about the complexity. If there is a term for this type of problem, please post it up.
2: Validation is a separate problem, worthy of its own investigation and algorithm. I'm pretty sure that the existence of a valid order can be proven in linear time:
Allocate array of tuples of length equal to your list. Each tuple is an integer counter k and a double value v for the relative assignment weight.
Walk the list, adding the fractional value of each elements position constraint to the corresponding range and incrementing its counter by 1 (e.g. range 2:5 on a list of 10 adds 0.4 to each of 2,3,4, and 5 on our tuple list, incrementing the counter of each as well)
Walk the tuple list and
If no entry has value v greater than the sum of the series from 1 to k of 1/k, a valid order exists.
If there is such a tuple, the position it is in is over-constrained; throw an exception, log an error, use the doubles array to correct the problem elements etc.
Edit: This validation algorithm itself is actually O(n2). Worst case, every element has the constraints 1:n, you end up walking your list of n tuples n times. This is still irrelevant to the scope of the question, because in the real problem domain, the constraints are enforced once and don't change.
Determining that a given list is in valid order is even easier. Just check each elements current position against its constraints.
3: This is admittedly a little bit premature optimization. Our initial use for this is for fairly small lists, but we're eyeing expansion to longer lists, so if we can optimize now we'd get small performance gains now and large performance gains later. And besides, my curiosity is piqued and if there is research out there on this topic, I would like to see it and (hopefully) learn from it.
On the existence of a solution: You can view this as a bipartite digraph with one set of vertices (U) being the k values, and the other set (V) the k ranks (1 to k), and an arc from each vertex in U to its valid ranks in V. Then the existence of a solution is equivalent to the maximum matching being a bijection. One way to check for this is to add a source vertex with an arc to each vertex in U, and a sink vertex with an arc from each vertex in V. Assign each edge a capacity of 1, then find the max flow. If it's k then there's a solution, otherwise not.
http://en.wikipedia.org/wiki/Maximum_flow_problem
--edit-- O(k^3) solution: First sort to find the sorted rank of each vertex (1-k). Next, consider your values and ranks as 2 sets of k vertices, U and V, with weighted edges from each vertex in U to all of its legal ranks in V. The weight to assign each edge is the distance from the vertices rank in sorted order. E.g., if U is 10 to 20, then the natural rank of 10 is 1. An edge from value 10 to rank 1 would have a weight of zero, to rank 3 would have a weight of 2. Next, assume all missing edges exist and assign them infinite weight. Lastly, find the "MINIMUM WEIGHT PERFECT MATCHING" in O(k^3).
http://www-math.mit.edu/~goemans/18433S09/matching-notes.pdf
This does not take advantage of the fact that the legal ranks for each element in U are contiguous, which may help get the running time down to O(k^2).
Here is what a coworker and I have come up with. I think it's an O(n2) solution that returns a valid, optimal order if one exists, and a closest-possible effort if the initial ranges were over-constrained. I just tweaked a few things about the implementation and we're still writing tests, so there's a chance it doesn't work as advertised. This over-constrained condition is detected fairly easily when it occurs.
To start, things are simplified if you normalize your inputs to have all non-null constraints. In linear time, that is:
for each item in input
if an item doesn't have a minimum position, set it to 1
if an item doesn't have a maximum position, set it to the length of your list
The next goal is to construct a list of ranges, each containing all of the candidate elements that have that range and ordered by the remaining capacity of the range, ascending so ranges with the fewest remaining spots are on first, then by start position of the range, then by end position of the range. This can be done by creating a set of such ranges, then sorting them in O(n log n) time with a simple comparator.
For the rest of this answer, a range will be a simple object like so
class Range<T> implements Collection<T> {
int startPosition;
int endPosition;
Collection<T> items;
public int remainingCapacity() {
return endPosition - startPosition + 1 - items.size();
}
// implement Collection<T> methods, passing through to the items collection
public void add(T item) {
// Validity checking here exposes some simple cases of over-constraining
// We'll catch these cases with the tricky stuff later anyways, so don't choke
items.add(item);
}
}
If an element A has range 1:5, construct a range(1,5) object and add A to its elements. This range has remaining capacity of 5 - 1 + 1 - 1 (max - min + 1 - size) = 4. If an element B has range 1:5, add it to your existing range, which now has capacity 3.
Then it's a relatively simple matter of picking the best element that fits each position 1 => k in turn. Iterate your ranges in their sorted order, keeping track of the best eligible element, with the twist that you stop looking if you've reached a range that has a remaining size that can't fit into its remaining positions. This is equivalent to the simple calculation range.max - current position + 1 > range.size (which can probably be simplified, but I think it's most understandable in this form). Remove each element from its range as it is selected. Remove each range from your list as it is emptied (optional; iterating an empty range will yield no candidates. That's a poor explanation, so lets do one of our examples from the question. Note that C(-:-) has been updated to the sanitized C(1:5) as described in above.
Input order
E(1:1) C(1:5) B(1:5) A(4:4) D(2:3)
Built ranges (min:max) <remaining capacity> [elements]
(1:1)0[E] (4:4)0[A] (2:3)1[D] (1:5)3[C,B]
Find best for 1
Consider (1:1), best element from its list is E
Consider further ranges?
range.max - current position + 1 > range.size ?
range.max = 1; current position = 1; range.size = 1;
1 - 1 + 1 > 1 = false; do not consider subsequent ranges
Remove E from range, add to output list
Find best for 2; current range list is:
(4:4)0[A] (2:3)1[D] (1:5)3[C,B]
Consider (4:4); skip it because it is not eligible for position 2
Consider (2:3); best element is D
Consider further ranges?
3 - 2 + 1 > 1 = true; check next range
Consider (2:5); best element is B
End of range list; remove B from range, add to output list
An added simplifying factor is that the capacities do not need to be updated or the ranges reordered. An item is only removed if the rest of the higher-sorted ranges would not be disturbed by doing so. The remaining capacity is never checked after the initial sort.
Find best for 3; output is now E, B; current range list is:
(4:4)0[A] (2:3)1[D] (1:5)3[C]
Consider (4:4); skip it because it is not eligible for position 3
Consider (2:3); best element is D
Consider further ranges?
same as previous check, but current position is now 3
3 - 3 + 1 > 1 = false; don't check next range
Remove D from range, add to output list
Find best for 4; output is now E, B, D; current range list is:
(4:4)0[A] (1:5)3[C]
Consider (4:4); best element is A
Consider further ranges?
4 - 4 + 1 > 1 = false; don't check next range
Remove A from range, add to output list
Output is now E, B, D, A and there is one element left to be checked, so it gets appended to the end. This is the output list we desired to have.
This build process is the longest part. At its core, it's a straightforward n2 selection sorting algorithm. The range constraints only work to shorten the inner loop and there is no loopback or recursion; but the worst case (I think) is still sumi = 0 n(n - i), which is n2/2 - n/2.
The detection step comes into play by not excluding a candidate range if the current position is beyond the end of that ranges max position. You have to track the range your best candidate came from in order to remove it, so when you do the removal, just check if the position you're extracting the candidate for is greater than that ranges endPosition.
I have several other counter-examples that foiled my earlier algorithms, including a nice example that shows several over-constraint detections on the same input list and also how the final output is closest to the optimal as the constraints will allow. In the mean time, please post any optimizations you can see and especially any counter examples where this algorithm makes an objectively incorrect choice (i.e. arrives at an invalid or suboptimal output when one exists).
I'm not going to accept this answer, because I specifically asked if it could be done in better than O(n2). I haven't wrapped my head around the constraints satisfaction approach in #DaveGalvin's answer yet and I've never done a maximum flow problem, but I thought this might be helpful for others to look at.
Also, I discovered the best way to come up with valid test data is to start with a valid list and randomize it: for 0 -> i, create a random value and constraints such that min < i < max. (Again, posting it because it took me longer than it should have to come up with and others might find it helpful.)
Not likely*. I assume you mean average run time of O(n log n) in-place, non-stable, off-line. Most Sorting algorithms that improve on bubble sort average run time of O(n^2) like tim sort rely on the assumption that comparing 2 elements in a sub set will produce the same result in the super set. A slower variant of Quicksort would be a good approach for your range constraints. The worst case won't change but the average case will likely decrease and the algorithm will have the extra constraint of a valid sort existing.
Is ... O(n log n) sort is not guaranteed to find a valid order?
All popular sort algorithms I am aware of are guaranteed to find an order so long as there constraints are met. Formal analysis (concrete proof) is on each sort algorithems wikepedia page.
Is there other existing research on this class of problem?
Yes; there are many journals like IJCSEA with sorting research.
*but that depends on your average data set.

Algorithm to generate k element subsets in order of their sum

If I have an unsorted large set of n integers (say 2^20 of them) and would like to generate subsets with k elements each (where k is small, say 5) in increasing order of their sums, what is the most efficient way to do so?
Why I need to generate these subsets in this fashion is that I would like to find the k-element subset with the smallest sum satisfying a certain condition, and I thus would apply the condition on each of the k-element subsets generated.
Also, what would be the complexity of the algorithm?
There is a similar question here: Algorithm to get every possible subset of a list, in order of their product, without building and sorting the entire list (i.e Generators) about generating subsets in order of their product, but it wouldn't fit my needs due to the extremely large size of the set n
I intend to implement the algorithm in Mathematica, but could do it in C++ or Python too.
If your desired property of the small subsets (call it P) is fairly common, a probabilistic approach may work well:
Sort the n integers (for millions of integers i.e. 10s to 100s of MB of ram, this should not be a problem), and sum the k-1 smallest. Call this total offset.
Generate a random k-subset (say, by sampling k random numbers, mod n) and check it for P-ness.
On a match, note the sum-total of the subset. Subtract offset from this to find an upper bound on the largest element of any k-subset of equivalent sum-total.
Restrict your set of n integers to those less than or equal to this bound.
Repeat (goto 2) until no matches are found within some fixed number of iterations.
Note the initial sort is O(n log n). The binary search implicit in step 4 is O(log n).
Obviously, if P is so rare that random pot-shots are unlikely to get a match, this does you no good.
Even if only 1 in 1000 of the k-sized sets meets your condition, That's still far too many combinations to test. I believe runtime scales with nCk (n choose k), where n is the size of your unsorted list. The answer by Andrew Mao has a link to this value. 10^28/1000 is still 10^25. Even at 1000 tests per second, that's still 10^22 seconds. =10^14 years.
If you are allowed to, I think you need to eliminate duplicate numbers from your large set. Each duplicate you remove will drastically reduce the number of evaluations you need to perform. Sort the list, then kill the dupes.
Also, are you looking for the single best answer here? Who will verify the answer, and how long would that take? I suggest implementing a Genetic Algorithm and running a bunch of instances overnight (for as long as you have the time). This will yield a very good answer, in much less time than the duration of the universe.
Do you mean 20 integers, or 2^20? If it's really 2^20, then you may need to go through a significant amount of (2^20 choose 5) subsets before you find one that satisfies your condition. On a modern 100k MIPS CPU, assuming just 1 instruction can compute a set and evaluate that condition, going through that entire set would still take 3 quadrillion years. So if you even need to go through a fraction of that, it's not going to finish in your lifetime.
Even if the number of integers is smaller, this seems to be a rather brute force way to solve this problem. I conjecture that you may be able to express your condition as a constraint in a mixed integer program, in which case solving the following could be a much faster way to obtain the solution than brute force enumeration. Assuming your integers are w_i, i from 1 to N:
min sum(i) w_i*x_i
x_i binary
sum over x_i = k
subject to (some constraints on w_i*x_i)
If it turns out that the linear programming relaxation of your MIP is tight, then you would be in luck and have a very efficient way to solve the problem, even for 2^20 integers (Example: max-flow/min-cut problem.) Also, you can use the approach of column generation to find a solution since you may have a very large number of values that cannot be solved for at the same time.
If you post a bit more about the constraint you are interested in, I or someone else may be able to propose a more concrete solution for you that doesn't involve brute force enumeration.
Here's an approximate way to do what you're saying.
First, sort the list. Then, consider some length-5 index vector v, corresponding to the positions in the sorted list, where the maximum index is some number m, and some other index vector v', with some max index m' > m. The smallest sum for all such vectors v' is always greater than the smallest sum for all vectors v.
So, here's how you can loop through the elements with approximately increasing sum:
sort arr
for i = 1 to N
for v = 5-element subsets of (1, ..., i)
set = arr{v}
if condition(set) is satisfied
break_loop = true
compute sum(set), keep set if it is the best so far
break if break_loop
Basically, this means that you no longer need to check for 5-element combinations of (1, ..., n+1) if you find a satisfying assignment in (1, ..., n), since any satisfying assignment with max index n+1 will have a greater sum, and you can stop after that set. However, there is no easy way to loop through the 5-combinations of (1, ..., n) while guaranteeing that the sum is always increasing, but at least you can stop checking after you find a satisfying set at some n.
This looks to be a perfect candidate for map-reduce (http://en.wikipedia.org/wiki/MapReduce). If you know of any way of partitioning them smartly so that passing candidates are equally present in each node then you can probably get a great throughput.
Complete sort may not really be needed as the map stage can take care of it. Each node can then verify the condition against the k-tuples and output results into a file that can be aggregated / reduced later.
If you know of the probability of occurrence and don't need all of the results try looking at probabilistic algorithms to converge to an answer.

finding max value on each subset

(I'm banging my head here. Let X={x1,x2,...,xn} is an integer set. Let A1,A2,...Am be the m subsets of X. For any i and j, Ai and Aj are not necessarily disjoint. Now the goal is to find the maximal value on each Ai (i=1,...,m) efficiently, with the number of operations as fewer as possible.
For example, given X={2,4,6,3,1}, and its subsets A1={2,3,1}, A2={2,6,3,1}, A3={4,2,3,1}. We need to find Max{A1}, Max{A2}, Max{A3}, respectively.
The brute-force way for finding Max{A1}, Max{A2}, Max{A3} is to scan all the elements in each Ai, and (m*d) operations are required, with m the number of subsets of X, and d the average length of the subsets {Ai} of X.
Now, I have some observations:
(1) For any set Y⊆X, max{Y}≤max{X},
For instance, since Max{X}=6 and 6 is in A2, then Max{A2}=6 can be found directly.
(2) For any two sets A and B, if A∩B is non-empty, Max{A} and Max{B} can be identified as follows:
First, we find the common parts between A and B, deonted as c=max{A∩B}.
Then, we find Max{A}=Max{Max{A-(A∩B)}, c} and Max{B}=Max{Max{B-(A∩B)}, c}.
I am not sure whether there are some other interesting obervations for find these max values.
Any ideas are warmly welcome!
My question is what if for the general case when X={x1,x2,...,xn} and there are m subsets of X, denoted as A1,A2,...Am, is there some more efficient techniques to find such max values Max{Ai} (i=1,...,m) ?
Your help will be highly appreciated!
There is no method asymptotically better than brute force, assuming a typical representation of the given sets. Simply scanning through the sets to find the largest member of each requires linear time and linear time is optimal since every member of the set must be read in order to determine the maximum value.
Now if the input representation is not simply a listing of the elements in each set, than other bounds and algorithms may apply. For example, if we know the input sets are sorted and the length of the set is given as part of the input, we can obviously find the maximum elements in time linear only on the number of subsets but not on their length.
If your sets are implemented in a hash (or, more generally, if you can otherwise check for the presence of a value in the set in O(1) time) you can improve on a brute-force approach.
Instead of iterating through the elements of the subset and maintaining the maximum, iterate over the elements of the parent set in descending order, checking for the presence of those elements in the subset. The first found element is necessarily the subset's maximum. Technically, this still takes O(n) time (n = subset carnality) in the general case, but will generally carry a great performance benefit in practice. (If you have any data regarding the number and size of the subsets, and they favor this approach, you can improve on O(n) in the average case.)
This approach requires sorting of the parent set's elements (n log n), however, so it may only be worthwhile if the number of subsets is much greater than the carnality of the parent set.

How to perform minimum splits to satisfy special set ordering?

I'm trying to create an algorithm to solve the following problem:
Input is an unsorted list of sets containing pairs (key, value) of ints. The first of each pair is positive and unique within the set.
I want to find an algorithm to split the input sets so the sets can be ordered such that for each key the value is nondecreasing in the set order.
There is a trival solution which is to split the sets into each individual value and sort them, I'd like something more efficient in terms of the number of sets which are split.
Are there any similar problems you have encountered and/or techniques you can suggest?
Does the optimal (minimum number of splits) solution sound like it is possible in polynomial time?
Edit: In the example the "<=" operator indicates a constraint on the sets as a whole whereby for each key value (100, 101, 102) the corresponding values are equal to or greater than the values in previous sets (or omitted from the set). I.e extracting the values for each key using the order from the output sets gives:
Key 100 {0, 1}
Key 101 {2, 3}
Key 102 {10, 15}
A*
I propose using A* to find an optimal solution. Build the order of split sets incrementally from left to right, minimizing the number of sets required to achieve this.
A* visits states based on some heuristic estimate of the total cost. I propose that a state is described by the totality of all the pairs already included in the order as we have it so far. If all values for every key are different, then you can represent this information rather concisely by simply storing the last value for each key. Otherwise you'll have to somehow take care of equal values, so you know which ones were already included and which ones were not. For every state you maintain some representation of the best order leading to it, but that may get updated along the way while the state remains the same.
The heuristic should be an estimate of the total cost of the path from the beginning through the current state to the goal. It may be too low, but must never be too high. In our case, the heuristic should count the number of (possibly split) sets included in the order so far, and add to that the number of (unsplit) sets still waiting for insertion. As the remaining sets may need splitting, this might be too low, but as you can never have less sets than those still waiting for insertion, it is a suitable heuristic.
Now you have some priority queue of states, ordered by the value of this heuristic. You extract minimal items from it, and know that the moment you extract a state from the queue, the cost up to that state can not decrease any more, so the path up to that state is optimal. Now you examine what other states can be reached from this: which other pairs can be next in the order of split sets? For each remaining set which has pairs that are ready to be included, you create a new subsequent state, taking all the pairs from the set which are ready. The cost so far increases by one. If you manage to take a whole set, without splitting, then the extimate for the remaining cost decreases by one.
For this new state, you check whether it is already persent in your priority queue. If it is, and its previous cost was higher than the one just computed, then you update its cost, and the optimal path leading to it. Make sure the priority key changes its position accordingly (“decrease key”). If the state wasn't present in the queue before, then add it to the queue.
Dijkstra
Come to think of it, this is the same as running Dijkstra's algorithm with the number of splits as cost. And as each edge has either cost zero or cost one, you can implement this even easier, without any priority queue at all. Instead, you can use two sets, called S₀ and S₁, where all elements from S₀ require the same number of splits, and all elements from S₁ require one more split. Roughly sketched in pseudocode:
S₀ = ∅ (empty set)
S₁ = ∅
add initial state (no pairs added yet, all sets remain to be added) to S₀
while True
while (S₀ ≠ ∅)
x = take and remove any element from zero
if x is the target state (all pairs included in the order) then
return the path information associated with it
for (r: those sets which remain to be added in state x)
if we can take r as a whole then
let y be the state obtained by taking r as the next set in the order
if y is in S₁, remove it
add y to S₀
else if we can add only some elements from r then
let y bet the state obtained by taking as many elements from r as possible
if y is not in S₀, add it to S₁
S₀ = S₁
S₁ = ∅

Dynamic Programming: Sum-of-products

Let's say you have two lists, L1 and L2, of the same length, N. We define prodSum as:
def prodSum(L1, L2) :
ans = 0
for elem1, elem2 in zip(L1, L2) :
ans += elem1 * elem2
return ans
Is there an efficient algorithm to find, assuming L1 is sorted, the number of permutations of L2 such that prodSum(L1, L2) < some pre-specified value?
If it would simplify the problem, you may assume that L1 and L2 are both lists of integers from [1, 2, ..., N].
Edit: Managu's answer has convinced me that this is impossible without assuming that L1 and L2 are lists of integers from [1, 2, ..., N]. I'd still be interested in solutions that assume this constraint.
I want to first dispell a certain amount of confusion about the math, then discuss two solutions and give code for one of them.
There is a counting class called #P which is a lot like the yes-no class NP. In a qualitative sense, it is even harder than NP. There is no particular reason to believe that this counting problem is any better than #P-hard, although it could be hard or easy to prove that.
However, many #P-hard problems and NP-hard problems vary tremendously in how long they take to solve in practice, and even one particular hard problem can be harder or easier depending on the properties of the input. What NP-hard or #P-hard mean is that there are hard cases. Some NP-hard and #P-hard problems also have less hard cases or even outright easy cases. (Others have very few cases that seem much easier than the hardest cases.)
So the practical question could depend a lot on the input of interest. Suppose that the threshold is on the high side or on the low side, or you have enough memory for a decent number of cached results. Then there is a useful recursive algorithm that makes use of two ideas, one of them already mentioned: (1) After partially assigning some of the values, the remaining threshold for list fragments may rule out all of the permutations, or it may allow all of them. (2) Memory permitting, you should cache the subtotals for some remaining threshold and some list fragments. To improve the caching, you might as well pick the elements from one of the lists in order.
Here is a Python code that implements this algorithm:
list1 = [1,2,3,4,5,6,7,8,9,10,11]
list2 = [1,2,3,4,5,6,7,8,9,10,11]
size = len(list1)
threshold = 396 # This is smack in the middle, a hard value
cachecutoff = 6 # Cache results when up to this many are assigned
def dotproduct(v,w):
return sum([a*b for a,b in zip(v,w)])
factorial = [1]
for n in xrange(1,len(list1)+1):
factorial.append(factorial[-1]*n)
cache = {}
# Assumes two sorted lists of the same length
def countprods(list1,list2,threshold):
if dotproduct(list1,list2) <= threshold: # They all work
return factorial[len(list1)]
if dotproduct(list1,reversed(list2)) > threshold: # None work
return 0
if (tuple(list2),threshold) in cache: # Already been here
return cache[(tuple(list2),threshold)]
total = 0
# Match the first element of list1 to each item in list2
for n in xrange(len(list2)):
total += countprods(list1[1:],list2[:n] + list2[n+1:],
threshold-list1[0]*list2[n])
if len(list1) >= size-cachecutoff:
cache[(tuple(list2),threshold)] = total
return total
print 'Total permutations below threshold:',
print countprods(list1,list2,threshold)
print 'Cache size:',len(cache)
As the comment line says, I tested this code with a hard value of the threshold. It is quite a bit faster than a naive search over all permutations.
There is another algorithm that is better than this one if three conditions are met: (1) You don't have enough memory for a good cache, (2) the list entries are small non-negative integers, and (3) you're interested in the hardest thresholds. A second situation to use this second algorithm is if you want counts for all thresholds flat-out, whether or not the other conditions are met. To use this algorithm for two lists of length n, first pick a base x which is a power of 10 or 2 that is bigger than n factorial. Now make the matrix
M[i][j] = x**(list1[i]*list2[j])
If you compute the permanent of this matrix M using the Ryser formula, then the kth digit of the permanent in base x tells you the number of permutations for which the dot product is exactly k. Moreover, the Ryser formula is quite a bit faster than the summing over all permutations directly. (But it is still exponential, so it does not contradict the fact that computing the permanent is #P-hard.)
Also, yes it is true that the set of permutations is the symmetric group. It would be great if you could use group theory in some way to accelerate this counting problem. But as far as I know, nothing all that deep comes from that description of the question.
Finally, if instead of exactly counting the number of permutations below a threshold, you only wanted to approximate that number, then probably the game changes completely. (You can approximate the permanent in polynomial time, but that doesn't help here.) I'd have to think about what to do; in any case it isn't the question posed.
I realized that there is another kind of caching/dynamic programming that is missing from the above discussion and the above code. The caching implemented in the code is early-stage caching: If just the first few values of list1 are assigned to list2, and if a remaining threshold occurs more than once, then the cache allows the code to reuse the result. This works great if the entries of list1 and list2 are integers that are not too large. But it will be a failed cache if the entries are typical floating point numbers.
However, you can also precompute at the other end, when most of the values of list1 have been assigned. In this case, you can make a sorted list of the subtotals for all of the remaining values. And remember, you can use up list1 in order, and do all of the permutations on the list2 side. For example, suppose that the last three entries of list1 are [4,5,6], and suppose that three of the values in list2 (somewhere in the middle) are [2.1,3.5,3.7]. Then you would cache a sorted list of the six dot products:
endcache[ [2.1, 3.5, 3.7] ] = [44.9, 45.1, 46.3, 46.7, 47.9, 48.1]
What does this do for you? If you look in the code that I did post, the function countprods(list1,list2,threshold) recursively does its work with a sub-threshold. The first argument, list1, might have been better as a global variable than as an argument. If list2 is short enough, countprods can do its work much faster by doing a binary search in the list endcache[list2]. (I just learned from stackoverflow that this is implemented in the bisect module in Python, although a performance code wouldn't be written in Python anyway.) Unlike the head cache, the end cache can speed up the code a lot even if there are no numerical coincidences among the entries of list1 and list2. Ryser's algorithm also stinks for this problem without numerical coincidences, so for this type of input I only see two accelerations: Sawing off a branch of the search tree using the "all" test and the "none" test, and the end cache.
Probably not (without the simplifying assumption): your problem is NP-Hard. Here's a trivial reduction to SUBSET-SUM. Let count_perms(L1, L2, x) represent the function "count the number of permutations of L2 such that prodSum(L1, L2) < x"
SUBSET_SUM(L2,n): # (determine if any subset of L2 adds up to n)
For i in [1,...,len(L2)]
Set L1=[0]*(len(L2)-i)+[1]*i
calculate count_perms(L1,L2,n+1)-count_perms(L1,L2,n)
if result positive, return true
Return false
Thus, if there were a way to calculate your function count_perms(L1, L2, x) efficiently, then we would have an efficient algorithm to calculate SUBSET_SUM(L2,n).
This also turns out to be an abstract algebra problem. It's been awhile for me, but here's a few things to get started. There's nothing terribly significant about the following (it's all very basic; an expansion on the fact that every group is isomorphic to a permutation group), but it provides a different way of looking at the problem.
I'll try to stick to fairly standard notation: "x" is a vector, and "xi" is the ith component of x. If "L" is a list, L is the equivalent vector. "1n" is a vector with all components = 1. The set of natural numbers ℕ is taken to be the positive integers. "[a,b]" is the set of integers from a through b, inclusive. "θ(x, y)" is the angle formed by x and y
Note prodSum is the dot product. The question is equivalent to finding all vectors L generated by an operation (permuting elements) on L2 such that θ(L1, L) less than a given angle α. The operation is equivalent to reflecting a point in ℕn through a subspace with presentation:
< ℕn | (xixj-1)(i,j) ∈ A >
where i and j are in [1,n], A has at least one element and no (i,i) is in A (i.e. A is a non-reflexive subset of [1,n]2 where |A| > 0). Stated more plainly (and more ambiguously), the subspaces are the points where one or more components are equal to one or more other components. The reflections correspond to matrices whose columns are all the standard basis vectors.
Let's name the reflection group "RPn" (it should have another name, but memory fails). RPn is isomorphic to the symmetric group Sn. Thus
|RPn| = |Sn| = n!
In 3 dimensions, this gives a group of order 6. The reflection group is D3, the triangle symmetry group, as a subgroup of the cube symmetry group. It turns out you can also generate the points by rotating L2 in increments of π/3 around the line along 1n. This is the the modular group ℤ6 and this points to a possible solution: find a group of order n! with a minimal number of generators and use that to generate the permutations of L2 as sequences with increasing, then decreasing, angle with L2. From there, we can try to generate the elements L with θ(L1, L) < α directly (for example we can binsearch on the 1st half of each sequence to find the transition point; with that, we can specify the rest of the sequence that fulfills the condition and count it in O(1) time). Let's call this group RP'n.
RP'4 is constructed of 4 subspaces isomorphic to ℤ6. More generally, RP'n is constructed of n subspaces isomorphic to RP'n-1.
This is where my abstract algebra muscles really begins to fail. I'll try to keep working on the construction, but Managu's answer doesn't leave much hope. I fear that reducing RP3 to ℤ6 is the only useful reduction we can make.
It looks like if l1 and l2 are both ordered high->low (or low->high, whatever, if they have the same order), the result is maximized, and if they are ordered oposite, the result is minimized, and other alterations of order appear to follow some rules; swapping two numbers in a continuous list of integers always reduces the sum by a fixed amount which seems to be related to their distance apart (ie swapping 1 and 3 or 2 and 4 have the same effect). This was just from a little messing around, but the idea is that there is a maximum, a minimum, and if some-pre-specified-value is between them, there are ways to count the permutations that make that possible (although; if the list isn't evenly spaced, then there aren't. Well, not that I know of. If l2 is (1 2 4 5) swapping 1 2 and 2 4 would have different effects)

Resources