Find mutually compatible options from list of list of options - algorithm

For purposes of this question, let us call a list of mutually incompatible options for "OptionS". I have a list of such OptionS, where each Option, apart from disqualifying all other Options in it's own OptionS list, also disqualify some Options from the other OptionS lists. These rules are symmetrical, so if A forbids B, B forbids A.
I want to pick exactly one Option from each list, such that no Options disqualify each other. There are too many Options (and OptionS) and too few disqualifications in each step to brute force a backtracking solution.
It reminds be a bit of Sudoku, but it is not an exact analog. From certain external factors, I have a rough likelihood for the different Options, or at least an ordering.
Is there a known better solution to this problem? Is it in NP?
Currently, I plan to just take random "paths" through the solution space, weighted by likelihood. A sort of simulated annealing.
EDIT - Clarification
I have a number, let's say between 5 and 500, of vectors.
Each vector contains a number, between 10 and 10000, of elements
Each element rules out a number of elements in the other vectors
This relation is symmetric
I want to pick exactly one element from each vector in a way that no elements disqualify each other
If there is no way to choose one from each vector, I want to at least choose as many as possible. The nature of the data is such that there will always be at least one (and at most a few) solution (or almost-solution - with just a few misses).
I cannot share the real data, but an example would be that the elements are integers between 1 and 10e9 and that only elements whose pairwise sum has more than P prime factors are allowed. Some numbers are more likely than others to "fit" other numbers, since larger numbers tend to have more factors, which makes some choices more likely just like the real one.
Pick P and the sizes and number of vectors as needed to make it suitably challenging :).
My naive solution:
I order the elements by how many other elements they rule out and try those who rule out few first (because that gives you a larger chance to be able to pick one from each).
Then I order the vectors by how many elements the "best" element rules out. Vectors that rule out many other elements are first. So the most constrained vector is tried first, even though the least constrained elements of that vector are tried first.
I then search depth first
The problem with this approach is that if the first choice is wrong, then the depth first search will never have time to reach the next choice.
A better way, which I try to explain in a comment below, would be to score each partial choice (node in the search tree) of elements according to how many you have chosen and how many elements are left. Then I could look deeper in the highest scoring node at each step, so the first choice is less rigid.
A similar way, which I might try first because it is slightly easier, is to do simulated annealing and take random paths, weighted by how many possibilities they keep, down the tree.

Depending on what constraints are allowed, I think you can reduce SAT to this.
Take a SAT expression e.g. (A|B|C)(~A|C|~D)...
Replace ~A by a and make a vector out of each term giving you {A,B,C} {a,C,d}...
You now have the problem of choosing one element from each vector, subject to the constraint that you cannot choose both versions of a variable - the constraints say that A is incompatible with a, B is incompatible with b, and so on.
If you can solve this instance of your problem you can solve SAT by setting to true variables that are chosen in your problem as A, B, C,... to false variables that are chosen as a, b, c,.. and making an arbitrary choice for anything not chosen - therefore your problem is at least as hard as SAT. (Except if you don't encounter these sorts of constraints, in which case I have not proved that your problem is this hard).
Given an instance of your problem, associate a variable with each element, write the constraints as boolean expressions (typically with only 2 variables) to give something which looks like 2-SAT, except that you need an expression for each vector of the form (A|B|C|D|...) to say that you must choose at least one element from each vector - so the exact solution version of your problem, at least, might code up quite nicely as input for a SAT-solver - so it is in NP and since we have already shown it is NP-hard it is NP-complete.

My first recommendation would be to find an off-the-shelf constraint solver and try that (request a maximum-weight solution with the log-likelihoods as weights), but if you're determined to implement a solver from scratch, then I would suggest that you start with something like WalkSAT. To summarize the link in the language of your question: at all times, keep a list of option choices (one from each option list, not necessarily compatible) and a list of conflicts (i.e., a set of pairs of indexes into the list of option lists). Repeatedly choose a conflict at random and resolve it by choosing differently for one half of the conflict or the other (most of the time) so as to decrease the number of conflicts afterward as much as possible or (some of the time) randomly, perhaps according to the likelihoods. Good data structures will be essential in making this run fast.

Related

General algorithm for partial backtracking search

Backtracking search is a well-known problem-solving technique, that recurs through all possible combinations of variable assignments in search of a valid solution. The general algorithm is abstracted into a concise higher-order function: https://en.wikipedia.org/wiki/Backtracking
Some problems require partial backtracking, that is, they have a mixture of don't-know non-determinism (have a choice to make, that matters, if you get it wrong you have to backtrack) and don't-care non-determinism (have a choice to make, that doesn't matter, maybe it matters for how long it takes you to find the solution, but not for the correctness thereof, you don't have to backtrack).
Consider for example the Boolean satisfiability problem that can be solved with the DPLL algorithm. If you try to represent that with the general backtracking algorithm, the result will not only recur through all 2^N variable assignments (which is sadly necessary in the general case), but all N! orders of trying the variables (completely unnecessary and hopelessly inefficient).
Is there a general algorithm for partial backtracking? A concise higher-order function that takes function parameters for both don't-know and don't-care choices?
If I understand you correctly, you’re asking about symmetry-breaking in tree search. In the specific example you gave, all permutations of the list of variable assignments are equivalent.
Symmetries are going to be domain-specific. So is the more-general technique of pruning the search tree, by short-circuiting and backtracking eagerly. There are a few symmetry-breaking techniques I’ve used that generalize.
One is to search the problem space in a canonical order. If the branch that sets variable 10 only tries variables 11, 12 and up, not variables 9, 8 or 7, it won’t search any permutation of the same solution. It will only test solutions that are unique up to permutation. (In the specific case of SAT-solving, this might rule out an optimal search order—although you could re-order the variables arbitrarily.)
Another is to make a test that only one distinct solution of any equivalence class will pass, ideally one that can be checked near the top of the search tree. The classic example of this is, in the 8-queens problem, checking whether the queen on the row you look at first is on the left or the right side of the chessboard. Any solution where she’s on the right is a mirror-image of one other solution where she’s on the left, so you can cut the search space in half. (You can actually do better than this with that problem.) If you only need to test for satisfiability, you can get by with a filter that merely guarantees that, if any solution exists, at least one solution will pass.
If you have enough memory, you might also store a set of branches that have already been searched, and then check whether a branch that you are considering whether to search is equivalent to one already in the set. This would be more practical for a search space with a huge number of symmetries than one with a huge number of solutions unique up to symmetry.

Optimized Algorithm: Fastest Way to Derive Sets

I'm writing a program for a competition and I need to be faster than all the other competitors. For this I need a little algorithm help; ideally I'd be using the fastest algorithm.
For this problem I am given 2 things. The first is a list of tuples, each of which contains exactly two elements (strings), each of which represents an item. The second is an integer, which indicates how many unique items there are in total. For example:
# of items = 3
[("ball","chair"),("ball","box"),("box","chair"),("chair","box")]
The same tuples can be repeated/ they are not necessarily unique.) My program is supposed to figure out the maximum number of tuples that can "agree" when the items are sorted into two groups. This means that if all the items are broken into two ideal groups, group 1 and group 2, what are the maximum number of tuples that can have their first item in group 1 and their second item in group 2.
For example, the answer to my earlier example would be 2, with "ball" in group 1 and "chair" and "box" in group 2, satisfying the first two tuples. I do not necessarily need know what items go in which group, I just need to know what the maximum number of satisfied tuples could be.
At the moment I'm trying a recursive approach, but its running on (n^2), far too inefficient in my opinion. Does anyone have a method that could produce a faster algorithm?
Thanks!!!!!!!!!!
Speed up approaches for your task:
1. Use integers
Convert the strings to integers (store the strings in an array and use the position for the tupples.
String[] words = {"ball", "chair", "box"};
In tuppls ball now has number 0 (pos 0 in array) , chair 1, box 2.
comparing ints is faster than Strings.
2. Avoid recursion
Recursion is slow, due the recursion overhead.
For example look at binarys search algorithm in a recursive implementatiion, then look how java implements binSearch() (with a while loop and iteration)
Recursion is helpfull if problems are so complex that a non recursive implementation is to complex for a human brain.
An iterataion is faster, but not in the case when you mimick recursive calls by implementing your own stack.
However you can start implementing using a recursiove algorithm, once it works and it is a suited algo, then try to convert to a non recursive implementation
3. if possible avoid objects
if you want the fastest, the now it becomes ugly!
A tuppel array can either be stored in as array of class Point(x,y) or probably faster,
as array of int:
Example:
(1,2), (2,3), (3,4) can be stored as array: (1,2,2,3,3,4)
This needs much less memory because an object needs at least 12 bytes (in java).
Less memory becomes faster, when the array are really big, then your structure will hopefully fits in the processor cache, while the objects array does not.
4. Programming language
In C it will be faster than in Java.
Maximum cut is a special case of your problem, so I doubt you have a quadratic algorithm for it. (Maximum cut is NP-complete and it corresponds to the case where every tuple (A,B) also appears in reverse as (B,A) the same number of times.)
The best strategy for you to try here is "branch and bound." It's a variant of the straightforward recursive search you've probably already coded up. You keep track of the value of the best solution you've found so far. In each recursive call, you check whether it's even possible to beat the best known solution with the choices you've fixed so far.
One thing that may help (or may hurt) is to "probe": for each as-yet-unfixed item, see if putting that item on one of the two sides leads only to suboptimal solutions; if so, you know that item needs to be on the other side.
Another useful trick is to recurse on items that appear frequently both as the first element and as the second element of your tuples.
You should pay particular attention to the "bound" step --- finding an upper bound on the best possible solution given the choices you've fixed.

Optimal placement of objects wrt pairwise similarity weights

Ok this is an abstract algorithmic challenge and it will remain abstract since it is a top secret where I am going to use it.
Suppose we have a set of objects O = {o_1, ..., o_N} and a symmetric similarity matrix S where s_ij is the pairwise correlation of objects o_i and o_j.
Assume also that we have an one-dimensional space with discrete positions where objects may be put (like having N boxes in a row or chairs for people).
Having a certain placement, we may measure the cost of moving from the position of one object to that of another object as the number of boxes we need to pass by until we reach our target multiplied with their pairwise object similarity. Moving from a position to the box right after or before that position has zero cost.
Imagine an example where for three objects we have the following similarity matrix:
1.0 0.5 0.8
S = 0.5 1.0 0.1
0.8 0.1 1.0
Then, the best ordering of objects in the tree boxes is obviously:
[o_3] [o_1] [o_2]
The cost of this ordering is the sum of costs (counting boxes) for moving from one object to all others. So here we have cost only for the distance between o_2 and o_3 equal to 1box * 0.1sim = 0.1, the same as:
[o_3] [o_1] [o_2]
On the other hand:
[o_1] [o_2] [o_3]
would have cost = cost(o_1-->o_3) = 1box * 0.8sim = 0.8.
The target is to determine a placement of the N objects in the available positions in a way that we minimize the above mentioned overall cost for all possible pairs of objects!
An analogue is to imagine that we have a table and chairs side by side in one row only (like the boxes) and you need to put N people to sit on the chairs. Now those ppl have some relations that is -lets say- how probable is one of them to want to speak to another. This is to stand up pass by a number of chairs and speak to the guy there. When the people sit on two successive chairs then they don't need to move in order to talk to each other.
So how can we put those ppl down so that every distance-cost between two ppl are minimized. This means that during the night the overall number of distances walked by the guests are close to minimum.
Greedy search is... ok forget it!
I am interested in hearing if there is a standard formulation of such problem for which I could find some literature, and also different searching approaches (e.g. dynamic programming, tabu search, simulated annealing etc from combinatorial optimization field).
Looking forward to hear your ideas.
PS. My question has something in common with this thread Algorithm for ordering a list of Objects, but I think here it is better posed as problem and probably slightly different.
That sounds like an instance of the Quadratic Assignment Problem. The speciality is due to the fact that the locations are placed on one line only, but I don't think this will make it easier to solve. The QAP in general is NP hard. Unless I misinterpreted your problem you can't find an optimal algorithm that solves the problem in polynomial time without proving P=NP at the same time.
If the instances are small you can use exact methods such as branch and bound. You can also use tabu search or other metaheuristics if the problem is more difficult. We have an implementation of the QAP and some metaheuristics in HeuristicLab. You can configure the problem in the GUI, just paste the similarity and the distance matrix into the appropriate parameters. Try starting with the robust Taboo Search. It's an older, but still quite well working algorithm. Taillard also has the C code for it on his website if you want to implement it for yourself. Our implementation is based on that code.
There has been a lot of publications done on the QAP. More modern algorithms combine genetic search abilities with local search heuristics (e. g. Genetic Local Search from Stützle IIRC).
Here's a variation of the already posted method. I don't think this one is optimal, but it may be a start.
Create a list of all the pairs in descending cost order.
While list not empty:
Pop the head item from the list.
If neither element is in an existing group, create a new group containing
the pair.
If one element is in an existing group, add the other element to whichever
end puts it closer to the group member.
If both elements are in existing groups, combine them so as to minimize
the distance between the pair.
Group combining may require reversal of order in a group, and the data structure should
be designed to support that.
Let me help the thread (of my own) with a simplistic ordering approach.
1. Order the upper half of the similarity matrix.
2. Start with the pair of objects having the highest similarity weight and place them in the center positions.
3. The next object may be put on the left or the right side of them. So each time you may select the object that when put to left or right
has the highest cost to the pre-placed objects. Goto Step 2.
The selection of Step 3 is because if you left this object and place it later this cost will be again the greatest of the remaining, and even more (farther to the pre-placed objects). So the costly placements should be done as earlier as it can be.
This is too simple and of course does not discover a good solution.
Another approach is to
1. start with a complete ordering generated somehow (random or from another algorithm)
2. try to improve it using "swaps" of object pairs.
I believe local minima would be a huge deterrent.

Tricky algorithm for sorting symbols in an array while preserving relationships via order

The problem
I have multiple groups which specify the relationships of symbols.. for example:
[A B C]
[A D E]
[X Y Z]
What these groups mean is that (for the first group) the symbols, A, B, and C are related to each other. (The second group) The symbols A, D, E are related to each other.. and so forth.
Given all these data, I would need to put all the unique symbols into a 1-dimension array wherein the symbols which are somehow related to each other would be placed closer to each other. Given the example above, the result should be something like:
[B C A D E X Y Z]
or
[X Y Z D E A B C]
In this resulting array, since the symbol A has multiple relationships (namely with B and C in one group and with D and E in another) it's now located between those symbols, somewhat preserving the relationship.
Note that the order is not important. In the result, X Y Z can be placed first or last since those symbols are not related to any other symbols. However, the closeness of the related symbols is what's important.
What I need help in
I need help in determining an algorithm that takes groups of symbol relationships, then outputs the 1-dimension array using the logic above. I'm pulling my hair out on how to do this since with real data, the number of symbols in a relationship group can vary, there is also no limit to the number of relationship groups and a symbol can have relationships with any other symbol.
Further example
To further illustrate the trickiness of my dilemma, IF you add another relationship group to the example above. Let's say:
[C Z]
The result now should be something like:
[X Y Z C B A D E]
Notice that the symbols Z and C are now closer together since their relationship was reinforced by the additional data. All previous relationships are still retained in the result also.
The first thing you need to do is to precisely define the result you want.
You do this by defining how good a result is, so that you know which is the best one. Mathematically you do this by a cost function. In this case one would typically choose the sum of the distances between related elements, the sum of the squares of these distances, or the maximal distance. Then a list with a small value of the cost function is the desired result.
It is not clear whether in this case it is feasible to compute the best solution by some special method (maybe if you choose the maximal distance or the sum of the distances as the cost function).
In any case it should be easy to find a good approximation by standard methods.
A simple greedy approach would be to insert each element in the position where the resulting cost function for the whole list is minimal.
Once you have a good starting point you can try to improve it further by modifying the list towards better solutions, for example by swapping elements or rotating parts of the list (local search, hill climbing, simulated annealing, other).
I think, because with large amounts of data and lack of additional criteria, it's going to be very very difficult to make something that finds the best option. Have you considered doing a greedy algorithm (construct your solution incrementally in a way that gives you something close to the ideal solution)? Here's my idea:
Sort your sets of related symbols by size, and start with the largest one. Keep those all together, because without any other criteria, we might as well say their proximity is the most important since it's the biggest set. Consider every symbol in that first set an "endpoint", an endpoint being a symbol you can rearrange and put at either end of your array without damaging your proximity rule (everything in the first set is an endpoint initially because they can be rearranged in any way). Then go through your list and as soon as one set has one or more symbols in common with the first set, connect them appropriately. The symbols that you connected to each other are no longer considered endpoints, but everything else still is. Even if a bigger set only has one symbol in common, I'm going to guess that's better than smaller sets with more symbols in common, because this way, at least the bigger set stays together as opposed to possibly being split up if it was put in the array later than smaller sets.
I would go on like this, updating the list of endpoints that existed so that you could continue making matches as you went through your set. I would keep track of if I stopped making matches, and in that case, I'd just go to the top of the list and just tack on the next biggest, unmatched set (doesn't matter if there are no more matches to be made, so go with the most valuable/biggest association). Ditch the old endpoints, since they have no matches, and then all the symbols of the set you just tacked on are the new endpoints.
This may not have a good enough runtime, I'm not sure. But hopefully it gives you some ideas.
Edit: Obviously, as part of the algorithm, ditch duplicates (trivial).
The problem as described is essentially the problem of drawing a graph in one dimension.
Using the relationships, construct a graph. Treat the unique symbols as the vertices of the graph. Place an edge between any two vertices that co-occur in a relationship; more sophisticated would be to construct a weight based on the number of relationships in which the pair of symbols co-occur.
Algorithms for drawing graphs place well-connected vertices closer to one another, which is equivalent to placing related symbols near one another. Since only an ordering is needed, the symbols can just be ranked based on their positions in the drawing.
There are a lot of algorithms for drawing graphs. In this case, I'd go with Fiedler ordering, which orders the vertices using a particular eigenvector (the Fiedler vector) of the graph Laplacian. Fiedler ordering is straightforward, effective, and optimal in a well-defined mathematical sense.
It sounds like you want to do topological sorting: http://en.wikipedia.org/wiki/Topological_sorting
Regarding the initial ordering, it seems like you are trying to enforce some kind of stability condition, but it is not really clear to me what this should be from your question. Could you try to be a bit more precise in your description?

Looking for a multidimensional optimization algorithm

Problem description
There are different categories which contain an arbitrary amount of elements.
There are three different attributes A, B and C. Each element does have an other distribution of these attributes. This distribution is expressed through a positive integer value. For example, element 1 has the attributes A: 42 B: 1337 C: 18. The sum of these attributes is not consistent over the elements. Some elements have more than others.
Now the problem:
We want to choose exactly one element from each category so that
We hit a certain threshold on attributes A and B (going over it is also possible, but not necessary)
while getting a maximum amount of C.
Example: we want to hit at least 80 A and 150 B in sum over all chosen elements and want as many C as possible.
I've thought about this problem and cannot imagine an efficient solution. The sample sizes are about 15 categories from which each contains up to ~30 elements, so bruteforcing doesn't seem to be very effective since there are potentially 30^15 possibilities.
My model is that I think of it as a tree with depth number of categories. Each depth level represents a category and gives us the choice of choosing an element out of this category. When passing over a node, we add the attributes of the represented element to our sum which we want to optimize.
If we hit the same attribute combination multiple times on the same level, we merge them so that we can stripe away the multiple computation of already computed values. If we reach a level where one path has less value in all three attributes, we don't follow it anymore from there.
However, in the worst case this tree still has ~30^15 nodes in it.
Does anybody of you can think of an algorithm which may aid me to solve this problem? Or could you explain why you think that there doesn't exist an algorithm for this?
This question is very similar to a variation of the knapsack problem. I would start by looking at solutions for this problem and see how well you can apply it to your stated problem.
My first inclination to is try branch-and-bound. You can do it breadth-first or depth-first, and I prefer depth-first because I think it's cleaner.
To express it simply, you have a tree-walk procedure walk that can enumerate all possibilities (maybe it just has a 5-level nested loop). It is augmented with two things:
At every step of the way, it keeps track of the cost at that point, where the cost can only increase. (If the cost can also decrease, it becomes more like a minimax game tree search.)
The procedure has an argument budget, and it does not search any branches where the cost can exceed the budget.
Then you have an outer loop:
for (budget = 0; budget < ... ; budget++){
walk(budget);
// if walk finds a solution within the budget, halt
}
The amount of time it takes is exponential in the budget, so easier cases will take less time. The fact that you are re-doing the search doesn't matter much because each level of the budget takes as much or more time than all the previous levels combined.
Combine this with some sort of heuristic about the order in which you consider branches, and it may give you a workable solution for typical problems you give it.
IF that doesn't work, you can fall back on basic heuristic programming. That is, do some cases by hand, and pay attention to how you did it. Then program it the same way.
I hope that helps.

Resources