Optimization problem - vector mapping - algorithm

A and B are sets of N dimensional vectors (N=10), |B|>=|A| (|A|=10^2, |B|=10^5). Similarity measure sim(a,b) is dot product (required). The task is following: for each vector a in A find vector b in B, such that sum of similarities ss of all pairs is maximal.
My first attempt was greedy algorithm:
find the pair with the highest similarity and remove that pair from A,B
repeat (1) until A is empty
But such greedy algorithm is suboptimal in this case:
a_1=[1, 0]
a_2=[.5, .4]
b_1=[1, 1]
b_2=[.9, 0]
sim(a_1,b_1)=1
sim(a_1,b_2)=.9
sim(a_2,b_1)=.9
sim(a_2, b_2)=.45
Algorithm returns [a_1,b_1] and [a_2, b_2], ss=1.45, but optimal solution yields ss=1.8.
Is there efficient algo to solve this problem? Thanks

This is essentially a matching problem in weighted bipartite graph. Just assume that weight function f is a dot product (|ab|).
I don't think the special structure of your weight function will simplify problem a lot, so you're pretty much down to finding a maximum matching.
You can find some basic algorithms for this problem in this wikipedia article. Although at first glance they don't seem viable for your data (V = 10^5, E = 10^7), I would still research them: some of them might allow you to take advantage of your 'lame' set of vertixes, with one part orders of magnitude smaller than the other.
This article also seems relevant, although doesn't list any algorithms.
Not exactly a solution, but hope it helps.

I second Nikita here, it is an assignment (or matching) problem. I'm not sure this is computationally feasible for your problem, but you could use the Hungarian algorithm, also known as Munkres' assignment algorithm, where the cost of assignment (i,j) is the negative of the dot product of ai and bj. Unless you happen to know how the elements of A and B are formed, I think this is the most efficient known algorithm for your problem.

Related

Is finding a subset with exact cut with other given subsets NP-hard?

I am trying to figure out whether the following problem is NP-hard:
Given G_1,..,G_n subsets of {1..m}
c_1,..,c_n non-negative integers in {0..m}
Find T subset of {1..m}
S.T. for all i=1..n, T intersects G_i in exactly c_i elements
I tried to find reductions to NP problems such as coloring, or P problems such as matching, but on both cases could think of exponential conversion algorithm (i.e. one that takes into account all subsets of G_i of size c_i), which doesn't help me much :(
A side note: would any restrictions on the parameters make this problem much easier?
For instance, would m<< n make a computational difference (we would still be looking for an algorithm that is polynomial in m)?
What if we knew that some of the constants c_i were zero?
Would appreciate any insight :)
This is NP-Complete problem, and is a generalization of Hitting Set Problem. Proof of NP-Completeness follows.
The problem is in NP (trivial - given a solution T, it is easy to check the intersection with each of Gi, and verify if its size is Ci).
It is also NP-Complete, assuming you are looking for minimal such T, with reduction from Exact Hitting Set Problem:
Where the decision problem of exact hitting set is:
Given a universe of elements U, subsets S1,...,Sk, and a number
m - is there a subset S of U of size at most m that contains
exactly one element from each Si?
Given instance of hitting-set problem (S1,S2,...Sk,d) - reduce it to this problem with Gi=Si, ci = 1, if the minimal solution of this problem is of size d, there is also a solution to exact hitting set of size d (and vise versa).

Suggestions for fragment proposal algorithm

I'm currently trying to solve the following problem, but am unsure which algorithm I should be using. Its in the area of mass identification.
I have a series of "weights", *w_i*, which can sum up to a total weight. The as-measured total weight has an error associated with it, so is thus inexact.
I need to find, given the total weight T, the closest k possible combinations of weights that can sum up to the total, where k is an input from the user. Each weight can be used multiple times.
Now, this sounds suspiciously like the bounded-integer multiple knapsack problem, however
it is possible to go over the weight, and
I also want all of the ranked solutions in terms of error
I can probably solve it using multiple sweeps of the knapsack problem, from weight-error->weight+error, by stepping in small enough increments, however it is possible if the increment is too large to miss certain weight combinations that could be used.
The number of weights is usually small (4 ->10 weights) and the ratio of the total weight to the mean weight is usually around 2 or 3
Does anyone know the names of an algorithm that might be suitable here?
Your problem effectively resembles the knapsack problem which is a NP-complete problem.
For really limited number of weights, you could run over every combinations with repetition followed by a sorting which gives you a quite high number of manipulations; at best: (n + k - 1)! / ((n - 1)! · k!) for the combination and n·log(n) for the sorting part.
Solving this kind of problem in a reasonable amount of time is best done by evolutionary algorithms nowadays.
If you take the following example from deap, an evolutionary algorithm framework in Python:
ga_knapsack.py, you realise that by modifying lines 58-59 that automatically discards an overweight solution for something smoother (a linear relation, for instance), it will give you solutions close to the optimal one in a shorter time than brute force. Solutions are already sorted for you at the end, as you requested.
As a first attempt I'd go for constraint programming (but then I almost always do, so take the suggestion with a pinch of salt):
Given W=w_1, ..., w_i for weights and E=e_1,.., e_i for the error (you can also make it asymmetric), and T.
Find all sets S (if the weights are unique, or a list) st sum w_1+e_1,..., w_k+e_k (where w_1, .., w_k \elem and e_1, ..., e_k \elem E) \approx T within some delta which you derive from k. Or just set it to some reasonably large value and decrease it as you are solving the constraints.
I just realise that you also want to parametrise the expression w_n op e_m over op \elem +, - (any combination of weights and error terms) and off the top of my head I don't know which constraint solver would allow you to do that. In any case, you can always fall back to prolog. It may not fly, especially if you have a lot of weights, but it will give you solutions quickly.

SAT/CNF optimization

Problem
I'm looking at a special subset of SAT optimization problem. For those not familiar with SAT and related topics, here's the related Wikipedia article.
TRUE=(a OR b OR c OR d) AND (a OR f) AND ...
There are no NOTs and it's in conjunctive normal form. This is easily solvable. However I'm trying to minimize the number of true assignments to make the whole statement true. I couldn't find a way to solve that problem.
Possible solutions
I came up with the following ways to solve it:
Convert to a directed graph and search the minimum spanning tree, spanning only a subset of vertices. There's Edmond's algorithm but that gives a MST for the complete graph instead of a subset of the vertices.
Maybe there's a version of Edmond's algorithm that solves the problem for a subset of the vertices?
Maybe there's a way to construct a graph out of the original problem that's solvable with other algorithms?
Use a SAT solver, a LIP solver or exhaustive search. I'm not interested in those solutions as I'm trying to use this problem as lecture material.
Question
Do you have any ideas/comments? Can you come up with other approaches that might work?
This problem is NP-Hard as well.
One can show an east reduction from Hitting Set:
Hitting Set problem: Given sets S1,S2,...,Sn and a number k: chose set S of size k, such that for every Si there is an element s in S such that s is in Si. [alternative definition: the intersection between each Si and S is not empty].
Reduction:
for an instance (S1,...,Sn,k) of hitting set, construct the instance of your problem: (S'1 AND S'2 And ... S'n,k) where S'i is all elements in Si, with OR. These elements in S'i are variables in the formula.
proof:
Hitting Set -> This problem: If there is an instance of hittins set, S then by assigning all of S's elements with true, the formula is satisfied with k elements, since for every S'i there is some variable v which is in S and Si and thus also in S'i.
This problem -> Hitting set: build S with all elements whom assigment is true [same idea as Hitting Set->This problem].
Since you are looking for the optimization problem for this, it is also NP-Hard, and if you are looking for an exact solution - you should try an exponential algorithm

Is this combinatorial optimization problem NP-hard?

I working on a combinatorial optimization problem that I suspect is NP-hard, and a genetic algorithm has been working well with our dataset. We're a research group and plan to publish our results in our field (not in math or CS), and I'd like to explore the NP-hard question before sending the manuscript out for review.
There are two main questions:
1) I'd like to know whether this particular optimization problem has been studied. I've heavily searched the lit but haven't seen anything exactly the same.
2) If the problem hasn't been studied, I might take a crack at doing a reducibility proof, and would like some pointers to good NP-complete candidates for the reduction.
The problem can be described in two ways, as a subsequence variant, and as a bipartite graph problem.
In the subsequence flavor, I want to find a "relaxed" subsequence that allows permutations, and optimize to minimize the permutation count. For example: (. = any other char)
Query: abc, Target: ..b.a.b.c., Result: abc (normal subsequence)
Query: abc, Target: ..b.a.c.a., Result: bac (subsequence with one permutation)
The bipartite formulation is a matching problem or linear assignment problem, with the graph partitioned into query character nodes and target character nodes. The edges connect the query characters to the target characters, such that there is exactly one edge from each query char to a target char. The objective function is to minimize the number of edge crossings (also called "crossing number" in the lit). This is similar to bipartite graph layout algorithms that reorder node placement to minimize edge crossings, but my problem requires the that both node orders stay fixed.
Any thoughts from the experts on questions 1 or 2?
Thanks in advance!
Just some idea: Does it somehow equivalent to finding the minimal number of swap needed to sort an array (MIN-SBR)? If yes, this is NP-Hard.
(btw, are you working on something similar to this?)
I don't think this is NP-hard. See work by Pevzner and Hannehali. A paper that comes to mind is titled ``From Cabbage to Turnip''. The idea is to find the minimum number of inversions to go from one string to another. They have a polytime algorithm for this.
The problem with "word problem" should be harder, right? – J-16 SDiZ 14
Yes, having multiple occurrences of the char in the target seems to make my problem harder than MIN-SBR, so from that angle my problem would be at least as hard as NP-complete. On the other hand, I'm not yet clear on how their central notion of block reversals would affect my claim of NP-completeness.
I sure would be nice to know whether my optimization can be solved in polynomial time. Put another way, it sure would be embarrassing if a reviewer came back with five lines of pseudocode that finds the global maximum in O(n)..
Would, Query: abc Target: ..c.b.a.a Result: cba, be three permutations (as per your use of the term) then? If so, then maybe you mean transpositions rather than permutations. A transposition is the swapping of two adjacent characters.
Good question. We're interested in a mapping from Query -> Target that has as few crossings as possible. This is very much the motivation for mentioning the bipartite edge crossings in the original post. Alternatively, you can think of maximizing a rank statistic, like Spearman's Rho, over the mapping.
Also, out of curiousity, how many unique characters are there in the query/target? – Justin Peel 18
Typical query: 100, typical target: 1000. Combinatorially, it's a huge solution space.

What's the most insidious way to pose this problem?

My best shot so far:
A delivery vehicle needs to make a series of deliveries (d1,d2,...dn), and can do so in any order--in other words, all the possible permutations of the set D = {d1,d2,...dn} are valid solutions--but the particular solution needs to be determined before it leaves the base station at one end of the route (imagine that the packages need to be loaded in the vehicle LIFO, for example).
Further, the cost of the various permutations is not the same. It can be computed as the sum of the squares of distance traveled between di -1 and di, where d0 is taken to be the base station, with the caveat that any segment that involves a change of direction costs 3 times as much (imagine this is going on on a railroad or a pneumatic tube, and backing up disrupts other traffic).
Given the set of deliveries D represented as their distance from the base station (so abs(di-dj) is the distance between two deliveries) and an iterator permutations(D) which will produce each permutation in succession, find a permutation which has a cost less than or equal to that of any other permutation.
Now, a direct implementation from this description might lead to code like this:
function Cost(D) ...
function Best_order(D)
for D1 in permutations(D)
Found = true
for D2 in permutations(D)
Found = false if cost(D2) > cost(D1)
return D1 if Found
Which is O(n*n!^2), e.g. pretty awful--especially compared to the O(n log(n)) someone with insight would find, by simply sorting D.
My question: can you come up with a plausible problem description which would naturally lead the unwary into a worse (or differently awful) implementation of a sorting algorithm?
I assume you're using this question for an interview to see if the applicant can notice a simple solution in a seemingly complex question.
[This assumption is incorrect -- MarkusQ]
You give too much information.
The key to solving this is realizing that the points are in one dimension and that a sort is all that is required. To make this question more difficult hide this fact as much as possible.
The biggest clue is the distance formula. It introduces a penalty for changing directions. The first thing an that comes to my mind is minimizing this penalty. To remove the penalty I have to order them in a certain direction, this ordering is the natural sort order.
I would remove the penalty for changing directions, it's too much of a give away.
Another major clue is the input values to the algorithm: a list of integers. Give them a list of permutations, or even all permutations. That sets them up to thinking that a O(n!) algorithm might actually be expected.
I would phrase it as:
Given a list of all possible
permutations of n delivery locations,
where each permutation of deliveries
(d1, d2, ...,
dn) has a cost defined by:
Return permutation P such that the
cost of P is less than or equal to any
other permutation.
All that really needs to be done is read in the first permutation and sort it.
If they construct a single loop to compare the costs ask them what the big-o runtime of their algorithm is where n is the number of delivery locations (Another trap).
This isn't a direct answer, but I think more clarification is needed.
Is di allowed to be negative? If so, sorting alone is not enough, as far as I can see.
For example:
d0 = 0
deliveries = (-1,1,1,2)
It seems the optimal path in this case would be 1 > 2 > 1 > -1.
Edit: This might not actually be the optimal path, but it illustrates the point.
YOu could rephrase it, having first found the optimal solution, as
"Give me a proof that the following convination is the most optimal for the following set of rules, where optimal means the smallest number results from the sum of all stage costs, taking into account that all stages (A..Z) need to be present once and once only.
Convination:
A->C->D->Y->P->...->N
Stage costs:
A->B = 5,
B->A = 3,
A->C = 2,
C->A = 4,
...
...
...
Y->Z = 7,
Z->Y = 24."
That ought to keep someone busy for a while.
This reminds me of the Knapsack problem, more than the Traveling Salesman. But the Knapsack is also an NP-Hard problem, so you might be able to fool people to think up an over complex solution using dynamic programming if they correlate your problem with the Knapsack. Where the basic problem is:
can a value of at least V be achieved
without exceeding the weight W?
Now the problem is a fairly good solution can be found when V is unique, your distances, as such:
The knapsack problem with each type of
item j having a distinct value per
unit of weight (vj = pj/wj) is
considered one of the easiest
NP-complete problems. Indeed empirical
complexity is of the order of O((log
n)2) and very large problems can be
solved very quickly, e.g. in 2003 the
average time required to solve
instances with n = 10,000 was below 14
milliseconds using commodity personal
computers1.
So you might want to state that several stops/packages might share the same vj, inviting people to think about the really hard solution to:
However in the
degenerate case of multiple items
sharing the same value vj it becomes
much more difficult with the extreme
case where vj = constant being the
subset sum problem with a complexity
of O(2N/2N).
So if you replace the weight per value to distance per value, and state that several distances might actually share the same values, degenerate, some folk might fall in this trap.
Isn't this just the (NP-Hard) Travelling Salesman Problem? It doesn't seem likely that you're going to make it much harder.
Maybe phrasing the problem so that the actual algorithm is unclear - e.g. by describing the paths as single-rail railway lines so the person would have to infer from domain knowledge that backtracking is more costly.
What about describing the question in such a way that someone is tempted to do recursive comparisions - e.g. "can you speed up the algorithm by using the optimum max subset of your best (so far) results"?
BTW, what's the purpose of this - it sounds like the intent is to torture interviewees.
You need to be clearer on whether the delivery truck has to return to base (making it a round trip), or not. If the truck does return, then a simple sort does not produce the shortest route, because the square of the return from the furthest point to base costs so much. Missing some hops on the way 'out' and using them on the way back turns out to be cheaper.
If you trick someone into a bad answer (for example, by not giving them all the information) then is it their foolishness or your deception that has caused it?
How great is the wisdom of the wise, if they heed not their ego's lies?

Resources