Min cost matching with outliers - algorithm

Given a complete bipartite graph G = (V1, V2; E), |V1|=|V2|=n and a non-negative cost for each edge the min cost bipartite matching problem finds a partition of G to n pairs of vertices connected by an edge, such that the total sum of the edges costs is minimized.
This problem can be solved using the min cost flow algorithm, by adding a source and sink vertices connected to each group with a weight 0 and a capacity 1.
But what if instead we get as an input a number m < n and want to find a partition of m pairs such that the total cost is minimized?
At first I thought we can just add another vertex at the beginning which is connected to the original source with weight 0 and capacity m and call it the new source, that way the maximum flow would be m and it should choose only m pairs.
However when I ran this algorithm using boost's min cost flow function a lot of times there were 2 big problems:
1) The flow in an edge wasn't always an integer (i.e. instead of 0 or 1 the flow was 0.5 for example).
2) There were many possible (non-integer) solutions so even for the same input with different order the algorithm outputted different results.
The moment I set m to be n both of these problems were resolved.
So my question is: is there a way to solve this problems and if not is there another algorithm that can solve the min cost bipartite matching with outliers problem?

I just found out the algorithm I described in the question and said that didn't work actually did work and it happened because of floating point error caused inside boosts min cost flow function, when I multiplied all the costs by 10000 all the problems were resolved.

Related

Shortest Path in a Directed Acyclic Graph with two types of costs

I am given a directed acyclic graph G = (V,E), which can be assumed to be topologically ordered (if needed). The edges in G have two types of costs - a nominal cost w(e) and a spiked cost p(e).
The goal is to find the shortest path from a node s to a node t which minimizes the following cost:
sum_e (w(e)) + max_e (p(e)), where the sum and maximum are taken over all edges in the path.
Standard dynamic programming methods show that this problem is solvable in O(E^2) time. Is there a more efficient way to solve it? Ideally, an O(E*polylog(E,V)) algorithm would be nice.
---- EDIT -----
This is the O(E^2) solution I found using dynamic programming.
First, order all costs p(e) in an ascending order. This takes O(Elog(E)) time.
Second, define the state space consisting of states (x,i) where x is a node in the graph and i is in 1,2,...,|E|. It represents "We are in node x, and the highest edge weight p(e) we have seen so far is the i-th largest".
Let V(x,i) be the length of the shortest path (in the classical sense) from s to x, where the highest p(e) encountered was the i-th largest. It's easy to compute V(x,i) given V(y,j) for any predecessor y of x and any j in 1,...,|E| (there are two cases to consider - the edge y->x is has the j-th largest weight, or it does not).
At every state (x,i), this computation finds the minimum of about deg(x) values. Thus the complexity is O(|E| * sum_(x\in V) deg(x)) = O(|E|^2), as each node is associated to |E| different states.
I don't see any way to get the complexity you want. Here's an algorithm that I think would be practical in real life.
First, reduce the graph to only vertices and edges between s and t, and do a topological sort so that you can easily find shortest paths in O(E) time.
Let W(m) be the minimum sum(w(e)) cost of paths max(p(e)) <= m, and let P(m) be the smallest max(p(e)) among those shortest paths. The problem solution corresponds to W(m)+P(m) for some cost m. Note that we can find W(m) and P(m) simultaneously in O(E) time by finding a shortest W-cost path, using P-cost to break ties.
The relevant values for m are the p(e) costs that actually occur, so make a sorted list of those. Then use a Kruskal's algorithm variant to find the smallest m that connects s to t, and calculate P(infinity) to find the largest relevant m.
Now we have an interval [l,h] of m-values that might be the best. The best possible result in the interval is W(h)+P(l). Make a priority queue of intervals ordered by best possible result, and repeatedly remove the interval with the best possible result, and:
stop if the best possible result = an actual result W(l)+P(l) or W(h)+P(h)
stop if there are no p(e) costs between l and P(h)
stop if the difference between the best possible result and an actual result is within some acceptable tolerance; or
stop if you have exceeded some computation budget
otherwise, pick a p(e) cost t between l and P(h), find a shortest path to get W(t) and P(t), split the interval into [l,t] and [t,h], and put them back in the priority queue and repeat.
The worst case complexity to get an exact result is still O(E2), but there are many economies and a lot of flexibility in how to stop.
This is only a 2-approximation, not an approximation scheme, but perhaps it inspires someone to come up with a better answer.
Using binary search, find the minimum spiked cost θ* such that, letting C(θ) be the minimum nominal cost of an s-t path using edges with spiked cost ≤ θ, we have C(θ*) = θ*. Every solution has either nominal or spiked cost at least as large as θ*, hence θ* leads to a 2-approximate solution.
Each test in the binary search involves running Dijkstra on the subset with spiked cost ≤ θ, hence this algorithm takes time O(|E| log2 |E|), well, if you want to be technical about it and use Fibonacci heaps, O((|E| + |V| log |V|) log |E|).

Error in an approach to Maximum Bipartite matching

A bipartite graph with a source and sink is given as shown below. The capacity of every edge is 1 unit :
Source : GeeksforGeeks
I'm trying to find the maximum flow from source to sink. One approach would be using the Ford-Fulkerson Algorithm for Maximum Flow Problem, which is applicable to all the graphs.
I found a simple approach to find the maximum flow(too simple to be correct!) and I'm not able to find any error in the approach.
Approach :
c1 = Count the number of vertices having non zero number of edges originating from it ,in the list of vertices having outgoing edges.
c2 = Count the number of vertices having non zero number of edges converging into it ,in the list of vertices having incoming edges.
The max flow would be the minimum of both these numbers,i.e., min(c1,c2).[Since any path needs one vertex from the outgoing vertices list, and other from incoming vertices list.]
Any help would be appreciated.
Consider a graph like
*--*
/
/
* *
/
/
*--*
(The patch of working by connected component doesn't fix things; connect the lower left to the upper right.)
Don't have an exact answer, but I have an iterative algorithm that works.
To me you clearly need to equilibrate the flow, so that it is distributed among the left vertices that can send it to the right vertices that can receive it.
Suppose you model your situation with a matrix A containing the bipartite links. You can then assume that if your matrix contains exactly the amount of flow (between 0 and 1) you want to pass in an edge, then the total flow given this decision is b=A*a where a is a vector of ones.
If you start with the maximum capacity for A, putting all the edges at one and all the others at 0, you might have some elements of b with more than 1, but you can reduce their corresponding elements of A so they pass less flow.
Then you can revert the flow and see what is the maximum reception capacity of your bipartite part, and test it with a=A'b.
Again you might have elements of a that are more than 1 meaning that an ideal flow would be greater than the possible capacity from source to the element, and reduce these elements in the flow of A.
With this ping-pong algorithm and reducing the corrections progressively, you are guaranteed to converge on the optimal matrix.
Given a final b=Aa with a vector of ones, your maximal flow will be sum(b).
See the octave code below, I use B as the converging matrix, and let me know your comments.
A=[0 1 0 1;1 0 0 1;1 1 0 0;0 0 1 0];
B=A;
repeat
b=B*ones(4,1);
B=B.*([.8 .8 .8 1]'*ones(1,4));
a=B'*ones(4,1)
B=B.*(ones(4,1)*[.9 .9 1 .9]);
until converge
maxflow=sum(b)

Partitioning graph into 2 sets of vertices

I want to partition a connected graph into 2 sets of vertices, such that the difference of sum of edge-weights among vertices of each set is minimized.
For example, if a graph consists of vertices 1,2,3,4,5, consider this partition:
Set A - {1,2,3}
Set B - {4,5}
Sum A = {w(1 2) + w(2 3) + w(1 3)}
Sum B = {w(4 5)}
Diff = abs(Sum A - Sum B) ... (This is one possible partition difference.)
So, how do I find a partition such that the difference is minimized?
This problem is NP hard because it is at least as hard as the partition problem.
Sketch of proof
Consider a partition problem where we have the numbers {1,2,3,4,5} that we wish to partition into two sets with as small a difference as possible.
Construct the graph shown below:
If someone comes up with an algorithm to solve your problem you can use the algorithm to partition this graph into two sets such that the sum of weights within each set is minimized.
In the optimal solution the blue and green nodes must be placed into different sets (because we have an edge with weight infinity connecting them). The remaining nodes will be connected to either the blue or green nodes. Call the ones connected to blue set1, and the ones connected to green set2. This partition will give the optimal answer to the partition problem.
Greedy algorithm
However, depending on the structure of your graph and values of the weights you may well be able to do a reasonable job.
For example, you could try:
Choose a random permutation of vertices
Loop through each vertex and assign to set 1 or 2 according to whichever minimises the objective function (which is just evaluated over the vertices assigned so far)
Repeat this algorithm a few times and keep track of the best score.
When you get down to just a few vertices left to be assigned, you could also try a brute force evaluation of all possible partitions of the remaining vertices to search for a good solution.
The following algorithmic sketch is based on Iterated Local Search. The idea is to greedily optimize the current solution until a local optimal solution is found. Then disturb this solution to overcome the local optimal solution. Always keep track of the best solution found so far.
Randomly divide the set of vertice into V1 and V2
Iterate
Calculate the costs (edge-weight-difference) of your current division
Select two random vertices v1 from V1 and v2 from V2
Check whether swapping these vertices (move v1 to V2 and v2 to V1) would lead to lower costs (edge-weight-difference). If so, swap vertices v1 and v2, else keep the sets.
Disturb a converged solution by swapping half of the vertices in V1 with half of the vertices in V2. Goto 2.
Iterated Local Search is a surprisingly effective and practical heuristic -- even for NP-complete problems.

Min-cost-flow to max-flow

Is there a reduction from the min cost flow problem to the max-flow problem? Or viceversa? I would like to use a min cost flow algorithm to solve a max-flow problem.
Sorry I think I misunderstand the question the first time. Yes, Minimum Cost is a special case for max flow. Rather than max flow, min cost assumes that after going through each edge, there is a cost to the flow. Therefore, if you set the cost at each edge to be zero, then min cost is reduced to the max flow.
Edit:
Since min cost problem needs a pre-defined required flow to send to begin with. You will need to run the above algorithm (with cost of edge c(u, v) = 0) for multiple times to search for the maximum value. For a given range of values, binary search can be used to more efficiently locate the max
Do you mean Min Cut Max Flow? (Edit: I do not think you meant this, but this is the basis of proving max flow, worth looking at if you have not)
I will be easier to understand if you drop a graph and do a min cut yourself.
Add a cost (per unit flow) of -1 to each edge, then use your minimise cost algorithm. That will maximise the flow.
The accepted answer may be practical. Proofing that Max-Flow is a special case of Min-Cost-Flow there is another possibility. My solution takes one iteration of the minimum-mean-cycle-cancelling algorithm in O(m^3 n^2 log n) (cause c is not conservative):
1. set c(e) = 0 for all edges in G
2. add edge (t,s) with inf capacity and c((t,s)) = -1
3. start MIN-MEAN-CYCLE-CANCELLING on modified graph G'
correctness: Algorithm is searching for residual circles with negative weight. As long as there is an augmentive path from s to t there are negative weighted residual circles.

Minimum number of days required to solve a list of questions

There are N problems numbered 1..N which you need to complete. You've arranged the problems in increasing difficulty order, and the ith problem has estimated difficulty level i. You have also assigned a rating vi to each problem. Problems with similar vi values are similar in nature. On each day, you will choose a subset of the problems and solve them. You've decided that each subsequent problem solved on the day should be tougher than the previous problem you solved on that day. Also, to not make it boring, consecutive problems you solve should differ in their vi rating by at least K. What is the least number of days in which you can solve all problems?
Input:
The first line contains the number of test cases T. T test cases follow. Each case contains an integer N and K on the first line, followed by integers v1,...,vn on the second line.
Output:
Output T lines, one for each test case, containing the minimum number of days in which all problems can be solved.
Constraints:
1 <= T <= 100
1 <= N <= 300
1 <= vi <= 1000
1 <= K <= 1000
Sample Input:
2
3 2
5 4 7
5 1
5 3 4 5 6
Sample Output:
2
1
This is one of the challenge from interviewstreet.
Below is my approach
Start from 1st question and find out max possible number of question can be solve and remove these questions from the question list.Now again start from first element of the remainning list and do this till now size of the question list is 0.
I am getting wrong answer from this method so looking for some algo to solve this challenge.
Construct a DAG of problems in the following way. Let pi and pj be two different problems. Then we will draw a directed edge from pi to pj if and only if pj can be solved directly after pi on the same day, consecutively. Namely, the following conditions have to be satisfied:
i < j, because you should solve the less difficult problem earlier.
|vi - vj| >= K (the rating requirement).
Now notice that each subset of problems that is chosen to be solved on some day corresponds to the directed path in that DAG. You choose your first problem, and then you follow the edges step by step, each edge in the path corresponds to the pair of problems that have been solved consecutively on the same day. Also, each problem can be solved only once, so any node in our DAG may appear only in exactly one path. And you have to solve all the problems, so these paths should cover all the DAG.
Now we have the following problem: given a DAG of n nodes, find the minimal number of non-crossing directed paths that cover this DAG completely. This is a well-known problem called Path cover. Generally speaking, it is NP-hard. However, our directed graph is acyclic, and for acyclic graphs it can be solved in polynomial time using reduction to the matching problem. Maximal matching problem, in its turn, can be solved using Hopcroft-Karp algorithm, for example. The exact reduction method is easy and can be read, say, on Wikipedia. For each directed edge (u, v) of the original DAG one should add an undirected edge (au, bv) to the bipartite graph, where {ai} and {bi} are two parts of size n.
The number of nodes in each part of the resulting bipartite graph is equal to the number of nodes in the original DAG, n. We know that Hopcroft-Karp algorithm runs in O(n2.5) in the worst case, and 3002.5 ≈ 1 558 845. For 100 tests this algorithm should take under a 1 second in total.
The algorithm is simple. First, sort the problems by v_i, then, for each problem, find the number of problems in the interval (v_i-K, v_i]. The maximum of those numbers is the result. The second phase can be done in O(n), so the most costly operation is sorting, making the whole algorithm O(n log n). Look here for a demonstration of the work of the algorithm on your data and K=35 in a spreadsheet.
Why does this work
Let's reformulate the problem to the problem of graph coloring. We create graph G as follows: vertices will be the problems and there will be an edge between two problems iff |v_i - v_j| < K.
In such graph, independent sets exactly correspond to sets of problems doable on the same day. (<=) If the set can be done on a day, it is surely an independent set. (=>) If the set doesn't contain two problems not satisfying the K-difference criterion, you can just sort them according to the difficulty and solve them in this order. Both condition will be satisfied this way.
Therefore, it easily follows that colorings of graph G exactly correspond to schedules of the problems on different days, with each color corresponding to one day.
So, we want to find the chromaticity of graph G. This will be easy once we recognize the graph is an interval graph, which is a perfect graph, those have chromaticity equal to cliqueness, and both can be found by a simple algorithm.
Interval graphs are graphs of intervals on the real line, edges are between intervals that intersect. Our graph, as can be easily seen, is an interval graph (for each problem, assign an interval (v_i-K, v_i]. It can be easily seen that the edges of this interval graph are exactly the edges of our graph).
Lemma 1: In an interval graph, there exist a vertex whose neighbors form a clique.
Proof is easy. You just take the interval with the lowest upper bound (or highest lower bound) of all. Any intervals intersecting it have the upper bound higher, therefore, the upper bound of the first interval is contained in the intersection of them all. Therefore, they intersect each other and form a clique. qed
Lemma 2: In a family of graphs closed on induced subgraphs and having the property from lemma 1 (existence of vertex, whose neighbors form a clique), the following algorithm produces minimal coloring:
Find the vertex x, whose neighbors form a clique.
Remove x from the graph, making its subgraph G'.
Color G' recursively
Color x by the least color not found on its neighbors
Proof: In (3), the algorithm produces optimal coloring of the subgraph G' by induction hypothesis + closeness of our family on induced subgraphs. In (4), the algorithm only chooses a new color n if there is a clique of size n-1 on the neighbors of x. That means, with x, there is a clique of size n in G, so its chromaticity must be at least n. Therefore, the color given by the algorithm to a vertex is always <= chromaticity(G), which means the coloring is optimal. (Obviously, the algorithm produces a valid coloring). qed
Corollary: Interval graphs are perfect (perfect <=> chromaticity == cliqueness)
So we just have to find the cliqueness of G. That is easy easy for interval graphs: You just process the segments of the real line not containing interval boundaries and count the number of intervals intersecting there, which is even easier in your case, where the intervals have uniform length. This leads to an algorithm outlined in the beginning of this post.
Do we really need to go to path cover? Can't we just follow similar strategy as LIS.
The input is in increasing order of complexity. We just have to maintain a bunch of queues for tasks to be performed each day. Every element in the input will be assigned to a day by comparing the last elements of all queues. Wherever we find a difference of 'k' we append the task to that list.
For ex: 5 3 4 5 6
1) Input -> 5 (Empty lists so start a new one)
5
2) 3 (only list 5 & abs(5-3) is 2 (k) so append 3)
5--> 3
3) 4 (only list with last vi, 3 and abs(3-4) < k, so start a new list)
5--> 3
4
4) 5 again (abs(3-5)=k append)
5-->3-->5
4
5) 6 again (abs(5-6) < k but abs(4-6) = k) append to second list)
5-->3-->5
4-->6
We just need to maintain an array with last elements of each list. Since order of days (when tasks are to be done is not important) we can maintain a sorted list of last tasks therefore, searching a place to insert the new task is just looking for the value abs(vi-k) which can be done via binary search.
Complexity:
The loop runs for N elements. In the worst case , we may end up querying ceil value using binary search (log i) for many input[i].
Therefore, T(n) < O( log N! ) = O(N log N). Analyse to ensure that the upper and lower bounds are also O( N log N ). The complexity is THETA (N log N).
The minimum number of days required will be the same as the length of the longest path in the complementary (undirected) graph of G (DAG). This can be solved using Dilworth's theorem.

Resources