So I have a problem that is basically like this: I have a bunch of strings, and I want to construct a DAG such that every path corresponds to a string and vice versa. However, I have the freedom to permute my strings arbitrarily. The order of characters does not matter. The DAGs that I generate have a cost associated with them. Basically, the cost of a branch in the DAG is proportional to the length of its child paths.
For example, let's say I have the strings BAAA, CAAA, DAAA, and I construct a DAG representing them without permuting them. I get:
() -> (B, C, D) -> A -> A -> A
where the tuple represents branching.
A cheaper representation for my purposes would be:
() -> A -> A -> A -> (B, C, D)
The problem is: Given n strings, permute the strings such that the corresponding DAG has the cheapest cost, where the cost function is: If we traverse the graph from the source in depth first, left to right order, the total number of nodes we visit, with multiplicity.
So the cost of the first example is 12, because we must visit the A's multiple times on the traversal. The cost of the second example is 6, because we only visit the A's once before we deal with the branches.
I have a feeling this problem is NP Hard. It seems like a question about formal languages and I'm not familiar enough with those sorts of algorithms to figure out how I should go about the reduction. I don't need a complete answer per se, but if someone could point out a class of well known problems that seem related, I would much appreciate it.
To rephrase:
Given words w1, …, wn, compute permutations x1 of w1, …, xn of wn to minimize the size of the trie storing x1, …, xn.
Assuming an alphabet of unlimited size, this problem is NP-hard via a reduction from vertex cover. (I believe it might be fixed-parameter tractable in the size of the alphabet.) The reduction is easy: given a graph, let each vertex be its own letter and create a two-letter word for each edge.
There is exactly one node at depth zero, and as many nodes at depth two as there are edges. The possible sets of nodes at depth one are exactly the sets of nodes that are vertex covers.
Related
I have a situation in which I need to find optimal split positions in an array based on some costs. The problem goes like this:
As input I have an array of events ordered by an integer timestamp and as output I want an array of indexes which split the input array into many parts. The output array needs to be optimal (more on this below).
struct e {
int Time;
// other values
}
Example Input: [e0, e1, e2, e3, e4, e5, ..., e10]
Example output: [0, 2, 6, 8] (the 0 at the start is always there)
Using the above examples I can use the split indices to partition the original array into 5 subarrays like so:
[ [], [e0, e1], [e2, e3, e4, e5], [e6, e7], [e8, e9, e10] ]
The cost of this example solution is the total cost of "distances" between the subarrays:
double distance(e[] arr1, e[] arr2) {
// return distance from arr1 to arr2, order matters so non-euclidean
}
total cost = distance([], [e0, e1]) + distance([e0, e1], [e2, e3, e4, e5]) + ...
At this point it is helpful to understand the actual problem.
The input array represents musical notes at some time (i.e. a MIDI file) and I want to split the MIDI file into optimal guitar fingerings. Hence each subarray of notes represents a chord (or a melody grouped together in a single fingering). The distance between two subarrays represents the difficulty of moving from one fingering pattern to another. The goal is to find the easiest (optimal) way to play a song on a guitar.
I have not yet proved it but to me this looks like an NP-Complete or NP-Hard problem. Therefore it could be helpful if I could reduce this to another known problem and use a known divide and conquer algorithm. Also, one could solve this with a more traditional search algorithm (A* ?). It could be efficient because we can filter out bad solutions much faster than in a regular graph (because the input is technically a complete graph since each fingering can be reached from any other fingering).
I'm not able to decide what the best approach would be so I am currently stuck. Any tips or ideas would be appreciated.
It's probably not NP-hard.
Form a graph whose nodes correspond one-to-one to (contiguous) subarrays. For each pair of nodes u, v where u's right boundary is v's left, add an arc from u to v whose length is determined by distance(). Create an artificial source with an outgoing arc to each node whose left boundary is the beginning. Create an artificial sink with an incoming arc from each node whose right boundary is the end.
Now we can find a shortest path from the source to the sink via the linear-time (in the size of the graph, so cubic in the parameter of interest) algorithm for directed acyclic graphs.
This is a bit late but I did solve this problem. I ended up using a slightly modified version of Dijkstra for this but any pathfinding algo could work. I tried A* as well but finding a good heuristic proved to be extremely difficult because of the non-euclidean nature of the problem.
The main changes to Dijkstra are that at some point I can already tell that some unvisited nodes cannot provide an optimal result. This speeds up the algorithm by a lot which is also one of the reasons I didn't opt for A*.
The algorithm essentially works like this:
search()
visited = set()
costs = map<node, double>()
// add initial node to costs
while costs is not empty:
node = node with minimum cost in costs
if current.Index == songEnd:
// backtrack from current to get fingering for the song
return solution
visited.add(node)
foreach neighbour of node:
if visited node:
continue
newCost = costs[node] + distance(node, neighbour)
add neighbour with newCost to costs
// we can remove nodes that have a higher cost but
// which have a lower index than our current node
// this is because every fingering position is reachable
// from any fingering positions
// therefore a higher cost node which is not as far as our current node
// cannot provide an optimal solution
remove unoptimal nodes from costs
remove node from costs
// if costs ends up empty then it is impossible to play this song
// on guitar (e.g. more than 6 notes played at the same time)
The magic of this algorithm happens in fetching the neighbours and calculating the distance between two nodes but those are irrelevant for this question.
I've been reading some papers on multicut algorithms for segmenting graph structures. I'm specifically interested in this work which proposes an algorithm to solve an extension of the multicut problem:
https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Keuper_Efficient_Decomposition_of_ICCV_2015_paper.pdf
Regarding the edge costs, it says:
...for any pair
of nodes, a real-valued cost (reward) to all decompositions
for which these nodes are in distinct components
Fair enough. It further says that the solution to the multicut problem is a simple binary vector of length equal to the number of edges in the graph, in which a '1' indicates that the corresponding edge separates two vertices belonging to distinct graph components:
for every edge vw ∈ E ∪ F , y(v,w) = 1 if and only if v and w
are in distinct components of G.
But then the optimization problem is written as:
This doesn't seem to make sense. If the edge weights depict rewards for that edge connecting nodes in distinct components, shouldn't this be a maximization problem? And in either case, if all edge weights are positive, wouldn't that lead to a trivial solution where y is an all-zeros vector? The above expression is followed by some constraints in the paper, but I couldn't figure out how any of those prevent this outcome.
Furthermore, when it later tries to generate an initial solution using Greedy Additive Edge Contraction, it says:
Alg. 1 starts from the decomposition into
single nodes. In every iteration, a pair of neighboring components is joined for which the join decreases the objective
value maximally. If no join strictly decreases the objective
value, the algorithm terminates.
Again, if edge weights are rewards for keeping nodes separate, wouldn't joining any two nodes reduce the reward? And even if I assume for a second that edge weights are penalties for keeping nodes separate, wouldn't this method simply lump all the nodes into a single component?
The only way I see where this would work is if the edge weights are a balanced combination of positive and negative values, but I'm pretty sure I'm missing something because this constraint isn't mentioned anywhere in literature.
Just citing this multicut lecture
Minimum Multicut. The input consists of a weighted, undirected graph G
= (V,E) with a non-negative weight c_k for every edge in E, and a set of terminal pairs {(s1,t1);(s2,t2)...(sk,tk)}. A multicut is a set of
edges whose removal disconnects each of the terminal pairs.
I think from this definition it is clear that the multicut problem is a minimization problem for the accumulated weight which is defined by the selection of edges to cut. Maximizing the weight would of course be trivial (removing all edges). No?
Better late than never, here's the answer:
The weights c_e for cutting the edge e are not restricted to be positive as defined in Definition 1. In fact, Equation (7) specifies that they are log-ratios of two complementary probabilities. That means if the estimated probability for edge e being cut is greater than 0.5, then c_e will be negative. If it's smaller, then c_e will be positive.
While the trivial "all edges cut" solution is still feasible, it is quite unlikely that it is also optimal in any "non-toy" instance where you will have edges that are more likely to be cut while others are more likely to remain.
I have a DAG with N nodes, i.e., 1, 2, ..., N, and each node has a weight (we can call it time) x_1, x_2, ..., x_N. I want to do a topological sorting but the difficulty is that I have an objective function when sorting. My objective function is to minimize the total time between several pairs of nodes.
For example, I have a DAG with 6 nodes, and I want a specific topological sorting such that (1,3) + (2,4) is minimized, where (A,B) denotes the time between two nodes A and B. For instance, if we have a sort [1, 6, 3, 2, 5, 4, 7], (1,3) = x_6 and (2,4) = x_5. Based on the DAG, I want to find a sorting that minimizes (1,3) + (2,4).
I have been thinking this problem for a while. Generate all possible topological sorts (reference link) and calculate the objective function one by one is always a possible solution but it takes too much time if N is large. I was also suggested to use branch-bound pruning when generating all possible sorts (I am not very familiar branch-bound but I think that won't dramatically reduce the complexity).
Any (optimal or heuristic) algorithm for this kind of problem? It would be perfect if the algorithm can also be applied to other objective functions such as minimizing the total starting time for some nodes. Any suggestion is appreciated.
PS: Or alternatively, is it possible to formulate this problem as a linear integer optimization problem?
One way to solve this is as follows:
First we run All-Pairs shortest path algorithm Floyd-Warshall. This algorithm can be coded in essentially 5 lines of code and it runs in O(V^3) time. It generates the shortest paths between each of the pair of vertices in the graph, i.e., it generates V X V matrix of shortest paths as its output.
It's trivial to modify this algorithm so that we can also get the count of vertices included in each of the O(N^2) paths. So now we can eliminate all paths that have less than N vertices. For the remaining paths, we order them by their cost and then test each of them to see if topological sort property is not violated. If this property is not violated then we have found our desired result.
The last step above, i.e., testing topological sort can be performed in O(V+E) for each of the O(V^2) paths. This yields worst case runtime of O(V^4). However in practice this should be fast because Floyd-Warshall can be made very cache friendly and we would be testing only small fraction of O(N^2) paths in reality. Also if your DAG is not dense then you might be able to optimize topological testing as well with appropriate data structures.
Here's an idea:
For simplicity, first suppose that you have a single pair to optimize (I'll comment on the general case later,) and suppose you already have your graph topologically sorted into an array.
Take the array segment starting at the lower (in terms of your topological order) node of the pair, say l, and ending at the higher node, say h. For every single node sorted between l and h, calculate whether it is bounded from below by l and/or bounded from above by h. You can calculate the former property by marking nodes in an "upward" BFS from l, cutting at nodes sorted above h; and similarly, the latter by marking in a "downward" BFS from h, cutting at nodes sorted below l. The complexity of either pass will be O( B*L ), where B is the branching factor, and L is the number of nodes originally sorted between l and h.
Now, all nodes that are not bounded from above by h can be moved above h, and all nodes that are not bounded from below by l can be moved below l (the two sets may overlap,) all without violating the topological sorting of the array, provided that that the original sorted order within each group of nodes relocated upward or downward is preserved.
This same procedure can be applied to as many pairs as needed, provided that the segments that they cut from the original sorting order do not overlap.
If any two pairs overlap, say (l1, h1) and (l2, h2), so that e.g. l1 < l2 < h1 < h2 in the original sorted order, you have the following two cases:
1) In the trivial case where h1 and l2 happen to be unrelated in the topological order, then you should be able to optimize the two pairs mostly independently from each other, only applying some care to move either l2 above h1 or h1 below l2 (but not e.g. h1 below l1, if that should turn out to be possible.)
2) If l2 < h1 in the topological order, you can treat both pairs as the single pair (l1, h2), and then possibly apply the procedure once more to (l2, h1).
As it's not clear what the complete process will achieve in the nontrivial overlapping case, especially if you have more complicated overlapping patterns, it may turn out to be better to treat all pairs uniformly, regardless of overlapping. In this case, the order can be processed repeatedly while each run yields an improvement over the previous (I'm not sure if the procedure will be monotonic in terms of the objective function though -- probably not.)
I'd like to solve a harder version of the minimum spanning tree problem.
There are N vertices. Also there are 2M edges numbered by 1, 2, .., 2M. The graph is connected, undirected, and weighted. I'd like to choose some edges to make the graph still connected and make the total cost as small as possible. There is one restriction: an edge numbered by 2k and an edge numbered by 2k-1 are tied, so both should be chosen or both should not be chosen. So, if I want to choose edge 3, I must choose edge 4 too.
So, what is the minimum total cost to make the graph connected?
My thoughts:
Let's call two edges 2k and 2k+1 a edge set.
Let's call an edge valid if it merges two different components.
Let's call an edge set good if both of the edges are valid.
First add exactly m edge sets which are good in increasing order of cost. Then iterate all the edge sets in increasing order of cost, and add the set if at least one edge is valid. m should be iterated from 0 to M.
Run an kruskal algorithm with some variation: The cost of an edge e varies.
If an edge set which contains e is good, the cost is: (the cost of the edge set) / 2.
Otherwise, the cost is: (the cost of the edge set).
I cannot prove whether kruskal algorithm is correct even if the cost changes.
Sorry for the poor English, but I'd like to solve this problem. Is it NP-hard or something, or is there a good solution? :D Thanks to you in advance!
As I speculated earlier, this problem is NP-hard. I'm not sure about inapproximability; there's a very simple 2-approximation (split each pair in half, retaining the whole cost for both halves, and run your favorite vanilla MST algorithm).
Given an algorithm for this problem, we can solve the NP-hard Hamilton cycle problem as follows.
Let G = (V, E) be the instance of Hamilton cycle. Clone all of the other vertices, denoting the clone of vi by vi'. We duplicate each edge e = {vi, vj} (making a multigraph; we can do this reduction with simple graphs at the cost of clarity), and, letting v0 be an arbitrary original vertex, we pair one copy with {v0, vi'} and the other with {v0, vj'}.
No MST can use fewer than n pairs, one to connect each cloned vertex to v0. The interesting thing is that the other halves of the pairs of a candidate with n pairs like this can be interpreted as an oriented subgraph of G where each vertex has out-degree 1 (use the index in the cloned bit as the tail). This graph connects the original vertices if and only if it's a Hamilton cycle on them.
There are various ways to apply integer programming. Here's a simple one and a more complicated one. First we formulate a binary variable x_i for each i that is 1 if edge pair 2i-1, 2i is chosen. The problem template looks like
minimize sum_i w_i x_i (drop the w_i if the problem is unweighted)
subject to
<connectivity>
for all i, x_i in {0, 1}.
Of course I have left out the interesting constraints :). One way to enforce connectivity is to solve this formulation with no constraints at first, then examine the solution. If it's connected, then great -- we're done. Otherwise, find a set of vertices S such that there are no edges between S and its complement, and add a constraint
sum_{i such that x_i connects S with its complement} x_i >= 1
and repeat.
Another way is to generate constraints like this inside of the solver working on the linear relaxation of the integer program. Usually MIP libraries have a feature that allows this. The fractional problem has fractional connectivity, however, which means finding min cuts to check feasibility. I would expect this approach to be faster, but I must apologize as I don't have the energy to describe it detail.
I'm not sure if it's the best solution, but my first approach would be a search using backtracking:
Of all edge pairs, mark those that could be removed without disconnecting the graph.
Remove one of these sets and find the optimal solution for the remaining graph.
Put the pair back and remove the next one instead, find the best solution for that.
This works, but is slow and unelegant. It might be possible to rescue this approach though with a few adjustments that avoid unnecessary branches.
Firstly, the edge pairs that could still be removed is a set that only shrinks when going deeper. So, in the next recursion, you only need to check for those in the previous set of possibly removable edge pairs. Also, since the order in which you remove the edge pairs doesn't matter, you shouldn't consider any edge pairs that were already considered before.
Then, checking if two nodes are connected is expensive. If you cache the alternative route for an edge, you can check relatively quick whether that route still exists. If it doesn't, you have to run the expensive check, because even though that one route ceased to exist, there might still be others.
Then, some more pruning of the tree: Your set of removable edge pairs gives a lower bound to the weight that the optimal solution has. Further, any existing solution gives an upper bound to the optimal solution. If a set of removable edges doesn't even have a chance to find a better solution than the best one you had before, you can stop there and backtrack.
Lastly, be greedy. Using a regular greedy algorithm will not give you an optimal solution, but it will quickly raise the bar for any solution, making pruning more effective. Therefore, attempt to remove the edge pairs in the order of their weight loss.
The goal is to sort a list X of n unknown variables {x0, x1, x2, ... x(n-1)} using a list C of m comparison results (booleans). Each comparison is between two of the n variables, e.g. x2 < x5, and the pair indices for each of the comparisons are fixed and given ahead of time. Also given: All pairs in C are unique (even when flipped, e.g. the pair x0, x1 means there is no pair x1, x0), and never compare a variable against itself. That means C has at most n*(n-1)/2 entries.
So the question is can I prove that my list C of m comparisons is sufficient to sort the list X? Obviously it would be if C was the largest possible length (had all possible comparisons). But what about shorter lists?
Then, if it has been proven that C contains enough information to sort, how do I then actually go about performing the sort.
Let's imagine that you have the collection of objects to be sorted and form a graph from them with one node per object. You're then given a list of pairs indicating how the comparisons go. You can think of these as edges in the graph: if you know that object x compares less than object y, then you can draw an edge from x to y.
Assuming that the results of the comparisons are consistent - that is, you don't have any cycles - you should have a directed acyclic graph.
Think about what happens if you topologically sort this DAG. What you'll end up with is one possible ordering of the values that's consistent with all of the constraints. The reason for this is that in a topological ordering, you won't place an element x before an element y if there is any transitive series of edges leading from y to x, and there's a transitive series of edges leading from y to x if there's a chain of comparisons that transitively indicates that y precedes x.
You can actually make a stronger claim: the set of all topological orderings of the DAG is exactly the set of all possible orderings that satisfy all the constraints. We've already argued that every topological ordering satisfies all the constraints, so all we need to do now is argue that every sequence satisfying all the constraints is a valid topological ordering. The argument here is essentially that if you obey all the constraints, you never place any element in the sequence before something that it transitively compares less than, so you never place any element in the sequence before something that has a path to it.
This then gives us a nice way to solve the problem: take the graph formed this way and see if it has exactly one topological ordering. If so, then that ordering is the unique sorted order. If not, then there are two or more orderings.
So how best to go about this? Well, one of the standard algorithms for doing a topological sort is to annotate each node with its indegree, then repeatedly pull off a node of indegree zero and adjust the indegrees of its successors. The DAG has exactly one topological ordering if in the course of performing this algorithm, at every stage there is exactly one node of indegree zero, since in that case the topological ordering is forced.
With the right setup and data structures, you can implement this to run in time O(n + m), where n is the number of nodes and m is the number of constraints. I'll leave those details as a proverbial exercise to the reader. :-)
Your problem can be reduced to the well-known Topological sort.
To prove that "C contains enough information to sort" is to prove the uniqueness of topological sort:
If a topological sort has the property that all pairs of consecutive vertices in the sorted order are connected by edges, then these edges form a directed Hamiltonian path in the DAG. If a Hamiltonian path exists, the topological sort order is unique; no other order respects the edges of the path. Conversely, if a topological sort does not form a Hamiltonian path, the DAG will have two or more valid topological orderings, for in this case it is always possible to form a second valid ordering by swapping two consecutive vertices that are not connected by an edge to each other. Therefore, it is possible to test in linear time whether a unique ordering exists, and whether a Hamiltonian path exists, despite the NP-hardness of the Hamiltonian path problem for more general directed graphs (Vernet & Markenzon 1997).