I have a DAG with N nodes, i.e., 1, 2, ..., N, and each node has a weight (we can call it time) x_1, x_2, ..., x_N. I want to do a topological sorting but the difficulty is that I have an objective function when sorting. My objective function is to minimize the total time between several pairs of nodes.
For example, I have a DAG with 6 nodes, and I want a specific topological sorting such that (1,3) + (2,4) is minimized, where (A,B) denotes the time between two nodes A and B. For instance, if we have a sort [1, 6, 3, 2, 5, 4, 7], (1,3) = x_6 and (2,4) = x_5. Based on the DAG, I want to find a sorting that minimizes (1,3) + (2,4).
I have been thinking this problem for a while. Generate all possible topological sorts (reference link) and calculate the objective function one by one is always a possible solution but it takes too much time if N is large. I was also suggested to use branch-bound pruning when generating all possible sorts (I am not very familiar branch-bound but I think that won't dramatically reduce the complexity).
Any (optimal or heuristic) algorithm for this kind of problem? It would be perfect if the algorithm can also be applied to other objective functions such as minimizing the total starting time for some nodes. Any suggestion is appreciated.
PS: Or alternatively, is it possible to formulate this problem as a linear integer optimization problem?
One way to solve this is as follows:
First we run All-Pairs shortest path algorithm Floyd-Warshall. This algorithm can be coded in essentially 5 lines of code and it runs in O(V^3) time. It generates the shortest paths between each of the pair of vertices in the graph, i.e., it generates V X V matrix of shortest paths as its output.
It's trivial to modify this algorithm so that we can also get the count of vertices included in each of the O(N^2) paths. So now we can eliminate all paths that have less than N vertices. For the remaining paths, we order them by their cost and then test each of them to see if topological sort property is not violated. If this property is not violated then we have found our desired result.
The last step above, i.e., testing topological sort can be performed in O(V+E) for each of the O(V^2) paths. This yields worst case runtime of O(V^4). However in practice this should be fast because Floyd-Warshall can be made very cache friendly and we would be testing only small fraction of O(N^2) paths in reality. Also if your DAG is not dense then you might be able to optimize topological testing as well with appropriate data structures.
Here's an idea:
For simplicity, first suppose that you have a single pair to optimize (I'll comment on the general case later,) and suppose you already have your graph topologically sorted into an array.
Take the array segment starting at the lower (in terms of your topological order) node of the pair, say l, and ending at the higher node, say h. For every single node sorted between l and h, calculate whether it is bounded from below by l and/or bounded from above by h. You can calculate the former property by marking nodes in an "upward" BFS from l, cutting at nodes sorted above h; and similarly, the latter by marking in a "downward" BFS from h, cutting at nodes sorted below l. The complexity of either pass will be O( B*L ), where B is the branching factor, and L is the number of nodes originally sorted between l and h.
Now, all nodes that are not bounded from above by h can be moved above h, and all nodes that are not bounded from below by l can be moved below l (the two sets may overlap,) all without violating the topological sorting of the array, provided that that the original sorted order within each group of nodes relocated upward or downward is preserved.
This same procedure can be applied to as many pairs as needed, provided that the segments that they cut from the original sorting order do not overlap.
If any two pairs overlap, say (l1, h1) and (l2, h2), so that e.g. l1 < l2 < h1 < h2 in the original sorted order, you have the following two cases:
1) In the trivial case where h1 and l2 happen to be unrelated in the topological order, then you should be able to optimize the two pairs mostly independently from each other, only applying some care to move either l2 above h1 or h1 below l2 (but not e.g. h1 below l1, if that should turn out to be possible.)
2) If l2 < h1 in the topological order, you can treat both pairs as the single pair (l1, h2), and then possibly apply the procedure once more to (l2, h1).
As it's not clear what the complete process will achieve in the nontrivial overlapping case, especially if you have more complicated overlapping patterns, it may turn out to be better to treat all pairs uniformly, regardless of overlapping. In this case, the order can be processed repeatedly while each run yields an improvement over the previous (I'm not sure if the procedure will be monotonic in terms of the objective function though -- probably not.)
Related
Suppose I have a list of tuples:
[(a1, b1), (a2, b2), ..., (an, bn)]
I could sort them by the a's, or the b's, but not both.
But what if I want to sort them by both as well as possible? A good way to measure how well they're sorted is the number of pairs of "a" values that are in the wrong order, plus the number of pairs of "b" values that are in the wrong order. What algorithm will do this quickly?
An algorithm that minimizes a different loss function would also be interesting but I think what would be best for my application is to minimize discordant pairs.
Update: it turns out there is a very simple solution in O(n log n) time.
Just sort the list by the a components, using the b components as a tiebreaker. (Or vice versa.) Or, if they are numbers, you can sort by the sum of the two components, a + b. This can be done in O(n log n) time using any efficient comparison-based sorting algorithm.
This solution works because the loss function can be written as a sum of individual loss functions, for each pair of elements. For pairs like (2, 4) vs. (3, 3) which will be discordant whatever their relative order, the individual loss for that pair is always 1. Similarly, when two pairs are equal, such as (4, 5) vs. (4, 5), the individual loss for that pair is 0 whatever their relative order.
The only non-constant individual loss functions are for pairs where one component is bigger and the other is bigger-or-equal, e.g. (2, 4) vs. (3, 4), or (2, 4) vs. (3, 5). Each of the sorting orders described above will put all such pairs in their optimal order relative to each other. This simultaneously minimises every term in the loss function, so therefore it minimises the total loss.
Note that this specifically only works for a list of 2-tuples. For 3-tuples or higher, a solution as simple as this won't work, but the ideas in my original answer can be adapted (see below). However, adapting them won't be easy, since the graph will not necessarily be acyclic.
Original answer (expanded)
This can be modelled as a kind of graph problem. Each pair (a_i, b_i) is a node in the graph.
Insert a directed edge i → j whenever both a_i <= a_j and b_i <= b_j, unless both are equal. For any pairs where a_i < a_j and b_i > b_j, or vice versa, and any pairs where a_i = a_j and b_i = b_j, there is no edge. The existence of an edge is equivalent to a preference between the relative ordering of node i and node j; if there is no edge, then the loss is the same whatever the relative ordering of those two nodes.
For the case of 2-tuples, it is quite straightforward to show that this graph is acyclic, from the way it is constructed. So a topological sorting algorithm will find an ordering such that all edges point "forwards", i.e. node i appears before node j whenever there is an edge i → j. This ordering clearly minimises the loss function, because it simultaneously minimises the individual losses of every pair i, j.
The only discordant pairs in the resulting order are those which are necessarily discordant; those where, whichever way round that pair ends up, either the a components are out of order, or the b components are.
Actually implementing a topological sorting algorithm doesn't require constructing the graph explicitly; you can just treat the "nodes" and "edges" as an implicit graph, using comparisons to find the edges, instead of looking them up in some kind of graph data structure. To avoid scanning the whole list to find a node's neighbours on every iteration, you can take advantage of the fact that the edge relation is transitive: if node A only has edges to nodes B, C and D, then node B can only have edges to C and D. This will take O(n²) time in the worst case, but should be more efficient than brute force.
I have unordered elements in a vector. There's no transitivity; if element A > B and B > C, A > C doesn't need to be true.
I need to sort them so that an element is greater than its following one.
For example, if we have three elements A, B and C, and:
A<B, A>C
B<C, B>A
C<A, C>B
and the vector is <A,B,C>, we would need to sort it as <A,C,B>.
I've done the sorting with bubble sort and other classic sorting algorithms that require O(n2) time, but doesn't look efficient.
Is there a more efficient algorithm?
Thanks.
Consider your data as a graph, where the elements of your array A, B, C are vertices, and a (directed) edge between two vertices x and y are comparisons x>y.
The requirement to order the elements such that each adjacent pair x, y satisfies x>y is, in the graph view of your problem, the same as finding a Hamiltonian path through the vertices.
There's no apparent restrictions for your > relation (for example, it's not transitive, and it's ok for it to contain cycles), so the graph you get is an arbitrary graph. So you're left with the problem of finding a Hamiltonian path in an arbitrary graph, which is an NP complete problem. That means you're not going to find an easy, efficient solution.
What are you seeking is called topological sorting.
I ended up using Quicksort. I choose a pivot element, and sort elements so that I have two halves: elements lesser than the pivot and elements greater than the pivot. Quicksort is executed again recursively for those two halves. That way I have a O(nlog n) complexity time on average.
Thanks for the comments!
The goal is to sort a list X of n unknown variables {x0, x1, x2, ... x(n-1)} using a list C of m comparison results (booleans). Each comparison is between two of the n variables, e.g. x2 < x5, and the pair indices for each of the comparisons are fixed and given ahead of time. Also given: All pairs in C are unique (even when flipped, e.g. the pair x0, x1 means there is no pair x1, x0), and never compare a variable against itself. That means C has at most n*(n-1)/2 entries.
So the question is can I prove that my list C of m comparisons is sufficient to sort the list X? Obviously it would be if C was the largest possible length (had all possible comparisons). But what about shorter lists?
Then, if it has been proven that C contains enough information to sort, how do I then actually go about performing the sort.
Let's imagine that you have the collection of objects to be sorted and form a graph from them with one node per object. You're then given a list of pairs indicating how the comparisons go. You can think of these as edges in the graph: if you know that object x compares less than object y, then you can draw an edge from x to y.
Assuming that the results of the comparisons are consistent - that is, you don't have any cycles - you should have a directed acyclic graph.
Think about what happens if you topologically sort this DAG. What you'll end up with is one possible ordering of the values that's consistent with all of the constraints. The reason for this is that in a topological ordering, you won't place an element x before an element y if there is any transitive series of edges leading from y to x, and there's a transitive series of edges leading from y to x if there's a chain of comparisons that transitively indicates that y precedes x.
You can actually make a stronger claim: the set of all topological orderings of the DAG is exactly the set of all possible orderings that satisfy all the constraints. We've already argued that every topological ordering satisfies all the constraints, so all we need to do now is argue that every sequence satisfying all the constraints is a valid topological ordering. The argument here is essentially that if you obey all the constraints, you never place any element in the sequence before something that it transitively compares less than, so you never place any element in the sequence before something that has a path to it.
This then gives us a nice way to solve the problem: take the graph formed this way and see if it has exactly one topological ordering. If so, then that ordering is the unique sorted order. If not, then there are two or more orderings.
So how best to go about this? Well, one of the standard algorithms for doing a topological sort is to annotate each node with its indegree, then repeatedly pull off a node of indegree zero and adjust the indegrees of its successors. The DAG has exactly one topological ordering if in the course of performing this algorithm, at every stage there is exactly one node of indegree zero, since in that case the topological ordering is forced.
With the right setup and data structures, you can implement this to run in time O(n + m), where n is the number of nodes and m is the number of constraints. I'll leave those details as a proverbial exercise to the reader. :-)
Your problem can be reduced to the well-known Topological sort.
To prove that "C contains enough information to sort" is to prove the uniqueness of topological sort:
If a topological sort has the property that all pairs of consecutive vertices in the sorted order are connected by edges, then these edges form a directed Hamiltonian path in the DAG. If a Hamiltonian path exists, the topological sort order is unique; no other order respects the edges of the path. Conversely, if a topological sort does not form a Hamiltonian path, the DAG will have two or more valid topological orderings, for in this case it is always possible to form a second valid ordering by swapping two consecutive vertices that are not connected by an edge to each other. Therefore, it is possible to test in linear time whether a unique ordering exists, and whether a Hamiltonian path exists, despite the NP-hardness of the Hamiltonian path problem for more general directed graphs (Vernet & Markenzon 1997).
A graph of size n is given and a subset of size m of it's nodes is given . Find all nodes which are at a distance <=k from ALL nodes of the subset .
eg . A->B->C->D->E is the graph , subset = {A,C} , k = 2.
Now , E is at distance <=2 from C , but not from A , so it should not be counted .
I thought of running Breadth First Search from each node in subset , and taking intersection of the respective answers .
Can it be further optimized ?
I went through many posts on SO , but they all direct to kd-trees which i don't understand , so is there any other way ?
I can think of two non-asymptotic (I believe) optimizations:
If you're done with BFS from one of the subset nodes, delete all nodes that have distance > k from it
Start with the two nodes in the subset whose distance is largest to get the smallest possible leftover graph
Of course this doesn't help if k is large (close to n), I have no idea in that case. I am positive however that k/d trees are not applicable to general graphs :)
Nicklas B's optimizations can be applied to both of the following optimizations.
Optimization #1: Modify BFS to do the intersection as it runs rather than afterwords.
The BFS and intersection seems to be the way to go. However, there is redudant work being done by the BFS. Specicially, it is expanding nodes that it doesn't need to expand (after the first BFS). This can be resolved by merging the intersection aspect into the BFS.
The solution seems to be to keep two sets of nodes, call them "ToVisit" and "Visited", rather than label nodes visited or not.
The new rules of the BFS are as followed:
Only nodes in ToVisit are expanded upon by the BFS. They are then moved from ToVisit to Visited to prevent being expanded twice.
The algorithm returns the Visited set as it's result and any nodes left in the ToVisit are discarded. This is then used as the ToVisit set for the next node.
The first node either uses a standard BFS algorithm or ToVisit is the list of all nodes. Either way, the result becomes the second ToVisit set for the second node.
It works better if The ToVisit set is small on average, which tends to be the case of m and k are much less than N.
Optimization #2: Pre-compute the distances if there are enough queries so queries just do intersections.
Although, this is incompatible with the first optimization. If there are a sufficient number of queries on differing subsets and k values, then it is better to find the distances between every pair of nodes ahead of time at a cost of O(VE).
This way you only need to do the intersections, which is O(V*M*Q), where Q is the number of queries, M is the average size of the subset over the queries and V is the number of nodes. If it is expected to the be case that O(M*Q) > O(E), then this approach should be less work. Noting the two most distant nodes are useful as any k equal or higher will always return the set of all vertices, resulting in just O(V) for the query cost in that case.
The distance data should then be stored in four forms.
The first is "kCount[A][k] = number of nodes with distance k or less from A". This provides an alternative to Niklas B.'s suggestion of "Start with the two nodes in the subset whose distance is largest to get the smallest possible leftover graph" in the case that O(m) > O(sqrt(V)) since finding the smallest is O(m^2) and it may be better to avoid trying to find the best choice for the starting pair and just pick a good choice. You can start with the two nodes in the subset with the smallest value for the given k in this data structure. You could also just sort the nodes in the subset by this metric and do the intersections in that order.
The second is "kMax[A] = max k for A", which can be done using a hashmap/dictionary. If the k >= this value, then this this one can be skipped unless kCount[A][kMax[A]] < (number of vertices), meaning not all nodes are reachable from A.
The third is "kFrom[A][k] = set of nodes k distance from A", since k is valid from 0 to the max distance, an hashmap/dictionary to an array/list could be used here rather than a nested hashmap/dictionary. This allows for space and time efficient*** creating the set of nodes with distance <= k from A.
The fourth is "dist[A][B] = distance from A to B", this can be done using a nested hashmap/dictionary. This allows for handling the intersection checks fairly quickly.
* If space isn't an issue, then this structure can store all the nodes k or less distance from A, but that requires O(V^3) space and thus time. The main benefit however is that it allow for also storing a separate list of nodes that are greater than k distance. This allows the algorithm use the smaller of the sets, dist > k or dist <= k. Using an intersection in the case of dist <= k and set subtraction in the case of dist <= k or intersection then set subtraction if the main set has the minimize size.
Add a new node (let's say s) and connect it to all the m given nodes.
Then, find all the nodes which are at a distance less than or equal to k+1 from s and subtract m from it. T(n)=O(V+E)
So I have a problem that is basically like this: I have a bunch of strings, and I want to construct a DAG such that every path corresponds to a string and vice versa. However, I have the freedom to permute my strings arbitrarily. The order of characters does not matter. The DAGs that I generate have a cost associated with them. Basically, the cost of a branch in the DAG is proportional to the length of its child paths.
For example, let's say I have the strings BAAA, CAAA, DAAA, and I construct a DAG representing them without permuting them. I get:
() -> (B, C, D) -> A -> A -> A
where the tuple represents branching.
A cheaper representation for my purposes would be:
() -> A -> A -> A -> (B, C, D)
The problem is: Given n strings, permute the strings such that the corresponding DAG has the cheapest cost, where the cost function is: If we traverse the graph from the source in depth first, left to right order, the total number of nodes we visit, with multiplicity.
So the cost of the first example is 12, because we must visit the A's multiple times on the traversal. The cost of the second example is 6, because we only visit the A's once before we deal with the branches.
I have a feeling this problem is NP Hard. It seems like a question about formal languages and I'm not familiar enough with those sorts of algorithms to figure out how I should go about the reduction. I don't need a complete answer per se, but if someone could point out a class of well known problems that seem related, I would much appreciate it.
To rephrase:
Given words w1, …, wn, compute permutations x1 of w1, …, xn of wn to minimize the size of the trie storing x1, …, xn.
Assuming an alphabet of unlimited size, this problem is NP-hard via a reduction from vertex cover. (I believe it might be fixed-parameter tractable in the size of the alphabet.) The reduction is easy: given a graph, let each vertex be its own letter and create a two-letter word for each edge.
There is exactly one node at depth zero, and as many nodes at depth two as there are edges. The possible sets of nodes at depth one are exactly the sets of nodes that are vertex covers.