Disjoint Sets of Strings - Minimization Problem - algorithm

There are two sets, s1 and s2, each containing pairs of letters. A pair is only equivalent to another pair if their letters are in the same order, so they're essentially strings (of length 2). The sets s1 and s2 are disjoint, neither set is empty, and each pair of letters only appears once.
Here is an example of what the two sets might look like:
s1 = { ax, bx, cy, dy }
s2 = { ay, by, cx, dx }
The set of all letters in (s1 ∪ s2) is called sl. The set sr is a set of letters of your choice, but must be a subset of sl. Your goal is to define a mapping m from letters in sl to letters in sr, which, when applied to s1 and s2, will generate the sets s1' and s2', which also contain pairs of letters and must also be disjoint.
The most obvious m just maps each letter to itself. In this example (shown below), s1 is equivalent to s1', and s2 is equivalent to s2' (but given any other m, that would not be the case).
a -> a
b -> b
c -> c
d -> d
x -> x
y -> y
The goal is to construct m such that sr (the set of letters on the right-hand side of the mapping) has the fewest number of letters possible. To accomplish this, you can map multiple letters in sl to the same letter in sr. Note that depending on s1 and s2, and depending on m, you could potentially break the rule that s1' and s2' must be disjoint. For example, you would obviously break that rule by mapping every letter in sl to a single letter in sr.
So, given s1 and s2, how can someone construct an m that minimizes sr, while ensuring that s1' and s2' are disjoint?
Here is a simplified visualization of the problem:

This problem is NP-hard, to show this, consider reducing graph coloring to this problem.
Proof:
Let G=(V,E) be the graph for which we want to compute the minimal graph coloring problem. Formally, we want to compute the chromatic number of the graph, which is the lowest k for which G is k colourable.
To reduce the graph coloring problem to the problem described here, define
s1 = { zu : (u,v) \in E }
s2 = { zv : (u,v) \in E }
where z is a magic value unused other than in constructing s1 & s2.
By construction of the sets above, for any mapping m and any edge (u,v) we must have m(u) != m(v), otherwise the disjointedness of s1' and s2' would be violated. Thus, any optimal sr is the set of optimal colors (with the exception of z) to color the graph G and m is the mapping that defines which node is assigned which color. QED.
The proof above may give the intuition that researching graph coloring approximations would be a good start, and indeed it probably would, but there is a confounding factor involved. This confounding factor is that for two elements ab \in s1 and cd \in s2, if m(a) = m(c) then m(b) != m(d). Logically, this is equivalent to the statement m(a) != m(c) or m(b) != m(d). These types of constraints, in isolation, do not map naturally to an analogous graph problem (because of the or statement.)
There are ways to formulate this problem as an (binary) ILP and solve it as such. This would likely give you (slightly) inferior results to a custom designed & tuned branch-and-bound implementation (assuming you want to find the optimal solution) but would work with turn-key solvers.
If you are more interested in approximations (possibly with guaranteed ratios of optimality) I would investigate a SDP relaxation to your problem & appropriate rounding scheme. This level of work would likely be the kind one would invest in a small-to-medium sized research paper.

Related

Bit masks for subsets from two different sets

I have two sets -
set1 - {i1,i2,i3...iN1}
set2 - {k1,k2,k3...kN2}
For any single set of n items I can represent all possible subsets using bit masks 0-2^n -1.
Similarly how can i represent -
All possible subset of set1 and set2, where at-least 1 items is from the different set.
for example
{i1,i2,k1} is valid
but {i1,i2} - invalid as it has no item from set2.
I am trying to generate two things -
Kind of a equation which can give me a count of all subsets, like we have 2^n subsets for a single n items set.
Bit encoding/masks using which i can represent above type of subsets.
This will be easier if we introduce a few extra sets of interest. Let's call the two input sets S1 and S2; we'll define sets L, C, and R (for left, center, and right). Think of these as being the Venn diagram. So, define L = S1 \ S2, the elements in S1 that aren't in S2 at all; C = S1 ∩ S2, the elements that are in both S1 and S2, and R = S2 \ S1. Let's also write l, c, and r for the sizes of these sets, respectively.
Now we're ready to answer question 1: how many subsets of S1 ∪ S2 have an element from S1 and have an element from S2? There are two cases to consider: either we have a subset with an element in C (which satisfies both the "an element from S1" and the "an element from S2" clauses), or we have no elements in C and at least one each from L and R. In the first case, there are (2^c - 1) non-empty subsets of C, and 2^(l+r) subsets of the remainder, so there are (2^c - 1)*2^(l+r) sets in that case. In the second case, there are 2^l - 1 non-empty subsets of L and 2^r - 1 non-empty subsets of R, so there are (2^l - 1) * (2^r - 1) subsets in that case. Adding up the two cases, we have a total of 2^(c+l+r) - 2^l - 2^r total subsets satisfying the condition.
If you have a fancy representation of non-empty subsets, this also immediately suggests a data structure for storing these: a single bit tag for which case you're in, plus the appropriate representations of subsets and non-empty subsets in each case.
But I would probably just use a single bitmask of size c+l+r, even though there are a few "invalid" bitmasks: it's very compact, it's easy to check validity, and there are many cheap operations on bitmasks.

Encoding directed graph as numbers

Let's say that I have a directed graph, with a single root and without cycles. I would like to add a type on each node (for example as an integer with some custom ordering) with the following property:
if Node1.type <= Node2.type then there exists a path from Node1 to Node2
Note that topological sorting actually satisfies the reversed property:
if there exists a path from Node1 to Node2 then Node1.type <= Node2.type
so it cannot be used here.
Now note that integers with natural ordering cannot be used here because every 2 integers can be compared, i.e. the ordering of integers is linear while the tree does not have to be.
So here's an example. Assume that the graph has 4 nodes A, B, C, D and 4 arrows:
A->B, A->C, B->D, C->D
So it's a diamond. Now we can put
A.type = 00
B.type = 01
C.type = 10
D.type = 11
where on the right side we have integers in binary format. The comparison is defined bitwise:
(X <= Y) if and only if (n-th bit of X <= n-th bit of Y for all n)
So I guess such ordering could be used, the question is how to construct values from a given graph? I'm not even sure if the solution always exists. Any hints?
UPDATE: Since there is some misunderstanding about terminology I'm using let me be more explicite: I'm interested in directed acyclic graph such that there is exactly one node without predecessors (a.k.a. the root) and there's at most one arrow between any two nodes. The diamond would be an example. It does not have to have one leaf (i.e. the node without successors). Each node might have multiple predecessors and multiple successors. You might say that this is a partially ordered set with a smallest element (i.e. a unique globally minimal element).
You call the relation <=, but it's necessarily not complete (that is: it may be that for a given pair a and b, neither a <= b nor b <= a).
Here's one idea for how to define it.
If your nodes are numbered 0, 1..., N-1, then you can define type like this:
type(i) = (1 << i) + sum(1 << (N + j), for j such that Path(i, j))
And define <= like this:
type1 <= type2 if (type1 >> N) & type2 != 0
That is, type(i) encodes the value of i in the lowest N bits, and the set of all reachable nodes in the highest N bits. The <= relation looks for the target node in the encoded set of reachable nodes.
This definition works whether or not there's cycles in the graph, and in fact just encodes an arbitrary relation on your set of nodes.
You could make the definition a little more efficient by using ceil(log2(N)) bits to encode the node number (for a total of N + ceil(log2(N)) bits per type).
For any DAG, you can define x <= y as "there's a path from x to y". This relation is a partial order. I take it that the question is how to represent this relation efficiently.
For each vertex X, define ¡X to be the set of vertices reachable from X (including X itself). The two statements
¡X is a subset of ¡Y
X is reachable from Y
are equivalent.
Encode these sets as bitsets (N-bit binary numbers), and you are set.
The question said (and continues to say) that the input is a tree, but a later edit contradicted this with an example of a diamond graph. In such non-tree cases, my algorithm below won't apply.
The existing answers work for general relations on general directed graphs, which inflates their representation sizes to O(n) bits for n vertices. Since you have a tree, a shorter O(log n)-bit representation is possible.
In a tree directed away from the root, for any two vertices u and v, the sets of leaves L(u) and L(v) reachable from u and v, respectively, must either be disjoint, or one must be a subset of the other. If they are disjoint, then u is not reachable from v (and vice versa); if one is a proper subset of the other, the one with the smaller set is reachable from the other (and in this case, the one with the smaller set will necessarily have strictly greater depth). If L(u) = L(v), then u is reachable from v if and only if depth(v) < depth(u), where depth(u) is the number of edges on the path from the root to u. (In particular, if L(u) = L(v) and depth(u) = depth(v), then u = v.)
We can encode this relationship concisely by noticing that all leaves reachable from a given vertex v occupy a contiguous segment of the leaves output by an inorder traversal of the tree. For any given vertex v, this set of leaves can therefore be represented by a pair of integers (first, last), with first identifying the first leaf (in inorder traversal order) and last the last. The test for whether a path exists from i to j is then very simple -- in pseudo-C++:
bool doesPathExist(int i, int j) {
return x[i].first <= x[j].first && x[i].last >= x[j].last && depth[i] <= depth[j];
}
Note that if every non-leaf vertex in the tree has at least 2 children, then you don't need to bother with depths, since L(u) = L(v) implies u = v in this case. (My original version of the post made this assumption; I've now fixed it to work even when this is not the case.)

Finding maximum valued subset in which PartitionProblem algorithm returns true

I´ve got the following assignment.
You have a multiset S with of 1<=N<=22 elements.
Each element has a positive value of up to 10000000.
Assmuming that there are two subsets s1 and s2 of S in which the sum of the values of all the elements of one is equal to the sum of the value of all the elements of the other and it is the highest possible value. I have to return which elements of S would not be included in either of the two subsets.
Its probably been solved before, I think its some variant of the Partition problem but I can´t find it. If anyone could point me in the right direction that´d be great.
EDIT: An element can´t be in both subsets.
This is variation of subset sum, and can be solved similarly, by increasing the dimension of the problem (and the DP matrix), and then applying a solution very similar to the original one for subset-sum, which follows the recursive formula:
D(i,x,y) = D(i-1,x,y) OR D(i-1,x-l[i],y) OR D(i-1,x,y-l[i])
^ ^ ^
not chosen chosen for first set chosen for 2nd set
and base clause:
D(0,0,0) = true
D(0,x,y) = false x!=0 or y!=0
D(i,x,y) = false x<0 or y<0
After done calculating the DP matrix (3d array actyally) for this problem, all you have to do is find if there is any entry D(n,x,x) == true, for some x<= SUM/2 (where SUM is the sum of the entire original set), to find if there is any feasible solution.
Since you want the maximal value, the answer should be the maximal value of such x that D(n,x,x)=true (since there could be more than one)
Finding the elements themselves can be done after finding the solution (the value of x in D(n,x,x)) by following back the DP matrix and retracing your steps as explained for similar problems such as this: How to find which elements are in the bag, using Knapsack Algorithm [and not only the bag's value]?
Total complexity of this solution is O(SUM^2 * n)
Partition S as evenly as possible into T ∪ U (put the extra element, if any, in U). Loop through the three-way partitions of T into A ∪ B ∪ C (≤ 311 = 177,147 of them). Store the item |sum(A) - sum(B)| → C into a map, keeping only the value with the lowest sum in case the key already exists.
Loop through the three-way partitions of U into D ∪ E ∪ F. Look up |sum(D) - sum(E)| in the map; if it exists with value C, then consider C ∪ F as a possibility for the elements left out (the two parts with equal sum are either A ∪ D and B ∪ E, or A ∪ E and B ∪ D).

An algorithm to get all connected subgraphs from graph, is it correct?

I try to find an quick algorithm to obtain all connected subgraphs form an undirected graph with subgraphs length restricted. Simple methods, such as BFS or DFS from every vertex generate huge amount of equals subgraphs, so in every algorithm iteration we have to prune subgraphs set. I have found in russian mathematical forum an algorithm:
Procedure F(X,Y)
//X set of included vertices
//Y set of forbidden vertices to construct new subgraph
1.if |X|=k, then return;
2.construct a set T[X] of vertices that adjacent to vertices from X (If X is a empty set, than T[X]=V), but not belong to the sets X,Y;
3.Y1=Y;
4.Foreach v from T[X] do:
__4.1.X1=X+v;
__4.2.show subgraph X1;
__4.3.F(X1,Y1);
__4.4.Y1=Y1+v;
Initial call F(X,Y):
X, Y = empty set;
F(X,Y);
The main idea of this algorithm is using "forbidden set" so that, this one doesn't require pruning, author of this algorithm said that it is 300 times more quickly than solution based on pruning equals subgraphs. But I haven't found any proofs that this algorithm is correct at all.
UPDATE:
More efficient solution was found here
Here is an Python implementation of what I believe to be your original algorithm:
from collections import defaultdict
D=defaultdict(list)
def addedge(a,b):
D[a].append(b)
D[b].append(a)
addedge(1,2)
addedge(2,3)
addedge(3,4)
V=D.keys()
k=2
def F(X,Y):
if len(X)==k:
return
if X:
T = set(a for x in X for a in D[x] if a not in Y and a not in X)
else:
T = V
Y1=set(Y)
for v in T:
X.add(v)
print X
F(X,Y1)
X.remove(v)
Y1.add(v)
print 'original method'
F(set(),set())
F generates all connected subgraphs of size <=k where the subgraph must include vertices in X (a connected subgraph itself), and must not include vertices in Y.
We know that to include another vertex in the subgraph we must use a connected vertex so we can recurse based on the identity of the first connected vertex v that is inside the final subgraph. The forbidden set means that we ensure that a second copy of subgraph cannot be generated as this copy would have to use v, but v is in the forbidden set so cannot be used again.
So at this superficial level of analysis, this algorithm appears efficient and correct.
You did not describe the algorithm well. We dont know that k is or what V is in this algorithm. I just assume k is the restricted length on the sub-graph and V is some root vertex.
If that is true than it looks to me that this algorithm is incorrect. Suppose we have a graph with only two connected vertices v1, v2 and the restricted on the sub graph k = 1.
In the first iteration: X, Y = empty, T(X) = {v1}, X1 = {V1}, Y1 = empty we show X1.
Then we recursively call F(X1, Y1), and it should return immediately because |X| = |{v1}| = 1
Back to the 1st iteration now Y(1) = v1. The loop ends and the initial call also ends here. So we are printing out only X1. We suppose to print out X1, X2.
By the way do not "test" an algorithm - there is no way to test it (the number of possible test case is infinite). You should indeed formally prove it.

Don't understand closest pair heuristic from “The Algorithm Design Manual

I have been reading Algorithm design manual.
A different idea might be to repeatedly connect the closest pair of endpoints whose connection will not create a problem, such as premature termination of the cycle. Each vertex begins as its own single vertex chain. After merging everything together, we will end up with a single chain containing all the points in it. Connecting the final two endpoints gives us a cycle. At any step during the execution of this closest-pair heuristic, we will have a set of single vertices and vertex-disjoint chains available to merge. In pseudocode:
ClosestPair(P)
Let n be the number of points in set P.
For i = 1 to n − 1 do
d = ∞
For each pair of endpoints (s, t) from distinct vertex chains
if dist(s, t) ≤ d then sm = s, tm = t, and d = dist(s, t)
Connect (sm, tm) by an edge
Connect the two endpoints by an edge
Please note that sm and tm should be sm and tm.
why d = ∞
?
Coluld any one please explain the nearest-neighbour tour?
Which book should I read before reading this book?
The algorithm sets d = ∞ so that the first comparison always succeeds: if dist(s, t) ≤ d then ...
An alternative would be to set d to the distance between the first pair and then try all the remaining pairs, but in terms of lines of code, that's more code. In programming you typically use the maximum value possible for your given arithmetic type and often that is provided as a constant in the language, e.g. Int.MaxValue.

Resources