Identifying non-intersecting (super-)sets - algorithm

I am looking for an algorithm to identify non-intersecting (super-)sets in a set of sets.
Lets, assume I have a set of sets containing the sets A, B, C and D, i.e. {A, B, C, D}. Each set may or may not intersect some or all of the other sets.
I would like to identify non-intersecting (super-)sets.
Examples:
If A & B intersect and C & D intersect but (A union B) does not intersect (C union D), I would like the output of {(A union B), (C union D)}
If only C & D intersect, I would like the output {A, B, (C union D)}
I am sure this problem has long been solved. Can somebody point me in the right direction?
Even better would be of course if somebody had already done the work and had an implementation in python they were willing to share. :-)

I would turn this from a set problem into a graph problem by constructing a graph whose nodes are the graphs with edges connecting sets with an intersection.
Here is some code that does it. It takes a dictionary mapping the name of the set to the set. It returns an array of sets of set names that connect.
def set_supersets (sets_by_label):
element_mappings = {}
for label, this_set in sets_by_label.items():
for elt in this_set:
if elt not in element_mappings:
element_mappings[elt] = set()
element_mappings[elt].add(label)
graph_conn = {}
for elt, sets in element_mappings.items():
for s in sets:
if s not in graph_conn:
graph_conn[s] = set()
for t in sets:
if t != s:
graph_conn[s].add(t)
seen = set()
answer = []
for s, sets in graph_conn.items():
if s not in seen:
todo = [s]
this_group = set()
while 0 < len(todo):
t = todo.pop()
if t not in seen:
this_group.add(t)
seen.add(t)
for u in graph_conn[t]:
todo.append(u)
answer.append(this_group)
return answer
print(set_supersets({
"A": set([1, 2]),
"B": set([1, 3]),
"C": set([4, 5]),
"D": set([3, 6])
}))

Related

efficient algorithm for describing this join?

I'm not sure how best to describe this without using sets:
Assume we have two distinct, finite sets A & B, and a set P which contains a subset of all the different pairs of A & B (it's a predicate join, basically).
for example:
P = { (a1, b1), (a1, b3), (a2, b1), (a2, b3), (a1, b2) }
I want to find a set C which contains contains the fewest (or close to) number of pairs (as,bs) of subsets of A & B, eg:
C = { ( {a1, a2}, {b1, b3} ), ( {a1}, {b2} ) }
such that for each (as,bs) in C, for each combination of a in as and b in bs, (a,b) is in P and each element of P appears once and only once in this 'expansion' of C.
i'm not exactly sure how to describe this, but it seems somewhat analagous to rectangle covering in computational geometry. maybe someone has seen something like this before?

Maximal sets intersection

Given 5 finite sets a,b,c,d,e. Each set is assigned the arbitrary number:
a = 100, b = 34, c = 15, d = 89, e = 57
complement of each set has the same number assigned but negated e.g. for (a') it will be -100.
We need to find such intersection of these all sets or their complements so the resulting set is not null set, and the sum of the assigned numbers is maximal.
I only see one brute force solution to this problem, but it will be very inefficient and it's not elegant. In this case we just generate all combinations and resolve them to see if they are not empty, combinations look like this:
{a∩b'∩c'∩d'∩e'}, {a'∩b∩c'∩d∩e'}, {a'∩b'∩c∩d'∩e'}, {a'∩b'∩c'∩d∩e'}, {a'∩b'∩c'∩d'∩e} {a∩b∩c'∩d'∩e'}, {a∩b'∩c∩d'∩e'}, {a∩b'∩c'∩d∩e}, {a∩b'∩c'∩d'∩e}, {a'∩b∩c∩d'∩e'} {a'∩b∩c'∩d∩e'} {a'∩b∩c'∩d'∩e} ...
and then just pick the max number.
Looking forward to see if someone can think of something better :)
Define score(x, X) be to be the value of set X if x is in X, otherwise its negation.
Then, letting * represent an element that's not in any of the 5 sets, the highest score possible is:
max_{x in union(A, B, C, D, E, {*}} sum_{X in A, B, C, D, E} score(x, X)
This follows from the observation that any particular x is either in a set or its complement. You don't actually have to compute the union here. In Python you might write:
def max_config(A, B, C, D, E):
best = None
for S in A, B, C, D, E, set([None]):
for x in S:
best = max(best, sum(score(x, X) for X in A, B, C, D, E)))
return best
Assuming a set membership test is O(1), this has complexity O(N), where N is the total size of the given sets.

Tarjan's algorithm: do lowest-links have to be similar for two or more nodes to be inside the same SCC

I'm having some trouble with a homework question involving using Tarjan's algorithm on a provided graph to find the particular SCC's for that graph. While (according to my professor) I have found the correct SCC's by using the pseudo-code algorithm found here, some of the nodes in my SCC's do not share the same lowest-link number as the root node for that SCC.
From what I can gather from the pseudo-code, this is because if an un-referenced node i (which is the input node for the current recursive call to the algorithm) has an arc to an already visited node i + 1 which is not the root node, then the algorithm sets is LL = MIN(i.LowestLink, (i + 1).index), and (i + 1).index may not be equal to its own lowest-link value anymore.
For example (this is similar to a part of the graph from the problem I'm trying to solve): if we have nodes in N = {a, b, c, d}, and arcs in E = {a ⇒ c, c ⇒ b, c ⇒ d, b ⇒ a, d ⇒ b}, and our root node which we start the algorithm from is a, then:
1.1) We set a.index = 1 (using 1 rather than 0), a.LL = 1, and push a onto the stack; a has a single arc to c, so we check c; finding that it is undiscovered, we call the algorithm on c.
2.1) We set c.index = 2, c.LL = 2, and push c onto the stack; c has two arcs, one to b, and one to d. Assume our for loop checks b first; b is undiscovered, and so we call the algorithm on b.
3.1) We set b.index = 3, b.LL = 3, and push b onto the stack; b has one arc to a; checking a we find that it is already on the stack, and so (by the pseudo-code linked above) we set b.LL = MIN(b.LL, a.index) = a.index = 1; b has no further arcs, so we exit our for loop, and check if b.LL = b.index, it does not, so we end this instance of the algorithm.
2.2) Now that the recursive call on b has ended, we set c.LL = MIN(c.LL, b.LL) = b.LL = 1. c still has the arc from c to d remaining; checking d we find it is undefined, so we call the algorithm on d.
4.1) d.index is set to 4, d.LL is set to 4, and we push d onto the stack. d has one arc from d to b, so we check b; we find that b is already in the stack, so we set d.LL = MIN(d.LL, b.index) = b.index = 3. d has no further arcs, so we exit our for loop and check if d.LL = d.index; it does not, so we end this instance of the algorithm.
2.3) With the recursive call on d ended, we again set c.LL = MIN(c.LL, d.LL) = c.LL = 1. c has no further arcs, and so we end our for loop. We check to see if c.LL = c.index; it does not, so we end this instance of the algorithm.
1.2) With the recursive call on c ended, we set a.LL = MIN(a.LL, c.LL) = 1. a has no further arcs, so we end our for loop. We check if a.LL = a.index; they are equal, so we have found a root node for this SCC; we create a new SCC, and pop each item in the stack into this SCC until we find a in the stack (wich also goes into this SCC).
After these steps all the nodes in the graph are discovered, so running the algorithm with the other nodes initially does nothing, we have one SCC = {a, b, c, d}. However, d.LL = 3 which is not equal to the rest of the nodes lowest-links (which are all 1).
Have I done something wrong here? Or is it possible in this situation to have an SCC with differing lowest-links among its nodes?

Algorithm for determining equivalency classes

Rather general question. I have a list like this:
A B
A C
C A
D E
F G
E F
C L
M N
and so on.
What I want to do - is to figure out all the relations and put everything that's related in a single line. The example above would become:
A B C L
D E F G
M N
so that every letter appears only once, and the letters that related to each other are on in one line (list, array, whatever).
Is this some kind of known problem with a well-defined algorithm? Does it have a name? Sounds like it should be. I'd assume some kind of a recursive solution should be in place.
One way to solve this is to use an undirected graph G=(V,E). Each pair in your input represents an edge in E, and the output you want is the connected components of G. There are some great Python graph modules such as NetworkX.
Demo
>>> data
[['A', 'B'], ['A', 'C'], ['C', 'A'], ['D', 'E'], ['F', 'G'], ['E', 'F'], ['C', 'L'], ['M', 'N']]
>>> import networkx as nx
>>> G = nx.Graph()
>>> G.add_edges_from( data )
>>> components = nx.connected_components( G )
>>> print "\n".join([ " ".join(sorted(cc)) for cc in components ])
A B C L
D E F G
M N
https://en.wikipedia.org/wiki/Connected_component_(graph_theory)
(but don't worry too much about their suggested algorithms, because you have a list of edges, whereas they assume that you don't.)
Let's call a letter a Node, and a set of nodes a Component. You need to produce a set of Components given a list of edges.
First, map Nodes to Components:
Map<Node, Component> map.
Then:
For each edge E:
For each node N in E (i.e. all two of them):
Component c = map.get (N)
if c doesn't exist then:
c = new Component
map.put (N, c)
c.add (N)
For each Component C in map.values ():
Print (sort C's nodes)

closest pair pseudo code example

I have been reading Algorithm design manual. I have the same question as What is the meaning of "from distinct vertex chains" in this nearest neighbor algorithm? but I am not able to follow the answers there.
A different idea might be to repeatedly connect the closest pair of endpoints whose connection will not create a problem, such as premature termination of the cycle. Each vertex begins as its own single vertex chain. After merging everything together, we will end up with a single chain containing all the points in it. Connecting the final two endpoints gives us a cycle. At any step during the execution of this closest-pair heuristic, we will have a set of single vertices and vertex-disjoint chains available to merge. In pseudocode:
ClosestPair(P)
Let n be the number of points in set P.
For i = 1 to n − 1 do
d = ∞
For each pair of endpoints (s, t) from distinct vertex chains
if dist(s, t) ≤ d then sm = s, tm = t, and d = dist(s, t)
Connect (sm, tm) by an edge
Connect the two endpoints by an edge
Please note that sm and tm should be sm and tm.
I am not able to follow above logic. Please demonstrate the computation for the simple example which is given in the book: -21, -5, -1, 0, 1, 3, and 11. Show the computations step by step, so that one can follow above code easily.
First example
In the following notation, I use parentheses to denote chains. Each vertex starts out as its first chain. The inner loop iterates over all pairs of chain endpoints, i.e. all pairs of nodes which have a parenthesis written immediately next to them, but only those pairs where the two endpoints come from different chains. The results of this inner loop are a pair of endpoints which minimize the distance d. I'll assume that pairs are sorted s < t, and furthermore that pairs are traversed in lexicographical order. In that case, the rightmost pair matching that minimal d will be returned, due to the ≤ in the code.
(-21), (-5), (-1), (0), (1), (3), (11) d = 1, sm = 0, tm = 1
(-21), (-5), (-1), (0 , 1), (3), (11) d = 1, sm = -1, tm = 0
(-21), (-5), (-1 , 0 , 1), (3), (11) d = 2, sm = 1, tm = 3
(-21), (-5), (-1 , 0 , 1 , 3), (11) d = 4, sm = -5, tm = -1
(-21), (-5 , -1 , 0 , 1 , 3), (11) d = 8, sm = 3, tm = 11
(-21), (-5 , -1 , 0 , 1 , 3 , 11) d = 16, sm = -21, tm = -5
(-21 , -5 , -1 , 0 , 1 , 3 , 11) d = 32 to close the loop
So in this example, the code works as intended.
Second example
Figure 1.4 will give an example where this code does not work, i.e. will yield a suboptimal result. Labeling the vertices like this
A <--(1+e)--> B <--(1+e)--> C
^ ^ ^
| | |
(1-e) (1-e) (1-e)
| | |
v v v
D <--(1+e)--> E <--(1+e)--> F
In this case you'll get
(A), (B), (C), (D), (E), (F) d = 1-e, sm = C, tm = F
(C, F), (A), (B), (D), (E) d = 1-e, sm = B, tm = E
(C, F), (B, E), (A), (D) d = 1-e, sm = A, tm = D
(C, F), (B, E), (A, D) d = 1+e, sm = E, tm = F
(B, E, F, C), (A, D) d = 1+e, sm = A, tm = B
(D, A, B, E, F, C) d = sqrt((2+2e)^2+(1-e)^2) to close
which is not the optimal solution.

Resources