Transitive set merging - algorithm

Is there a well-known algorithm, that given a collection of sets, would merge together every two sets that have at least one common element? So for example, given the following input:
A B C
B D
A E
F G H
I J
K F
L M N
E O
It would produce:
A B C D E O
F G H K
I J
L M N
I already have a working implementation, but it seems to be common enough that there has to be a name for what I am doing.

You can model this as a simple graph problem: Introduce a node for every distinct element. Introduce a node for every set. Connect every set to the elements it contains. You get an (undirected) bipartite graph, in which the connected components are the solution of your problem. You can use depth-first search to find the CCs.
The runtime should be linear (with hash tables, so only expected runtime unless your numbers are bounded).
I don't think it deserves a special name, it's just an application of well-known concepts.

Related

Algorithm for "balanced" breadth-first search

I'm looking for references for an algorithm that conducts a breadth-first tree search in a balanced manner that is resilient in a situation where
we expect most nodes to have few children, but
a few nodes may have a many (possibly infinitely many) children.
Consider this simple tree (modified from this post):
A
/ \
B C
/ / \
D E F
| /|\ \
G HIJ... K
Depth-first visits nodes in this order:
A B D G C E H I J ... F K
Breadth-first visits nodes in this order:
A B C D E F G H I J ... K
The balanced breadth-first algorithm that I have in mind would visit nodes in this order:
A B C D E G F H K I J ...
Note how
we visit G before F, and
we visit K after H but before I.
G is deeper than F, but it is an only child of B whereas F is a second child of C and must share its search priority with E. Similarly between K and the many children H, I, J, ... of E.
I call this "balanced" because a node with lots of children cannot choke the algorithm. Concretely, if E has 𝜔 (infinitely) many nodes then a pure breadth-first strategy would never reach K, whereas the "balanced" algorithm would still reach K after H but before the other children of E.
(The reader who does not like 𝜔 can attain a similar effect with a large but still finite number such as "the greatest number of steps any practical search algorithm will ever make, plus 1".)
I can only imagine that this style of search or something like it must have been the subject of much research and practical application. I would be grateful to be pointed in the right direction. Thank you.
Transform your tree to a different kind of representation. In this new representation, each node has at most two links: one to its leftmost child, and one to its right sibling.
A
/ \
B C
/ / \
D E F
| /|\ \
G HIJ... K
  ⇓
A
/
B --> C
/ /
D E -> F
| / \
G / K
/
H -> I -> J -> ...
Then treat this representation as a normal binary tree, and traverse it breadth-first. It may have an infinite height, but like with any binary tree, the width at any particular level is finite.
depth-first-search, breadth-first-search, A*-search (and others) only differ in how you handle the list of "nodes still to visit".
(I assume you always process the node at the start of the list next)
depth-first-search: append new nodes at the front of the list
breadth-first-search: add new nodes to the end of the list
A*-search: insert new nodes in the list according to costs+heuristic
So you need to formalize how to insert new nodes to the list to fulfill your requirements.

How to represent a dependency graph with alternative paths

I'm having some trouble trying to represent and manipulate dependency graphs in this scenario:
a node has some dependencies that have to be solved
every path must not have dependencies loops (like in DAG graphs)
every dependency could be solved by more than one other node
I starts from the target node and recursively look for its dependencies, but have to mantain the above properties, in particular the third one.
Just a little example here:
I would like to have a graph like the following one
(A)
/ \
/ \
/ \
[(B),(C),(D)] (E)
/\ \
/ \ (H)
(F) (G)
which means:
F,G,C,H,E have no dependencies
D dependends on H
B depends on F and G
A depends on E and
B or
C or
D
So, if I write down all the possible topological-sorted paths to A I should have:
E -> F -> G -> B -> A
E -> C -> A
E -> H -> D -> A
How can I model a graph with these properties? Which kind of data structure is the more suitable to do that?
You should use a normal adjacency list, with an additional property, wherein a node knows its the other nodes that would also satisfy the same dependency. This means that B,C,D should all know that they belong to the same equivalence class. You can achieve this by inserting them all into a set.
Node:
List<Node> adjacencyList
Set<Node> equivalentDependencies
To use this data-structure in a topo-sort, whenever you remove a source, and remove all its outgoing edges, also remove the nodes in its equivalency class, their outgoing edges, and recursively remove the nodes that point to them.
From wikipedia:
L ← Empty list that will contain the sorted elements
S ← Set of all nodes with no incoming edges
while S is non-empty do
remove a node n from S
add n to tail of L
for each node o in the equivalency class of n <===removing equivalent dependencies
remove j from S
for each node k with an edge e from j to k do
remove edge e from the graph
if k has no other incoming edges then
insert k into S
for each node m with an edge e from n to m do
remove edge e from the graph
if m has no other incoming edges then
insert m into S
if graph has edges then
return error (graph has at least one cycle)
else
return L (a topologically sorted order)
This algorithm will give you one of the modified topologicaly-sorted paths.

Don't understand closest pair heuristic from “The Algorithm Design Manual

I have been reading Algorithm design manual.
A different idea might be to repeatedly connect the closest pair of endpoints whose connection will not create a problem, such as premature termination of the cycle. Each vertex begins as its own single vertex chain. After merging everything together, we will end up with a single chain containing all the points in it. Connecting the final two endpoints gives us a cycle. At any step during the execution of this closest-pair heuristic, we will have a set of single vertices and vertex-disjoint chains available to merge. In pseudocode:
ClosestPair(P)
Let n be the number of points in set P.
For i = 1 to n − 1 do
d = ∞
For each pair of endpoints (s, t) from distinct vertex chains
if dist(s, t) ≤ d then sm = s, tm = t, and d = dist(s, t)
Connect (sm, tm) by an edge
Connect the two endpoints by an edge
Please note that sm and tm should be sm and tm.
why d = ∞
?
Coluld any one please explain the nearest-neighbour tour?
Which book should I read before reading this book?
The algorithm sets d = ∞ so that the first comparison always succeeds: if dist(s, t) ≤ d then ...
An alternative would be to set d to the distance between the first pair and then try all the remaining pairs, but in terms of lines of code, that's more code. In programming you typically use the maximum value possible for your given arithmetic type and often that is provided as a constant in the language, e.g. Int.MaxValue.

How to deal with crossover when sequence matters?

How does one go about crossing over two parents when the children must have a particular ordering?
For example, when applying genetic algorithms to the Travelling Salesman Problem on a fixed graph of vertices / edges, you must contend with the fact that not all vertices can travel to other vertices. This makes crossover much more difficult because unlike the TSP in which all vertices may travel to all other vertices, when a crossover is performed it must be done at a point that produces a legal path. The alternative is to just crossover anyway and reject illegal paths, but the risk is great computational expensive and few to no legal paths.
I've read about permutation crossover but I'm not entirely sure how this solves the issue. Can someone point me in the right direction or advise?
Ordering should, as far as possible, not be a constraint in genetic programming. Maybe you should contemplate to pick up another format for your solutions.
For example, in your TSP, consider the codon A->B.
Instead of the meaning 'take the edge from A to B', you could consider 'take the shortest path from A to B'. This way, your solutions are always feasible. You just have to pre-compute a shortest path matrix as a pre-processing.
Now, this does not guarantee that candidates will be feasible solutions after your crossover. Your crossover should be tuned in order to guarantee that your solution are still feasible. For example, for the TSP, consider the sequences :
1 : A B C D E F G H
2 : A D E G C B F H
Choose a pivot randomly (E in our example). This leads to the following sequences to be completed :
1' : A B C D E . . .
2' : A D E . . . . .
All the vertices have to be visited in order to have a valid solution. In 1', F, G and H have to be visited. We order them as they are in sequence 2. In 2', G, C, B, F and H are re-ordered as in 1 :
1' : A B C D E G F H
2' : A D E B C F G H
Hope this helps.

Why is greedy algorithm not finding maximum independent set of a graph?

Given a graph G, why is following greedy algorithm not guaranteed to find maximum independent set of G:
Greedy(G):
S = {}
While G is not empty:
Let v be a node with minimum degree in G
S = union(S, {v})
remove v and its neighbors from G
return S
I am wondering can someone show me a simple example of a graph where this algorithm fails?
I'm not sure this is the simplest example, but here is one that fails: http://imgur.com/QK3DC
For the first step, you can choose B, C, D, or F since they all have degree 2. Suppose we remove B and its neighbors. That leaves F and D with degree 1 and E with degree 2. During the next two steps, we remove F and D and end up with a set size of 3, which is the maximum.
Instead suppose on the first step we removed C and its neighbors. This leaves us with F, A and E, each with a degree size of 2. We take either one of these next, and the graph is empty and our solution only contains 2 nodes, which as we have seen, isn't the maximum.

Resources