Disjoint sets on apache spark - algorithm

I trying to find algorithm of searching disjoint sets (connected components/union-find) on large amount of data with apache spark.
Problem is amount of data. Even Raw representation of graph vertex doesn't fit in to ram on single machine. Edges also doesn't fit in to the ram.
Source data is text file of graph edges on hdfs: "id1 \t id2".
id present as string value, not int.
Naive solution that I found is:
take rdd of edges -> [id1:id2] [id3:id4] [id1:id3]
group edges by key. -> [id1:[id2;id3]][id3:[id4]]
for each record set minimum id to each group -> (flatMap) [id1:id1][id2:id1][id3:id1][id3:id3][id4:id3]
reverse rdd from stage 3 [id2:id1] -> [id1:id2]
leftOuterJoin of rdds from stage 3 and 4
repeat from stage 2 while size of rdd on step 3 wouldn't change
But this results in the transfer of large amounts of data between nodes
(shuffling)
Any advices?

If you are working with graphs I would suggest that you take a look at either one of these libraries
GraphX
GraphFrames
They both provide the connected components algorithm out of the box.
GraphX:
val graph: Graph = ...
val cc = graph.connectedComponents().vertices
GraphFrames:
val graph: GraphFrame = ...
val cc = graph.connectedComponents.run()
cc.select("id", "component").orderBy("component").show()

In addition to #Marsellus Wallace answer, below full code to get disjoint sets from an RDD of edges using GraphX.
val edges:RDD[(Long,Long)] = ???
val g = Graph.fromEdgeTuples(edges,-1L)
val disjointSets:RDD[Iterable[Long]] = g.connectedComponents()
//Get tuples with (vertexId,parent vertexId)
.vertices
//Group by parent vertex Id so it aggregates the disjoint set
.groupBy(_._2)
.values
.map(_.map(_._1))

Related

Removing unnecessary nodes in graph

I have a graph that has two distinct classes of nodes, class A nodes and class B nodes.
Class A nodes are not connected to any other A nodes and class B nodes aren’t connected to any other B nodes, but B nodes are connected to A nodes and vice versa. Some B nodes are connected to lots of A nodes and most A nodes are connected to lots of B nodes.
I want to eliminate as many of the A nodes as possible from the graph.
I must keep all of the B nodes, and they must still be connected to at least one A node (preferably only one A node).
I can eliminate an A node when it has no B nodes connected only to it. Are there any algorithms that could find an optimal, or at least close to optimal, solution for which A nodes I can remove?
Old, Incorrect Answer, But Start Here
First, you need to recognize that you have a bipartite graph. That is, you can colour the nodes red and blue such that no edges connect a red node to a red node or a blue node to a blue node.
Next, recognize that you're trying to solve a vertex cover problem. From Wikipedia:
In the mathematical discipline of graph theory, a vertex cover (sometimes node cover) of a graph is a set of vertices such that each edge of the graph is incident to at least one vertex of the set. The problem of finding a minimum vertex cover is a classical optimization problem in computer science and is a typical example of an NP-hard optimization problem that has an approximation algorithm.
Since you have a special graph, it's reasonable to think that maybe the NP-hard doesn't apply to you. This thought brings us to Kőnig's theorem which relates the maximum matching problem to the minimum vertex cover problem. Once you know this, you can apply the Hopcroft–Karp algorithm to solve the problem in O(|E|√|V|) time, though you'll probably need to jigger it a bit to ensure you keep all the B nodes.
New, Correct Answer
It turns out this jiggering is the creation of a "constrained bipartitate graph vertex cover problem", which asks us if there is a vertex cover that uses less than a A-nodes and less than b B-nodes. The problem is NP-complete, so that's a no go. The jiggering was hard than I thought!
But using less than the minimum number of nodes isn't the constraint we want. We want to ensure that the minimum number of A-nodes is used and the maximum number of B-nodes.
Kőnig's theorem, above, is a special case of the maximum flow problem. Thinking about the problem in terms of flows brings us pretty quickly to minimum-cost flow problems.
In these problems we're given a graph whose edges have specified capacities and unit costs of transport. The goal is to find the minimum cost needed to move a supply of a given quantity from an arbitrary set of source nodes to an arbitrary set of sink nodes.
It turns out your problem can be converted into a minimum-cost flow problem. To do so, let us generate a source node that connects to all the A nodes and a sink node that connects to all the B nodes.
Now, let us make the cost of using a Source->A edge equal to 1 and give all other edges a cost of zero. Further, let us make the capacity of the Source->A edges equal to infinity and the capacity of all other edges equal to 1.
This looks like the following:
The red edges have Cost=1, Capacity=Inf. The blue edges have Cost=0, Capacity=1.
Now, solving the minimum flow problem becomes equivalent to using as few red edges as possible. Any red edge that isn't used allocates 0 flow to its corresponding A node and that node can be removed from the graph. Conversely, each B node can only pass 1 unit of flow to the sink, so all B nodes must be preserved in order for the problem to be solved.
Since we've recast your problem into this standard form, we can leverage existing tools to get a solution; namely, Google's Operation Research Tools.
Doing so gives the following answer to the above graph:
The red edges are unused and the black edges are used. Note that if a red edge emerges from the source the A-node it connects to generates no black edges. Note also that each B-node has at least one in-coming black edge. This satisfies the constraints you posed.
We can now detect the A-nodes to be removed by looking for Source->A edges with zero usage.
Source Code
The source code necessary to generate the foregoing figures and associated solutions is as follows:
#!/usr/bin/env python3
#Documentation: https://developers.google.com/optimization/flow/mincostflow
#Install dependency: pip3 install ortools
from __future__ import print_function
from ortools.graph import pywrapgraph
import matplotlib.pyplot as plt
import networkx as nx
import random
import sys
def GenerateGraph(Acount,Bcount):
assert Acount>5
assert Bcount>5
G = nx.DiGraph() #Directed graph
source_node = Acount+Bcount
sink_node = source_node+1
for a in range(Acount):
for i in range(random.randint(0.2*Bcount,0.3*Bcount)): #Connect to 10-20% of the Bnodes
b = Acount+random.randint(0,Bcount-1) #In the half-open range [0,Bcount). Offset from A's indices
G.add_edge(source_node, a, capacity=99999, unit_cost=1, usage=1)
G.add_edge(a, b, capacity=1, unit_cost=0, usage=1)
G.add_edge(b, sink_node, capacity=1, unit_cost=0, usage=1)
G.node[a]['type'] = 'A'
G.node[b]['type'] = 'B'
G.node[source_node]['type'] = 'source'
G.node[sink_node]['type'] = 'sink'
G.node[source_node]['supply'] = Bcount
G.node[sink_node]['supply'] = -Bcount
return G
def VisualizeGraph(graph, color_type):
gcopy = graph.copy()
for p, d in graph.nodes(data=True):
if d['type']=='source':
source = p
if d['type']=='sink':
sink = p
Acount = len([1 for p,d in graph.nodes(data=True) if d['type']=='A'])
Bcount = len([1 for p,d in graph.nodes(data=True) if d['type']=='B'])
if color_type=='usage':
edge_color = ['black' if d['usage']>0 else 'red' for u,v,d in graph.edges(data=True)]
elif color_type=='unit_cost':
edge_color = ['red' if d['unit_cost']>0 else 'blue' for u,v,d in graph.edges(data=True)]
Ai = 0
Bi = 0
pos = dict()
for p,d in graph.nodes(data=True):
if d['type']=='source':
pos[p] = (0, Acount/2)
elif d['type']=='sink':
pos[p] = (3, Bcount/2)
elif d['type']=='A':
pos[p] = (1, Ai)
Ai += 1
elif d['type']=='B':
pos[p] = (2, Bi)
Bi += 1
nx.draw(graph, pos=pos, edge_color=edge_color, arrows=False)
plt.show()
def GenerateMinCostFlowProblemFromGraph(graph):
start_nodes = []
end_nodes = []
capacities = []
unit_costs = []
min_cost_flow = pywrapgraph.SimpleMinCostFlow()
for node,neighbor,data in graph.edges(data=True):
min_cost_flow.AddArcWithCapacityAndUnitCost(node, neighbor, data['capacity'], data['unit_cost'])
supply = len([1 for p,d in graph.nodes(data=True) if d['type']=='B'])
for p, d in graph.nodes(data=True):
if (d['type']=='source' or d['type']=='sink') and 'supply' in d:
min_cost_flow.SetNodeSupply(p, d['supply'])
return min_cost_flow
def ColorGraphEdgesByUsage(graph, min_cost_flow):
for i in range(min_cost_flow.NumArcs()):
graph[min_cost_flow.Tail(i)][min_cost_flow.Head(i)]['usage'] = min_cost_flow.Flow(i)
def main():
"""MinCostFlow simple interface example."""
# Define four parallel arrays: start_nodes, end_nodes, capacities, and unit costs
# between each pair. For instance, the arc from node 0 to node 1 has a
# capacity of 15 and a unit cost of 4.
Acount = 20
Bcount = 20
graph = GenerateGraph(Acount, Bcount)
VisualizeGraph(graph, 'unit_cost')
min_cost_flow = GenerateMinCostFlowProblemFromGraph(graph)
# Find the minimum cost flow between node 0 and node 4.
if min_cost_flow.Solve() != min_cost_flow.OPTIMAL:
print('Unable to find a solution! It is likely that one does not exist for this input.')
sys.exit(-1)
print('Minimum cost:', min_cost_flow.OptimalCost())
ColorGraphEdgesByUsage(graph, min_cost_flow)
VisualizeGraph(graph, 'usage')
if __name__ == '__main__':
main()
Despite this is an old question, I see it has not been correctly answered yet.
An analogous question to this one has also been answered earlier in this post.
The problem you are presenting here is indeed the Minimum Set Cover Problem, which is one of the well-known NP-hard problems. From the Wikipedia, the Minimum Set Cover Problem can be formulated as:
Given a set of elements {1,2,...,n} (called the universe) and a collection S of m sets whose union equals the universe, the set cover problem is to identify the smallest sub-collection of S whose union equals the universe. For example, consider the universe U={1,2,3,4,5} and the collection of sets S={{1,2,3},{2,4},{3,4},{4,5}}. Clearly the union of S is U. However, we can cover all of the elements with the following, smaller number of sets: {{1,2,3},{4,5}}.
In your formulation, B nodes represent the elements in the universe, A nodes represent the sets and edges between A nodes and B nodes determine which elements (B nodes) belong to each set (A node). Then, the minimum set cover is equivalent to the minimum number of A nodes so that they are connected to all B nodes. Consequently, the maximum number of A nodes which can be removed from the graph while being connected to every B node are those which do not belong to the minimum set cover.
Since it is NP-hard, there is no polinomial time algorithm for computing the optimum, but a simple greedy algorithm can efficiently provide approximate solutions with tight bounds to the optimum. From the Wikipedia:
There is a greedy algorithm for polynomial time approximation of set covering that chooses sets according to one rule: at each stage, choose the set that contains the largest number of uncovered elements.

Isolate connected_components

I have following networkx graph excerpt:
Following functions were executed to explore the structure of connected components, as I have a sparse network with lots of singular connections:
nx.number_connected_components(G)
>>> 702
list(nx.connected_components(G))
>>> [{120930, 172034},
{118787, 173867, 176202},
{50376, 151561},
...]
Question: How can I restrict my whole graph visualization to connected_components with equal or more than three nodes?
graphs = list(nx.connected_component_subgraphs(G))
list_subgraphs=[items for i in graphs for items in i if len(i)>=3]
F=G.subgraph(list_subgraphs)
Creating a flat list of the subgraphs with components greater than 3 nodes lets say!
We can create a subgraph containing the components with equal or more than three nodes:
s = G.subgraph(
set.union(
*filter(lambda x: len(x) >= 3, nx.connected_components(G))
)
)
Now you just need to visualize this subgraph s.
We might need to make a copy instead of a SubGraph view, in that case, s = s.copy() will make a copy from the subgraph.

Error in the algorithm of ordering nodes in the undirected graph

The idea is to construct a list of the nodes in the undirected graph ordered by their degrees.
Graph is given in the form {node: (set of its neighbours) for node in the graph}
The code raises KeyError exception at the line "graph[neighbor].remove(node)". It seems like the node have already been deleted from the set, but I don't see where.
Can anyone please point out at the mistake?
Edit: This list of nodes is used in the simulation of the targeted attack in order of values of nodes in the graph. So, after an attack on the node with the biggest degree, it is removed from the node, and the degrees of the remaining nodes should be recalculated accordingly.
def fast_targeted_order(graph):
"""returns an orderedv list of the nodes in the graph in decreasing
order of their degrees"""
number_of_nodes = len(graph)
# initialise a list of sets of every possible degree
degree_sets = [set() for dummy_ind in range(number_of_nodes)]
#group nodes in sets according to their degrees
for node in graph:
degree = len(graph[node])
degree_sets[degree] |= {node}
ordered_nodes = []
#starting from the set of nodes with the maximal degree
for degree in range(number_of_nodes - 1, -1, -1):
#copy the set to avoid raising the exception "set size changed
during the execution
copied_degree_set = degree_sets[degree].copy()
while degree_sets[degree]:
for node in copied_degree_set:
degree_sets[degree] -= {node}
for neighbor in graph[node]:
neighbor_degree = len(graph[neighbor])
degree_sets[neighbor_degree] -= {neighbor}
degree_sets[neighbor_degree - 1] |= {neighbor}
graph[neighbor].remove(node)
ordered_nodes.append(node)
graph.pop(node)
return ordered_nodes
My previous answer (now deleted) was incorrect, the issue was not in using set, but in deleting items in any sequence during iteration through the same sequence.
Python tutorial for version 3.1 clearly warns:
It is not safe to modify the sequence being iterated over in the loop
(this can only happen for mutable sequence types, such as lists). If
you need to modify the list you are iterating over (for example, to
duplicate selected items) you must iterate over a copy.
However, tutorial for Python 3.5. (which I use) only advises:
If you need to modify the sequence you are iterating over while inside
the loop (for example to duplicate selected items), it is
recommended that you first make a copy.
It appears that this operation is still very unpredictable in Python 3.5, producing different results with the same input.
From my point of view, the previous version of the tutorial is preferred to the current one.
#PetarPetrovic and #jdehesa, thanks for the valuable advice.
Working solution:
def fast_targeted_order(ugraph):
"""
input: undirected graph in the form {node: set of node's neighbors)
returns an ordered list of the nodes in V in decresing order of their degrees
"""
graph = copy_graph(ugraph)
number_of_nodes = len(graph)
degrees_dict = {degree: list() for degree in range(number_of_nodes)}
for node in graph:
degree = len(graph[node])
degrees_dict[degree].append(node)
ordered_degrees = OrderedDict(sorted(degrees_dict.items(),
key = lambda key_value: key_value[0],
reverse = True))
ordered_nodes = []
for degree, nodes in ordered_degrees.items():
nodes_copy = nodes[:]
for node in nodes_copy:
if node in nodes:
for neighbor in graph[node]:
neighbor_degree = len(graph[neighbor])
ordered_degrees[neighbor_degree].remove(neighbor)
if neighbor_degree:
ordered_degrees[neighbor_degree - 1].append(neighbor)
graph[neighbor].remove(node)
graph.pop(node)
ordered_degrees[degree].remove(node)
ordered_nodes.append(node)
return ordered_nodes

How to store a Euler graph struct?

I'm working around the Euler Path issue and found a problem:How to define or store a Euler graph struct?
An usual way is using an "Adjoint Matrix",C[i][j] is defined to store the edge between i-j.It's concise and effective! But this kind of matrix is limited by the situation that the edge between 2 nodes is unique (figure 1).
class EulerPath
{
int[][] c;//adjoint matrix,c[i][j] means the edge between i and j
}
What if there are several edges (figure 2)?My solution might be using customized class ,like "Graph","Node","Edge" to store a graph,but dividing the graph into some discrete structs ,which means we have to take more class details into consideration,may hurt the efficiency and concision. So I'm very eager to hear your advice!Thanks a lot!
class EulerPath
{
class Graph
{
Node[] Nodes;
Edge[] Edges;
}
class Node{...}
class Edge{...}
}
You can use an adjacency matrix to store graphs with multi-edges. You just let the value of c[i][j] be the number of times that vertex i is adjacent to vertex j. In your first case, it's 1, in your second case, it's 3. See also Wikipedia -- adjacency matrices aren't defined as being composed of only 1 and 0, that's just the special case of an adjacency matrix for a simple graph.
EDIT: You can represent your second graph in an adjacency matrix like this:
1 2 3 4
1 0 3 1 1
2 3 0 1 1
3 1 1 0 0
4 1 1 0 0
You can do this in at least three ways:
Adjacency list
Meaning that you have a 2D array called al[N][N]
al[N][N] This N is the node index
al[N][N] This N is the neighbor node index
Example, a graph with this input:
0 => 1
1 => 2
2 => 3
3 => 1
The adjacency list will look like this:
0 [1]
1 [2,3]
2 [1,3]
3 [1,2]
PS: Since this is a 2D array, and not all horizontal cells are going to be used, you need to keep track of the number of connected neighbours for each graph index because some programming languages initialise array values with a zero which is a node index in the graph. This can be done easily by creating another array that will count the number of neighbours for each graph index. Example of this case: numLinks: [1,2,2,2]
Matrix
With a matrix, you create an N x N 2D array, and you put a 1 value in the intersection of row col neighobor nodes:
Example with the same input above:
0 1 2 3
0 0 1 0 0
1 1 0 1 1
2 0 1 0 1
3 0 1 1 0
Class Node
The last method is creating a class called Node that contain a dynamic array of type Node. And you can store in this array the other nodes connected
Consider using a vector of linked list. Add a class that will have a field for a Vertex as well as the Weight (let's name it Entry). Your weights should be preferably another vector or linked list (preferably ll) which will contain all possible weights to the according Vertex. Your main class will have a vector of vectors, or a vector of linked lists (I'd prefer linked lists since you will most likely not need random access, being forced to iterate through every Entry when performing any operation). You main class will have one more vector containing all vertices. In C++ this would look like this:
class Graph{
std::vector<std::forward_list<Entry>> adj_list;
std::vector<Vertex> vertices;
};
Where the Vertex that corresponds to vertices[i] has the corresponding list in adj_list[i]. Since every Entry contains the info regarding the Vertex to which you are connected and the according weights, you will have your graph represented by this class.
Efficiency for what type of operation?
If you want to find a route between two IP addresses on the internet, then your adjacency matrix might be a million nodes squared, ie a gigabyte of entries. And as finding all the nodes connected to a given node goes up as n, you could be looking at a million lookups per node just to find the nodes connected to that node. Horribly inefficient.
If your problem only involves a few nodes and is run infrequently, then adjacency matrices are simple and intuitive.
For most problems which involve traversing graphs, a better solution could be to create a class called node, which has a property a collection (say a List) of all the nodes it is connected to. For most real world applications, the list of connected nodes is much less than the total number of all nodes, so this works out as more compact. Plus it is highly efficient in finding edges - you can get a list of all connected nodes in fixed time per node.
If you use this structure, where you have a node class which contains as a property a collection of all the nodes it is connected to, then when you create a new edge (say between node A and node B) then you add B to the collection of nodes to which A is connected, and A to the collection of nodes to which B is connected. Excuse my Java/C#, something like
class Node{
Arraylist<Node> connectedNodes;
public Node() // initializer
{
connectedNodes = new ArrayList<Node>;
}
}
// and somewhere else you have this definition:
public addEdgeBetween(Node firstNode, Node secondNode) {
firstNode.connectedNodes.Add(secondNode);
secondNode.connectedNodes.Add(firstNode);
}
And similarly to delete an edge, remove the reference in A to B's collection and vice versa. There is no need to define a separate edge class, edges are implicit in the structure which cross-links the two nodes.
And that's about all you have to do to implement this structure, which is (for most real world problems) uses far less memory than an adjacency matrix, is much faster for large numbers of nodes for most problems, and is ultimately far more flexible.
Defining a node class also opens up a logical place to add enhancements of many sorts. For example, you might decide to generate for each node a list of all the nodes which are two steps away, because this improves path finding. You can easily add this in as another collection within the node class; this would be a pretty messy thing to do with adjacency matrices. You can obviously squeeze a lot more functionality into a class than a into a matrix of ints.
Your question concerning multiple links is unclear to me. If you want multiple edges between the same two points, then this can be accommodated in both ways of doing it. In adjacency matrices, simply have a number at that row and column which indicates the number of links. If you use a node class, just add each edge separately. Similarly directional graphs; an edge pointing from A to B has a reference to B in A's list of connected nodes, but B doesn't have A in its list.

Algorithm for Finding Redundant Edges in a Graph or Tree

Is there an established algorithm for finding redundant edges in a graph?
For example, I'd like to find that a->d and a->e are redundant, and then get rid of them, like this:
=>
Edit: Strilanc was nice enough to read my mind for me. "Redundant" was too strong of a word, since in the example above, neither a->b or a->c is considered redundant, but a->d is.
You want to compute the smallest graph which maintains vertex reachability.
This is called the transitive reduction of a graph. The wikipedia article should get you started down the right road.
Since the Wikipedia article mentioned by #Craig gives only a hit for an implementation, I post my implementation with Java 8 streams:
Map<String, Set<String>> reduction = usages.entrySet().stream()
.collect(toMap(
Entry::getKey,
(Entry<String, Set<String>> entry) -> {
String start = entry.getKey();
Set<String> neighbours = entry.getValue();
Set<String> visited = new HashSet<>();
Queue<String> queue = new LinkedList<>(neighbours);
while (!queue.isEmpty()) {
String node = queue.remove();
usages.getOrDefault(node, emptySet()).forEach(next -> {
if (next.equals(start)) {
throw new RuntimeException("Cycle detected!");
}
if (visited.add(next)) {
queue.add(next);
}
});
}
return neighbours.stream()
.filter(s -> !visited.contains(s))
.collect(toSet());
}
));
Several ways to attack this, but first you're going to need to define the problem a little more precisely. First, the graph you have here is acyclic and directed: will this always be true?
Next, you need to define what you mean by a "redundant edge". In this case, you start with a graph which has two paths a->c: one via b and one direct one. From this I infer that by "redundant" you mean something like this. Let G=< V, E > be a graph, with V the set of vertices and E ⊆ V×V the set of edges. It kinda looks like you're defining all edges from vi to vj shorter than the longest edge as "redundant". So the easiest thing would be to use depth first search, enumerate the paths, and when you find a new one that's longer, save it as the best candidate.
I can't imagine what you want it for, though. Can you tell?
I think the easiest way to do that, actually imagine how it would look in the real work, imagine if you have joints, Like
(A->B)(B->C)(A->C), imagine if distance between near graphs is equals 1, so
(A->B) = 1, (B->C) = 1, (A->C) = 2.
So you can remove joint (A->C).
In other words, minimize.
This is just my idea how I would think about it at start. There are various articles and sources on the net, you can look at them and go deeper.
Resources, that Will help you:
Algorithm for Removing Redundant Edges in the Dual Graph of a Non-Binary CSP
Graph Data Structure and Basic Graph Algorithms
Google Books, On finding minimal two connected Subgraphs
Graph Reduction
Redundant trees for preplanned recovery in arbitraryvertex-redundant or edge-redundant graphs
I had a similar problem and ended up solving it this way:
My data structure is made of dependends dictionary, from a node id to a list of nodes that depend on it (ie. its followers in the DAG). Note it works only for a DAG - that is directed, acyclic graph.
I haven't calculated the exact complexity of it, but it swallowed my graph of several thousands in a split second.
_transitive_closure_cache = {}
def transitive_closure(self, node_id):
"""returns a set of all the nodes (ids) reachable from given node(_id)"""
global _transitive_closure_cache
if node_id in _transitive_closure_cache:
return _transitive_closure_cache[node_id]
c = set(d.id for d in dependents[node_id])
for d in dependents[node_id]:
c.update(transitive_closure(d.id)) # for the non-pythonists - update is update self to Union result
_transitive_closure_cache[node_id] = c
return c
def can_reduce(self, source_id, dest_id):
"""returns True if the edge (source_id, dest_id) is redundant (can reach from source_id to dest_id without it)"""
for d in dependents[source_id]:
if d.id == dest_id:
continue
if dest_id in transitive_closure(d.id):
return True # the dest node can be reached in a less direct path, then this link is redundant
return False
# Reduce redundant edges:
for node in nodes:
dependents[node.id] = [d for d in dependents[node.id] if not can_reduce(node.id, d.id)]

Resources