Isolate connected_components - filter

I have following networkx graph excerpt:
Following functions were executed to explore the structure of connected components, as I have a sparse network with lots of singular connections:
nx.number_connected_components(G)
>>> 702
list(nx.connected_components(G))
>>> [{120930, 172034},
{118787, 173867, 176202},
{50376, 151561},
...]
Question: How can I restrict my whole graph visualization to connected_components with equal or more than three nodes?

graphs = list(nx.connected_component_subgraphs(G))
list_subgraphs=[items for i in graphs for items in i if len(i)>=3]
F=G.subgraph(list_subgraphs)
Creating a flat list of the subgraphs with components greater than 3 nodes lets say!

We can create a subgraph containing the components with equal or more than three nodes:
s = G.subgraph(
set.union(
*filter(lambda x: len(x) >= 3, nx.connected_components(G))
)
)
Now you just need to visualize this subgraph s.
We might need to make a copy instead of a SubGraph view, in that case, s = s.copy() will make a copy from the subgraph.

Related

Z3 connected components

As part of a puzzle solver, I dynamically build a graph, nodes and edges, from user input.
Each node is assigned an integer const, representing which connected component it is part of.
Nodes are restricted to have the same connected component as their neighbors.
Currently, it can assign the same number to multiple components, but I want to restrict the solution so that each connected component must have a unique number.
I cannot wrap my head around how to express this constraint in Z3.
I'm not looking for working code, just a high level description of how to approach this, and I can take it from there.
Thanks in advance!
Nodes of a graph component connected via other nodes can be modelled using TransitiveClosure.
The following snippet might get you started:
# https://stackoverflow.com/q/56496558/1911064
from z3 import *
B = BoolSort()
NodeSort = DeclareSort('Node')
R = Function('R', NodeSort, NodeSort, B)
TC_R = TransitiveClosure(R)
s = Solver()
na, nb, nc = Consts('na nb nc', NodeSort)
s.add(R(na, nb))
s.add(R(nb, nc))
s.add(Not(TC_R(na, nc)))
print(s.check()) # produces unsat

Removing unnecessary nodes in graph

I have a graph that has two distinct classes of nodes, class A nodes and class B nodes.
Class A nodes are not connected to any other A nodes and class B nodes aren’t connected to any other B nodes, but B nodes are connected to A nodes and vice versa. Some B nodes are connected to lots of A nodes and most A nodes are connected to lots of B nodes.
I want to eliminate as many of the A nodes as possible from the graph.
I must keep all of the B nodes, and they must still be connected to at least one A node (preferably only one A node).
I can eliminate an A node when it has no B nodes connected only to it. Are there any algorithms that could find an optimal, or at least close to optimal, solution for which A nodes I can remove?
Old, Incorrect Answer, But Start Here
First, you need to recognize that you have a bipartite graph. That is, you can colour the nodes red and blue such that no edges connect a red node to a red node or a blue node to a blue node.
Next, recognize that you're trying to solve a vertex cover problem. From Wikipedia:
In the mathematical discipline of graph theory, a vertex cover (sometimes node cover) of a graph is a set of vertices such that each edge of the graph is incident to at least one vertex of the set. The problem of finding a minimum vertex cover is a classical optimization problem in computer science and is a typical example of an NP-hard optimization problem that has an approximation algorithm.
Since you have a special graph, it's reasonable to think that maybe the NP-hard doesn't apply to you. This thought brings us to Kőnig's theorem which relates the maximum matching problem to the minimum vertex cover problem. Once you know this, you can apply the Hopcroft–Karp algorithm to solve the problem in O(|E|√|V|) time, though you'll probably need to jigger it a bit to ensure you keep all the B nodes.
New, Correct Answer
It turns out this jiggering is the creation of a "constrained bipartitate graph vertex cover problem", which asks us if there is a vertex cover that uses less than a A-nodes and less than b B-nodes. The problem is NP-complete, so that's a no go. The jiggering was hard than I thought!
But using less than the minimum number of nodes isn't the constraint we want. We want to ensure that the minimum number of A-nodes is used and the maximum number of B-nodes.
Kőnig's theorem, above, is a special case of the maximum flow problem. Thinking about the problem in terms of flows brings us pretty quickly to minimum-cost flow problems.
In these problems we're given a graph whose edges have specified capacities and unit costs of transport. The goal is to find the minimum cost needed to move a supply of a given quantity from an arbitrary set of source nodes to an arbitrary set of sink nodes.
It turns out your problem can be converted into a minimum-cost flow problem. To do so, let us generate a source node that connects to all the A nodes and a sink node that connects to all the B nodes.
Now, let us make the cost of using a Source->A edge equal to 1 and give all other edges a cost of zero. Further, let us make the capacity of the Source->A edges equal to infinity and the capacity of all other edges equal to 1.
This looks like the following:
The red edges have Cost=1, Capacity=Inf. The blue edges have Cost=0, Capacity=1.
Now, solving the minimum flow problem becomes equivalent to using as few red edges as possible. Any red edge that isn't used allocates 0 flow to its corresponding A node and that node can be removed from the graph. Conversely, each B node can only pass 1 unit of flow to the sink, so all B nodes must be preserved in order for the problem to be solved.
Since we've recast your problem into this standard form, we can leverage existing tools to get a solution; namely, Google's Operation Research Tools.
Doing so gives the following answer to the above graph:
The red edges are unused and the black edges are used. Note that if a red edge emerges from the source the A-node it connects to generates no black edges. Note also that each B-node has at least one in-coming black edge. This satisfies the constraints you posed.
We can now detect the A-nodes to be removed by looking for Source->A edges with zero usage.
Source Code
The source code necessary to generate the foregoing figures and associated solutions is as follows:
#!/usr/bin/env python3
#Documentation: https://developers.google.com/optimization/flow/mincostflow
#Install dependency: pip3 install ortools
from __future__ import print_function
from ortools.graph import pywrapgraph
import matplotlib.pyplot as plt
import networkx as nx
import random
import sys
def GenerateGraph(Acount,Bcount):
assert Acount>5
assert Bcount>5
G = nx.DiGraph() #Directed graph
source_node = Acount+Bcount
sink_node = source_node+1
for a in range(Acount):
for i in range(random.randint(0.2*Bcount,0.3*Bcount)): #Connect to 10-20% of the Bnodes
b = Acount+random.randint(0,Bcount-1) #In the half-open range [0,Bcount). Offset from A's indices
G.add_edge(source_node, a, capacity=99999, unit_cost=1, usage=1)
G.add_edge(a, b, capacity=1, unit_cost=0, usage=1)
G.add_edge(b, sink_node, capacity=1, unit_cost=0, usage=1)
G.node[a]['type'] = 'A'
G.node[b]['type'] = 'B'
G.node[source_node]['type'] = 'source'
G.node[sink_node]['type'] = 'sink'
G.node[source_node]['supply'] = Bcount
G.node[sink_node]['supply'] = -Bcount
return G
def VisualizeGraph(graph, color_type):
gcopy = graph.copy()
for p, d in graph.nodes(data=True):
if d['type']=='source':
source = p
if d['type']=='sink':
sink = p
Acount = len([1 for p,d in graph.nodes(data=True) if d['type']=='A'])
Bcount = len([1 for p,d in graph.nodes(data=True) if d['type']=='B'])
if color_type=='usage':
edge_color = ['black' if d['usage']>0 else 'red' for u,v,d in graph.edges(data=True)]
elif color_type=='unit_cost':
edge_color = ['red' if d['unit_cost']>0 else 'blue' for u,v,d in graph.edges(data=True)]
Ai = 0
Bi = 0
pos = dict()
for p,d in graph.nodes(data=True):
if d['type']=='source':
pos[p] = (0, Acount/2)
elif d['type']=='sink':
pos[p] = (3, Bcount/2)
elif d['type']=='A':
pos[p] = (1, Ai)
Ai += 1
elif d['type']=='B':
pos[p] = (2, Bi)
Bi += 1
nx.draw(graph, pos=pos, edge_color=edge_color, arrows=False)
plt.show()
def GenerateMinCostFlowProblemFromGraph(graph):
start_nodes = []
end_nodes = []
capacities = []
unit_costs = []
min_cost_flow = pywrapgraph.SimpleMinCostFlow()
for node,neighbor,data in graph.edges(data=True):
min_cost_flow.AddArcWithCapacityAndUnitCost(node, neighbor, data['capacity'], data['unit_cost'])
supply = len([1 for p,d in graph.nodes(data=True) if d['type']=='B'])
for p, d in graph.nodes(data=True):
if (d['type']=='source' or d['type']=='sink') and 'supply' in d:
min_cost_flow.SetNodeSupply(p, d['supply'])
return min_cost_flow
def ColorGraphEdgesByUsage(graph, min_cost_flow):
for i in range(min_cost_flow.NumArcs()):
graph[min_cost_flow.Tail(i)][min_cost_flow.Head(i)]['usage'] = min_cost_flow.Flow(i)
def main():
"""MinCostFlow simple interface example."""
# Define four parallel arrays: start_nodes, end_nodes, capacities, and unit costs
# between each pair. For instance, the arc from node 0 to node 1 has a
# capacity of 15 and a unit cost of 4.
Acount = 20
Bcount = 20
graph = GenerateGraph(Acount, Bcount)
VisualizeGraph(graph, 'unit_cost')
min_cost_flow = GenerateMinCostFlowProblemFromGraph(graph)
# Find the minimum cost flow between node 0 and node 4.
if min_cost_flow.Solve() != min_cost_flow.OPTIMAL:
print('Unable to find a solution! It is likely that one does not exist for this input.')
sys.exit(-1)
print('Minimum cost:', min_cost_flow.OptimalCost())
ColorGraphEdgesByUsage(graph, min_cost_flow)
VisualizeGraph(graph, 'usage')
if __name__ == '__main__':
main()
Despite this is an old question, I see it has not been correctly answered yet.
An analogous question to this one has also been answered earlier in this post.
The problem you are presenting here is indeed the Minimum Set Cover Problem, which is one of the well-known NP-hard problems. From the Wikipedia, the Minimum Set Cover Problem can be formulated as:
Given a set of elements {1,2,...,n} (called the universe) and a collection S of m sets whose union equals the universe, the set cover problem is to identify the smallest sub-collection of S whose union equals the universe. For example, consider the universe U={1,2,3,4,5} and the collection of sets S={{1,2,3},{2,4},{3,4},{4,5}}. Clearly the union of S is U. However, we can cover all of the elements with the following, smaller number of sets: {{1,2,3},{4,5}}.
In your formulation, B nodes represent the elements in the universe, A nodes represent the sets and edges between A nodes and B nodes determine which elements (B nodes) belong to each set (A node). Then, the minimum set cover is equivalent to the minimum number of A nodes so that they are connected to all B nodes. Consequently, the maximum number of A nodes which can be removed from the graph while being connected to every B node are those which do not belong to the minimum set cover.
Since it is NP-hard, there is no polinomial time algorithm for computing the optimum, but a simple greedy algorithm can efficiently provide approximate solutions with tight bounds to the optimum. From the Wikipedia:
There is a greedy algorithm for polynomial time approximation of set covering that chooses sets according to one rule: at each stage, choose the set that contains the largest number of uncovered elements.

Disjoint sets on apache spark

I trying to find algorithm of searching disjoint sets (connected components/union-find) on large amount of data with apache spark.
Problem is amount of data. Even Raw representation of graph vertex doesn't fit in to ram on single machine. Edges also doesn't fit in to the ram.
Source data is text file of graph edges on hdfs: "id1 \t id2".
id present as string value, not int.
Naive solution that I found is:
take rdd of edges -> [id1:id2] [id3:id4] [id1:id3]
group edges by key. -> [id1:[id2;id3]][id3:[id4]]
for each record set minimum id to each group -> (flatMap) [id1:id1][id2:id1][id3:id1][id3:id3][id4:id3]
reverse rdd from stage 3 [id2:id1] -> [id1:id2]
leftOuterJoin of rdds from stage 3 and 4
repeat from stage 2 while size of rdd on step 3 wouldn't change
But this results in the transfer of large amounts of data between nodes
(shuffling)
Any advices?
If you are working with graphs I would suggest that you take a look at either one of these libraries
GraphX
GraphFrames
They both provide the connected components algorithm out of the box.
GraphX:
val graph: Graph = ...
val cc = graph.connectedComponents().vertices
GraphFrames:
val graph: GraphFrame = ...
val cc = graph.connectedComponents.run()
cc.select("id", "component").orderBy("component").show()
In addition to #Marsellus Wallace answer, below full code to get disjoint sets from an RDD of edges using GraphX.
val edges:RDD[(Long,Long)] = ???
val g = Graph.fromEdgeTuples(edges,-1L)
val disjointSets:RDD[Iterable[Long]] = g.connectedComponents()
//Get tuples with (vertexId,parent vertexId)
.vertices
//Group by parent vertex Id so it aggregates the disjoint set
.groupBy(_._2)
.values
.map(_.map(_._1))

How to store a Euler graph struct?

I'm working around the Euler Path issue and found a problem:How to define or store a Euler graph struct?
An usual way is using an "Adjoint Matrix",C[i][j] is defined to store the edge between i-j.It's concise and effective! But this kind of matrix is limited by the situation that the edge between 2 nodes is unique (figure 1).
class EulerPath
{
int[][] c;//adjoint matrix,c[i][j] means the edge between i and j
}
What if there are several edges (figure 2)?My solution might be using customized class ,like "Graph","Node","Edge" to store a graph,but dividing the graph into some discrete structs ,which means we have to take more class details into consideration,may hurt the efficiency and concision. So I'm very eager to hear your advice!Thanks a lot!
class EulerPath
{
class Graph
{
Node[] Nodes;
Edge[] Edges;
}
class Node{...}
class Edge{...}
}
You can use an adjacency matrix to store graphs with multi-edges. You just let the value of c[i][j] be the number of times that vertex i is adjacent to vertex j. In your first case, it's 1, in your second case, it's 3. See also Wikipedia -- adjacency matrices aren't defined as being composed of only 1 and 0, that's just the special case of an adjacency matrix for a simple graph.
EDIT: You can represent your second graph in an adjacency matrix like this:
1 2 3 4
1 0 3 1 1
2 3 0 1 1
3 1 1 0 0
4 1 1 0 0
You can do this in at least three ways:
Adjacency list
Meaning that you have a 2D array called al[N][N]
al[N][N] This N is the node index
al[N][N] This N is the neighbor node index
Example, a graph with this input:
0 => 1
1 => 2
2 => 3
3 => 1
The adjacency list will look like this:
0 [1]
1 [2,3]
2 [1,3]
3 [1,2]
PS: Since this is a 2D array, and not all horizontal cells are going to be used, you need to keep track of the number of connected neighbours for each graph index because some programming languages initialise array values with a zero which is a node index in the graph. This can be done easily by creating another array that will count the number of neighbours for each graph index. Example of this case: numLinks: [1,2,2,2]
Matrix
With a matrix, you create an N x N 2D array, and you put a 1 value in the intersection of row col neighobor nodes:
Example with the same input above:
0 1 2 3
0 0 1 0 0
1 1 0 1 1
2 0 1 0 1
3 0 1 1 0
Class Node
The last method is creating a class called Node that contain a dynamic array of type Node. And you can store in this array the other nodes connected
Consider using a vector of linked list. Add a class that will have a field for a Vertex as well as the Weight (let's name it Entry). Your weights should be preferably another vector or linked list (preferably ll) which will contain all possible weights to the according Vertex. Your main class will have a vector of vectors, or a vector of linked lists (I'd prefer linked lists since you will most likely not need random access, being forced to iterate through every Entry when performing any operation). You main class will have one more vector containing all vertices. In C++ this would look like this:
class Graph{
std::vector<std::forward_list<Entry>> adj_list;
std::vector<Vertex> vertices;
};
Where the Vertex that corresponds to vertices[i] has the corresponding list in adj_list[i]. Since every Entry contains the info regarding the Vertex to which you are connected and the according weights, you will have your graph represented by this class.
Efficiency for what type of operation?
If you want to find a route between two IP addresses on the internet, then your adjacency matrix might be a million nodes squared, ie a gigabyte of entries. And as finding all the nodes connected to a given node goes up as n, you could be looking at a million lookups per node just to find the nodes connected to that node. Horribly inefficient.
If your problem only involves a few nodes and is run infrequently, then adjacency matrices are simple and intuitive.
For most problems which involve traversing graphs, a better solution could be to create a class called node, which has a property a collection (say a List) of all the nodes it is connected to. For most real world applications, the list of connected nodes is much less than the total number of all nodes, so this works out as more compact. Plus it is highly efficient in finding edges - you can get a list of all connected nodes in fixed time per node.
If you use this structure, where you have a node class which contains as a property a collection of all the nodes it is connected to, then when you create a new edge (say between node A and node B) then you add B to the collection of nodes to which A is connected, and A to the collection of nodes to which B is connected. Excuse my Java/C#, something like
class Node{
Arraylist<Node> connectedNodes;
public Node() // initializer
{
connectedNodes = new ArrayList<Node>;
}
}
// and somewhere else you have this definition:
public addEdgeBetween(Node firstNode, Node secondNode) {
firstNode.connectedNodes.Add(secondNode);
secondNode.connectedNodes.Add(firstNode);
}
And similarly to delete an edge, remove the reference in A to B's collection and vice versa. There is no need to define a separate edge class, edges are implicit in the structure which cross-links the two nodes.
And that's about all you have to do to implement this structure, which is (for most real world problems) uses far less memory than an adjacency matrix, is much faster for large numbers of nodes for most problems, and is ultimately far more flexible.
Defining a node class also opens up a logical place to add enhancements of many sorts. For example, you might decide to generate for each node a list of all the nodes which are two steps away, because this improves path finding. You can easily add this in as another collection within the node class; this would be a pretty messy thing to do with adjacency matrices. You can obviously squeeze a lot more functionality into a class than a into a matrix of ints.
Your question concerning multiple links is unclear to me. If you want multiple edges between the same two points, then this can be accommodated in both ways of doing it. In adjacency matrices, simply have a number at that row and column which indicates the number of links. If you use a node class, just add each edge separately. Similarly directional graphs; an edge pointing from A to B has a reference to B in A's list of connected nodes, but B doesn't have A in its list.

How to find the set of trees every one of which spans over another given tree?

Imagine it's given a set of trees ST and each vertex of every tree is labeled. Also another tree T is given (also with labels vertices). The question is how can I find which trees of the ST can span over the tree T starting from the root of T in such a way that the labels of the vertices of the spanning tree T' coincide with those labels of T 's vertices. Note that the children of every vertex of T should be either completely covered or not covered at all - partial covering of children is not allowed. Stated in other words: Given a tree and the following procedure: pick a vertex and remove all vertices and edges below this vertex (except the vertex itself). Find those trees of ST such that each tree is generated with a series of procedures applied to T.
For example given the tree T
the trees
cover T and the tree
does not because this tree has children 3, 5 unlike T which has 2, 3 as children. The best thing I was able to think of was either to brute force it or to find the set of tree every one of which has the same root label as T and then to search for the answer among those trees but I guess neither of those two approaches is the optimal one. I was thinking of somehow hashing the trees but nothing came out. Any thoughts?
Notes:
The trees are not necessarily binary
A tree T can cover another tree T' if they share a root
The tree is ordered meaning that you cannot swap the position of any two children.
TL; DR Find a efficient algorithm which on query with given tree T the algorithm finds all trees from a given(fixed/static) set ST which are able to cover T.
I'll sketch an answer and then provide some working source code.
First off, you need an algorithm to hash a tree. We can assume, without loss of generality, that the children of each of your tree's nodes are ordered from least to greatest (or vice versa).
Run this algorithm on every member of ST and save the hashes.
Now, take your test tree T and generate all of its subtrees TP that retain the original root. You can do this (perhaps inefficiently) by:
Making a set S of its nodes
Generating the power set P of S
Generating the subtrees by removing the nodes present in each member of P from copies of T
Adding those subtrees which retain the original root to TP.
Now generate a set of all of the hashes of TP.
Now check each of your ST hashes for membership in TP.
ST hash storage requires O(n) space in ST, and possibly the space to hold the trees.
You can optimize the membership code so that it requires no storage space (I have not done this in my test code). The code will require approximately 2N checks, where N is the number of nodes in **T.
So the algorithm runs in O(H 2**N), where H is the size of ST and N is the number of nodes in T. The best way of speeding this up is to find an improved algorithm for generating the subtrees of T.
The following Python code accomplishes this:
#!/usr/bin/python
import itertools
import treelib
import Crypto.Hash.SHA
import copy
#Generate a hash of a tree by recursively hashing children
def HashTree(tree):
digester=Crypto.Hash.SHA.new()
digester.update(str(tree.get_node(tree.root).tag))
children=tree.get_node(tree.root).fpointer
children.sort(key=lambda x: tree.get_node(x).tag, cmp=lambda x,y:x-y)
hash=False
if children:
for child in children:
digester.update(HashTree(tree.subtree(child)))
hash = "1"+digester.hexdigest()
else:
hash = "0"+digester.hexdigest()
return hash
#Generate a power set of a set
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return itertools.chain.from_iterable(itertools.combinations(s, r) for r in range(len(s)+1))
#Generate all the subsets of a tree which still share the original root
#by using a power set of all the tree's nodes to remove nodes from the tree
def TreePowerSet(tree):
nodes=[x.identifier for x in tree.nodes.values()]
ret=[]
for s in powerset(nodes):
culled_tree=copy.deepcopy(tree)
for n in s:
try:
culled_tree.remove_node(n)
except:
pass
if len([x.identifier for x in culled_tree.nodes.values()])>0:
ret.append(culled_tree)
return ret
def main():
ST=[]
#Generate a member of ST
treeA = treelib.Tree()
treeA.create_node(1,1)
treeA.create_node(2,2,parent=1)
treeA.create_node(3,3,parent=1)
ST.append(treeA)
#Generate a member of ST
treeB = treelib.Tree()
treeB.create_node(1,1)
treeB.create_node(2,2,parent=1)
treeB.create_node(3,3,parent=1)
treeB.create_node(4,4,parent=2)
treeB.create_node(5,5,parent=2)
ST.append(treeB)
#Generate hashes for members of ST
hashes=[(HashTree(tree), tree) for tree in ST]
print hashes
#Generate a test tree
T=treelib.Tree()
T.create_node(1,1)
T.create_node(2,2,parent=1)
T.create_node(3,3,parent=1)
T.create_node(4,4,parent=2)
T.create_node(5,5,parent=2)
T.create_node(6,6,parent=3)
T.create_node(7,7,parent=3)
#Generate all the subtrees of this tree which still retain the original root
Tsets=TreePowerSet(T)
#Hash all of the subtrees
Thashes=set([HashTree(x) for x in Tsets])
#For each member of ST, check to see if that member is present in the test
#tree
for hash in hashes:
if hash[0] in Thashes:
print [x for x in hash[1].expand_tree()]
main()
To verify that one tree covers another, one must look at all vertices of the first tree at least once. It is trivial to verify that a tree covers another by looking at all vertices of the first tree exactly once. Thus the simplest possible algorithm is already optimal, if it's only needed to check one tree.
Everything below are untested fruits of my sick imagination.
If there are many possible T that must be checked against the same ST, then it's possible to store trees of ST as sets of facts like these
root = 1
children of node 1 = (2, 3)
children of node 2 = ()
children of node 3 = ()
These facts can be stored in a standard relational DB in two tables, "roots" (fields "tree" and rootnode") and "branches" (fields "tree", "node" and "children"). then an SQL query or a series of queries can be built to find matching trees quickly. My SQL-fu is rudimentary so I could not manage it in a single query, but I'm believe it should be possible.

Resources