example code for Spark GraphX squaring the adjacency matrix - hadoop

I have a bipartite graph that I would like to project onto 2 new graphs (e.g. using the network from the internet movie database of actors and movies, projecting out the actor network and the movie network with weights corresponding to number of movies (actors) in common)
In regular matrix notation, one simply needs to square the adjacency matrix and ignore the diagonal (which is equal to the original graph outdegree). Presumably there are quicker edge-based algorithms for large sparse graphs, which is why I became interested in using Spark / GraphX with Hadoop and Map Reduce frameworks.
So, how do I do this A^2 calculation in Spark GraphX?
Say I begin with this code:
import org.apache.spark.graphx._
val orig_graph = GraphLoader.edgeListFile(sc, "bipartite_network.dat")
All of the functions I see are joins on vertices, joins on edges, or maps for vertex attributes or edge attributes. How do I loop over all edges doubly and create a new graph with EDGES that are based on vertex ids?
Here's some pseudocode:
for i = 1 to orig_graph.edges.count
for j = i to orig_graph.edges.count
var edge1 = orig_graph.edges.[i]
var edge2 = orig_graph.edges.[j]
if edge1.1 == edge2.1 then add new edge = (edge1.2, edge2.2)
if edge1.1 == edge2.2 then add new edge = (edge1.2, edge2.1)
if edge1.2 == edge2.1 then add new edge = (edge1.1, edge2.2)
if edge1.2 == edge2.2 then add new edge = (edge1.1, edge2.1)
something like that.
Do I need to use Pregel and message passing, or just use the various graph.join functions?
See also http://apache-spark-developers-list.1001551.n3.nabble.com/GraphX-adjacency-matrix-example-td6579.html

Related

Group divided polygon into N contiguous shapes

Given the following polygon, which is divided into sub-polygons as depicted below [left], I would like to create n number of contiguous, equally sized groups of sub-polygons [right, where n=6]. There is no regular pattern to the sub-polygons, though they are guaranteed to be contiguous and without holes.
This is not splitting a polygon into equal shapes, it is grouping its sub-polygons into equal, contiguous groups. The initial polygon may not have a number of sub-polygons divisible by n, and in these cases non-equally sized groups are ok. The only data I have is n, the number of groups to create, and the coordinates of the sub-polygons and their outer shell (generated through a clipping library).
My current algorithm is as follows:
list sub_polygons[] # list of polygon objects
for i in range(n - 1):
# start a new grouping
pick random sub_polygon from list as a starting point
remove this sub_polygon from the list
add this sub_polygon to the current group
while (number of shapes in group < number needed to be added):
add a sub_polygon that the group borders to the group
remove this sub-polygon from the sub-polygons list
add all remaining sub-shapes to the final group
This runs into problems with contiguity, however. The below illustrates the problem - if the red polygon is added to the blue group, it cuts off the green polygon such that it cannot be added to anything else to create a contiguous group.
It's simple to add a check for this when adding a sub-polygon to a group, such as
if removing sub-polygon from list will create non-contiguous union
pass;
but this runs into edge conditions where every possible shape that can be added creates a non-contiguous union of the available sub-polygons. In the below, my current algorithm is trying to add a sub-polygon to the red group, and with the check for contiguity is unable to add any:
Is there a better algorithm for grouping the sub-polygons?
I think it's more complicated to be solved in a single run. Despite the criteria used for selecting next polygon, it may stock somewhere in the middle. So, you need an algorithm that goes back and changes previous decision in such cases. The classic algorithm that does so is BackTracking.
But before starting, let's change the representation of the problem. These polygons form a graph like this:
This is the pseudocode of the algorithm:
function [ selected, stop ] = BackTrack(G, G2, selected, lastGroupLen, groupSize)
if (length(selected) == length(G.Node))
stop = true;
return;
end
stop = false;
if (lastGroupLen==groupSize)
// start a new group
lastGroupLen=0;
end;
// check continuity of remaining part of graph
if (discomp(G2) > length(selected))
return;
end
if (lastGroupLen==0)
available = G.Nodes-selected;
else
available = []
// find all nodes connected to current group
for each node in last lastGroupLen selected nodes
available = union(available, neighbors(G, node));
end
available = available-selected;
end
if (length(available)==0)
return;
end
lastSelected = selected;
for each node in available
[selected, stop] = BackTrack(G, removeEdgesTo(G2, node),
Union(lastSelected, node), lastGroupLen+1, groupSize);
if (stop)
break;
end
end
end
where:
selected: an ordered set of nodes that can be divided to n consecutive groups
stop: becomes true when the solution was found
G: the initial graph
G2: what remains of the graph after removing all edges to last selected node
lastGroupLen: number of nodes selected for last group
groupSize: maximum allowable size of each group
discomp(): returns number of discontinuous components of the graph
removeEdgesTo(): removes all edges connected to a node
That should be called like:
[ selected, stop ] = BackTrack( G, G, [], 0, groupSize);
I hope that is clear enough. It goes like this:
Just keep in mind the performance of this algorithm can be severely affected by order of nodes. One solution to speed it up is to order polygons by their centroids:
But there is another solution, if you are not satisfied with this outcome like myself. You can order the available set of nodes by their degrees in G2, so in each step, nodes that have less chance to make the graph disconnected will be visited first:
And as a more complicated problem, i tested map of Iran that has 262 counties. I set the groupSize to 20:
I think you can just follow the procedure:
Take some contiguous group of sub-polygons lying on the perimeter of the current polygon (if the number of polygons on the perimeter is less than the target size of the group, just take all of them and take whatever more you need from the next perimeter, and repeat until you reach your target group size).
Remove this group and consider the new polygon that consists of the remaining sub-polygons.
Repeat until remaining polygon is empty.
Implementation is up to you but this method should ensure that all formed groups are contiguous and that the remaining polygon formed at step 2 is contiguous.
EDIT:
Never mind, user58697 raises a good point, a counterexample to the algorithm above would be a polygon in the shape of an 8, where one sub-polygon bridges two other polygons.

divide and conquer algorithm for finding a 3-colored triangle in an undirected graph with the following properties?

In an undirected Graph G=(V,E) the vertices are colored either red, yellow or green. Furthermore there exist a way to partition the graph into two subsets so that |V1|=|V2| or |V1|=|V2|+1 where the following conditions apply: either every vertex of V1 is connected to every vertex of V2 or no Vertex of V1 is connected to a vertex of V2 . This applies recursively to all induced subgraphs of V1 and V2
I can find all triangles in the Graph by multiplying the adjacency matrix with itself three times and step up the nodes corresponding to the non zero entries of the main diagonal. Then I can see if the nodes of the triangle are colored the right way. O(n^~2,8)! But given the unique properties of the graph I want to find a solution using divide and conquer to find the colored triangle.
this is an example graph with the given properties. I need to find the bold triangle:
Blue boxes symbolize the partitions are fully connected, purple boxes mean no connection between the partitions
It can be done in O(E*V) without using the partition property.
Start by deleting all edges with the same color on both vertexes, this can be done in O(E).
In the modified graph G', every triangle is a 3-colored triangle.
Finding the triangles in a graph:
for each edge e(u,v):
for each vertex w:
if e(v,w) and e(u,w) in G':
add (u,v,w) to triangle list
If you keep adjacency list as well as adjacency matrix, you can improve the time of the inner loop by checking only w's in the adjacency list of v.
In that case the complexity is O(E * max(deg(v)).
Problem Statement:
To find all the triangles in an undirected graph with vertices of different colours. (Red, Yellow and Green).
Assumptions:
There exists a way to partition the graph into two subsets so that |V1|=|V2| or |V1|=|V2|+1 where the following conditions apply: either every vertex of V1 is connected to every vertex of V2 or no Vertex of V1 is connected to a vertex of V2 . This applies recursively to all induced subgraphs of V1 and V2.
Logic:
We can break the graph into two subgraphs recursively and find a triangle formed between one vertex in V1 and other two in V2 or one vertex in V2 and other two in V1.
At each recursive call, we can partition the given graph into V1, V2 which will satisfy the above property (function partition already given). The recursion breaks when the size of either V1, V2 becomes zero or both become equal to 1. This function is called recursively for both V1 and V2. If there is no edge between V1 and V2, we need not consider this partition for our final triangle list; so we return from this call.
Now, for each vertex in V2, we add to a globally declared colour map for the three colour combinations. Using this map, for each vertex in V2,we check the corresponding other colour combination and add this to the triangle list.
Pseudo Implementation
//let g be the given graph.
//Vertex be the class representing each vertex ( will have attributes 'vertex_number' + 'colour')
//let Edge be the class representing edges ( will have attributes 'a' and 'b' corresponding to two edges
//let (v1,v2) = partition(g) be the given function which can partition the graph into V1, v2.
//let adjacency_list be the ArrayList<ArrayList<Vertex>> containing the Adjacency list for the given vertices
//Main Callee Method
HashMap<String, List<Edge>> edge_list = new HashMap<String, List<Edge>>()
ArrayList<ArrayList<Vertex>> adjacency_list = new ArrayList<ArrayList<Vertex>>()
edge_list.put('rg', new ArrayList<Edge>())
edge_list.put('gy', new ArrayList<Edge>())
edge_list.put('yr', new ArrayList<Edge>())
ArrayList<new ArrayList<Vertex>> triangle_list = new ArrayList<new ArrayList<Vertex>>()
getColouredTriangles(g)
//Recursive Implementation of Coloured Triangle method
getColouredTriangles(g):
(v1,v2) = partition(g)
//If size is zero or both have size as 1 no triangles can be formed
if v1.size() == 0 || v2.size() == 0 || (v1.size() == 1 && v2.size() == 1):
return
//Calling recursively for both v1 and v2
getColouredTriangles(v1)
getColouredTriangles(v2)
//If there is no edge between the two subgraphs, return as no triangle is possible now between v1 and v2.
if not edge(v1.get(0), v2.get(0)):
return
//call for one vertex in v1, two in v2
getTrianglesInTwoGraphs(v1,v2)
//call for one vertex in v2, two in v1
getTrianglesInTwoGraphs(v2,v1)
//Method to get triangles between two graphs with one vertex in v1 and other two in v2.
getTrianglesInTwoGraphs(v1,v2):
//Form edge_list having colour to Edge mapping
for v in v2:
for vertex in adjacency_list.get(v):
if vertex in v2:
String colour = v.colour + vertex.colour
if(edge_list.get(colour) == null):
colour = vertex.colour + v.colour
edge_list.colour.put(colour,vertex.edge)
//for each v in v1, check other coloured edges from edge_list
for v in v1:
ArrayList<Edge> edges = new ArrayList<Edge>()
if v.colour == r:
edges = edge_list.get("gy")
else if v.colour == g:
edges = edge_list.get("yr")
else:
edges = edge_list.get("rg")
for edge in edges:
ArrayList<Vertex> vertices = new ArrayList<Vertex>()
vertices.add(v)
vertices.add(edge.a)
vertices.add(edge.b)
triangle_list.add(vertices)
Result:
The global variable triangle_list contains the vertex groups with coloured triangles.

Find symmetry distinct pairs on a Matrix

I have a square-planar lattice represented as an NxN grid Graph. Is there any way in Jung to get a symmetric pair of a specific vertex (given the axis of symmetry). Example: 8->0, 5->3.
My goal is to get distinct pairs of nodes. Since pairs (4,1), (4,7), (4,3) and (4,5) are essentially the same. (1,3) would be the same as (3,7) etc. Perhaps some algorithm can be performed on a matrix and then translated to the Graph.
General graphs aren't really particularly well-suited to this sort of thing, because they don't have a built-in notion of rows, columns, symmetry about an axis, etc.; they're all about topology, not geometry.
If you really want something like this, you should either create a subtype of Graph that has the operations you want, and create an implementation to match, or just create the corresponding matrix (and a mapping from matrix locations to graph nodes) and do the operations on that matrix instead.
So far I was able to write an algorithm rotating a matrix 3 times and keeping track of nodes at fixed indices. The same can be written for any type of Graph, using the node's visual coordinates instead of indices.
fun rotateMatrix(matrix: List<IntArray>): List<IntArray> {/*---*/}
val reflections = mutableListOf<Pair<Number, Number>>()
(0..2).fold(mat) { a, b ->
val new = rotateMatrix(a)
mat.forEachIndexed { x, e ->
e.forEachIndexed { y, e2 ->
reflections.add(mat[x][y] to new[x][y])
}
}
new
}
The result is a relationship describing that (0,2,8,6) are the "same"; (1,5,3,7) are the same etc. The only thing left to do is to use the output to determine which pairs of nodes correspond to which reflective siblings.

Optimizing the layout of a graph with given (erroneous) node-distances

I have a loosely connected graph. For every edge in this graph, I know the approximate distance d(v,w) between node v and w at positions p(v) and p(w) as a vector in R3, not only as an euclidean distance. The error shall be small (lets say < 3%) and the first node is at <0,0,0>.
If there were no errors at all, I can calculate the node-positions this way:
set p(first_node) = <0,0,0>
calculate_position(first_node)
calculate_position(v):
for (v,w) in Edges:
if p(w) is not set:
set p(w) = p(v) + d(v,w)
calculate_position(w)
for (u,v) in Edges:
if p(u) is not set:
set p(u) = p(v) - d(u,v)
calculate_position(u)
The errors of the distance are not equal. But to keep things simple, assume the relative error (d(v,w)-d'(v,w))/E(v,w) is N(0,1)-normal-distributed. I want to minimize the sum of the squared error
sum( ((p(v)-p(w)) - d(v,w) )^2/E(v,w)^2 ) for all edges
The graph may have a moderate amount of Nodes ( > 100 ) but with just some connections between the nodes and have been "prefiltered" (split into subgraphs, if there is only one connection between these subgraphs).
I have tried a simplistic "physical model" with hooks low but its slow and unstable. Is there a better algorithm or heuristic for this kind of problem?
This looks like linear regression. Take error terms of the following form, i.e. without squares and split into separate coordinates:
(px(v) - px(w) - dx(v,w))/E(v,w)
(py(v) - py(w) - dy(v,w))/E(v,w)
(pz(v) - pz(w) - dz(v,w))/E(v,w)
If I understood you correctly, you are looking for values px(v), py(v) and pz(v) for all nodes v such that the sum of squares of the above terms is minimized.
You can do this by creating a matrix A and a vector b in the following way: every row corresponds to one of equation of the above form, and every column of A corresponds to one variable, i.e. a single coordinate. For n vertices and m edges, the matrix A will have 3m rows (since you separate coordinates) and 3n−3 columns (since you also fix the first node px(0)=py(0)=pz(0)=0).
The row for (px(v) - px(w) - dx(v,w))/E(v,w) would have an entry 1/E(v,w) in the column for px(v) and an entry -1/E(v,w) in the column for px(w). All other columns would be zero. The corresponding entry in the vector b would be dx(v,w)/E(v,w).
Now solve the linear equation (AT·A)x = AT·b where AT denotes the transpose of A. The solution vector x will contain the coordinates for your vertices. You can break this into three independent problems, one for each coordinate direction, to keep the size of the linear equation system down.

Should vertex-order matter when trying to two-color a directed graph?

In The Algorithm Design Manual, the author provides an algorithm for two-coloring a graph. It's similar to the algorithm that counts the number of components, in that it iterates over all available vertices, and then colors and performs a BFS on that vertex only if it is not discovered:
for(i = 1; i <= (g->nvertices); i++) {
if(discovered[i] == FALSE) {
color[i] = WHITE;
bfs(g, i);
}
}
The BFS calls a process_edge function when y in the edge x -> y is not processed, or if the graph is directed. The BFS looks like this:
bfs(graph *g, int start) {
queue q; //queue of vertices to visit
int v; //current vertex
int y; //successor vertex
edgenode* p; //temporary pointer used to traverse adjacency list
init_queue(&q);
enqueue(&q, start);
discovered[start] = TRUE;
while(empty_queue(&q) == FALSE) {
v = dequeue(&q);
process_vertex_early(v); //function that handles early processing
processed[v] = TRUE;
p = g->edges[v];
while(p != NULL) {
y = p->y;
//If the node hasn't already been processed, or if the graph is directed
//process the edge. This part bothers me with undirected graphs
//because how would you process an edge that was wrongly colored
//in an earlier iteration when it will have processed[v] set to TRUE?
if((processed[y] == FALSE) || g->directed) {
process_edge(v, y); //coloring happens here
}
if(discovered[y] == FALSE) {
enqueue(&q, y);
discovered[y] = TRUE;
parent[y] = v;
}
p = p->next;
}
process_vertex_late(v); //function that handles late processing
}
}
The process_edge function looks like this:
process_edge(int x, int y) {
if(color[x] == color[y]) {
bipartite = FALSE;
printf("Warning: not bipartite due to (%d, %d)\n", x, y);
}
color[y] = complement(color[x]);
}
Now assume we have a graph like this:
We can two-color it like this:
But if we are traversing it by vertex order, then we will initially start with node 1, and color it to WHITE. Then we will find node 13 and color it to BLACK. In the next iteration of the loop, we are looking at node 5, which is undiscovered and so we will color it WHITE and initiate a BFS on it. While doing this, we will discover a conflict between nodes 5 and 1 because 1 should be BLACK, but it was previously set to WHITE. We will then discover another conflict between 1 and 13, because 13 should be WHITE, but it was set to BLACK.
When performing a normal traversal of a graph through all components (connected or not), order will not matter since we will end up visiting all the nodes anyway, however the order seems to matter in the case of coloring the graph. I didn't see a mention of this in the book and only ran across this issue when I was trying to two-color a randomly-generated graph like the one above. I was able to make a small change to the existing algorithm, which eliminated this problem:
for(i = 1; i <= (g->nvertices); i++) {
//Only initiate a BFS on undiscovered vertices and vertices that don't
//have parents.
if(discovered[i] == FALSE && parent[i] == NULL) {
color[i] = WHITE;
bfs(g, i);
}
}
Does this change make sense, or is it a hack due to my not understanding some fundamental concept?
UPDATE
Based on G. Bach's answer, assume we have the following graph:
I'm still confused as to how this would end up being two-colored properly. With the original algorithm, the first iteration will initiate a BFS with node 1 to give us a graph that is colored like this:
In the next iteration, we will initiate a BFS with node 5 to give us a graph that is colored like this:
The next iteration will initiate a BFS with node 6 to give us a graph that is colored like this:
But now we won't re-color 5 because we have already visited it and so this leaves us with a graph that hasn't been colored properly.
The directed nature of the graph has no bearing on the bipartite coloring problem you posed, unless you define a different problem where the direction would indeed begin to matter. So you can convert the graph you used in your example into an undirected graph and run the algorithm as given in the textbook.
While the textbook does not explicitly mention that the graph should be undirected, edge direction has no bearing on the common coloring problems we study. However, you could define problems which take into account edge directions (http://www.labri.fr/perso/sopena/pmwiki/index.php?n=TheOrientedColoringPage.TheOrientedColoringPage).
Note: I intended to write this as a comment, but as a newbie, I am not allowed to do so, until I accumulate a few reputation points.
Coloring a bipartite graph using BFS does not depend on vertex order. Call the two vertex sets that make up the partitions of the bipartite graph A and B; WLOG start at vertex a in A, color it WHITE; the first BFS will find the neighbors N(a), which will all be colored BLACK. For every v in N(a) with all v in B, this will then start a BFS (if I read this correctly), finding N(v) which is a subset of A again, coloring them all white and so on.
If you try to color a graph that is not bipartite using this, you will encounter a cycle of odd length (since having such a cycle as a subgraph is equivalent to not being bipartite) and the BFS will at some point encounter the start vertex of that odd cycle again and find that it already has a color, but not the one it wants to assign to it.
What I assume happens with your graph (since you do not include 5 in the BFS starting from 1) is that it is directed; I assume the algorithm you read was written for undirected graphs, because otherwise you do run into the problem you described. Also, coloring graphs is commonly defined on undirected graphs (can of course be transferred to directed ones).
Your fix will not solve the problem in general; add a new vertex 6 along with the edge 6->24 and you will run into the same problem. The BFS starting from 5 will want to color 1 black, the BFS starting from 6 will want to color it white. However, that graph would still be 2-colorable.

Resources