Accessing Edges without iterating Vertices

Accessing Edges without iterating Vertices - performance

I´m currently working with janusGraph in an hadoop Environment.
I´ve already loaded a bigger amount of Vertices into the graph (about half a million) and got an index for the primary Key running.
Iterating every vertex takes about 3 minutes.
I´ve currently 0 edges in my graph.
For the loading of my graph-edges i´m reading out a csv-file which contains the data.
As i´m sometimes facing Timeouts (because of the environment) i´ve been first looking for the count of vertices and then skip to the correct row in the csv, to restart the loading.
However, asking for the count of edges to do the same with my edge-csv-file takes about 4 minutes and produces a timeout for my tinkerpopserver.
Is there a way to get the total count of edges in the graph without iterating every single vertex?
Adding the edges itself works fine, as the composite index for the vertices is quite fast.

Given the way that edges are stored for JanusGraph g.E() will basically iterate all vertices to get the edges so there's not much you can do to get a count. It is worth noting that iterating edges is a graph specific issue, so it is possible that other graphs may behave differently. For example, TinkerGraph handles counting with a strategy that bypasses iteration completely.

graph.traversal().E().count() should work.

Related

Fixing Karger's min cut algorithm with union-find data structure

I was trying to implement Karger's min cut algorithm in the same way it is explained here but I don't like the fact that at each step of the while loop we can pick an edge with it's two endpoints already in a supernode. More specifically, this part
// If two corners belong to same subset,
// then no point considering this edge
if (subset1 == subset2)
continue;
Is there a quick fix for avoiding this problem?

It might help to back up and think about why there’s a union-find structure here at all and why it’s worth improving on the continue statement.
Conceptually, each contraction performed changes the graph in the following way:
The nodes contracted get replaced with a single node.
The edges incident to either node get replaced with an edge to the new joint node.
The edges running between the two earlier nodes get removed.
The question, then, is how to actually do this to the graph. The code you’ve found does this lazily. It doesn’t actually change the graph when the contraction is done. Instead, it uses the union-find structure to show which nodes are now equivalent to one another. When it samples a random edge, it then has to check whether that edge is one of the ones that would have been deleted in step (3). If so, it skips it and moves on. This has the effect that early contractions are really fast (the likelihood of picking two edges that are part of contracted nodes is very low when few edges are contracted), but later contractions might be a lot slower (once edges have started being contracted, lots of edges may have been deleted).
Here’s a simple modification you can use to speed this step up. Whenever you pick an edge to contract and find that its endpoints are already connected, discard that edge, and remove it from the list of edges so that it never gets picked again. You can do this by swapping that edge to the end of the list of edges, then removing the last element of the list. This has the effect that every edge processed will never be seen again, so across all iterations of the algorithm every edge will be processed at most once. That gives a runtime of one randomized contraction phase as O(m + nα(n)), where m is the number of edges and n is the number of nodes. The factor of α(n) comes from the use of the union-find structure.
If you truly want to remove all semblance of that continue statement, an alternative approach would be to directly simulate the contraction. After each contraction, iterate over all m edges and adjust each one by seeing whether it needs to remain unchanged, point to the new contracted node, or be removed altogether. This will take time O(m) per contraction for a net cost of O(mn) for the overall min cut calculation.
There are ways to speed things up beyond this. Karger’s original paper suggests generating a random permutation of the edges and using binary search over that array with a clever use of BFS or DFS to find the cut produced in time O(m), which is slightly faster than the O(m + nα(n)) approach for large graphs. The basic idea is the following:
Probe the middle element of the list of edges.
Run a BFS on the graph formed by only using those edges and see if there are exactly two connected components.
If so, great! Those two CCs are the ones you want.
If there is only one CC, discard the back half of the array of edges and try again.
If there is more than one CC, contract each CC into a single node and update a global table indicating which CC each node belongs to. Then discard the first half of the array and try again.
The cost of each BFS is O(m), where m is the number of edges in the graph, and this gives the recurrence T(m) = T(m/2) + O(m) because at each stage we’re throwing away half of the edges. That solves to O(m) total time, though as you can see, it’s a much trickier way of coding this algorithm up!
To summarize:
With a very small modification to the provided code, you can keep the continue statement in but still have a very fast implementation of the randomized contraction algorithm.
To eliminate that continue without sacrificing the runtime of the algorithm, you need to do some major surgery and change approaches to something only marginally asymptotically faster than keeping the continue in.
Hope this helps!

Marking special nodes that traverses a certain length in a graph to fill up every node

So I am working on a problem where you have one graph, where all edges has a certain weight to it. Now the algorithm is supposed to select certain node(s) in the graph and the selected nodes must be able to traverse/span through all other nodes within a certain total weight. The output should be the minimum number of selected nodes and the position of the selected nodes (it doesn't matter which position it outputs as long as it is the minimum number of selected nodes)
I have thought of a couple simple solutions but neither of them seem too good so far.
Brute force by trying out every node. The algorithm will start with one selected node and then try every combination of selected nodes, which then loops through every other non-selected nodes and check if they are all within range. If not it increases one extra selected node and repeats the same process.
The algorithm makes a list of subgraph that can be traversed with 1 selected node at every node position. Then attempt to puzzle fit the subgraphs so that it re-creates the original graph and if it succeeds that would be the solution.
As an example, here's a picture of a grid as the graph.
The weights of every edge in this grid is 1, and each selected node can travel through a total weights of 2.
I'm not too familiar with graph problems so if there is a similar question out there or if anyone can provide any help with the solution that would be great!

First, you can get rid of the edge weights by simply making an edge between two nodes if they are within the distance limit. Then you need to find a subset of the nodes such that each node is either selected or neighbor of a selected node. This is known as the Dominating Set problem. Unfortunately, it is NP-hard, so typically it is solved heuristically or using Integer Linear Programming. It might be possible to take advantage of certain properties of the input, but it is hard to tell without knowing more about what they look like.

What are good ways of organizing directed graph data?

Here's my situation. I have a graph that has different sets of data being added at different times. For example, set1 might have a few thousand nodes and then set2 comes in later and we apply business logic to create edges from set1 to set2(and disgard any Vertices from set1 that do not have edges to set2). Then at a later point, we get set3, set4, and so on and the same process applies between each set and its previous set.
Question, what's the best way to organize this? What I did before was name the nodes set1-xx, set2-xx,etc.. The problem I faced was when I was trying to run analytics between the current set and the previous set I would have to run a loop through the entire graph and look for all the nodes that started with 'setx'. It took a long time as the graph grew, so I thought of another solution which was to create a node called 'set1' and have it connected to all nodes for that particular set. I am testing it but I was wondering if there way a more efficient way or a build in way of handling data structures like this? Is there a way to somehow segment data like this?
I think a general solution would be application but if it helps I'm using neo4j(so any specific solution to that database would be good as well).

You have a very special type of a directed graph, called a layered graph.
The choice of the data structure depends primarily on the expected graph density (how many nodes from a previous set/layer are typically connected to a node in the current set/layer) and on the operations that you need to perform on it most of the time. It is definitely a good idea to have each layer directly represented by a numeric index (that is, the outermost structure will be an array of sets/layers), and presumably you can also use one array of vertices per layer. However, the list of edges per vertex (out only, or in and out sets of edges depending on whether you ever traverse the layers backward) may be any of the following:
Linked list of vertex identifiers; this is good if the graph is very sparse and edges are often added/removed.
Sorted array of vertex identifiers; this is good if the graph is quite sparse and immutable.
Array of booleans, indexed by vertex identifiers, determining whether a given vertex is or is not linked by an edge from the current vertex; this is good if the graph is dense.
The "vertex identifier" can take many forms. For example, it can be an index into the array of vertices on the next layer.

Your second solution is what I would do- create a setX node and connect all nodes belonging to that set to setX. That way your data is partitioned and it is easier to query.

Find connected-blocks with certain value in a grid

I'm having trouble finding an algorithm for my problem.
I have a grid of 8x8 blocks, each block has a value ranging from 0 to 9. And I want to find collections of connected blocks that match a total value of for example 15. My first approach was to start of at the border, that worked fine. But when starting in the middle of the grid my algorithm gets lost.
Would anyone know a simple algorithm to use or can you point me in the right direction?
Thanks!

As far as I know, no simple algorithm exists for this. As for pointing you in the right direction, an 8x8 grid is really just a special case of a graph, so I'd start with graph traversal algorithms. I find that in cases like this, it sometimes helps to think how you would solve the problem for a smaller grid (say, 3x3 or 4x4) and then see if your algorithm scales up to "full size."
EDIT :
My proposed algorithm is a modified depth-first traversal. To use it, you'll have to convert your grid into a graph. The graph should be undirected, since connected blocks are connected equally in both directions.
Each graph node represents a single block, containing the block's value and a visited variable. Edge weights represent their edges' resistance to being followed. Set them by summing the values of the nodes they connect. Depending on the sum you're looking for, you may be able to optimize this by removing edges that are guaranteed to fail. For example, if you're looking for 15, you can delete all edges with weight of 16 or greater.
The rest of the algorithm will be performed as many times as there are blocks, with each block serving as the starting block once. Traverse the graph by following the lowest-weighted edge from the current node, unless that takes you to a visited node. Push each visited node onto a stack and set its visited variable to true. Keep a running sum for every path followed.
Whenever the desired sum is reached, save the current path as one of your answers. Do not stop traversal, because the current node could be connected to a zero.
Whenever the total exceeds the desired sum, backtrack by setting visited to false and popping the current node off the stack.
Whenever all edges for a given node have been explored, backtrack.
After every possible path from a given starting node is analyzed, every answer that includes that node has been found. So, remove all edges touching the starting node and choose a new starting node.
I haven't fully analyzed the efficiency/running time of this algorithm yet, but... it's not good. (Consider the number of paths to be searched in a graph containing all zeroes.) That said, it's far better than pure brute force.

An algorithm to check if a vertex is reachable

Is there an algorithm that can check, in a directed graph, if a vertex, let's say V2, is reachable from a vertex V1, without traversing all the vertices?

You might find a route to that node without traversing all the edges, and if so you can give a yes answer as soon as you do. Nothing short of traversing all the edges can confirm that the node isn't reachable (unless there's some other constraint you haven't stated that could be used to eliminate the possibility earlier).
Edit: I should add that it depends on how often you need to do queries versus how large (and dense) your graph is. If you need to do a huge number of queries on a relatively small graph, it may make sense to pre-process the data in the graph to produce a matrix with a bit at the intersection of any V1 and V2 to indicate whether there's a connection from V1 to V2. This doesn't avoid traversing the graph, but it can avoid traversing the graph at the time of the query. I.e., it's basically a greedy algorithm that assumes you're going to eventually use enough of the combinations that it's easiest to just traverse them all and store the result. Depending on the size of the graph, the pre-processing step may be slow, but once it's done executing a query becomes quite fast (constant time, and usually a pretty small constant at that).

Depth first search or breadth first search. Stop when you find one. But there's no way to tell there's none without going through every one, no. You can improve the performance sometimes with some heuristics, like if you have additional information about the graph. For example, if the graph represents a coordinate space like a real map, and most of the time you know that there's going to be a mostly direct path, then you can attempt to have the depth-first search look along lines that "aim towards the target". However, imagine the case where the start and end points are right next to each other, but with no vector inbetween, and to find it, you have to go way out of the way. You have to check every case in order to be exhaustive.

I doubt it has a name, but a breadth-first search might go like this:
Add V1 to a queue of nodes to be visited
While there are nodes in the queue:
If the node is V2, return true
Mark the node as visited
For every node at the end of an outgoing edge which is not yet visited:
Add this node to the queue
End for
End while
Return false

Create an adjacency matrix when the graph is created. At the same time you do this, create matrices consisting of the powers of the adjacency matrix up to the number of nodes in the graph. To find if there is a path from node u to node v, check the matrices (starting from M^1 and going to M^n) and examine the value at (u, v) in each matrix. If, for any of the matrices checked, that value is greater than zero, you can stop the check because there is indeed a connection. (This gives you even more information as well: the power tells you the number of steps between nodes, and the value tells you how many paths there are between nodes for that step number.)
(Note that if you know the number of steps in the longest path in your graph, for whatever reason, you only need to create a number of matrices up to that power. As well, if you want to save memory, you could just store the base adjacency matrix and create the others as you go along, but for large matrices that may take a fair amount of time if you aren't using an efficient method of doing the multiplications, whether from a library or written on your own.)
It would probably be easiest to just do a depth- or breadth-first search, though, as others have suggested, not only because they're comparatively easy to implement but also because you can generate the path between nodes as you go along. (Technically you'd be generating multiple paths and discarding loops/dead-end ones along the way, but whatever.)

In principle, you can't determine that a path exists without traversing some part of the graph, because the failure case (a path does not exist) cannot be determined without traversing the entire graph.
You MAY be able to improve your performance by searching backwards (search from destination to starting point), or by alternating between forward and backward search steps.
Any good AI textbook will talk at length about search techniques. Elaine Rich's book was good in this area. Amazon is your FRIEND.

You mentioned here that the graph represents a road network. If the graph is planar, you could use Thorup's Algorithm which creates an O(nlogn) space data structure that takes O(nlogn) time to build and answers queries in O(1) time.

Another approach to this problem would allow you to ignore all of the vertices. If you were to only look at the edges, you can produce a transitive closure array that will show you each vertex that is reachable from any other vertex.
Start with your list of edges:
Va -> Vc
Va -> Vd
....
Create an array with start location as the rows and end location as the columns. Fill the arrays with 0. For each edge in the list of edges, place a one in the start,end coordinate of the edge.
Now you iterate a few times until either V1,V2 is 1 or there are no changes.
For each row:
NextRowN = RowN
For each column that is true for RowN
Use boolean OR to OR in the results of that row of that number with the current NextRowN.
Set RowN to NextRowN
If you run this algorithm until the end, you will quickly have a complete list of all reachable vertices without looking at any of them. The runtime is proportional to the number of edges. This would work well with a reasonable implementation and a reasonable number of edges.
A slightly more complex version of this algorithm would be to only calculate the vertices reachable by V1. To do this, you would focus your scope on the ones that are currently reachable at any given time. You can also limit adding rows to only one time, since the other rows are never changing.

In order to be sure, you either have to find a path, or traverse all vertices that are reachable from V1 once.
I would recommend an implementation of depth first or breadth first search that stops when it encounters a vertex that it has already seen. The vertex will be processed on the first occurrence only. You need to make sure that the search starts at V1 and stops when it runs out of vertices or encounters V2.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio