Storing very large graphs on disk/streaming graph partitioning algorithms?

Storing very large graphs on disk/streaming graph partitioning algorithms? - performance

Suppose that I have a very large undirected, unweighted graph (starting at hundreds of millions of vertices, ~10 edges per vertex), non-distributed and processed by single thread only and that I want to do breadth-first searches on it. I expect them to be I/O-bound, thus I need a good-for-BFS disk page layout, disk space is not an issue. The searches can start on every vertex with equal probability. Intuitively that means minimizing the number of edges between vertices on different disk pages, which is a graph partitioning problem.
The graph itself looks like a spaghetti, think of random set of points randomly interconnected, with some bias towards shorter edges.
The problem is, how does one partition graph this large? The available graph partitioners I have found work with graphs that fit into memory only. I could not find any descriptions nor implementations of any streaming graph partitioning algorithms.
OR, maybe there is an alternative to partitioning graph for getting a disk layout that works well with BFS?
Right now as an approximation I use the fact that the vertices have spatial coordinates attached to them and put the vertices on disk in Hilbert sort order. This way spatially close vertices land on the same page, but the presence or absence of edge between them is completely ignored. Can I do better?
As an alternative, I can split graph into pieces using the Hilbert sort order for vertices, partition the subgraphs, stitch them back and accept poor partitioning on the seams.
Some things I have looked into already:
How to store a large directed unweighted graph with billions of nodes and vertices
http://neo4j.org/ - I found zero information on how does it do graph layout on disk
Partitioning implementations (unless I'm mistaken, all of them need to fit graph into memory):
http://glaros.dtc.umn.edu/gkhome/views/metis
http://www.sandia.gov/~bahendr/chaco.html
http://staffweb.cms.gre.ac.uk/~c.walshaw/jostle/
http://www.cerfacs.fr/algor/Softs/MESHPART/
EDIT: info on how the graphs looks like and that BFS can start everywhere.
EDIT: idea on partitioning subgraphs

No algorithm truly needs to "fit into memory"--you can always page things in and out as needed. But you do want to avoid having the computation take unreasonably long--and global graph partitioning in the generic case is a NP-complete problem, which is "unreasonably long" for most problems that do not even fit in memory.
Fortunately, you want to do breadth-first searches, which means that you want a format where breadth-first is the easy computation. I don't know of any algorithms offhand that do this, but you can construct your own breadth-first layout if you're willing to allow a bit of extra disk space.
If the edges are not biased towards local interactions, then disentangling the graph will be difficult. If they are biased towards local interactions, then I suggest an algorithm like the following:
Pick a random set of vertices as starting points from throughout the entire data set.
For each vertex, collect all neighboring vertices (takes one sweep through the data set).
For each set of neighboring vertices collect the set of neighbors-of-neighbors and rank them according to how many edges connect to them. If you don't have space in a page to store them all, keep the most-connected vertices. If you do have space to save them all, you may wish to throw away the least useful ones (e.g. if the fraction of edges kept within a page / fraction of vertices needing storage ratio drops "too low"--where "too low" will depend on how much breadth your searches really need, and whether you can do any pruning and so on--then don't include those in the neighborhood.
Repeat the process of collecting and ranking neighbors until your neighborhood is full (e.g. fills some page size that suits you). Then check for repeats among the randomly chosen starts. If you have a small number of vertices appearing in both, remove them from one or the other, whichever breaks fewer edges. If you have a large number of vertices appearing in both, keep the neighborhood with the best (vertices in neighborhood/broken edge) ratio and throw the other away.
Now you have some local neighborhoods that are approximately locally optimal in that breadth-first searches tend to fall inside. If your breadth-first search prunes off unproductive branches pretty effectively, then this is probably good enough. If not, you probably want adjacent neighborhoods to cluster.
If you don't need adjacent neighborhoods to cluster too much, you set aside the vertices you've grouped into neighborhoods, and repeat the process on the remaining data until all vertices are accounted for. You change each vertex identifier to (vertex,neighborhood), and you're done: when following edges, you know exactly which page to grab, and most of them will be close by given the construction.
If you do need adjacent neighborhoods, then you'll need to keep track of your growing neighborhoods. You repeat the previous process (pick at random, grow neighborhoods), but now rank neighbors by both how many edges they satisfy within the neighborhood and what fraction of their edges that leave the neighborhood are in an existing group. You might need weighting factors, but something like
score = (# edges within) - (# neighborhoods outside) - (# neighborhoodless edges outside)
would probably do the trick.
Now, this is not globally or even locally optimal, but this or something very much like it should give a nicely locally-connected structure, and should let you produce a covering set of neighborhoods that have relatively high interconnectivity.
Again, it depends whether your breadth-first search prunes branches or not. If it does, the inexpensive thing to do is to maximize local interconnectivity. If it doesn't the thing to do is to minimize external connectivity--and in that case, I'd suggest just collecting breadth-first sets up to some size and saving those (with duplication at the edges of the sets--you're not badly limited by hard drive space, are you?).

You might want to look at HDF5. Despite the H standing for Hierarchical it can store graphs, check the documentation under the keyword 'Groups', and it is designed for very large datasets. If I understand correctly HDF5 'files' can be spread across multiple o/s 'files'. Now, HDF5 is only a data structure, plus a set of libraries for low- and high-level manipulations of the data structure. Off the top of my head I haven't a clue about streaming graph-partitioning algorithms, but I stick to the notion that if you get the data structure right algorithms become easier to implement.
What do you already know about the mega-graph ? Does it naturally partition into dense subgraphs which themselves are sparsely connected ? Would a topological sort of the graph be a better basis for storage on disk than the existing spatial sort?
Failing crisp answers to such questions, maybe you just have to bite the bullet and read the graph multiple times to build the partitions, in which case you just want the fastest I/O you can manage, and sophisticated layout of partitions on nodes is nice but not as important. If you can partition the graph into sub-graphs which themselves have single edges to the other sub-graphs you maybe able to make the problem more tractable.
You want a good-for-BFS layout, but BFS is usually applied to trees. Does your graph have a unique root from which to start all BFSes? If not, then layout for BFS from one vertex will be suboptimal for BFS from another vertex.

Related

How to divide a connected weighted graph to N semi-equal subgraphs

I have a graph of many hundred nodes that are mainly connected with each other. I can do processing on entire graph but it really takes a lot of time, so I would like to divide it to smaller sub-graphs of approximately similar size.
With other words. I have a collection of aerial images and I do pairwise image matching on all of them. As a result I get a set of matches for each pair (pixel from first image matched with pixel on second image). Number of matches is considered as weight of this (undirected) edge. These edges then form a graph mentioned above.
I'm not so familiar with graph theory (as it is a very broad topic). What is the best algorithm for this job?
Thank you.
Edit:
This problem has a perfect analogy which I think is easier to understand. Imagine you have a set of people and their connections/friendships, like I social network. Each friendship has a numeric value/weight representing how good friends they are. So in a large group of people I want to get k most interconnected sub-groups .

Unfortunately, the problem you're describing is almost certainly NP-hard. From a graph perspective, you have a graph where each edge has a weight on it. You're trying to split the graph into relatively equal pieces while cutting the lowest total cost of edges cut. This problem is called the maximum k-cut problem and is NP-hard. If you introduce the constraint that you also want to try to make the pieces roughly even in size, you have the balanced k-cut problem, which is also NP-hard.
The good news is that there are nice approximation algorithms for these problems, so if you're looking for solutions that are just "good enough," then you can probably find a library somewhere that implements them. There are also other techniques like spectral clustering which work well in practice and are really fast, but which don't have any guarantees on how well they'll do.

How to find certain sized clusters of points

Given a list of points, I'd like to find all "clusters" of N points. My definition of cluster is loose and can be adjusted to whatever allows an easiest solution: it could be N points within a certain size circle or N points that are all within a distance of each other or something else that makes sense. Heuristics are acceptable.
Where N=2, and we're just looking for all point pairs that are close together, it's pretty easy to do ~efficiently with a k-d tree (e.g. recursively break the space into octants or something, where each area is a different branch on the tree and then for each point, compare it to other points with the same parent (if near the edge of an area, check up the appropriate number of levels as well)). I recognize that inductively with a solution for N=N', I can find solution for N=N'+1 by taking the intersections between different N' solutions, but that's super inefficient.
Anyone know a decent way to go about this?

You start by calculating the Euclidean minimum spanning tree, e.g CGAL can do this. From there the precise algorithm depends on your specific requirements, but it goes roughly like this: You sort the edges in that graph by length. Then delete edges, starting with the longest one. It's a singly connected graph, so with each deleted edge you split the graph into two sub-graphs. Check each created sub-graph if it forms a cluster according to your conditions. If not, continue deleting edges.

graph algorithm to detect even cycles

I have an undirected graph. One edge in that graph is special. I want to find all other edges that are part of a even cycle containing the first edge.
I don't need to enumerate all the cycles, that would be inherently NP I think. I just need to know, for each each edge, whether it satisfies the conditions above.
A brute force search works of course but is too slow, and I'm struggling to come up with anything better. Any help appreciated.

I think we have an answer (I must credit my colleague with the idea). Essentially his idea is to do a flood fill algorithm through the space of even cycles. This works because if you have a large even cycle formed by merging two smaller cycles then the smaller cycles must have been both even or both odd. Similarly merging an odd and even cycle always forms a larger odd cycle.
This is a practical option only because I can imagine pathological cases consisting of alternating even and odd cycles. In this case we would never find two adjacent even cycles and so the algorithm would be slow. But I'm confident that such cases don't arise in real chemistry. At least in chemistry as it's currently known, 30 years ago we'd never heard of fullerenes.

If your graph has a small node degree, you might consider using a different graph representation:
Let three atoms u,v,w and two chemical bonds e=(u,v) and k=(v,w). A typical way of representing such data is to store u,v,w as nodes and e,k as edges in a graph.
However, one may represent e and k as nodes in the graph, having edges like f=(e,k) where f represents a 2-step link from u to w, f=(e,k) or f=(u,v,w). Running any algorithm to find cycles on such a graph will return all even cycles on the original graph.
Of course, this is efficient only if the original graph has a small node degree. When a user performs an edit, you can easily edit accordingly the alternative representation.

Looking for an algorithm by possibly Dijkstra

I am looking for an algorithm that distributes nodes on a plane, such that the edges are
all the same size. I think it is by Dijkstra, but I cannot remember.
Anyone heard of this algorithm?

In general this will be impossible. Effectively you want something similar to the finite pictures in tilings of the plane.
There are some simple cases - regular polygons and a few graphs which include joined polygons, but even something as simple as the complete graph for 4 points (tetrahedron) is impossible.
If you want something that tries to balance the impossible constraints, try graphviz and its neato program.

Well if you want to create any graph with such property, then there are number of graphs that may help you with that, for instance: a line, a ring, a tree etc .. but in here, you are the one who decide what edges to include or exclude.
If you have a certain graph, and you want to have all edges of the same size then this is impossible (because of some cases) - such as: a complete graph of more than 3 nodes, a star topology with one master and more than 5 slaves, and slaves that are directly close to each other are neighbors. [I believe the cases in the other posts tells you more]
A special case, is given a graph $G(V,E)$, draw $G$ such that the length of each edge in $e \in E$ is less than a unit. This is an NP-Hard problem. [That is, you cannot decide whether an arbitrary graph $G$ is a unit disk graph]

Generating a tower defense maze (longest maze with limited walls) - near-optimal heuristic?

In a tower defense game, you have an NxM grid with a start, a finish, and a number of walls.
Enemies take the shortest path from start to finish without passing through any walls (they aren't usually constrained to the grid, but for simplicity's sake let's say they are. In either case, they can't move through diagonal "holes")
The problem (for this question at least) is to place up to K additional walls to maximize the path the enemies have to take. For example, for K=14
My intuition tells me this problem is NP-hard if (as I'm hoping to do) we generalize this to include waypoints that must be visited before moving to the finish, and possibly also without waypoints.
But, are there any decent heuristics out there for near-optimal solutions?
[Edit] I have posted a related question here.

I present a greedy approach and it's maybe close to the optimal (but I couldn't find approximation factor). Idea is simple, we should block the cells which are in critical places of the Maze. These places can help to measure the connectivity of maze. We can consider the vertex connectivity and we find minimum vertex cut which disconnects the start and final: (s,f). After that we remove some critical cells.
To turn it to the graph, take dual of maze. Find minimum (s,f) vertex cut on this graph. Then we examine each vertex in this cut. We remove a vertex its deletion increases the length of all s,f paths or if it is in the minimum length path from s to f. After eliminating a vertex, recursively repeat the above process for k time.
But there is an issue with this, this is when we remove a vertex which cuts any path from s to f. To prevent this we can weight cutting node as high as possible, means first compute minimum (s,f) cut, if cut result is just one node, make it weighted and set a high weight like n^3 to that vertex, now again compute the minimum s,f cut, single cutting vertex in previous calculation doesn't belong to new cut because of waiting.
But if there is just one path between s,f (after some iterations) we can't improve it. In this case we can use normal greedy algorithms like removing node from a one of a shortest path from s to f which doesn't belong to any cut. after that we can deal with minimum vertex cut.
The algorithm running time in each step is:
min-cut + path finding for all nodes in min-cut
O(min cut) + O(n^2)*O(number of nodes in min-cut)
And because number of nodes in min cut can not be greater than O(n^2) in very pessimistic situation the algorithm is O(kn^4), but normally it shouldn't take more than O(kn^3), because normally min-cut algorithm dominates path finding, also normally path finding doesn't takes O(n^2).
I guess the greedy choice is a good start point for simulated annealing type algorithms.
P.S: minimum vertex cut is similar to minimum edge cut, and similar approach like max-flow/min-cut can be applied on minimum vertex cut, just assume each vertex as two vertex, one Vi, one Vo, means input and outputs, also converting undirected graph to directed one is not hard.

it can be easily shown (proof let as an exercise to the reader) that it is enough to search for the solution so that every one of the K blockades is put on the current minimum-length route. Note that if there are multiple minimal-length routes then all of them have to be considered. The reason is that if you don't put any of the remaining blockades on the current minimum-length route then it does not change; hence you can put the first available blockade on it immediately during search. This speeds up even a brute-force search.
But there are more optimizations. You can also always decide that you put the next blockade so that it becomes the FIRST blockade on the current minimum-length route, i.e. you work so that if you place the blockade on the 10th square on the route, then you mark the squares 1..9 as "permanently open" until you backtrack. This saves again an exponential number of squares to search for during backtracking search.
You can then apply heuristics to cut down the search space or to reorder it, e.g. first try those blockade placements that increase the length of the current minimum-length route the most. You can then run the backtracking algorithm for a limited amount of real-time and pick the best solution found thus far.

I believe we can reduce the contained maximum manifold problem to boolean satisifiability and show NP-completeness through any dependency on this subproblem. Because of this, the algorithms spinning_plate provided are reasonable as heuristics, precomputing and machine learning is reasonable, and the trick becomes finding the best heuristic solution if we wish to blunder forward here.
Consider a board like the following:
..S........
#.#..#..###
...........
...........
..........F
This has many of the problems that cause greedy and gate-bound solutions to fail. If we look at that second row:
#.#..#..###
Our logic gates are, in 0-based 2D array ordered as [row][column]:
[1][4], [1][5], [1][6], [1][7], [1][8]
We can re-render this as an equation to satisfy the block:
if ([1][9] AND ([1][10] AND [1][11]) AND ([1][12] AND [1][13]):
traversal_cost = INFINITY; longest = False # Infinity does not qualify
Excepting infinity as an unsatisfiable case, we backtrack and rerender this as:
if ([1][14] AND ([1][15] AND [1][16]) AND [1][17]:
traversal_cost = 6; longest = True
And our hidden boolean relationship falls amongst all of these gates. You can also show that geometric proofs can't fractalize recursively, because we can always create a wall that's exactly N-1 width or height long, and this represents a critical part of the solution in all cases (therefore, divide and conquer won't help you).
Furthermore, because perturbations across different rows are significant:
..S........
#.#........
...#..#....
.......#..#
..........F
We can show that, without a complete set of computable geometric identities, the complete search space reduces itself to N-SAT.
By extension, we can also show that this is trivial to verify and non-polynomial to solve as the number of gates approaches infinity. Unsurprisingly, this is why tower defense games remain so fun for humans to play. Obviously, a more rigorous proof is desirable, but this is a skeletal start.
Do note that you can significantly reduce the n term in your n-choose-k relation. Because we can recursively show that each perturbation must lie on the critical path, and because the critical path is always computable in O(V+E) time (with a few optimizations to speed things up for each perturbation), you can significantly reduce your search space at a cost of a breadth-first search for each additional tower added to the board.
Because we may tolerably assume O(n^k) for a deterministic solution, a heuristical approach is reasonable. My advice thus falls somewhere between spinning_plate's answer and Soravux's, with an eye towards machine learning techniques applicable to the problem.
The 0th solution: Use a tolerable but suboptimal AI, in which spinning_plate provided two usable algorithms. Indeed, these approximate how many naive players approach the game, and this should be sufficient for simple play, albeit with a high degree of exploitability.
The 1st-order solution: Use a database. Given the problem formulation, you haven't quite demonstrated the need to compute the optimal solution on the fly. Therefore, if we relax the constraint of approaching a random board with no information, we can simply precompute the optimum for all K tolerable for each board. Obviously, this only works for a small number of boards: with V! potential board states for each configuration, we cannot tolerably precompute all optimums as V becomes very large.
The 2nd-order solution: Use a machine-learning step. Promote each step as you close a gap that results in a very high traversal cost, running until your algorithm converges or no more optimal solution can be found than greedy. A plethora of algorithms are applicable here, so I recommend chasing the classics and the literature for selecting the correct one that works within the constraints of your program.
The best heuristic may be a simple heat map generated by a locally state-aware, recursive depth-first traversal, sorting the results by most to least commonly traversed after the O(V^2) traversal. Proceeding through this output greedily identifies all bottlenecks, and doing so without making pathing impossible is entirely possible (checking this is O(V+E)).
Putting it all together, I'd try an intersection of these approaches, combining the heat map and critical path identities. I'd assume there's enough here to come up with a good, functional geometric proof that satisfies all of the constraints of the problem.

At the risk of stating the obvious, here's one algorithm
1) Find the shortest path
2) Test blocking everything node on that path and see which one results in the longest path
3) Repeat K times
Naively, this will take O(K*(V+ E log E)^2) but you could with some little work improve 2 by only recalculating partial paths.
As you mention, simply trying to break the path is difficult because if most breaks simply add a length of 1 (or 2), its hard to find the choke points that lead to big gains.
If you take the minimum vertex cut between the start and the end, you will find the choke points for the entire graph. One possible algorithm is this
1) Find the shortest path
2) Find the min-cut of the whole graph
3) Find the maximal contiguous node set that intersects one point on the path, block those.
4) Wash, rinse, repeat
3) is the big part and why this algorithm may perform badly, too. You could also try
the smallest node set that connects with other existing blocks.
finding all groupings of contiguous verticies in the vertex cut, testing each of them for the longest path a la the first algorithm
The last one is what might be most promising
If you find a min vertex cut on the whole graph, you're going to find the choke points for the whole graph.

Here is a thought. In your grid, group adjacent walls into islands and treat every island as a graph node. Distance between nodes is the minimal number of walls that is needed to connect them (to block the enemy).
In that case you can start maximizing the path length by blocking the most cheap arcs.

I have no idea if this would work, because you could make new islands using your points. but it could help work out where to put walls.
I suggest using a modified breadth first search with a K-length priority queue tracking the best K paths between each island.
i would, for every island of connected walls, pretend that it is a light. (a special light that can only send out horizontal and vertical rays of light)
Use ray-tracing to see which other islands the light can hit
say Island1 (i1) hits i2,i3,i4,i5 but doesn't hit i6,i7..
then you would have line(i1,i2), line(i1,i3), line(i1,i4) and line(i1,i5)
Mark the distance of all grid points to be infinity. Set the start point as 0.
Now use breadth first search from the start. Every grid point, mark the distance of that grid point to be the minimum distance of its neighbors.
But.. here is the catch..
every time you get to a grid-point that is on a line() between two islands, Instead of recording the distance as the minimum of its neighbors, you need to make it a priority queue of length K. And record the K shortest paths to that line() from any of the other line()s
This priority queque then stays the same until you get to the next line(), where it aggregates all priority ques going into that point.

You haven't showed the need for this algorithm to be realtime, but I may be wrong about this premice. You could then precalculate the block positions.
If you can do this beforehand and then simply make the AI build the maze rock by rock as if it was a kind of tree, you could use genetic algorithms to ease up your need for heuristics. You would need to load any kind of genetic algorithm framework, start with a population of non-movable blocks (your map) and randomly-placed movable blocks (blocks that the AI would place). Then, you evolve the population by making crossovers and transmutations over movable blocks and then evaluate the individuals by giving more reward to the longest path calculated. You would then simply have to write a resource efficient path-calculator without the need of having heuristics in your code. In your last generation of your evolution, you would take the highest-ranking individual, which would be your solution, thus your desired block pattern for this map.
Genetic algorithms are proven to take you, under ideal situation, to a local maxima (or minima) in reasonable time, which may be impossible to reach with analytic solutions on a sufficiently large data set (ie. big enough map in your situation).
You haven't stated the language in which you are going to develop this algorithm, so I can't propose frameworks that may perfectly suit your needs.
Note that if your map is dynamic, meaning that the map may change over tower defense iterations, you may want to avoid this technique since it may be too intensive to re-evolve an entire new population every wave.

I'm not at all an algorithms expert, but looking at the grid makes me wonder if Conway's game of life might somehow be useful for this. With a reasonable initial seed and well-chosen rules about birth and death of towers, you could try many seeds and subsequent generations thereof in a short period of time.
You already have a measure of fitness in the length of the creeps' path, so you could pick the best one accordingly. I don't know how well (if at all) it would approximate the best path, but it would be an interesting thing to use in a solution.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio