How to divide a connected weighted graph to N semi-equal subgraphs - algorithm

I have a graph of many hundred nodes that are mainly connected with each other. I can do processing on entire graph but it really takes a lot of time, so I would like to divide it to smaller sub-graphs of approximately similar size.
With other words. I have a collection of aerial images and I do pairwise image matching on all of them. As a result I get a set of matches for each pair (pixel from first image matched with pixel on second image). Number of matches is considered as weight of this (undirected) edge. These edges then form a graph mentioned above.
I'm not so familiar with graph theory (as it is a very broad topic). What is the best algorithm for this job?
Thank you.
Edit:
This problem has a perfect analogy which I think is easier to understand. Imagine you have a set of people and their connections/friendships, like I social network. Each friendship has a numeric value/weight representing how good friends they are. So in a large group of people I want to get k most interconnected sub-groups .

Unfortunately, the problem you're describing is almost certainly NP-hard. From a graph perspective, you have a graph where each edge has a weight on it. You're trying to split the graph into relatively equal pieces while cutting the lowest total cost of edges cut. This problem is called the maximum k-cut problem and is NP-hard. If you introduce the constraint that you also want to try to make the pieces roughly even in size, you have the balanced k-cut problem, which is also NP-hard.
The good news is that there are nice approximation algorithms for these problems, so if you're looking for solutions that are just "good enough," then you can probably find a library somewhere that implements them. There are also other techniques like spectral clustering which work well in practice and are really fast, but which don't have any guarantees on how well they'll do.

Related

Finding a minimum/maximum weight Steiner tree

I asked this question on reddit, but haven't converged on a solution yet. Since many of my searches bring me to Stack Overflow, I decided I would give this a try. Here is a simple formulation of my problem:
Given a weighted undirected graph G(V,E,w) and a subset of vertices S in G, find the min/max weight tree that spans S. Adding vertices is not allowed. An extension of the basic model is adding edges with 0 weight, and vertices that must be excluded. This seems similar to the question asked here:
Algorithm to find minimum spanning tree of chosen vertices
There is also more insight into what values the edges can take. Each edge is actually a correlation probability, which I can encode in several ways, so the main questions I want to ask the graph are:
Given k vertices that must be connected, what are the top X min/max spanning trees that connect them, and what vertices do they pass through? As I understand it, this is the same question as asking the graph what is the highest probability of connecting all of the k vertices.
Getting more vague, is there a logical way to cluster the nodes?
As for implementation, I have the boost libraries installed, and once I get the framework rolling on this problem, I can deal with how to multi-thread it (if appropriate), what kind of graph to use, and how to store/cache the data, since the number of vertices and edges is going to be quite large.
Update
Looking at the problem I am trying to solve, it makes sense that it would be NP-complete. The real world problem that I am trying to solve involves medical diagnoses; specifically when the medical community is working on a problem with a specific idea in mind, and they need to take a step back and reconsider how they got there. What I want from the program I am trying to design is:
Given several conditions, tests, symptoms, age, gender, season, confirmed diagnosis, timeline, how can you relate them? What cells/tissues/organs/systems are touched? Are they even related?
Along with the defined groups that conditions/symptoms can belong to, is there a way to logically group the conditions/symptoms?
Example
Flu-like symptoms, red eyes, early pneumonia, and some of the signs of diabetes. Is there a way to relate all of the symptoms? Are there some tests that could be done to make it easier to determine? What systems are involved?
It just seemed natural to try and map this to a graph, or several graphs, and use probabilities as the correlation between different symptoms/conditions.
I have seen models for your problem that were mostly based on Bayesian inference and fuzzy logic. Bayesian inference networks express the relation between causes and effects e.g. smoking and lung cancer. Look here for a quick tutorial. You can apply fuzzy logic to that modelling to try to take into account the variablility in real life (as not everyone gets lung cancer).

graph algorithm to detect even cycles

I have an undirected graph. One edge in that graph is special. I want to find all other edges that are part of a even cycle containing the first edge.
I don't need to enumerate all the cycles, that would be inherently NP I think. I just need to know, for each each edge, whether it satisfies the conditions above.
A brute force search works of course but is too slow, and I'm struggling to come up with anything better. Any help appreciated.
I think we have an answer (I must credit my colleague with the idea). Essentially his idea is to do a flood fill algorithm through the space of even cycles. This works because if you have a large even cycle formed by merging two smaller cycles then the smaller cycles must have been both even or both odd. Similarly merging an odd and even cycle always forms a larger odd cycle.
This is a practical option only because I can imagine pathological cases consisting of alternating even and odd cycles. In this case we would never find two adjacent even cycles and so the algorithm would be slow. But I'm confident that such cases don't arise in real chemistry. At least in chemistry as it's currently known, 30 years ago we'd never heard of fullerenes.
If your graph has a small node degree, you might consider using a different graph representation:
Let three atoms u,v,w and two chemical bonds e=(u,v) and k=(v,w). A typical way of representing such data is to store u,v,w as nodes and e,k as edges in a graph.
However, one may represent e and k as nodes in the graph, having edges like f=(e,k) where f represents a 2-step link from u to w, f=(e,k) or f=(u,v,w). Running any algorithm to find cycles on such a graph will return all even cycles on the original graph.
Of course, this is efficient only if the original graph has a small node degree. When a user performs an edit, you can easily edit accordingly the alternative representation.

Why do these maze generation algorithms produce mazes with different properties?

I was browsing the Wikipedia entry on maze generation algorithms and found that the article strongly insinuated that different maze generation algorithms (randomized depth-first search, randomized Kruskal's, etc.) produce mazes with different characteristics. This seems to suggest that the algorithms produce random mazes with different probability distributions over the set of all single-solution mazes (spanning trees on a rectangular grid).
My questions are:
Is this correct? That is, am I reading this article correctly, and is the article correct?
If so, why? I don't see an intuitive reason why the different algorithms would produce different distributions.
Uh well I think it's pretty obvious different algorithms generate different mazes. Let's just talk about spanning trees of a grid. Suppose you have a grid G and you have two algorithms to generate a spanning tree for the grid:
Algorithm A:
Pick any edge of the grid, with 99% probability choose a horizontal one, otherwise a vertical one
Add the edge to the maze, unless adding it would create a cycle
Stop when every vertex is connected to every other vertex (spanning tree complete)
Algorithm B:
As algorithm A, but set the probability to 1% instead of 99%
"Obviously" algorithm A produces mazes with lots of horizontal passages and algorithm B mazes with lots of vertical passages. That is, there is a statistical correlation between the number of horizontal passages in a maze and the maze being produced by algorithm A.
Of course the differences between the Wikipedia algorithms are more intricate but the principle is the same. The algorithms sample the space of possible mazes for a given grid in a non-uniform, structured way.
LOL I remember a scientific conference where a researcher presented her results about her algorithm that did something "for graphs". The results were statistical and presented for "random graphs". Someone asked from the audience "which distribution of random graphs did you draw the graphs from?" The answer: "uh... they were produced by our graph generation program". Duh!
Interesting question. Here my random 2c.
Comparing Prim's to, say, DFS, the latter seems to have a proclivity for producing deeper trees simply due to the fact that the first 'runs' have more space to create deep trees with less branches. Prim's algorithm, on the other hand, appears to create trees with more branching due to the fact that any open branch can be selected at each iteration.
One way to ask this would be to look at what is the probability that each algorithm will produce a tree of depth > N. I have a hunch that they would be different. A more formal approach to do proving this might be to assign some weights to each part of the tree and show it's more likely to be taken or attempt to characterize the space some other way, but I'll be hand wavy and guessing it's correct :). I'm interested in what lead to you think it wouldn't be, because my intuition was the opposite. And no, the Wiki article doesn't give a convincing argument.
EDIT
One simple way to see this to consider an initial tree with two children with a total of k nodes
e.g.,
*---* ... *
\--* ... *
Choose a random node as the start and end. DFS will produce one of two mazes, either the entire tree, or the part of it with the direct path from start to end. Prim's algorithm will produce the 'maze' with the direct path from start to end with secondary paths of length 1 ... k.
It is not statistical until you request that each algorithm produce every solution it can.
What you are perceiving as statistical bias is only a bias towards the preferred, first solution.
That bias may not be algorithmic (set-theory-wise) but implementation dependent (like the bias in the choice of the pivot in quicksort).
Yes, it is correct. You can produce different mazes by starting the process in different ways. Some algorithms start with a fully closed grid and remove walls to generate a path through the maze while some start with a empty grid and add walls leaving behind a path. This alone can produce different results.

How to assign consecutive numbers to nodes of directed graph?

There's a graph with a lot of nodes, and very few edges between them - the problem is assigning numbers to nodes, so that most nodes are from i to i+1 or otherwise close.
My problem is about printing graph data nicely, but an algorithm just like that is part of pretty much every compiler (intermediate code is just a graph, produced object code gets memory locations).
I thought it was just straightforward depth-first search, but results of that aren't that great - it seems to minimize number of links back well enough, but ones it leaves tend to be horrible (like 1 -> 500 -> 1).
Any better ideas?
This paper discusses this problem, if you use Eyal Schneider's formulation of minimizing the sum of the edge deltas (absolute value of the difference between the endpoints' labels). It's under #2, Optimal Linear Arrangements.
Sadly, there's no algorithm given for achieving an optimal ordering (or labeling), and the general problem is NP-complete. There are references to some polynomial-time algorithms for trees, though.
If you want to get into the academic stuff, google gives lots of hits for "Optimal Linear Arrangements".

Storing very large graphs on disk/streaming graph partitioning algorithms?

Suppose that I have a very large undirected, unweighted graph (starting at hundreds of millions of vertices, ~10 edges per vertex), non-distributed and processed by single thread only and that I want to do breadth-first searches on it. I expect them to be I/O-bound, thus I need a good-for-BFS disk page layout, disk space is not an issue. The searches can start on every vertex with equal probability. Intuitively that means minimizing the number of edges between vertices on different disk pages, which is a graph partitioning problem.
The graph itself looks like a spaghetti, think of random set of points randomly interconnected, with some bias towards shorter edges.
The problem is, how does one partition graph this large? The available graph partitioners I have found work with graphs that fit into memory only. I could not find any descriptions nor implementations of any streaming graph partitioning algorithms.
OR, maybe there is an alternative to partitioning graph for getting a disk layout that works well with BFS?
Right now as an approximation I use the fact that the vertices have spatial coordinates attached to them and put the vertices on disk in Hilbert sort order. This way spatially close vertices land on the same page, but the presence or absence of edge between them is completely ignored. Can I do better?
As an alternative, I can split graph into pieces using the Hilbert sort order for vertices, partition the subgraphs, stitch them back and accept poor partitioning on the seams.
Some things I have looked into already:
How to store a large directed unweighted graph with billions of nodes and vertices
http://neo4j.org/ - I found zero information on how does it do graph layout on disk
Partitioning implementations (unless I'm mistaken, all of them need to fit graph into memory):
http://glaros.dtc.umn.edu/gkhome/views/metis
http://www.sandia.gov/~bahendr/chaco.html
http://staffweb.cms.gre.ac.uk/~c.walshaw/jostle/
http://www.cerfacs.fr/algor/Softs/MESHPART/
EDIT: info on how the graphs looks like and that BFS can start everywhere.
EDIT: idea on partitioning subgraphs
No algorithm truly needs to "fit into memory"--you can always page things in and out as needed. But you do want to avoid having the computation take unreasonably long--and global graph partitioning in the generic case is a NP-complete problem, which is "unreasonably long" for most problems that do not even fit in memory.
Fortunately, you want to do breadth-first searches, which means that you want a format where breadth-first is the easy computation. I don't know of any algorithms offhand that do this, but you can construct your own breadth-first layout if you're willing to allow a bit of extra disk space.
If the edges are not biased towards local interactions, then disentangling the graph will be difficult. If they are biased towards local interactions, then I suggest an algorithm like the following:
Pick a random set of vertices as starting points from throughout the entire data set.
For each vertex, collect all neighboring vertices (takes one sweep through the data set).
For each set of neighboring vertices collect the set of neighbors-of-neighbors and rank them according to how many edges connect to them. If you don't have space in a page to store them all, keep the most-connected vertices. If you do have space to save them all, you may wish to throw away the least useful ones (e.g. if the fraction of edges kept within a page / fraction of vertices needing storage ratio drops "too low"--where "too low" will depend on how much breadth your searches really need, and whether you can do any pruning and so on--then don't include those in the neighborhood.
Repeat the process of collecting and ranking neighbors until your neighborhood is full (e.g. fills some page size that suits you). Then check for repeats among the randomly chosen starts. If you have a small number of vertices appearing in both, remove them from one or the other, whichever breaks fewer edges. If you have a large number of vertices appearing in both, keep the neighborhood with the best (vertices in neighborhood/broken edge) ratio and throw the other away.
Now you have some local neighborhoods that are approximately locally optimal in that breadth-first searches tend to fall inside. If your breadth-first search prunes off unproductive branches pretty effectively, then this is probably good enough. If not, you probably want adjacent neighborhoods to cluster.
If you don't need adjacent neighborhoods to cluster too much, you set aside the vertices you've grouped into neighborhoods, and repeat the process on the remaining data until all vertices are accounted for. You change each vertex identifier to (vertex,neighborhood), and you're done: when following edges, you know exactly which page to grab, and most of them will be close by given the construction.
If you do need adjacent neighborhoods, then you'll need to keep track of your growing neighborhoods. You repeat the previous process (pick at random, grow neighborhoods), but now rank neighbors by both how many edges they satisfy within the neighborhood and what fraction of their edges that leave the neighborhood are in an existing group. You might need weighting factors, but something like
score = (# edges within) - (# neighborhoods outside) - (# neighborhoodless edges outside)
would probably do the trick.
Now, this is not globally or even locally optimal, but this or something very much like it should give a nicely locally-connected structure, and should let you produce a covering set of neighborhoods that have relatively high interconnectivity.
Again, it depends whether your breadth-first search prunes branches or not. If it does, the inexpensive thing to do is to maximize local interconnectivity. If it doesn't the thing to do is to minimize external connectivity--and in that case, I'd suggest just collecting breadth-first sets up to some size and saving those (with duplication at the edges of the sets--you're not badly limited by hard drive space, are you?).
You might want to look at HDF5. Despite the H standing for Hierarchical it can store graphs, check the documentation under the keyword 'Groups', and it is designed for very large datasets. If I understand correctly HDF5 'files' can be spread across multiple o/s 'files'. Now, HDF5 is only a data structure, plus a set of libraries for low- and high-level manipulations of the data structure. Off the top of my head I haven't a clue about streaming graph-partitioning algorithms, but I stick to the notion that if you get the data structure right algorithms become easier to implement.
What do you already know about the mega-graph ? Does it naturally partition into dense subgraphs which themselves are sparsely connected ? Would a topological sort of the graph be a better basis for storage on disk than the existing spatial sort?
Failing crisp answers to such questions, maybe you just have to bite the bullet and read the graph multiple times to build the partitions, in which case you just want the fastest I/O you can manage, and sophisticated layout of partitions on nodes is nice but not as important. If you can partition the graph into sub-graphs which themselves have single edges to the other sub-graphs you maybe able to make the problem more tractable.
You want a good-for-BFS layout, but BFS is usually applied to trees. Does your graph have a unique root from which to start all BFSes? If not, then layout for BFS from one vertex will be suboptimal for BFS from another vertex.

Resources