top-down community detection in a network - algorithm

I'm trying to find a way to find network communities in a top-down way. Most of the algorithms available (e.g. in the igraph package) are working bottom up - that is they start by assuming all nodes are singleton communities, and then combine them to larger communities. I want to got the other way around, similar to how decision trees are built: start with the whole network, then find a split that improves some "measure of information", etc.
Does anyone know of such algorithm or such a measure? I can't find such in the literature, but maybe I am missing something.
Also, what bothers me with some measures of modularity is that if you think of the whole network as one module, then all edges are within module and no out-module edges exist, so this seems like a perfect partition into a modules. Is there a measure that overcome this limitation?

I think Newman's algorithm meets your requirements.
It works by computing "network modularity" and then splitting the network into two groups. After that it recursively applies the same principle to the newly formed groups until no further increase in modularity is possible.
It should also be implemented in igraph. At least in the r version.

Related

Community Detection in complete and weighted networks

I do have a complete network graph where every vertex is connected with each other and they only differ in form of their different weights. A example network would be: a trade network, where every country is connected with each other somehow and only differ in form of different trading volumina.
Now the question is how I could perform a community detection in that form of network. The usual suspects (algorithm) are only able to perform in either unweighted or incomplete networks well. The main problem is that the geodesic is everywhere the same.
Two option came into my mind:
Cut the network into smaller pieces by cutting them at a certain "weight-threshold-level"
Or use a hierarchical cluster algorithm to turn the whole network into a blockmodel. But I think the problem "no variance in geodesic terms" will remain.
Several methods were suggested.
One simple yet effective method was suggested in Fast unfolding of communities in large networks (Blondel et al., 2008). It supports weighted networks. Quoting from the abstract:
We propose a simple method to extract the community structure of large
networks. Our method is a heuristic method that is based on modularity
optimization. It is shown to outperform all other known community
detection method in terms of computation time. Moreover, the quality
of the communities detected is very good, as measured by the so-called
modularity.
Quoting from the paper:
We now introduce our algorithm that finds high modularity partitions
of large networks in short time and that unfolds a complete
hierarchical community structure for the network, thereby giving
access to different resolutions of community detection.
So it supposed to work well for complete graph, but you should better check it.
A C++ implementation is available here (now maintained here).
Your other idea - using weight-threshold - may prove as a good pre-processing step, especially for algorithms which won't partition complete graphs. I believe it is best to set it to some percentile (e.g. to the median) of the weights.

Pathfinding with limited knowledge and no distance heuristic

I'm having trouble writing the pathfinding routine for the AI in a simple Elite-esque game I'm writing.
The world in this game is a handful of "systems" connected by "wormholes", and a ship can jump from the system it's in to any system it's linked to once per turn. The AI is limited to only knowing things that it should know; it doesn't know what links come from a system it hasn't been to (though it can work it out from the systems it has seen, since links are two-way). Other parts of the AI decide which system the ship needs to get to based on what goods it has in its inventory and how much it remembers things being worth on systems it has passed through.
The problem is, I don't know how to approach the problem of finding a path to the target system. I can't use A*; there's no way to determine the 'distance' to another system without pathing to it. I also need this algorithm to be efficient, since it'll need to run about 100 times every time the player takes his turn.
Does anyone know of a suitable algorithm?
I ended up implementing a bidirectional, greedy version of breadth-first search, which suits the purpose well enough. To put it simply, I just had the program look through each node its starting node connected through, then each node those nodes connected to, then each node those connected to... until the destination node was found.
Normally one would build a list of appropriate paths and pick the shortest one, but I tried a different method; I had the program run two searches in parallel, one from the starting point, and one from the end point. When the 'from' search found the last node of the 'to' search, the path was considered found.
It then optimizes the path by checking if each node on the path connects to a node further up in the path, and deleting each node in between them.
Whether or not this funky algorithm is actually any better than a straight BFS remains to be seen.
When it comess to unknown environments, I usually use an evolutionary algorithm approach. It doesn't guarantee that you'll find the best solution in the small timeframe you have, but is a way to approach such a problem.
Have a look at Partially Observable Markov Decision Problems (POMDP). You should be able to express your problem with this model.
Then you can use an algorithm that solves these problems to try to find a good solution. Note that solving POMDPs is usually very expensive and you probably have to resort to approximate methods.
Easiest way to cheat this would be o go through, or at least attempt to access as many systems as possible, then implement the distance heuristic as the sum of all the systems you've been to.
Alternatively, and way cooler:
I've implemented something similar using ACO (Ant colony optimization) and worked pretty well combined with PSO(particle swarm optimization), however, the additional constraints your system is imposing means that you'll have to spend a few (at least one) sessions figuring out the environment layout, and if it's dynamic... well... though.
The good thing is that this algorithm completely bypasses the need for heuristic generation which is what you need since you are flying blind. Be advised though, that this is a bad idea if your search space (number of runs) is small. (100 may be acceptable, but 10 or 5 ... not so much).
This scales up quite nicely when dealing with large numbers of nodes (systems) and it bypasses the heuristic distance computational need for every node-to-node relationship, thereby making it more efficient.
Good luck.

Where to find a set of hard Traveling Salesman Problems (with known solutions/approximations)?

I want to try my hand at finding heuristics/approximations for solving the Traveling Salesman Problem, and in order to do that, I'm looking for some "hard" TSP instances (along with their best known solutions) so that I can try solving them and see how well I can do.
Ideally, they would be simply a text-based list of adjacency matrices or adjacency lists (I don't want to deal with parsing, just the algorithm).
By "hard", I mean that they should be practically impossible to solve or approximate using brute-force.
(This is so that I can be reasonably confident that if I find an answer close to the best known answer, then I'm actually doing something right, and not just getting lucky.)
Are there any lists that would work for this purpose? I searched around a bit but didn't find anything.
Here is another question on SE partially answering your problem (it lists problems, but most of these seems not to have a solution provided, but you better check the links anyway - things may have changed).
If you can't find them, what about randomly generating a set of nodes along with a path connecting them, saving the path length as "minimal" (making sure that the longest connection between two nodes is never > X) and then adding a bunch of other paths making sure these are all > X?
This way (unless I am missing something) you have a set of connected nodes "as complex as you want" and know the actual shortest connecting path from the start...
Addendum - if you really want to see how you compare to existing tools, then you have to run these on your generated problems. One that is free and accessible (but I don't know how "efficient" it may be) is the TSP Library for R.
Wikipedia has a list of other free sw packages for this.
Maybe you could create a different SE question asking for how to get other TSP tools.
The TSP gatech site seems to be the canonical site for TSP information.
Here's a list of the available datasets: http://www.tsp.gatech.edu/data/index.html
The optimal solution is available for some datasets with over 10 000 cities. And there are datasets available of over 1 000 000 cities.
There is a well-known algorithm for finding the optimum TSP solution - it is called brute force.
So the only real way you can compare two algorithms has to be on the quality of the solution as well as some other criteria - usually running time.
Even here you run into a problem. Many algorithms are effectively search algorithms, and the longer you search the more possible solutions are evaluated. The algorithms already trade off quality and running time. They may or may not result in the correct (best) answer for some or all graphs.
The only real way you are going to be able to compare your algorithm to others is by implementing the other algorithms then throwing yours and them the same hard problems (and as others have identified, it is easy to make at least some types of hard problems). Implementing these existing algorithms may suggest ways of improving yours. http://en.wikipedia.org/wiki/Travelling_salesman_problem has plenty of algorithms, and at least a couple look very easy to code. Why not implement them as the first benchmark for your algorithm?

How to sample a scale-free graph

Given a large scale-free graph (a social network graph), what's the best way to sample it such that the sample retains an acceptable abstraction of the properties of the original?
I have a large graph (Munmun's twitter dataset, if you know it). But I need a connected sample of that graph with a reasonably large diameter (tl;dr... reasons why on request... a diameter of 10 would be good).
The problem is that any kinda breadth-first search always is likely to come across some massively connected nodes. So I start such a search, getting the friends of all nodes which I come across. I inevitably come across some massively-connected nodes, and have to get all their friends. This is a problem because I end up with a large number of nodes which are close to each other in the graph. To make programmatic analysis feasible, I have to limit the number of nodes (and edges). The whole point of this exercise is to find shortest paths between nodes, so I'm generally interested in ALL of a node's neighbours. And that's the problem.
One hack around this is to limit the max. number of nodes connected to a user which I'm interested in. For instance, if I come across #barackobama in my breadth-first search, I ensure that I only accept some small proportion of his friends and ignore the rest. But would this hacked graph be worth a damn, or am I losing too much information in terms of finding shortest paths??
Hope that makes sense...
Several sampling methods exist, how to choose one depends (amongst other things) of the properties you want to preserve. I found the literature review (section 3) in the thesis Sampling and Inference in Complex Networks [Maiya '11] very informative, for that matter.
But you seem to have found a way of sampling your network, and now you want to find out if the sample is representative of the whole graph in terms of shortest paths. You can try to have a look at this paper: Complex Network Measurements: Estimating the Relevance of Observed Properties [Latapy & Magnien '08]. They describe a method to assess the representativeness of a sample, regarding various classic topological properties. To summarize their approach, they initially have access to the whole studied network, and simulate some sampling process on these data, with increasing sample size. They monitor how properties evolve depending on sample size, and decide of an appropriate size when the properties of interest are stable enough. Their tool is freely available online.
Edit: the only ready-to-use tool I could find online is Albatross. The associated article Albatross Sampling: Robust and Effective Hybrid Vertex Sampling for Social Graphs [Jin et al. '11] also contains a nice review of existing sampling methods, some of which are implemented in the source code they provide.
Edit 2: I needed to use Albatross on a Linux system, so I did a Java port. It's very raw, but it seems to work fine. It's available on GitHub: https://github.com/vlabatut/Albatross
I am not sure, if I understand your question correctly. I think the main question you have is, about how you can compute the shortest path of two nodes in a giant, directed graph. Creating a subsample of the graph seems to be your attempt to create an efficient solution. (But I probably misunderstood you completely.)
Perhaps this SO-Question has some pointers for you: Efficiently finding the shortest path in large graphs
The graphs in that question seem to be significantly smaller, though.
You might want to check the following: Gscaler: https://github.com/jayCool/Gscaler
This is a recent tool which produces synthetic scaled graphs.
It contains the jar file and the related paper for your reference.

What are good examples of problems that graphs can solve better than the alternative? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
After reading Stevey Yegge's Get That Job At Google article, I found this little quote interesting:
Whenever someone gives you a problem, think graphs. They are the most fundamental and flexible way of representing any kind of a relationship, so it's about a 50–50 shot that any interesting design problem has a graph involved in it. Make absolutely sure you can't think of a way to solve it using graphs before moving on to other solution types. This tip is important!
What are some examples of problems that are best represented and/or solved by graph data structures/algorithms?
One example I can think of: navigation units (ala Garmin, TomTom), that supply road directions from your current location to another, utilize graphs and advanced pathing algorithms.
What are some others?
Computer Networks: Graphs model intuitively model computer networks and the Internet. Often nodes will represent end-systems or routers, while edges represent connections between these systems.
Data Structures: Any data structure that makes use of pointers to link data together is making use of a graph of some kind. This includes tree structures and linked lists which are used all the time.
Pathing and Maps: Trying to find shortest or longest paths from some location to a destination makes use of graphs. This can include pathing like you see in an application like Google maps, or calculating paths for AI characters to take in a video game, and many other similar problems.
Constraint Satisfaction: A common problem in AI is to find some goal that satisfies a list of constraints. For example, for a University to set it's course schedules, it needs to make sure that certain courses don't conflict, that a professor isn't teaching two courses at the same time, that the lectures occur during certain timeslots, and so on. Constraint satisfaction problems like this are often modeled and solved using graphs.
Molecules: Graphs can be used to model atoms and molecules for studying their interaction and structure among other things.
I am very very interested in graph theory and ive used it solved so many different kinds of problem. You can solve a lot of Path related problem, matching problem, structure problems using graph.
Path problems have a lot of applications.
This was in a career cup's interview question.
Say you want to find the longest sum of a sub array. For example, [1, 2, 3, -1] has the longest sum of 6. Model it as a Directed Acyclic Graph (DAG), add a dummy source, dummy destination. Connect each node with an edge which has a weight corresponding to the number. Now use the Longest Path algorithm in the DAG to solve this problem.
Similarly, Arbitrage problems in financial world or even geometry problems of finding the longest overlapping structure is a similar path problem.
Some obvious ones would be the network problems (where your network could have computers people, organisation charts, etc).
You can glean a lot of structural information like
which point breaks the graph into two pieces
what is the best way to connect them
what is the best way to reach one place to another
is there a way to reach one place from another, etc.
I've solved a lot of project management related problems using graphs. A sequence of events can be pictured as a directed graph (if you don't have cycles then thats even better). So, now you can
sort the events according to their priority
you can find the event that is the most crucial (that is would free a lot of other projects)
you can find the duration needed to solve the total project (path problem), etc.
A lot of matching problems can be solved by graph. For example, if you need to match processors to the work load or match workers to their jobs. In my final exam, I had to match people to tables in restaurants. It follows the same principle (bipartite matching -> network flow algorithms). Its simple yet powerful.
A special graph, a tree, has numerous applications in the computer science world. For example, in the syntax of a programming language, or in a database indexing structure.
Most recently, I also used graphs in compiler optimization problems. I am using Morgan's Book, which is teaching me fascinating techniques.
The list really goes on and on and on. Graphs are a beautiful math abstraction for relation. You really can do wonders, if you can model it correctly. And since the graph theory has found so many applications, there are many active researches in the field. And because of numerous researches, we are seeing even more applications which is fuelling back researches.
If you want to get started on graph theory, get a good beginner discrete math book (Rosen comes to my mind), and you can buy books from authors like Fould or Even. CLRS also has good graph algorithms.
Your source code is tree structured, and a tree is a type of graph. Whenever you hear people talking about an AST (Abstract Syntax Tree), they're talking about a kind of graph.
Pointers form graph structures. Anything that walks pointers is doing some kind of graph manipulation.
The web is a huge directed graph. Google's key insight, that led them to dominate in search, is that the graph structure of the web is of comparable or greater importance than the textual content of the pages.
State machines are graphs. State machines are used in network protocols, regular expressions, games, and all kinds of other fields.
It's rather hard to think of anything you do that does not involve some sort of graph structure.
An example most people are familiar: build systems. Make is the typical example, but almost any good build system relies on a Directed Acyclic Graph. The basic idea is that the direction models the dependency between a source and a target, and you should "walk" the graph in a certain order to build things correctly -> this is an example of topological sort.
Another example is source control system: again based on a DAG. It is used for merging, for example, to find common parent.
Well, many program optimization algorithms that compilers use are based on graphs (e.g., figure out call graph, flow control, lots of static analysis).
Many optimization problems are based on graph. Since many problems are reducable to graph colouring and similar problems, then many other problems are also graph based.
I'm not sure I agree that graphs are the best way to represent every relation and I certainly try to avoid these "got a nail, let's find a hammer" approaches. Graphs often have poor memory representations and many algorithms are actually more efficient (in practice) when implemented with matrices, bitsets, and other things.
OCR. Picture a page of text scanned at an angle, with some noise in the image, where you must find the space between lines of text. One way is to make a graph of pixels, and find the shortest path from one side of the page to the other, where the difference in brightness is the distance between pixels.
This example is from the Algorithm Design Manual, which has lots of other real world examples of graph problems.
One popular example is garbage collection.
The collector starts with a set of references, then traverses all the objects they reference, then all the objects referenced there and so on. Everything it finds is added into a graph of reachable objects. All other objects are unreachable and collected.
To find out if two molecules can fit together. When developing drugs one is often interested in seeing if the drug molecules can fit into larger molecules in the body. The problem with determining whether this is possible is that molecules are not static. Different parts of the molecule can rotate around their chemical bindings so that a molecule can change into quite a lot of different shapes.
Each shape can be said to represent a point in a space consisting of shapes. Solving this problem involves finding a path through this space. You can do that by creating a roadmap through space, which is essentially a graph consisting of legal shapes and saying which shape a shape can turn into. By using a A* graph search algorithm through this roadmap you can find a solution.
Okay that was a lot of babble that perhaps wasn't very understandable or clear. But my point was that graphs pop up in all kinds of problems.
Graphs are not data structures. They are mathematical representation of relations. Yes, you can think and theoretize about problems using graphs, and there is a large body of theory about it. But when you need to implement an algorithm, you are choosing data structures to best represent the problem, not graphs. There are many data structures that represent general graphs, and even more for special kinds of graphs.
In your question, you mix these two things. The same theoretical solution may be in terms of graph, but practical solutions may use different data structures to represent the graph.
The following are based on graph theory:
Binary trees and other trees such as Red-black trees, splay trees, etc.
Linked lists
Anything that's modelled as a state machine (GUIs, network stacks, CPUs, etc)
Decision trees (used in AI and other applications)
Complex class inheritance
IMHO most of the domain models we use in normal applications are in some respect graphs. Already if you look at the UML diagrams you would notice that with a directed, labeled graph you could easily translate them directly into a persistence model. There are some examples of that over at Neo4j
Cheers
/peter
Social connections between people make an interesting graph example. I've tried to model these connections at the database level using a traditional RDMS but found it way too hard. I ended up choosing a graph database and it was a great choice because it makes it easy to follow connections (edges) between people (nodes).
Graphs are great for managing dependencies.
I recently started to use the Castle Windsor Container, after inspecting the Kernel I found a GraphNodes property. Castle Windsor uses a graph to represent the dependencies between objects so that injection will work correctly. Check out this article.
I have also used simple graph theory to develop a plugin framework, each graph node represent a plugin, once the dependencies have been defined I can traverse the graph to create a plugin load order.
I am planning on changing the algorithm to implement Dijkstra's algorithm so that each plugin is weighted with a specific version, thus a simple change will only load the latest version of the plugin.
I with I had discovered this sooner. I like that quote "Whenever someone gives you a problem, think graphs." I definitely think that's true.
Profiling and/or Benchmarking algorithms and implementations in code.
Anything that can be modelled as a foreign key in a relational database is essentially an edge and nodes in a graph.
Maybe that will help you think of examples, since most things are readily modelled in a RDBMS.
You could take a look at some of the examples in the Neo4j wiki,
http://wiki.neo4j.org/content/Domain_Modeling_Gallery
and the projects that Neo4j is used in (the known ones)
http://wiki.neo4j.org/content/Neo4j_In_The_Wild .
Otherwise, Recommender Algorithms are a good use for Graphs, see for instance PageRank, and other stuff at
https://github.com/tinkerpop/gremlin/wiki/pagerank
Analysing transaction serialisability in database theory.
You can utilise graphs anywhere you can define the problem domain objects into nodes and the solution as the flow of control and/or data amongst the nodes.
Considering the fact that trees are indeed connected-acyclic graphs, there are even more areas you can use the graph theory.
Basically nearlly all common data structures like trees, lists, queues, etc, can be thought as type of graph, some with different type of constraint.
To my experiences, I have used graph intensively in network flow problems, which is used in lots of areas like telecommunication network routing and optimisation, workload assignment, matching, supply chain optimisation and public transport planning.
Another interesting area is social network modelling as previous answer mentioned.
There are far more, like integrated circuit optimisation, etc.

Resources