Infomap community detection understanding - algorithm

i need a understandable description of the Infomap Community Detection Algorithm. I read the papers, but it was not clear for me. My questions:
How does the algorithm basically work?
What has random walks to do with it?
What is the map equation and what is (clearly) the difference to modularity optimization? (There was an example given in the paper in Fig. 3 , but i didn't get that)
On their homepage, there are 2 improvements given. The first one is Submodule movements and the second one is Single-node movements. Why are they used and why are merged modules not seperateable?
The homepage:
The paper:

The basic idea behind the InfoMap algorithm is to use community partitions of the graph as a Huffman code that compresses the information about a random walker exploring your graph.
Let's unpack what that means. The central object is a random walker exploring the network with the probability that the walker transitions between two nodes given by its Markov transition matrix. At this point, we have effectively coded our network with an individual codeword for each node. However, in most real-world networks, we know that there are regions of the network such that once the random walker enters a region, it tends to stay there for a long time, and movements between the regions are relatively rare. This allows us to combinatorially combine codewords into Huffman codes: we can use a prefix code for each region, and then use a unique codeword for each node within a module, but we can reuse these node level codewords for each module. The same intuition can be gathered by looking at a street names; it would be crazy to have a unique street name for every street in the US, instead, we use states and towns, and then specify a street name, allowing us to reuse street names between towns (how many Main streets are there?).
Here is where the optimization algorithm comes onto the scene: when you use too few modules, you are effectively still back at the level of using an individual codeword for every node, but use too many modules, and the number of prefix codes becomes too large. So we need to find an optimal partition that assigns nodes to modules such that the information needed to compress the movement of our random walkers is minimized (equation 1 from their paper).
The number of possible partitions grows super exponentially in the number of nodes (given by the Bell numbers), so its impossible to do brute-force searches. Instead the authors leverage a variation of the Louvain method (originally designed for modularity maximization) to help them find appropriate partitions. The 2 ``improvements'' you ask about (question 4) are just heuristics to help effectively explore partition space: submodule exploration checks to verify that we didn't create a module that was too large and should have been broken into smaller modules, while single-node movements allow individual nodes to shift between modules.
The InfoMap algorithm and Modularity are both instances of optimal community detection methods: they each have a quality function and then search the space of graph partitions to find the partition that optimizes that quality function. The difference is the quality function: InfoMap focuses on the information needed to compress the movement of the random walker, while Modularity defines modules based on edge density (more edges within a module than expected by chance).


Select relevant features with PCA and K-MEANS

I am trying to understand PCA and K-Means algorithms in order to extract some relevant features from a set of features.
I don't know what branch of computer science study these topics, seems on internet there aren't good resources, just some paper that I don't understand well. An example of paper
I have csv files of pepole walks composed as follow:
TIME, X, Y, Z, these values are registred by the accelerometer
What I did
I transformed the dataset as a table in Python
I used tsfresh, a Python library, to extract from each walk a vector of features, these features are a lot, 2k+ features from each walk.
I have to use PFA, Principal Feature Analysis, to select the relevant features from the set of
vectors features
In order to do the last point, I have to reduce the dimension of the set of features walks with PCA (PCA will make the data different from the original one cause it modifies the data with the eigenvectors and eigenvalues of the covariance matrix of the original data). Here I have the first question:
How the input of PCA should look? The rows are the number of walks and the columns are the features or viceversa, so the rows are the number of the features and the columns are the number of walks of pepole?
After I reduced this data, I should use the K-Means algorithm on the reduced 'features' data. How the input should look in the K-Means? And what's the propouse on using this algorithm? All I know this algorithm it's used to 'cluster' some data, so in each cluster there are some 'points' based on some rule. What I did and think is:
If I use in PCA an input that looks like: the rows are the number of walks and the columns are the number of features, then for K-Means I should change the columns with rows cause in this way each point it's a feature (but this is not the original data with the features, it's just the reduced one, so I don't know). So then for each cluster I see with euclidean distance who has the lower distance from the centroid and select that feature. So how many clusters I should declare? If I declare that the clusters are the same as the number of features, I will extract always the same number of features. How can I say that a point in the reduced data correspond to this feature in the original set of features?
I know it's not correct what I am saying maybe, but I am trying to understand it, can some of you help me? If am I in the right way? Thanks!
For the PCA, make sure you separate the understanding of the method the algorithm uses (eigenvectors and such) and the result. The result, is a linear mapping, mapping the original space A, to A', where possibly, the dimension (number of features in your case) is less than the original space A.
So the first feature/element in space A', is a linear combination of features of A.
The row/column depends on implementation, but if you use scikit PCA the columns are the features.
You can feed the PCA output, the A' space, to K-means, and it will cluster them, based on a space of usually reduced dimension.
Each point will be part of a cluster, and the idea is that if you would calculate K-Means on A, you would probably end up with the same/similar clusters like with A'. Computationally A' is a lot cheaper. You now have a clustering, on A' and A. As we agree that points similar in A' are also similar in A.
The number of clusters is difficult to answer, if you don't know anything search the elbow method. But say you want to get a sense of different type of things you have, I argue go for 3~8 and not too much, compare 2-3 points closest to
each center, and you have something consumable. The number of features can be larger than the number of clusters. e.g. If we want to know the most dense area in some area (2D) you can easily have 50 clusters, to get a sense where 50 cities could be. Here we have number of cluster way higher than space dimension, and it makes sense.

Exploration Algorithm

Massively edited this question to make it easier to understand.
Given an environment with arbitrary dimensions and arbitrary positioning of an arbitrary number of obstacles, I have an agent exploring the environment with a limited range of sight (obstacles don't block sight). It can move in the four cardinal directions of NSEW, one cell at a time, and the graph is unweighted (each step has a cost of 1). Linked below is a map representing the agent's (yellow guy) current belief of the environment at the instant of planning. Time does not pass in the simulation while the agent is planning.
What exploration algorithm can I use to maximise the cost-efficiency of utility, given that revisiting cells are allowed? Each cell holds a utility value. Ideally, I would seek to maximise the sum of utility of all cells SEEN (not visited) divided by the path length, although if that is too complex for any suitable algorithm then the number of cells seen will suffice. There is a maximum path length but it is generally in the hundreds or higher. (The actual test environments used on my agent are at least 4x bigger, although theoretically there is no upper bound on the dimensions that can be set, and the maximum path length would thus increase accordingly)
I consider BFS and DFS to be intractable, A* to be non-optimal given a lack of suitable heuristics, and Dijkstra's inappropriate in generating a single unbroken path. Is there any algorithm you can think of? Also, I need help with loop detection, as I've never done that before since allowing revisitations is my first time.
One approach I have considered is to reduce the map into a spanning tree, except that instead of defining it as a tree that connects all cells, it is defined as a tree that can see all cells. My approach would result in the following:
In the resultant tree, the agent can go from a node to any adjacent nodes that are 0-1 turn away at intersections. This is as far as my thinking has gotten right now. A solution generated using this tree may not be optimal, but it should at least be near-optimal with much fewer cells being processed by the algorithm, so if that would make the algorithm more likely to be tractable, then I guess that is an acceptable trade-off. I'm still stuck with thinking how exactly to generate a path for this however.
Your problem is very similar to a canonical Reinforcement Learning (RL) problem, the Grid World. I would formalize it as a standard Markov Decision Process (MDP) and use any RL algorithm to solve it.
The formalization would be:
States s: your NxM discrete grid.
Actions a: UP, DOWN, LEFT, RIGHT.
Reward r: the value of the cells that the agent can see from the destination cell s', i.e. r(s,a,s') = sum(value(seen(s')).
Transition function: P(s' | s, a) = 1 if s' is not out of the boundaries or a black cell, 0 otherwise.
Since you are interested in the average reward, the discount factor is 1 and you have to normalize the cumulative reward by the number of steps. You also said that each step has cost one, so you could subtract 1 to the immediate reward rat each time step, but this would not add anything since you will already average by the number of steps.
Since the problem is discrete the policy could be a simple softmax (or Gibbs) distribution.
As solving algorithm you can use Q-learning, which guarantees the optimality of the solution provided a sufficient number of samples. However, if your grid is too big (and you said that there is no limit) I would suggest policy search algorithms, like policy gradient or relative entropy (although they guarantee convergence only to local optima). You can find something about Q-learning basically everywhere on the Internet. For a recent survey on policy search I suggest this.
The cool thing about these approaches is that they encode the exploration in the policy (e.g., the temperature in a softmax policy, the variance in a Gaussian distribution) and will try to maximize the cumulative long term reward as described by your MDP. So usually you initialize your policy with a high exploration (e.g., a complete random policy) and by trial and error the algorithm will make it deterministic and converge to the optimal one (however, sometimes also a stochastic policy is optimal).
The main difference between all the RL algorithms is how they perform the update of the policy at each iteration and manage the tradeoff exploration-exploitation (how much should I explore VS how much should I exploit the information I already have).
As suggested by Demplo, you could also use Genetic Algorithms (GA), but they are usually slower and require more tuning (elitism, crossover, mutation...).
I have also tried some policy search algorithms on your problem and they seems to work well, although I initialized the grid randomly and do not know the exact optimal solution. If you provide some additional details (a test grid, the max number of steps and if the initial position is fixed or random) I can test them more precisely.

A* Algorithm for very large graphs, any thoughts on caching shortcuts?

I'm writing a courier/logistics simulation on OpenStreetMap maps and have realised that the basic A* algorithm as pictured below is not going to be fast enough for large maps (like Greater London).
The green nodes correspond to ones that were put in the open set/priority queue and due to the huge number (the whole map is something like 1-2 million), it takes 5 seconds or so to find the route pictured. Unfortunately 100ms per route is about my absolute limit.
Currently, the nodes are stored in both an adjacency list and also a spatial 100x100 2D array.
I'm looking for methods where I can trade off preprocessing time, space and if needed optimality of the route, for faster queries. The straight-line Haversine formula for the heuristic cost is the most expensive function according to the profiler - I have optimised my basic A* as much as I can.
For example, I was thinking if I chose an arbitrary node X from each quadrant of the 2D array and run A* between each, I can store the routes to disk for subsequent simulations. When querying, I can run A* search only in the quadrants, to get between the precomputed route and the X.
Is there a more refined version of what I've described above or perhaps a different method I should pursue. Many thanks!
For the record, here are some benchmark results for arbitrarily weighting the heuristic cost and computing the path between 10 pairs of randomly picked nodes:
Weight // AvgDist% // Time (ms)
1 1 1461.2
1.05 1 1327.2
1.1 1 900.7
1.2 1.019658848 196.4
1.3 1.027619169 53.6
1.4 1.044714394 33.6
1.5 1.063963413 25.5
1.6 1.071694171 24.1
1.7 1.084093229 24.3
1.8 1.092208509 22
1.9 1.109188175 22.5
2 1.122856792 18.2
2.2 1.131574742 16.9
2.4 1.139104895 15.4
2.6 1.140021962 16
2.8 1.14088128 15.5
3 1.156303676 16
4 1.20256964 13
5 1.19610861 12.9
Surprisingly increasing the coefficient to 1.1 almost halved the execution time whilst keeping the same route.
You should be able to make it much faster by trading off optimality. See Admissibility and optimality on wikipedia.
The idea is to use an epsilon value which will lead to a solution no worse than 1 + epsilon times the optimal path, but which will cause fewer nodes to be considered by the algorithm. Note that this does not mean that the returned solution will always be 1 + epsilon times the optimal path. This is just the worst case. I don't know exactly how it would behave in practice for your problem, but I think it is worth exploring.
You are given a number of algorithms that rely on this idea on wikipedia. I believe this is your best bet to improve the algorithm and that it has the potential to run in your time limit while still returning good paths.
Since your algorithm does deal with millions of nodes in 5 seconds, I assume you also use binary heaps for the implementation, correct? If you implemented them manually, make sure they are implemented as simple arrays and that they are binary heaps.
There are specialist algorithms for this problem that do a lot of pre-computation. From memory, the pre-computation adds information to the graph that A* uses to produce a much more accurate heuristic than straight line distance. Wikipedia gives the names of a number of methods at and says that Hub Labelling is the leader. A quick search on this turns up An older one, using A*, is at
Do you really need to use Haversine? To cover London, I would have thought you could have assumed a flat earth and used Pythagoras, or stored the length of each link in the graph.
There's a really great article that Microsoft Research wrote on the subject:
The original paper is hosted here (PDF):
Essentially there's a few things you can try:
Start from the both the source as well as the destination. This helps to minimize the amount of wasted work that you'd perform when traversing from the source outwards towards the destination.
Use landmarks and highways. Essentially, find some positions in each map that are commonly taken paths and perform some pre-calculation to determine how to navigate efficiently between those points. If you can find a path from your source to a landmark, then to other landmarks, then to your destination, you can quickly find a viable route and optimize from there.
Explore algorithms like the "reach" algorithm. This helps to minimize the amount of work that you'll do when traversing the graph by minimizing the number of vertices that need to be considered in order to find a valid route.
GraphHopper does two things more to get fast, none-heuristic and flexible routing (note: I'm the author and you can try it online here)
A not so obvious optimization is to avoid 1:1 mapping of OSM nodes to internal nodes. Instead GraphHopper uses only junctions as nodes and saves roughly 1/8th of traversed nodes.
It has efficient implements for A*, Dijkstra or e.g. one-to-many Dijkstra. Which makes a route in under 1s possible through entire Germany. The (none-heuristical) bidirectional version of A* makes this even faster.
So it should be possible to get you fast routes for greater London.
Additionally the default mode is the speed mode which makes everything an order of magnitudes faster (e.g. 30ms for European wide routes) but less flexible, as it requires preprocessing (Contraction Hierarchies). If you don't like this, just disable it and also further fine-tune the included streets for car or probably better create a new profile for trucks - e.g. exclude service streets and tracks which should give you a further 30% boost. And as with any bidirectional algorithm you could easily implement a parallel search.
I think it's worth to work-out your idea with "quadrants". More strictly, I'd call it a low-resolution route search.
You may pick X connected nodes that are close enough, and treat them as a single low-resolution node. Divide your whole graph into such groups, and you get a low-resolution graph. This is a preparation stage.
In order to compute a route from source to target, first identify the low-res nodes they belong to, and find the low-resolution route. Then improve your result by finding the route on high-resolution graph, however restricting the algorithm only to nodes that belong to hte low-resolution nodes of the low-resolution route (optionally you may also consider neighbor low-resolution nodes up to some depth).
This may also be generalized to multiple resolutions, not just high/low.
At the end you should get a route that is close enough to optimal. It's locally optimal, but may be somewhat worse than optimal globally by some extent, which depends on the resolution jump (i.e. the approximation you make when a group of nodes is defined as a single node).
There are dozens of A* variations that may fit the bill here. You have to think about your use cases, though.
Are you memory- (and also cache-) constrained?
Can you parallelize the search?
Will your algorithm implementation be used in one location only (e.g. Greater London and not NYC or Mumbai or wherever)?
There's no way for us to know all the details that you and your employer are privy to. Your first stop thus should be CiteSeer or Google Scholar: look for papers that treat pathfinding with the same general set of constraints as you.
Then downselect to three or four algorithms, do the prototyping, test how they scale up and finetune them. You should bear in mind you can combine various algorithms in the same grand pathfinding routine based on distance between the points, time remaining, or any other factors.
As has already been said, based on the small scale of your target area dropping Haversine is probably your first step saving precious time on expensive trig evaluations. NOTE: I do not recommend using Euclidean distance in lat, lon coordinates - reproject your map into a e.g. transverse Mercator near the center and use Cartesian coordinates in yards or meters!
Precomputing is the second one, and changing compilers may be an obvious third idea (switch to C or C++ - see for details).
Extra optimization steps may include getting rid of dynamic memory allocation, and using efficient indexing for search among the nodes (think R-tree and its derivatives/alternatives).
I worked at a major Navigation company, so I can say with confidence that 100 ms should get you a route from London to Athens even on an embedded device. Greater London would be a test map for us, as it's conveniently small (easily fits in RAM - this isn't actually necessary)
First off, A* is entirely outdated. Its main benefit is that it "technically" doesn't require preprocessing. In practice, you need to pre-process an OSM map anyway so that's a pointless benefit.
The main technique to give you a huge speed boost is arc flags. If you divide the map in say 5x6 sections, you can allocate 1 bit position in a 32 bits integer for each section. You can now determine for each edge whether it's ever useful when traveling to section {X,Y} from another section. Quite often, roads are bidirectional and this means only one of the two directions is useful. So one of the two directions has that bit set, and the other has it cleared. This may not appear to be a real benefit, but it means that on many intersections you reduce the number of choices to consider from 2 to just 1, and this takes just a single bit operation.
Usually A* comes along with too much memory consumption rather than time stuggles.
However I think it could be useful to first only compute with nodes that are part of "big streets" you would choose a highway over a tiny alley usually.
I guess you may already use this for your weight function but you can be faster if you use some priority Queue to decide which node to test next for further travelling.
Also you could try reducing the graph to only nodes that are part of low cost edges and then find a way from to start/end to the closest of these nodes.
So you have 2 paths from start to the "big street" and the "big street" to end.
You can now compute the best path between the two nodes that are part of the "big streets" in a reduced graph.
Old question, but yet:
Try to use different heaps that "binary heap". 'Best asymptotic complexity heap' is definetly Fibonacci Heap and it's wiki page got a nice overview:
Note that binary heap has simpler code and it's implemented over array and traversal of array is predictable, so modern CPU executes binary heap operations much faster.
However, given dataset big enough, other heaps will win over binary heap, because of their complexities...
This question seems like dataset big enough.

Graph Simplification Algorithm Advice Needed

I have a need to take a 2D graph of n points and reduce it the r points (where r is a specific number less than n). For example, I may have two datasets with slightly different number of total points, say 1021 and 1001 and I'd like to force both datasets to have 1000 points. I am aware of a couple of simplification algorithms: Lang Simplification and Douglas-Peucker. I have used Lang in a previous project with slightly different requirements.
The specific properties of the algorithm I am looking for is:
1) must preserve the shape of the line
2) must allow me reduce dataset to a specific number of points
3) is relatively fast
This post is a discussion of the merits of the different algorithms. I will post a second message for advice on implementations in Java or Groovy (why reinvent the wheel).
I am concerned about requirement 2 above. I am not an expert enough in these algorithms to know whether I can dictate the exact number of output points. The implementation of Lang that I've used took lookAhead, tolerance and the array of Points as input, so I don't see how to dictate the number of points in the output. This is a critical requirement of my current needs. Perhaps this is due to the specific implementation of Lang we had used, but I have not seen a lot of information on Lang on the web. Alternatively we could use Douglas-Peucker but again I am not sure if the number of points in the output can be specified.
I should add I am not an expert on these types of algorithms or any kind of math wiz, so I am looking for mere mortal type advice :) How do I satisfy requirements 1 and 2 above? I would sacrifice performance for the right solution.
I think you can adapt Douglas-Pücker quite straightforwardly. Adapt the recursive algorithm so that rather than producing a list it produces a tree mirroring the structure of the recursive calls. The root of the tree will be the single-line approximation P0-Pn; the next level will represent the two-line approximation P0-Pm-Pn where Pm is the point between P0 and Pn which is furthest from P0-Pn; the next level (if full) will represent a four-line approximation, etc. You can then trim the tree either on the basis of depth or on the basis of distance of the inserted point from the parent line.
Edit: in fact, if you take the latter approach you don't need to build a tree. Instead you populate a priority queue where the priority is given by the distance of the inserted point from the parent line. Then when you've finished the queue tells you which points to remove (or keep, according to the order of the priorities).
You can find my C++ implementation and article on Douglas-Peucker simplification here and here. I also provide a modified version of the Douglas-Peucker simplification that allows you to specify the number of points of the resulting simplified line. It uses a priority queue as mentioned by 'Peter Taylor'. Its a lot slower though, so I don't know if it would satisfy the 'is relatively fast' requirement.
I'm planning on providing an implementation for Lang simplification (and several others). Currently I don't see any easy way how to adjust Lang to reduce to a fixed point count. If you
could live with a less strict requirement: 'must allow me reduce dataset to an approximate number of points', then you could use an iterative approach. Guess an initial value for lookahead: point count / desired point count. Then slowly increase the lookahead until you approximately hit the desired point count.
I hope this helps.
p.s.: I just remembered something, you could also try the Visvalingam-Whyatt algorithm. In short:
-compute the triangle area for each point with its direct neighbors
-sort these areas
-remove the point with the smallest area
-update the area of its neighbors
-continue until n points remain

Checking if a graph is random using the Erdős–Rényi model?

Given some graph, I would like to determine how likely it is that it was generated randomly. I was told that a comparison to the Erdős–Rényi model was a good way to get this information, but I can't quite figure out how to do that.
Any advice?
The simplest way would probably be to compare the expected number of links with what you observed in the given graph. A slightly smarter method would be to examine the degree distributions. Erdős–Rényi graphs will have a binomial distributions, while real world networks are typically power law.
It might also be easier to test if you had an idea as to what other kinds of models were being used to generate the graph.
You can have a look at the ERGM package for R ( at Although you might not be able to say with 100% certainty that your observed network is produced by a random process, you will be able to assess the likelihood that it was produced by random or non random partner selection processes. ERGM has a function called gof which stands for goodness-of-fit and will compare your observed network with simulated random networks and looks at network statistics such as: geodesic distance distribution, edgewise shared partner distribution, degree distribution and the triad census distribution. This will allow you to make an informed decision whether you consider your network to be random or not.
You will not be able to say whether a single graph is generated randomly. If the generating algorithm is random, than you have to check for randomness of the distribution of edges. But you will need many instances generated by that algorithm. Better check with the notion of randomness in mathematics, cryptography and information theory. [or maybe you want to start with rfc 1750]
The Erdős–Rényi model basically states that you take a number n of nodes and every possible edge has probability p of existence [G(n,p)-model]. Thus by p you can generate the expected number of edges and deviation from this expectation. If a significant ratio of graphs is within standard deviation of this expectation, well, you might not state that your algorithm is random at all, but you have at least one feature uncovered, the expected number of edges.
But again, without having a lot of states (graphs, intermediary graph generation steps or similar) you will be lost there. Say, I give you a number: 4. Is it generated randomly or not?
