Large graph processing on one machine

Large graph processing on one machine - algorithm

What are the algorithms for processing a large graph on a regular user computer (conventionally 8-16 GB RAM).
There is a task to process a sufficiently large graph (to calculate the PageRank) which does not fit completely into the operative memory under the given conditions.
I would like to know what algorithms exist for this or in which direction is it better to start studying? As I understand it, the algorithms for dividing the graph can help me with this, but it is not very clear how, given that it is not possible to build the entire graph in the program at once.
Perhaps there are algorithms for calculating the PageRank for each separate part of the graph and then combining the counting results.
UPD:
More substantive. There is a problem of calculating Pagerank on a large graph. The counting is done in a Python program.
A graph is built based on the data using networkx and the PageRank calculation will be performed using the same networkx. The problem is that there are RAM limitations, the entire graph does not fit into memory.
So I wonder if there are any algorithms that would allow me to calculate PageRank for graphs smaller (subgraph?) than the original one?

Generally speaking, if a graph is too large to fit into memory, then it must be partitioned into multiple partitions.
Suppose the vertices can fit into memory and edges reside on disks. Each time the program loads a partition of the edge into the memory, calculate the PageRank, and then loads another partition. The xstream gives a good solution to this case: http://sigops.org/s/conferences/sosp/2013/papers/p472-roy.pdf
A more complicated case is that both vertices and edges cannot fit into memory, then both of them needs to be loaded into memory many times. The grid graph gives a good solution to this case: https://www.usenix.org/system/files/conference/atc15/atc15-paper-zhu.pdf

Related

Clustering elements based on highest similarity

I'm working with Docker images which consist of a set of re-usable layers. Now given a collection of images, I would like to combine images which have a large amount of shared layers.
To be more exact: Given a collection of N images, I want to create clusters where all images in a cluster share more than X percent of services with eachother. Each image is only allowed to belong to one cluster.
My own research points in the direction of cluster algorithms where I use a similarity measure to decide which images belong in a cluster together. The similarity measure I know how to write. However, I'm having difficulty finding an exact algorithm or pseudo-algorithm to get started.
Can someone recommend an algorithm to solve this problem or provide pseudo-code please?
EDIT: after some more searching I believe I'm looking for something like this hierarchical clustering ( https://github.com/lbehnke/hierarchical-clustering-java ) but with a threshold X so that neighbors with less than X% similarity don't get combined and stay in a separate cluster.

I believe you are a developer and you have no experience with data science?
There are a number of clustering algorithms and they have their advantages and disadvantages (please consult https://en.wikipedia.org/wiki/Cluster_analysis), but I think solution for your problem is easier than one can think.
I assume that N is small enough so you can store a matrix with N^2 float values in RAM memory? If this is the case, you are in a very comfortable situation. You write that you know how to implement similarity measure, so just calculate the measure for all N^2 pairs and store it in a matrix (it is a symmetric matrix, so only half of it can be stored). Please ensure that your similarity measure assigns special value for pair of images, where similarity measure is less than some X%, like 0 or infinity (it depends on that you treat a function like similarity measure or like a distance). I think perfect solution is to assign 1 for pairs, where similarity is greater than X% threshold and 0 otherwise.
After that, treat is just like a graph. Get first vertex and make, e.g., deep first search or any other graph walking routine. This is your first cluster. After that get first not visited vertex and repeat graph walking. Of course you can store graph as an adjacency list to save memory.
This algorithm assumes that you really do not pay attention to that how much images are similar and which pairs are more similar than other, but if they are similar enough (similarity measure is greater than a given threshold).
Unfortunately in cluster analysis it is common that 100% of possible pairs has to be computed. It is possible to save some number of distance calls using some fancy data structures for k-nearest neighbor search, but you have to assure that your similarity measure hold triangle inequality.
If you are not satisfied with this answer, please specify more details of your problem and read about:
K-means (main disadvantage: you have to specify number of clusters)
Hierarchical clustering (slow computation time, at the top all images are in one cluster, you have to cut a dendrogram at proper distance)
Spectral clustering (for graphs, but I think it is too complicated for this easy problem)

I ended up solving the problem by using hierarchical clustering and then traversing each branch of the dendrogram top to bottom until I find a cluster where the distance is below a threshold. Worst case there is no such cluster but then I'll end up in a leaf of the dendrogram which means that element is in a cluster of its own.

Dijkstra-based algorithm optimization for caching

I need to find the optimal path connecting two planar points. I'm given a function that determines the maximal advance velocity, which depends on both the location and the time.
My solution is based on Dijkstra algorithm. At first I cover the plane by a 2D lattice, considering only discrete points. Points are connected to their neighbors up to specified order, to get sufficient direction resolution. Then I find the best path by (sort of) Dijkstra algorithm. Next I improve the resolution/quality of the found path. I increase the lattice density and neighbor connectivity order, together with restricting the search only to the points close enough to the already-found path. This may be repeated until the needed resolution is achieved.
This works generally well, but nevertheless I'd like to improve the overall algorithm performance. I've implemented several tricks, such as variable lattice density and neighbor connectivity order, based on "smoothness" of the price function. However I believe there's a potential to improve Dijkstra algorithm itself (applicable to my specific graph), which I couldn't fully realize yet.
First let's agree on the terminology. I split all the lattice points into 3 categories:
cold - points that have not been reached by the algorithm.
warm - points that are reached, but not fully processed yet (i.e. have potential for improvement)
stable - points that are fully processed.
At each step Dijkstra algorithm picks the "cheapest" warm lattice point, and then tries to improve the price of its neighbors. Because of the nature of my graph, I get a kind of a cloud of stable points, surrounded by a thin layer of warm points. At each step a warm point at the cloud perimeter is processed, then it's added to the stable cloud, and the warm perimeter is (potentially) expanded.
The problem is that warm points that are consequently processed by the algorithm are usually spatially (hence - topologically) unrelated. A typical warm perimeter consists of hundreds of thousands of points. At each step the next warm point to process is pseudo-randomal (spatially), hence there's virtually no chance that two related points are processed one after another.
This indeed creates a problem with CPU cache utilization. At each step the CPU deals with pseudo-random memory location. Since there's a large amount of warm points - all the relevant data may not fit the CPU cache (it's order of tens to hundreds of MB).
Well, this is indeed the implication of the Dijkstra algorithm. The whole idea is explicitly to pick the cheapest warm point, regardless to other properties.
However intuitively it's obvious that points on one side of a big cloud perimeter don't make any sense to the points on another side (in our specific case), and there's no problem to swap their processing order.
Hence I though about ways of "adjusting" the warm points processing order, yet without compromising the algorithm in general. I thought about several ideas, such as diving the plane into blocks, and partially solving them independently until some criteria is met, meaning their solution may be interfered. Or alternatively ignore the interference, and potentially allow the "re-solving" (i.e. transition from stable back to warm).
However so far I could not find rigorous method.
Are there any ideas how to do this? Perhaps it's a know problem, with existing research and (hopefully) solutions?
Thanks in advance. And sorry for the long question.

What you're describing is the motivation behind the A* search algorithm, a modification of Dijkstra's algorithm that can dramatically improve the runtime by guiding the search in a direction that is likely to pick points that keep getting closer and closer to the destination. A* never does any more work than a naive Dijkstra's implementation, and typically tends to expand out nodes that are clustered on the frontier of the warm nodes that are closest to the destination node.
Internally, A* works by augmenting Dijkstra's algorithm with a heuristic function that estimates the remaining distance to the target node. This means that if you get can get a rough approximation of how far away a given node is from the destination, you can end up ignoring nodes that don't need to be processed in favor of nodes that are likely to be better.
A* was not designed as a cache-optimal algorithm, but I believe that the increase in speed due to
Expanding fewer nodes (less in the cache), and
Expanding nodes closer to the goal (which were processed more recently and thus more likely to be in the cache)
will give you a huge performance increase and better cache performance.
Hope this helps!

Hierarchical clusterization heuristics

I want to explore relations between data items in large array. Every data item represented by multidimensional vector. First of all, I've decided to use clusterization. I'm interested in finding hierarchical relations between clusters (groups of data vectors). I'm able to calculate distance between my vectors. So at the first step I'm finding minimal spanning tree. After that I need to group data vectors according to links in my spanning tree. But at this step I'm disturbed - how to combine different vectors into hierarchical clusters? I'm using heuristics: if two vectors linked, and distance between them is very small - that means that they are in the same cluster, if two wectors are linked but distance between them is larger than threshold - that means that they are in different clusters with common root cluster.
But maybe there is better solution?
Thanks
P.S.
Thanks to all!
In fact I've tried to use k-means and some variation of CLOPE, but didn't get good results.
So, now I'm know that clusters of my dataset actually have complex structure (much more complex than n-spheres).
Thats why I want to use hierarchical clusterisation. Also I'm guess that clusters are looks like n-dimension concatenations (like 3d or 2d chain). So I use single-link strategy.
But I'm disturbed - how to combine different clusters with each other (in which situation I've to make common root cluster, and in which situations I've to combine all sub-clusters in one cluster?).
I'm using such simple strategy:
If clusters (or vectors) are too close to each other - I'm combine their content into one cluster (regulated by threshold)
If clusters (or vectors) are too far from each other - I'm creating root cluster and put them into it
But using this strategy I've got very large cluster trees. I'm trying to find satisfactory threshold. But maybe there might be better strategy to generate cluster-tree?
Here is a simple picture, describes my question:

A lot of work has been done in this area. The usual advice is to start with K-means clustering unless you have a really good reason to do otherwise - but K-means does not do hierarchical clustering (normally anyway), so you may have a good reason to do otherwise (although it's entirely possible to do hierarchical K-means by doing a first pass to create clusters, then do another pass, using the centroid of each of those clusters as a point, and continuing until you have as few high-level clusters as desired).
There are quite a few other clustering models though, and quite a few papers covering relative strengths and weaknesses, such as the following:
Pairwise Clustering and Graphical Models
Beyond pairwise clustering
Parallel pairwise clustering
A fast greedy pairwise distance clustering. algorithm and its use in discovering thematic. structures in large data sets.
Pairwise Clustering Algorithm
Hierarchical Agglomerative Clustering
A little Googling will turn up lots more. Glancing back through my research directory from when I was working on clustering, I have dozens of papers, and my recollection is that there were a lot more that I looked at but didn't keep around, and many more still that I never got a chance to really even look at.

There is a whole zoo of clustering algorithms. Among them, minimum spanning tree a.k.a. single linkage clustering has some nice theoretical properties, as noted e.g. at http://www.cs.uwaterloo.ca/~mackerma/Taxonomy.pdf. In particular, if you take a minimum spanning tree and remove all links longer than some threshold length, then the resulting grouping of points into clusters should have minimum total length of remaining links for any grouping of that size, for the same reason that Kruskal's algorithm produces a minimum spanning tree.
However, there is no guarantee that minimum spanning tree will be the best for your particular purpose, so I think you should either write down what you actually need from your clustering algorithm and then choose a method based on that, or try a variety of different clustering algorithms on your data and see which is best in practice.

3D clustering Algorithm

Problem Statement:
I have the following problem:
There are more than a billion points in 3D space. The goal is to find the top N points which has largest number of neighbors within given distance R. Another condition is that the distance between any two points of those top N points must be greater than R. The distribution of those points are not uniform. It is very common that certain regions of the space contain a lot of points.
Goal:
To find an algorithm that can scale well to many processors and has a small memory requirement.
Thoughts:
Normal spatial decomposition is not sufficient for this kind of problem due to the non-uniform distribution. irregular spatial decomposition that evenly divide the number of points may help us the problem. I will really appreciate that if someone can shed some lights on how to solve this problem.

Use an Octree. For 3D data with a limited value domain that scales very well to huge data sets.
Many of the aforementioned methods such as locality sensitive hashing are approximate versions designed for much higher dimensionality where you can't split sensibly anymore.
Splitting at each level into 8 bins (2^d for d=3) works very well. And since you can stop when there are too few points in a cell, and build a deeper tree where there are a lot of points that should fit your requirements quite well.
For more details, see Wikipedia:
https://en.wikipedia.org/wiki/Octree
Alternatively, you could try to build an R-tree. But the R-tree tries to balance, making it harder to find the most dense areas. For your particular task, this drawback of the Octree is actually helpful! The R-tree puts a lot of effort into keeping the tree depth equal everywhere, so that each point can be found at approximately the same time. However, you are only interested in the dense areas, which will be found on the longest paths in the Octree without even having to look at the actual points yet!

I don't have a definite answer for you, but I have a suggestion for an approach that might yield a solution.
I think it's worth investigating locality-sensitive hashing. I think dividing the points evenly and then applying this kind of LSH to each set should be readily parallelisable. If you design your hashing algorithm such that the bucket size is defined in terms of R, it seems likely that for a given set of points divided into buckets, the points satisfying your criteria are likely to exist in the fullest buckets.
Having performed this locally, perhaps you can apply some kind of map-reduce-style strategy to combine spatial buckets from different parallel runs of the LSH algorithm in a step-wise manner, making use of the fact that you can begin to exclude parts of your problem space by discounting entire buckets. Obviously you'll have to be careful about edge cases that span different buckets, but I suspect that at each stage of merging, you could apply different bucket sizes/offsets such that you remove this effect (e.g. perform merging spatially equivalent buckets, as well as adjacent buckets). I believe this method could be used to keep memory requirements small (i.e. you shouldn't need to store much more than the points themselves at any given moment, and you are always operating on small(ish) subsets).
If you're looking for some kind of heuristic then I think this result will immediately yield something resembling a "good" solution - i.e. it will give you a small number of probable points which you can check satisfy your criteria. If you are looking for an exact answer, then you are going to have to apply some other methods to trim the search space as you begin to merge parallel buckets.
Another thought I had was that this could relate to finding the metric k-center. It's definitely not the exact same problem, but perhaps some of the methods used in solving that are applicable in this case. The problem is that this assumes you have a metric space in which computing the distance metric is possible - in your case, however, the presence of a billion points makes it undesirable and difficult to perform any kind of global traversal (e.g. sorting of the distances between points). As I said, just a thought, and perhaps a source of further inspiration.

Here are some possible parts of a solution.
There are various choices at each stage,
which will depend on Ncluster, on how fast the data changes,
and on what you want to do with the means.
3 steps: quantize, box, K-means.
1) quantize: reduce the input XYZ coordinates to say 8 bits each,
by taking 2^8 percentiles of X,Y,Z separately.
This will speed up the whole flow without much loss of detail.
You could sort all 1G points, or just a random 1M,
to get 8-bit x0 < x1 < ... x256, y0 < y1 < ... y256, z0 < z1 < ... z256
with 2^(30-8) points in each range.
To map float X -> 8 bit x, unrolled binary search is fast —
see Bentley, Pearls p. 95.
Added: Kd trees
split any point cloud into different-sized boxes, each with ~ Leafsize points —
much better than splitting X Y Z as above.
But afaik you'd have to roll your own Kd tree code
to split only the first say 16M boxes, and keep counts only, not the points.
2) box: count the number of points in each 3d box,
[xj .. xj+1, yj .. yj+1, zj .. zj+1].
The average box will have 2^(30-3*8) points;
the distribution will depend on how clumpy the data is.
If some boxes are too big or get too many points, you could
a) split them into 8,
b) track the centre of the points in each box,
otherwide just take box midpoints.
3)
K-means clustering
on the 2^(3*8) box centres.
(Google parallel "k means" -> 121k hits.)
This depends strongly on K aka Ncluster, also on your radius R.
A rough approach would be to grow a
heap
of the say 27*Ncluster boxes with the most points,
then take the biggest ones subject to your Radius constraint.
(I like to start with a
Minimum spanning tree,
then remove the K-1 longest links to get K clusters.)
See also
Color quantization .
I'd make Nbit, here 8, a parameter from the beginning.
What is your Ncluster ?
Added: if your points are moving in time, see
collision-detection-of-huge-number-of-circles on SO.

I would also suggest to use an octree. The OctoMap framework is very good at dealing with huge 3D point clouds. It does not store all the points directly, but updates the occupancy density of every node (aka 3D box).
After the tree is built, you can use a simple iterator to find the node with the highest density. If you would like to model the point density or distribution inside the nodes, the OctoMap is very easy to adopt.
Here you can see how it was extended to model the point distribution using a planar model.

Just an idea. Create a graph with given points and edges between points when distance < R.
Creation of this kind of graph is similar to spatial decomposition. Your questions can be answered with local search in graph. First are vertices with max degree, second is finding of maximal unconnected set of max degree vertices.
I think creation of graph and search can be made parallel. This approach can have large memory requirement. Splitting domain and working with graphs for smaller volumes can reduce memory need.

data structure to support google/bing maps [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I was wondering what the data structure is in an application like google/bing maps. How is it that the results are returned so quickly when searching for directions?
what kind of algorithms are being used to determine this information?
thanks

There are two parts to your question:
What kind of data structure is used to store the map information.
What kind of algorithm is used to "navigate" from source to destination.
To this, I would add another question:
How is Google/Bing able to "stream in" the data. So for example, you are able to zoom in from miles up to the ground level seamlessly, all the while maintaining the coordinate system.
I will attempt to address each question in order. Do note, that I do not work for the Google Maps or the Bing team, so quite obviously, this information might not be completely accurate. I am basing this off of the knowledge gained from a good CS course about data structures and algorithms.
Ans 1) The map is stored in an Edge Weighted Directed Graph. Locations on the map are Vertices and the path from one location to another (from one vertex to another) are the Edges.
Quite obviously, since there can be millions of vertices and an order of magnitude more edges, the really interesting thing would be the representation of this Edge Weighted Digraph.
I would say that this would be represented by some kind of Adjacency List and the reason I say so is because, if you imagine a map, it is essentially a sparse graph. There are only a few ways to get from one location to another. Think about your house! How many roads (edges in our case) lead to it? Adjacency Lists are good for representing sparse graphs, and adjacency matrix is good for representing dense graphs.
Of course, even though we are able to efficiently represent sparse graphs in memory, given the sheer number of Vertices and Edges, it would be impossible to store everything in memory at once. Hence, I would imagine some kind of a streaming library underneath.
To create an analogy for this, if you have ever played an open-world game like World of Warcraft / Syrim / GTA, you will observe that to a large part, there is no loading screen. But quite obviously, it is impossible to fit everything into memory at once. Thus using a combination of quad-trees and frustum culling algorithms, these games are able to dynamically load resources (terrain, sprites, meshes etc).
I would imagine something similar, but for Graphs. I have not put a lot of thought into this particular aspect, but to cook up a very basic system, one can imagine an in memory database, which they query and add/remove vertices and edges from the graph at run-time as needed. This brings us to another interesting point. Since vertices and edges need to be removed and added at run-time, the classic implementation of Adjacency List will not cut it.
In a classic implementation, we simply store a List (a Vector in Java) in each element of an array: Adj[]. I would imagine, a linked list in place of the Adj[] array and a binary search tree in place of List[Edge]. The binary search tree would facilitate O(log N) insertion and removal of nodes. This is extremely desirable since in the List implementation, while addition is O(1), removal is O(N) and when you are dealing with millions of edges, this is prohibitive.
A final point to note here is that until you actually start the navigation, there is "no" graph. Since there can be million of users, it doesn't make sense to maintain one giant graph for everybody (this would be impossible due to memory space requirement alone). I would imagine that as you stat the navigation process, a graph is created for you. Quite obviously, since you start from location A and go to location B (and possibly other locations after that), the graph created just for you should not take up a very large amount of memory (provided the streaming architecture is in place).
Ans 2) This is a very interesting question. The most basic algorithm for solving this problem would be Dijkstra Path Finding algorithm. Faster variations such as A* exist. I would imagine Dijkstra to be fast enough, if it could work properly with the streaming architecture discussed above. Dijkstra uses space proportional to V and time proportional to E lg V, which are very good figures, especially for sparse graphs. Do keep in mind, if the streaming architecture has not been nailed down, V and E will explode and the space and run-time requirements of Dijkstra will make it prohibitive.
Ans 1) Streaming question: Do not confuse this question with the streaming architecture discussed above. This is basically asking how the seamless zoom is achieved.
A good algorithm for achieving this is the Quad Tree algorithm (you can generalize this to n-tree). You store coarser images higher up in the tree and higher resolution images as you traverse down the tree. This is actually what KML (Keyhole) did with its mapping algorithm. Keyhole was a company that partnered with NVIDIA many years back to produce one of the first "Google Earth" like softwares.
The inspiration for Quad Tree culling, comes from modern 3D games, where it is used to quickly cull away parts of the scene which is not in the view frustum.
To further clarify this, imagine that you are looking at the map of USA from really high up. At this level, you basically split the map into 4 sections and make each section a child of the Quad Tree.
Now, as you zoom in, you zoom in on one of the sections (quite obviously you can zoom right in the center, so that your zoom actually touches all 4 sections, but for simplicity's sake, lets say you zoom in on one of the sections). So when you zoom in to one section, you traverse the 4 children of that section. These 4 children contain higher resolution data of its parent. You can then continue to zoom down till you hit a set of leaves, which contain the highest resolution data. To make the jump from one resolution to the next "seamless" a combination of blur and fading effects can be utilized.
As a follow-up to this post, I will try to add links to many of the concepts I put in here.

For this sort of application, you would want some sort of database to represent map features and the connections between them, and would then need:
spatial indexing of the map feature database, so that it can be efficiently queried by 2D coordinates; and
a good way to search the connections to find a least-cost route, for some measure of cost (e.g. distance).
For 1, an example would be the R-tree data structure.
For 2, you need a graph search algorithm, such as A*.

Look up a paper about Highway Dimension from google authors. The idea is to precompute the shortest path between important nodes and then route everything through those. You are not going to use residential streets to go from LA to Chicago save for getting on and off the freeway at both ends.

I'm not sure of the internal data structure, but it may be some kind of 2D coordinate based tree structure that only displays a certain number of levels. The levels would correspond to zoom factors, so you could ignore as insignificant things below, say, 5 levels below the current level, and things above the current level.
Regardless of how it's structured, here's how you can use it:
http://code.google.com/apis/maps/documentation/reference.html

I would think of it as a computational geometry problem. When you click on a particular coordinate in the map and using that information, can get the latitude and longitude of that location. Based on the latitude and longitude and the level of zoom, the place can be identified.
Once you have identified the two places, the only problem for you is to identify the nearest route. Now this problem is finding the shortest path between two points, having polygon blocks between them(which correspond to the places which contains no roads) and the only possible connections are roads. This is a known problem and efficient algorithms exist to solve this.
I am not sure if this is what google is doing, but I hope they do something on these lines.
I am taking computational geometry this semester. Here is the course link: http://www.ams.sunysb.edu/~jsbm/courses/545/ams545.html. Check them if you are interested.

I was wondering what the data
structure is in an application like
google/bing maps.
To the user: XHTML/CSS/Javascript. Like any website.
On the server: who knows? Any Google devs around here? It certainly isn't PHP or ASP.net...
How is it that the results are
returned so quickly when searching for
directions?
Because Google spent years, manpower and millions of dollars on building up the architecture to get the fastest server reaction time possible?
What kind of algorithms are being used
to determine this information?
A journey planner algorithm.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio