Our application displays potentially big graphs with lots of nodes and edges. We use things like dot, of course, to layout the graph, and they look well on screen. However, users would like to print them to paper. Now technically we can do it, we split the graph into little images and the users could print that. But there is no guarantee that by cutting at the page size we're not going to cut through nodes, have loads of edges between different pages, etc. I'm looking for an algorithm that would alter the layout of the graph so that it'd be more usable on printing:
- ensure no node will fall on page boundaries
- try to minimize edges between pages
It seems that looks like finding "clusters" of densely related nodes to put on the same page with few edges crossing to other pages to other clusters. Can anybody point me to relevant literature/tools to do this kind of things?
Thanks
A cluster analysis is a good start.
I would suggest the following way to process:
First define a cost function:
Given a page repartition : Cost(Repartition) = f(nodes near boundaries, multipages-edges)
THe objective will be to minimize this function.
chose a cluster analysis algorithm:
define a cluster algorithm (you can have a look at my post on DBSCAN : DBSCAN code in C# or vb.net , for Cluster Analysis
and add a "page" constraint to the clustering. (size of the page)
Depending on order you will go through your points the outcome will be different because of the "page constraint".
chose the one with the smallest cost, or stop when you find a acceptable cost.
The complexity of this algorithm would be O(n!) ... kind of big for big graphs.
Therefore you need to think of some more clustering constraints, or just make a partial search (test the n2 first to have it in O(n2)) to get a good approximation.
Depending of the acceptable cost you can get a result in an reasonnable time or not.
Related
I develop a small program where an user can create a simple diagrams with abstract blocks connected by lines, for example, flow charts or structural diagrams. One of the clause of the statement of work is that lines have to get around other blocks\lines and don't intersect them while moving.
Illustration
I tried to use pathfinding algorithms like A* or Lee's algorithm for it and consider a workspace (a window with diagram elements) like a graph - one pixel is one graph node. However, moving of blocks\lines causes the significant time delay (for example, pathfinding in the workspace with size 500x500 takes about 320-360 ms). It seems like the graph is too big for those algorithms.
Could you please tell me how to reduce the number of nodes with regard to this case? Maybe is there a way to speed up those algorithms or use something other for it?!
Don't think of this as a graph theory problem, think of it as a physics problem.
Visualize it as follows. Each block has a specified force pulling it towards the last place that it was put. Line segments, blocks, and the edge of the graph repel each other with an inverse square law (except that the end of the line you are drawing doesn't repel blocks in front of it). Under sufficient stress, a line segment can be broken into smaller line segments that have a pull towards returning to being a straight line.
The dynamics are complicated, but the number of entities is the number of objects you see on screen, not the number of pixels it is drawn on. Therefore you'll be able to do updates relatively quickly.
You'll need to play with the dynamics a bit to get a good experience, but this should be a more tractable approach.
I am calculating pathfinding inside a mesh which I have build a uniform grid around. The nodes (cells in the 3D grid) close to what I deem a "standable" surface I mark as accessible and they are used in my pathfinding. To get alot of detail (like being able to pathfind up small stair cases) the ammount of accessible cells in my grid have grown quite large, several thousand in larger buildings. (every grid cell is 0.5x0.5x0.5 m and the meshes are rooms with real world dimensions). Even though I only use a fraction of the actual cells in my grid for pathfinding the huge ammount slows the algorithm down. Other than that it works fine and finds the correct path through the mesh, using a weighted manhattan distance heuristic.
Imagine my grid looks like that and the mesh is inside it (can be more or less cubes but its always cubical), however the pathfinding will not be calculated on all the small cubes just a few marked as accessible (usually at the bottom of the grid but that can depend on how many floors the mesh has).
I am looking to reduce the search space for the pathfinding... I have looked at clustering like how HPA* does it and other clustering algorithms like Markov but they all seem to be best used with node graphs and not grids. One obvious solution would be to just increase the size of the small cubes building the grid but then I would lose alot of detail in the pathfinding and it would not be as robust. How could I cluster these small cubes? This is how a typical search space looks when I do my pathfinding (blue are accessible, green is path):
and as you see there is a lot of cubes to search through because the distance between them is quite small!
Never mind that the grid is an unoptimal solution for pathfinding for now.
Does anyone have an idea on how to reduce the ammount of cubes in the grid I have to search through and how would I access the neighbors after I reduce the space? :) Right now it only looks at the closest neighbors while expanding the search space.
A couple possibilities come to mind.
Higher-level Pathfinding
The first is that your A* search may be searching the entire problem space. For example, you live in Austin, Texas, and want to get into a particular building somewhere in Alberta, Canada. A simple A* algorithm would search a lot of Mexico and the USA before finally searching Canada for the building.
Consider creating a second layer of A* to solve this problem. You'd first find out which states to travel between to get to Canada, then which provinces to reach Alberta, then Calgary, and then the Calgary Zoo, for example. In a sense, you start with an overview, then fill it in with more detailed paths.
If you have enormous levels, such as skyrim's, you may need to add pathfinding layers between towns (multiple buildings), regions (multiple towns), and even countries (multiple regions). If you were making a GPS system, you might even need continents. If we'd become interstellar, our spaceships might contain pathfinding layers for planets, sectors, and even galaxies.
By using layers, you help to narrow down your search area significantly, especially if different areas don't use the same co-ordinate system! (It's fairly hard to estimate distance for one A* pathfinder if one of the regions needs latitude-longitude, another 3d-cartesian, and the next requires pathfinding through a time dimension.)
More efficient algorithms
Finding efficient algorithms becomes more important in 3 dimensions because there are more nodes to expand while searching. A Dijkstra search which expands x^2 nodes would search x^3, with x being the distance between the start and goal. A 4D game would require yet more efficiency in pathfinding.
One of the benefits of grid-based pathfinding is that you can exploit topographical properties like path symmetry. If two paths consist of the same movements in a different order, you don't need to find both of them. This is where a very efficient algorithm called Jump Point Search comes into play.
Here is a side-by-side comparison of A* (left) and JPS (right). Expanded/searched nodes are shown in red with walls in black:
Notice that they both find the same path, but JPS easily searched less than a tenth of what A* did.
As of now, I haven't seen an official 3-dimensional implementation, but I've helped another user generalize the algorithm to multiple dimensions.
Simplified Meshes (Graphs)
Another way to get rid of nodes during the search is to remove them before the search. For example, do you really need nodes in wide-open areas where you can trust a much more stupid AI to find its way? If you are building levels that don't change, create a script that parses them into the simplest grid which only contains important nodes.
This is actually called 'offline pathfinding'; basically finding ways to calculate paths before you need to find them. If your level will remain the same, running the script for a few minutes each time you update the level will easily cut 90% of the time you pathfind. After all, you've done most of the work before it became urgent. It's like trying to find your way around a new city compared to one you grew up in; knowing the landmarks means you don't really need a map.
Similar approaches to the 'symmetry-breaking' that Jump Point Search uses were introduced by Daniel Harabor, the creator of the algorithm. They are mentioned in one of his lectures, and allow you to preprocess the level to store only jump-points in your pathfinding mesh.
Clever Heuristics
Many academic papers state that A*'s cost function is f(x) = g(x) + h(x), which doesn't make it obvious that you may use other functions, multiply the weight of the cost functions, and even implement heatmaps of territory or recent deaths as functions. These may create sub-optimal paths, but they greatly improve the intelligence of your search. Who cares about the shortest path when your opponent has a choke point on it and has been easily dispatching anybody travelling through it? Better to be certain the AI can reach the goal safely than to let it be stupid.
For example, you may want to prevent the algorithm from letting enemies access secret areas so that they avoid revealing them to the player, and so that they AI seems to be unaware of them. All you need to achieve this is a uniform cost function for any point within those 'off-limits' regions. In a game like this, enemies would simply give up on hunting the player after the path grew too costly. Another cool option is to 'scent' regions the player has been recently (by temporarily increasing the cost of unvisited locations because many algorithms dislike negative costs).
If you know what places you won't need to search, but can't implement in your algorithm's logic, a simple increase to their cost will prevent unnecessary searching. There's a lot of ways to take advantage of heuristics to simplify and inform your pathfinding, but your biggest gains will come from Jump Point Search.
EDIT: Jump Point Search implicitly selects pathfinding direction using the same heuristics as A*, so you may be able to implement heuristics to a small degree, but their cost function won't be the cost of a node, but rather, the cost of traveling between the two nodes. (A* generally searches adjacent nodes, so the distinction between a node's cost and the cost of traveling to it tends to break down.)
Summary
Although octrees/quad-trees/b-trees can be useful in collision-detection, they aren't as applicable to searches because they section a graph based on its coordinates; not on its connections. Layering your graph (mesh in your vocabulary) into super graphs (regions) is a more effective solution.
Hopefully I've covered anything you'll find useful.
Good luck!
I'm working with Docker images which consist of a set of re-usable layers. Now given a collection of images, I would like to combine images which have a large amount of shared layers.
To be more exact: Given a collection of N images, I want to create clusters where all images in a cluster share more than X percent of services with eachother. Each image is only allowed to belong to one cluster.
My own research points in the direction of cluster algorithms where I use a similarity measure to decide which images belong in a cluster together. The similarity measure I know how to write. However, I'm having difficulty finding an exact algorithm or pseudo-algorithm to get started.
Can someone recommend an algorithm to solve this problem or provide pseudo-code please?
EDIT: after some more searching I believe I'm looking for something like this hierarchical clustering ( https://github.com/lbehnke/hierarchical-clustering-java ) but with a threshold X so that neighbors with less than X% similarity don't get combined and stay in a separate cluster.
I believe you are a developer and you have no experience with data science?
There are a number of clustering algorithms and they have their advantages and disadvantages (please consult https://en.wikipedia.org/wiki/Cluster_analysis), but I think solution for your problem is easier than one can think.
I assume that N is small enough so you can store a matrix with N^2 float values in RAM memory? If this is the case, you are in a very comfortable situation. You write that you know how to implement similarity measure, so just calculate the measure for all N^2 pairs and store it in a matrix (it is a symmetric matrix, so only half of it can be stored). Please ensure that your similarity measure assigns special value for pair of images, where similarity measure is less than some X%, like 0 or infinity (it depends on that you treat a function like similarity measure or like a distance). I think perfect solution is to assign 1 for pairs, where similarity is greater than X% threshold and 0 otherwise.
After that, treat is just like a graph. Get first vertex and make, e.g., deep first search or any other graph walking routine. This is your first cluster. After that get first not visited vertex and repeat graph walking. Of course you can store graph as an adjacency list to save memory.
This algorithm assumes that you really do not pay attention to that how much images are similar and which pairs are more similar than other, but if they are similar enough (similarity measure is greater than a given threshold).
Unfortunately in cluster analysis it is common that 100% of possible pairs has to be computed. It is possible to save some number of distance calls using some fancy data structures for k-nearest neighbor search, but you have to assure that your similarity measure hold triangle inequality.
If you are not satisfied with this answer, please specify more details of your problem and read about:
K-means (main disadvantage: you have to specify number of clusters)
Hierarchical clustering (slow computation time, at the top all images are in one cluster, you have to cut a dendrogram at proper distance)
Spectral clustering (for graphs, but I think it is too complicated for this easy problem)
I ended up solving the problem by using hierarchical clustering and then traversing each branch of the dendrogram top to bottom until I find a cluster where the distance is below a threshold. Worst case there is no such cluster but then I'll end up in a leaf of the dendrogram which means that element is in a cluster of its own.
I need some input on solving the following problem:
Given a set of unordered (X,Y) points, I need to reduce/simplify the points and end up with a connected graph representation.
The following image show an example of an actual data set and the corresponding desired output (hand-drawn by me in MSPaint, sorry for shitty drawing, but the basic idea should be clear enough).
Some other things:
The input size will be between 1000-20000 points
The algorithm will be run by a user, who can see the input/output visually, tweak input parameters, etc. So automatically finding a solution is not a requirement, but the user should be able to achieve one within a fairly limited number of retries (and parameter tweaks). This also means that the distance between the nodes on the resulting graph can be a parameter and does not need to be derived from the data.
The time/space complexity of the algorithm is not important, but in practice it should be possible to finish a run within a few seconds on a standard desktop machine.
I think it boils down to two distinct problems:
1) Running a filtering pass, reducing the number of points (including some noise filtering for removing stray points)
2) Some kind of connect-the-dots graph problem afterwards. A very problematic area can be seen in the bottom/center part on the example data. Its very easy to end up connecting wrong parts of the graph.
Could anyone point me in the right direction for solving this? Cheers.
K-nearest neighbors (or, perhaps more accurately, a sigma neighborhood) might be a good starting point. If you're working in strictly Euclidean space, you may able to achieve 90% of what you're looking for by specifying some L2 distance threshold beyond which points are not connected.
The next step might be some sort of spectral graph analysis where you can define edges between points using some sort of spectral algorithm in addition to a distance metric. This would give the user a lot more knobs to turn with regards to the connectivity of the graph.
Both of these approaches should be able to handle outliers, e.g. "noisy" points that simply won't be connected to anything else. That said, you could probably combine them for the best possible performance (as spectral clustering performs a lot better when there are no 1-point clusters): run a basic KNN to identify and remove outliers, then a spectral analysis to more robustly establish edges.
I'm using Processing to develop a navigation system for complex data and processes. As part of that I have gotten into graph layout pretty deeply. It's all fun and my opinions on layout algorithms are : force-directed is for sissies (just look at it scale...haha), eigenvector projection is cool, Sugiyama layers look good but fail fast on graphy graphs, and although I have relied on eigenvectors thus far, I need to minimize edge crossings to really get to the point of the data. I know, I know NP-complete etc.
I should add that I have some good success from applying xy boxing and using Sugiyama-like permutation to reduce edge crossings across rows and columns. Viz: graph (|V|=90,avg degree log|V|) can go from 11000 crossings -> 1500 (by boxed eigenvectors) -> 300 by alternating row and column permutations to reduce crossings.
But the local minima... whatever it is sticks around this mark, and the result is not as clear as it could be. My research into the lit suggests to me that I really want to use the planarization algorithm like what they do use for VLSI:
Use BFS or something to make the maximal planar subgraph
1.a. Layout the planar subgraph nice-like
Cleverly add outstanding edges to recover the original graph
Please reply with your thoughts on the fastest planarization algorithm, you are welcome to go into some depth about any specific optimizations you have had familiarity with.
Thanks so much!
Given that all graphs are not planar (which you know), it might be better to go for a "next-best" approach rather than a "always provides best answer" approach.
I say this because my roommate in grad school had the same problem you did. He was trying to convert a graph to planar form and all the algorithms that guarantee minimum edge crossings were too slow. What he ended up doing what just implementing a randomized algorithm. Basically, lay out the graph, then jigger nodes that have edges with lots of crossings, and eventually you'll handle the worst clumps of edges.
The advantages were: you can cut it out after a specific amount of time, so if you're time limited this always comes in under a certain amount of time, if you get a crappy graph you can just run it again (over the already laid out graph) to get something better, and it's relatively easy to code up.
Disadvantage is you don't always get the global minima in crossings, and if the graph gets stuck in high crossing areas, his algorithm would sometimes shoot a node way off into the distance to try and resolve it, which sometimes causes really weird looking graphs.
You know already lots of graph layout but I believe your question is still, unfortunately, underspecified.
You can planarize any graph really quickly by doing the following: (1) randomly layout the graph on a plane (2) calculate where edges cross (3) create pseudo-vertices at the crossings (this is what you anyway do when you use planarity based layout for non-planar graphs) (4) the graph extended with the new vertices and split edges is automatically planar.
The first difficulty comes from having an algorithm that calculates a combinatorial embedding that minimizes the number of edge crossings. The second difficulty is using that combinatorial embedding to do layout in Euclidean plane that is visually appealing, e.g. for orthogonal graph layout you might want to minimize the number of bends, maximize the size of faces, minimize the area of the graph as a whole, etc., and these goals might conflict each other.