Storing nearest neighbor - algorithm

There are algorithms out there to find k nearest neighbors in many ways. I am eventually will have to apply these, however in my case, I can code my program to add points one-by-one rather than add all the points altogether, then run a algorithm. Is this make the problem easier, so that maybe I could use a tree, and add each node to a neighborhood tree or something. This seems like it would be faster than searching all the points linearly.
And in my program points will be moving constantly, so I will be required to update neighbors, that's why I thought it is better to use a tree or another construct to update records, rather than calculate nearest neighbors in every movement of these points. Do you know of such data structure ?

Maybe a graph datastructure/database is most appropriate due to the structural similarity.
Example: https://neo4j.com/graphgist/a7c915c8-a3d6-43b9-8127-1836fecc6e2f (I do not work for neo4j)

Related

Efficient algorithm to find closest line segments for each point

Given a polygonal subdivison S and a set of points P, find the closest line segment in S for each point (in 2-d space).
In my setting, I have hundreds of thousands of segments and a couple thousand points.
Checking each line for each point would take too long. Is there an efficient algorithm for this?
I was considering multiple options, but can't figure out which is best.
Build a trapezoidal map and query the face each point is in. Then go over the edges of the face (in the subdivision) to find the nearest line.
Build a range tree or segment tree. Query a box around the point and find the closest line segment in it. There has to be a segment in the box for this to find anything.
Build a line segment voronoi diagram. Each face describes the nearest segment, but I wouldn't know how to do a point query, since the edges can be parabolic arcs.
What is a good high-level approach for this problem?
Nearest Neighbor in Postgis
The approach of Postgis is to use R-trees with a custom search algorithm. While going down the tree like in a regular query, they keep track of the minimum and maximum distance to objects in the bounding regions they encounter in the tree. Each encountered branch of the tree is added to an Active Branch List (ABL), which are pruned using the distance metrics.
They pick a branch's bounding region in the ABL and apply the algorithm recursively. At a leaf (an object like a line segment), it updates the variable nearest. When returning from the recursion, they apply the nearest variable for more pruning of the ABL, repeating until the ABL is empty.
The theoretical worst case is linear per query, but it has much better results in practice.

Index for fast point-(> 100.000.000)-in-polygon-(> 10.000)-test

Problem
I'm working with openstreetmap-data and want to test for point-features in which polygon they lie. In total there are 10.000s of polygons and 100.000.000 of points. I can hold all this data in memory. The polygons usually have 1000s of points, hence making point-in-polygon-tests is very expensive.
Idea
I could index all polygons with an R-Tree, allowing me to only check the polygons whose bounding-box is hit.
Probable new problem
As the polygons are touching each other (think of administrative boundaries) there are many points in the bounding-box of more than one polygon, hence forcing many point-in-polygon-tests.
Question
Do you have any better suggestion than using an R-Tree?
Quad-Trees will likely work worse than rasterization - they are essentially a repeated rasterization to 2x2 images... But definitely exploit rasterization for all the easy cases, because testing the raster should be as fast as it gets. And if you can solve 90% of your points easily, you have more time for the remaining.
Also make sure to first remove duplicates. The indexes often suffer from duplicates, and they are obviously redundant to test.
R*-trees are probably a good thing to try, but you need to really carefully implement them.
The operation you are looking for is a containment spatial join. I don't think there is any implementation around that you could use - but for your performance issues, I would carefully implement it myself anyway. Also make sure to tune parameters and profile your code!
The basic idea of the join is to build two trees - one for the points, one for the polygons.
You then start with the root nodes of each tree, and repeat the following recursively until the leaf level:
If one is a non-directory node:
If the two nodes do not overlap: return
Decide by an heuristic (you'll need to figure this part out, "larger extend" may do for a start) which directory node to expand.
Recurse into each new node, plus the other non-opened node as new pair.
Leaf nodes:
fast test point vs. bounding box of polygon
slow test point in polygon
You can further accelerate this if you have a fast interior-test for the polygon, in particular for rectangle-in-polygon. It may be good enough if it is approximative, as long as it is fast.
For more detailed information, search for r-tree spatial join.
Try using quad trees.
Basically you can recursivelly partion space into 4 parts and then for each part you should know:
a) polygons which are superset of given part
b) polygons which intersect given part
This gives some O(log n) overhead factor which you might not be happy with.
The other option is to just partion space using grid. You should keep same information or each part of the grid as in the case above. This does only have some constant overhead.
Both this options assume, that the distribution of polygons is somehow uniform.
There is an other option, if you can process points offline (in other words you can pick the processing order of points). Then you can use some sweeping line techniques, where you sort points by one coordinate, you iterate over points in this sorted order and maintain only interesting set of polygons during iteration.

Can an unsorted binary tree be balanced using a distance-function?

I´ve got the following theoretical problem:
I have an amount n of cuboids in 3-dimensional space.
They are aligned to the coordinate-system, so that one cuboid can be described via a point (x,y,z) and dimensions (dimX,dimY,dimZ).
I want to organize these cuboids in a way that I´m able to check if a newly inserted cuboid intersecs with one of the existing (collision detection).
To do this I decided to use hierarchical bounding-boxes.
So in sum I have a binary-tree-structure of bounding volumes.
Insertion is then done by determining recursively the distance to both children (=the distance between the two centers of two cuboids) and inserting in the path with the smallest distance.
Collision detection works similar, but we take all bounding volumes in a sub-path which are intersecting a given cuboid.
The tricky part is, how to balance this tree to get better performance if some cuboids are very close to each other and others are far away.
So far, I´ve found no way to use e.g. an AVL-tree because then I´d have to be able to compare two cuboids in some way that does not break the conditions on which collision detection depends.
P.S.: I know there are libraries to do this, but I want to understand the principles of collision detection e.g. in games in detail and therefore want to implement this by myself.
I´ve now tried it with space-partitioning instead of object-partitioning. That´s not exactly what I wanted to do but I found much more helpful information about it, e.g.: https://en.wikipedia.org/wiki/Kd-tree
With this information it should be possible to implement it.

How to find the neighbors of a graph effiiciently

I have a program that create graphs as shown below
The algorithm starts at the green color node and traverses the graph. Assume that a node (Linked list type node with 4 references Left, Right, Up and Down) has been added to the graph depicted by the red dot in the image. Inorder to integrate the newly created node with it neighbors I need to find the four objects and link it so the graph connectivity will be preserved.
Following is what I need to clarify
Assume that all yellow colored nodes are null and I do not keep a another data structure to map nodes what is the most efficient way to find the existence of the neighbors of the newly created node. I know the basic graph search algorithms like DFS, BFS etc and shortest path algorithms but I do not think any of these are efficient enough because the graph can have about 10000 nodes and doing graph search algorithms (starting from the green node) to find the neighbors when a new node is added seems computationally expensive to me.
If the graph search is not avoidable what is the best alternative structure. I thought of a large multi-dimensional array. However, this has memory wastage and also has the issue of not having negative indexes. Since the graph in the image can grow in any directions. My solution to this is to write a separate class that consists of a array based data structure to portray negative indexes. However, before taking this option I would like to know if I could still solve the problem without resolving to a new structure and save a lot of rework.
Thank you for any feedback and reading this question.
I'm not sure if I understand you correctly. Do you want to
Check that there is a path from (0,0) to (x1,y1)
or
Check if any of the neighbors of (x1,y1) are in the graph? (even if there is no path from (0,0) to any of this neighbors).
I assume that you are looking for a path (otherwise you won't use a linked-list), which implies that you can't store points which have no path to (0,0).
Also, you mentioned that you don't want to use any other data structure beside / instead of your 2D linked-list.
You can't avoid full graph search. BFS and DFS are the classic algorithms. I don't think that you care about the shortest path - any path would do.
Another approaches you may consider is A* (simple explanation here) or one of its variants (look here).
An alternative data structure would be a set of nodes (each node is a pair < x,y > of course). You can easily run 4 checks to see if any of its neighbors are already in the set. It would take O(n) space and O(logn) time for both check and add. If your programming language does not support pairs as nodes of a set, you can use a single integer (x*(Ymax+1) + Y) instead.
Your data structure can be made to work, but probably not efficiently. And it will be a lot of work.
With your current data structure you can use an A* search (see https://en.wikipedia.org/wiki/A*_search_algorithm for a basic description) to find a path to the point, which necessarily finds a neighbor. Then pretend that you've got a little guy at that point, put his right hand on the wall, then have him find his way clockwise around the point. When he gets back, he'll have found the rest.
What do I mean by find his way clockwise? For example suppose that you go Down from the neighbor to get to his point. Then your guy should be faced the first of Right, Up, and Left which he has a neighbor. If he can go Right, he will, then he will try the directions Down, Right, Up, and Left. (Just imagine trying to walk through the maze yourself with your right hand on the wall.)
This way lies insanity.
Here are two alternative data structures that are much easier to work with.
You can use a quadtree. See http://en.wikipedia.org/wiki/Quadtree for a description. With this inserting a node is logarithmic in time. Finding neighbors is also logarithmic. And you're only using space for the data you have, so even if your graph is very spread out this is memory efficient.
Alternately you can create a class for a type of array that takes both positive and negative indices. Then one that builds on that to be 2-d class that takes both positive and negative indices. Under the hood that class would be implemented as a regular array and an offset. So an array that can start at some number, positive or negative. If ever you try to insert a piece of data that is before the offset, you create a new offset that is below that piece by a fixed fraction of the length of the array, create a new array, and copy data from the old to the new. Now insert/finding neighbors are usually O(1) but it can be very wasteful of memory.
You can use a spatial index like a quad tree or a r-tree.

Circular representation of a tree structure

I have some data in a tree structure, and I want to represent them in a graphical way, with the root node in the middle of the stage, his children displaced in a circle around him, and so on for every children, around their parent.
I don't want overlapping nodes, so the question is how to arrange space in an optimal way.
Something less or more like (found via google)
What algorhythms I have to search to realize something like this?
If you don't care about how it's done, but just that you are visualizing the data, then take a look at graphviz's radial layout. Although the example doesn't look exactly what you want, it is the layout you'd need. It'll also give you some ideas on how it's done too with the loads of research papers in there. Good luck!
You could also see how easy it is to extend this paper into a circular structure.
You can do it in an emergent way by setting up a system in which each tree node tries to keep as much distance from all other nodes (except parent) as possible, but as short a distance as possible from the parent (down to some minimum distance which it must maintain). If you run that algorithm for each node repeatedly until it stabilizes, you'll have an arrangement like the one you describe. I'm sure there are a lot of optimizations you can do to it, but I'm pretty sure this is going to be the simplest approach. Trying to calculate all the layout up front would be very complex...
You are trying to draw a planar representation of a graph.
Find some buzzwords and perhaps a resource here
And in wikipedia
Ah and I forgot: You can do this the newtonian way with forces.
Simply give all nodes a repelling potential, like make them all Protons which push each other away. Give the edges the properties of newtonian springs, exerting forces that pull them together and you are all set.
Could even create nice animations that way.
This is also an official way of graph drawing, but I don't know the name.
If you want to draw the tree with a minimum of wasted space and short connections, then you're in for a computationally expensive solution. It will be hard to get this to be realtime on a tree of decent size, not to mention that making small changes to the tree might result in a radically different equilibrium.
Another approach would be to abandon the physical simulation and just build it iteratively. I've done something similar last week, but my trees are probably a lot less involved than yours.
For this tree-layout, each node object has to store an angle and an offset. These two numbers control where on the graphics surface they end up.
Here is my basic algorithm:
1) recurse over your entire tree-data and find all the Leaf nodes.
2) while you're doing this, be sure to measure the length of each branching structure, so you know which is the longest.
3) once you have all your leaf nodes, distribute them equally over a concentric circle. You can either use the entire circle, or only some part of the angle domain.
4) once all Leaf nodes have been solved, you recurse again over the tree, going from the outside in. Each node you encounter that is not a leaf node is in need of layout. Essentially, every node from here on has an angle which is the average of all it's child nodes, and the offset is the graph_radius * (depth_of_node / maximum_depth)
I found this gives me a very decent and humanly readable distribution, albeit not a very efficient one in terms of screen usage. I uploaded an animation of my tree-display here: GIF anim

Resources