How to implement nearest neighbor search using KDTrees? - algorithm

So, I'm implementing a KD-Tree to do a nearest neighbor search. I've got the building the tree part working, but I don't think I understand the search part completely.
About traversing the tree to search for the neighbor, the Wikipedia article says the following:
Starting with the root node, the algorithm moves down the tree recursively, in the same
way that it would if the search point were being inserted (i.e. it goes right or left
depending on whether the point is greater or less than the current node in the split
dimension).
What does "greater or less than the current node in the spit dimension mean? Do we compare the points based on distance from the query or do we compare the points by the split dimension?
Also, could someone explain the part about hyperspace and hyperplane? I feel like I understand it, but since I'm not sure I'd like some more explanation.
Thanks!

Each node splits the space into 2 half-spaces along one axis. You look at where the point in question is relative to that split plane to decide which side of the tree to go down. For example if your point is (4,7,12) and you have a splitting plane that cuts the y axis at 9, you'd compare the 7 to the 9 and decide to go down the left (less than) side of the k-d tree first. After finding the nearest neighbor on the left side, you then check if it is closer than 2 (the distance to the split plane: 9-7). If it is closer than the split-plane you do not need to traverse the other half-tree at all. This is why it works so well, most of the time you will only have to traverse one sub-tree.
Hope that helps.

Related

Can a quad-tree be used to accurately determine the closest object to a point?

I have a list of coordinates and I need to find the closest coordinate to a specific point which I'll call P.
At first I tried to just calculate the distance from each coordinate to P, but this is too slow.
I then tried to store these coordinates as a quad-tree, find the leaf node that contains P, then find the closest coordinate in that leaf by comparing distances of every coordinate to P. This gives a good approximation for the closest coordinate, but can be wrong sometimes. (when a coordinate is outside the leaf node, but closer). I've also tried searching through the leaf node's parent, but while that makes the search more accurate, it doesn't make it perfect.
If it is possible for this to be done with a quad-tree, please let me know how, otherwise, what other methods/data structures could I used that are reasonably efficient, or is it even possible to do this perfectly in an efficient manner?
Try "loose quadtree". It does not have a fixed volume per node. So it can adjust each node's bounding volume to adapt to the items added.
If you don't like quadtree's traversing performance and if your objects are just points, adaptive-grid can perform fast or very close to O(N). But memory-wise, loose quadtree would be better.
There is an algorithm by Hjaltason and Samet described in their paper "Distance browsing in spatial databases". It can easily be applied to quadtrees, I have an implementation here.
Basically, you maintain a sorted list of object, the list is sorted by distance (closest first), and the objects are either point in your tree (you call the coordinates) or nodes in the tree (distance to closest corner, or distance=0 if they overlap with you search point).
You start adding all nodes that overlap with your search point, and add all points and subnodes in these points.
Then you simply return points from the top of the list until you have as many closest points as you want. If a node is at the top of the list, add points/subnodes from that node to the list and check the top of the list again. Repeat.
yes you can find the closest coordinate inside a quad-tree even when it is not directly inside the leaf. in order to do that, you can do the following search algorithm :
search the closest position inside the quad-tree.
take its distance from your initial position
search all the nodes inside this bounding box from your root node
return the closest node from all the nodes inside this bounding box
however, this is a very basic algorithm with no performance optimizations. among other things :
if the distance calculated in 2. is less than the distance to the border of the tree node, then you don't need to do 3 or 4. (or you can take a node that is not the root node)
also, 3 and 4 could be simplified into a single algorithm that only search inside the tree with the distance to the closest node as the bounding box.
And you could also sort the way you search for the nodes inside the bounding box by beginning to search for the nodes closest to your position first.
However, I have not made complexity calculation, but you should expect a worst case scenario on one node that is as bad if not worst than normal, but in general you should get a pretty decent speed up all the while being error free.

Algorithm to fill out a monochrome area - ideas?

I am looking for an algorithm which pretty much does the same as flood-fill, which is fill out a monochrome area. Instead of the recursion and the nearest-Neighbor approach i want the algorithm to be some sort of "turtle" or "mouse" that fills out the image, while leaving a path behind. This path must not contain diagonal movements. The result should be similar to a perfect Snake Game, where the entire square is filled (the snake represents the path in this case). It can cross its own path but that amount should be kept to a minimum and it should only occur in special cases (e.g: when the "mouse" enters a passage of width = 1px, where it would fill that passage out and turn around).The amount of changes in direction it takes should also be kept at a minimum.
P.S: not that this will be applied on an image, not a graph
I would model this as a graph problem where you are trying to visit all the nodes in the graph. Each pixel is a node and an edge exists between nodes that are directly next to each other.
I believe that creating an algorithm in which is a variation of a breadth first search on a graph modeled like this would achieve the result you were looking for. Ensuring that you do not visit nodes twice.
You would need to delve deeper into the implementation of the breadth first search in order to make sure it prefers a 'straight' path. Possibly when writing the breadth first search logic, create an is_straight() which checks the next node to see if it's straight or not, and prefer the straight node and only pick a non straight node when is_straight() returns false on all the child nodes.

Implementation of Planar Point Location using slabs

I am looking into a Planar Point Location, and right now thinking about make use of The Slabs approach.
I've read through a couple of articles:
Planar Point Location Using Persistent Search Trees
Persistent Data Structures
Point location on Wikipedia
I understand the concept behind a persistent balanced binary search tree, but let's omit them in this question, I don't really care about extra storage space overhead. The thing with the articles is they are discussing how speed can be improved, but don't explain basic stuff. For example:
Could you please correct me if I am wrong:
We draw a line across all intersections.
Now, we have slabs that are split by the segments at different angles.
Every intersection with any line is considered a vertex.
Slabs are sorted by order in a binary search tree (let's omit a partially persistent bst)
Somehow, sectors are sorted in those respective BST's, even though the segments dividing them are almost always at an angle. Does each node has to carry a definition of area?
Please refer to this example image:
Questions:
How would I actually figure out if the point lies in Node c and not in Node b? Is it somehow via area?
Do I instead need arrange my nodes to contain information about the segments? I could then check if the query point lies above a segment (if that's how I should determine my sectors)? If so, would I then search through a Polygon List after, to see which polygon this particular segment belongs to?
Maybe I need to store BST for each line and not a slab?
Would I then have to look at 2 BST's belonging to line on the left, and the 2nd one of the line to the right from the vertex? I could then sort the vertexes by y coordinate in each tree and return the y coordinate of a vertex (the end of a segment) right below my query point. Having done this for left and right line, I would then do a comparison to see if the names of the segment those vertexes come from actually match.
However, this will not give me the right answer, since even if the names do match, I might be below or above the segment (if I am close to it). Also, this implies I have to do 3 binary searches (1 for lines, 1 for the y-coordinate on left line, 1 for the right one), and the books says I only need to do 2 searches (1 for slab, 2nd for sector).
Could someone please point me in the direction to do it?
I probably just missed some essential thought or something.
Edit:
Here is another good article, that explains the solution to the problem, however, I don't quite understand how I would achieve the following:
"Consider any query point q ∈ R2. To find the face of G that contains q, we first use binary search with the x-coordinate of q to find the vertical slab s that contains q. Given s, we use binary search with the y-coordinate of q to find the edges of Es between which q lies. "
How exactly to find those 2 edges? Is it as simple as checking if the point lies below the segment? However, this seems like a complicated check to do (and expensive), as we descend down the tree, inspecting the other nodes.
Well, this was asked a year ago so I guess you're not going to be around to accept the answer. It's a fun question, though, so here goes anyway...
Yes, you can divide the plane into slabs using vertical lines through every intersection. The important result is that all line segments that touch a slab will go all the way through, and the order of those line segments will be the same across the entire slab.
In each slab, you make a binary search tree of those line segments, ordered by their order in the slab. The areas that you call "nodes" are the leaves of the tree, and the line segments are the internal nodes. I will call the leaves "areas" to avoid ambiguity. Since the order of the line segments is the same across the entire slab, the order in this tree is valid for every x coordinate in the slab.
to determine the area that contains any point, you find the slab that contains the point, and then do a simple search in the BST to find the area.
Lets say you're looking for point C, and you're at the node in slab 2 that contains (x2,y2) - (x3,y4). Determine whether C is above or below this line segment, and then take that branch in the tree.
For example, if C is (cx,cy) then the y coordinate of the segment at x=cx is:
testy = ((cx-x2)/(x3-x2)) * (y4-y2) + y2
If cy < testy, then take the upward branch, otherwise take the downward branch. (assuming positive Y goes downward).
When you get to a leaf, that's the area that contains C. You're done.
Now... Making a whole new tree for every slab takes a lot of space, since you can have N^2 intersection points and therefore N^2 slabs. Storing a separate tree in every slab would take O(N^3) space.
Using, say, persistent red-black trees, however, the slabs can actually share a lot of nodes. Using a simple purely functional implementation like Chris Okasaki's red-black tree takes O(log N) space per intersection for a total of O(N^2*log N) space.
I seem to remember that there is also a constant-space-per-change persistent red-black tree that would give you O(N^2), but I don't have a reference.
Note that for most real-life scenarios there are closer to O(N) intersections, because lines don't cross very much, but it's still nice to save the space by using a persistent tree.

Find nearest delivery centers to a given area code

This was asked during interviewing process for a company. Suppose there is an interface to look for nearest delivery center to your area. All you have to enter is your zipcode/pincode and it returns the nearest delivery center. What would be the data structure and algorithm to do this? Like, you have broken your phone and want to go to a service center. You go to the company website and enter your zipcode to find out the nearest repair center. How does it do that?
I suggested a graph + hashmap solution where I will return the neighbouring nodes from a given node and addresses will be stored in hashmap w.r.t zipcodes but that wasn't good enough as the interviewer kept pressing on using the geographical property saying that you are not given the distance between two centers so how do you know which is the nearest and also if asked for top 3 nearest centers. I was not able to come up with any solution then. He was also asking me again and again what data you need to solve this thing. Would be really helpful to know what could be the approach for this as it has been bugging me for days. Thanks
Most algorithms deal with single points - just taking the centre point of a zip code area should suffice.
For a single nearest neighbour, a Voronoi diagram seem like the way to go.
It separates the space into regions such that, given any query point, we know which point is closest.
Taken from Wikipedia:
A kd-tree is also an option:
The k-d tree is a binary tree in which every node is a k-dimensional point. Every non-leaf node can be thought of as implicitly generating a splitting hyperplane that divides the space into two parts, known as half-spaces. Points to the left of this hyperplane are represented by the left subtree of that node and points right of the hyperplane are represented by the right subtree. The hyperplane direction is chosen in the following way: every node in the tree is associated with one of the k-dimensions, with the hyperplane perpendicular to that dimension's axis. So, for example, if for a particular split the "x" axis is chosen, all points in the subtree with a smaller "x" value than the node will appear in the left subtree and all points with larger "x" value will be in the right subtree. In such a case, the hyperplane would be set by the x-value of the point, and its normal would be the unit x-axis.
Finding the k nearest neighbours is significantly more difficult. There is a k nearest neighbours algorithm, but this is a classification algorithm, so I'm not sure it helps here.
One option is to create a grid of the region. Then, given a point, we know which cell it's in, and we can simply query that cell and its neighbours until we've found the desired number of neighbours.
One just has to be careful here, as the next nearest point can actually be in another cell, e.g.:
--------------
| B|
A | X |
| |
| |
--------------
Given point X, the closest point is A, but B would be returned if we simply look in the same cell. We also need to look at all neighbouring cells after we've found k points.
You need the whole road network which is a sparse matrix containing the distance between all of the nodes. You also need to have the list of nodes containing the service centers. Having this, I think the A* algorithm should do the job in determining the distance between a given location and each service center, then picking up the least three in distance. I am certain there are more efficient algorithms but I believe the interviewer should concentrate on the way you think to resolve a problem rather than asking for implementation details such as data structures. Would I have to solve such a problem in real life, I would do a literature research first.
I am not sure about what strategy is best when facing such interviewer and if he would have accepted such a response. Being assertive and providing an overview of the solution before diping into the details might have been better.
Do not have regrets though. Benefit from the experience and move on. You do not know what bounties God has in store for you.

How to find the neighbors of a graph effiiciently

I have a program that create graphs as shown below
The algorithm starts at the green color node and traverses the graph. Assume that a node (Linked list type node with 4 references Left, Right, Up and Down) has been added to the graph depicted by the red dot in the image. Inorder to integrate the newly created node with it neighbors I need to find the four objects and link it so the graph connectivity will be preserved.
Following is what I need to clarify
Assume that all yellow colored nodes are null and I do not keep a another data structure to map nodes what is the most efficient way to find the existence of the neighbors of the newly created node. I know the basic graph search algorithms like DFS, BFS etc and shortest path algorithms but I do not think any of these are efficient enough because the graph can have about 10000 nodes and doing graph search algorithms (starting from the green node) to find the neighbors when a new node is added seems computationally expensive to me.
If the graph search is not avoidable what is the best alternative structure. I thought of a large multi-dimensional array. However, this has memory wastage and also has the issue of not having negative indexes. Since the graph in the image can grow in any directions. My solution to this is to write a separate class that consists of a array based data structure to portray negative indexes. However, before taking this option I would like to know if I could still solve the problem without resolving to a new structure and save a lot of rework.
Thank you for any feedback and reading this question.
I'm not sure if I understand you correctly. Do you want to
Check that there is a path from (0,0) to (x1,y1)
or
Check if any of the neighbors of (x1,y1) are in the graph? (even if there is no path from (0,0) to any of this neighbors).
I assume that you are looking for a path (otherwise you won't use a linked-list), which implies that you can't store points which have no path to (0,0).
Also, you mentioned that you don't want to use any other data structure beside / instead of your 2D linked-list.
You can't avoid full graph search. BFS and DFS are the classic algorithms. I don't think that you care about the shortest path - any path would do.
Another approaches you may consider is A* (simple explanation here) or one of its variants (look here).
An alternative data structure would be a set of nodes (each node is a pair < x,y > of course). You can easily run 4 checks to see if any of its neighbors are already in the set. It would take O(n) space and O(logn) time for both check and add. If your programming language does not support pairs as nodes of a set, you can use a single integer (x*(Ymax+1) + Y) instead.
Your data structure can be made to work, but probably not efficiently. And it will be a lot of work.
With your current data structure you can use an A* search (see https://en.wikipedia.org/wiki/A*_search_algorithm for a basic description) to find a path to the point, which necessarily finds a neighbor. Then pretend that you've got a little guy at that point, put his right hand on the wall, then have him find his way clockwise around the point. When he gets back, he'll have found the rest.
What do I mean by find his way clockwise? For example suppose that you go Down from the neighbor to get to his point. Then your guy should be faced the first of Right, Up, and Left which he has a neighbor. If he can go Right, he will, then he will try the directions Down, Right, Up, and Left. (Just imagine trying to walk through the maze yourself with your right hand on the wall.)
This way lies insanity.
Here are two alternative data structures that are much easier to work with.
You can use a quadtree. See http://en.wikipedia.org/wiki/Quadtree for a description. With this inserting a node is logarithmic in time. Finding neighbors is also logarithmic. And you're only using space for the data you have, so even if your graph is very spread out this is memory efficient.
Alternately you can create a class for a type of array that takes both positive and negative indices. Then one that builds on that to be 2-d class that takes both positive and negative indices. Under the hood that class would be implemented as a regular array and an offset. So an array that can start at some number, positive or negative. If ever you try to insert a piece of data that is before the offset, you create a new offset that is below that piece by a fixed fraction of the length of the array, create a new array, and copy data from the old to the new. Now insert/finding neighbors are usually O(1) but it can be very wasteful of memory.
You can use a spatial index like a quad tree or a r-tree.

Resources