What kind of data structure could be used for an efficient nearest neighbor search in a large set of geo coordinates? With "regular" spatial index structures like R-Trees that assume planar coordinates, I see two problems (Are there others I have overlooked?):
Wraparound at the poles and the International Date Line
Distortion of distances near the poles
How can these factors be allowed for? I guess the second one could compensated by transforming the coordinates. Can an R-Tree be modified to take wraparound into account? Or are there specialized geo-spatial index structures?
Could you use a locality-sensitive hashing (LSH) algorithm in 3 dimensions? That would quickly give you an approximate neighboring group which you could then sanity-check by calculating great-circle distances.
Here's a paper describing an algorithm for efficient LSH on the surface of a unit d-dimensional hypersphere. Presumably it works for d=3.
Take a look at Geohash.
Also, to compensate for wraparound, simply use not one but three orthogonal R-trees, so that there does not exist a point on the earth surface such that all three trees have a wraparound at that point. Then, two points are close if they are close according to at least one of these trees.
Related
I'm reading about image search and I've gotten to the point where I have a basic understanding of feature vectors and have a very basic (definitely incomplete) understanding of rotation invariant and scale invariant features. How you can look at multi-sampled images for scale invariance and corners for rotational invariance.
To search a billion images though there is no way you could do a linear search. Most of my reading seems to imply a K-d tree is used as a partitioning data structure to improve the lookup times.
What metric is the K-d tree split on? If you use descriptors like SIFT,SURF, or ORB there is no guarantee your similar keypoints line up in the feature vectors so I'm confused how you determine 'left' or 'right' since with features like this you need the split to be based on similarity. My guess is in euclidean distance from a 'standard' then you do a robust nearest neighbor search, but would like some input on how the inital query into the KD tree is handled before the nearest neighbor search. I would think a KD tree needs to be comparing similar features in each dimension, but I don't see how that happens with many key points.
I can find a lot of papers on the nearest neighbor search, but most seem to assume you know how this is handled so I'm missing something here.
It's quite simple. All that feature descriptors present image as a point in multidimensional space. Just for the sake of simplicity, let's assume that your descriptor dimension is 2. Than all your images would be mapped onto two dimesional plane. Then, kd-tree will split this plane into rectangular areas. Any images that fall within same area would be considered as similar.
That means, btw, two images that lie really close to each other, but in different areas (leafs of the kd-tree) will not be considered as similar.
To overcome this issue, cosine similarity can be used instead of euclidian distance. You can read more about the subject in wiki.
I would like to know if I am missing any acceleration structure that is designed for retrieving k-nearest spheres within a range.
The context of my question is molecular visualization, specifically, I need to retrieve k-nearest spheres to a point to produce a function that will be used to guide sphere tracing step length.
To simplify, the search can be limited in range to the point being tested.
All I have seen in the articles handle k-nearest points to a point, but my case is different, since I want to work with spheres closest to a point. It seems possible to adapt the kd-trees, changing the test of points to spheres but I believe that it would affect the performance. So I wonder if there is a better structure or if I should use and adapt the kd-trees.
Currently, I am using an Hybrid Bounding Volume Hierarchy but I think that the search performance could be better with other structure, since I have a big overlap of bounding volumes due to the nature of the molecules.
PS: I don't care much about the construction time. I want good search performance and decent memory occupation.
You could use a 3-step approach:
Find the nearest neighbor using the center-points of the spheres.
For this nearest neighbor you substract its radius and add the maximum radius. Then you Perform a spherical range query with the new distance. This will return all center points of spheres who may be the closest to your original sphere.
Then you manually calculate the actual distance for each sphere using it's actual radius.
This should be reasonably efficient assuming that the radius of spheres is not massively bigger than their average distance.
I have a set of 300.000 or so vectors which I would like to compare in some way, and given one vector I want to be able to find the closest vector I have thought of three methods.
Simple Euclidian distance
Cosine similarity
Use a kernel (for instance Gaussian) to calculate the Gram matrix.
Treat the vector as a discrete probability distribution (which makes
sense to do) and calculate some divergence measure.
I do not really understand when it is useful to do one rather than the other. My data has a lot of zero-elements. With that in mind, is there some general rule of thumbs as to which of the three methods is the best?
Sorry for the weak question, but I had to start somewhere...
Thank you!
Your question is not quite clear, are you looking for a distance metric between vectors, or an algorithm to efficiently find the nearest neighbour?
If your vectors just contain a numeric type such as doubles or integers, you can find a nearest neighbour efficiently using a structure such as the kd-tree. (since you are just looking at points in d-dimensional space). See http://en.wikipedia.org/wiki/Nearest_neighbor_search, for other methods.
Otherwise, choosing a distance metric and algorithm is very much dependent on the content of the vectors.
If your vectors are very sparse in nature and if they are binary, you can use Hamming or Hellinger distance. When your vector dimensions are large, avoid using Euclidean (refer http://en.wikipedia.org/wiki/Curse_of_dimensionality)
Please refer to http://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type=pdf&doi=10.1.1.154.8446 for a survey of distance/similarity measures, although the paper limits it to pair of probability distributions.
I'm working on an app that lets users select regions by finger painting on top of a map. The points then get converted to a latitude/longitude and get uploaded to a server.
The touch screen is delivering way too many points to be uploaded over 3G. Even small regions can accumulate up to ~500 points.
I would like to smooth this touch data (approximate it within some tolerance). The accuracy of drawing does not really matter much as long as the general area of the region is the same.
Are there any well known algorithms to do this? Is this work for a Kalman filter?
There is the Ramer–Douglas–Peucker algorithm (wikipedia).
The purpose of the algorithm is, given
a curve composed of line segments, to
find a similar curve with fewer
points. The algorithm defines
'dissimilar' based on the maximum
distance between the original curve
and the simplified curve. The
simplified curve consists of a subset
of the points that defined the
original curve.
You probably don't need anything too exotic to dramatically cut down your data.
Consider something as simple as this:
Construct some sort of error metric. An easy one would be a normalized sum of the distances from the omitted points to the line that was approximating them. Decide what a tolerable error using this metric is.
Then starting from the first point construct the longest line segment that falls within the tolerable error range. Repeat this process until you have converted the entire path into a polyline.
This will not give you the globally optimal approximation but it will probably be good enough.
If you want the approximation to be more "curvey" you might consider using splines or bezier curves rather than straight line segments.
You want to subdivide the surface into a grid with a quadtree or a space-filling-curve. A sfc reduce the 2d complexity to a 1d complexity. You want to look for Nick's hilbert curve quadtree spatial index blog.
I was going to do something this in an app, but was intending on generating a path from the points on-the-fly. I was going to use a technique mentioned in this Point Sequence Interpolation thread
I just finished implementing a kd-tree for doing fast nearest neighbor searches. I'm interested in playing around with different distance metrics other than the Euclidean distance. My understanding of the kd-tree is that the speedy kd-tree search is not guaranteed to give exact searches if the metric is non-Euclidean, which means that I might need to implement a new data structure and search algorithm if I want to try out new metrics for my search.
I have two questions:
Does using a kd-tree permanently tie me to the Euclidean distance?
If so, what other sorts of algorithms should I try that work for arbitrary metrics? I don't have a ton of time to implement lots of different data structures, but other structures I'm thinking about include cover trees and vp-trees.
The nearest-neighbour search procedure described on the Wikipedia page you linked to can certainly be generalised to other distance metrics, provided you replace "hypersphere" with the equivalent geometrical object for the given metric, and test each hyperplane for crossings with this object.
Example: if you are using the Manhattan distance instead (i.e. the sum of the absolute values of all differences in vector components), your hypersphere would become a (multidimensional) diamond. (This is easiest to visualise in 2D -- if your current nearest neighbour is at distance x from the query point p, then any closer neighbour behind a different hyperplane must intersect a diamond shape that has width and height 2x and is centred on p). This might make the hyperplane-crossing test more difficult to code or slower to run, however the general principle still applies.
I don't think you're tied to euclidean distance - as j_random_hacker says, you can probably use Manhattan distance - but I'm pretty sure you're tied to geometries that can be represented in cartesian coordinates. So you couldn't use a kd-tree to index a metric space, for example.