Measuring distance between vectors - probability

I have a set of 300.000 or so vectors which I would like to compare in some way, and given one vector I want to be able to find the closest vector I have thought of three methods.
Simple Euclidian distance
Cosine similarity
Use a kernel (for instance Gaussian) to calculate the Gram matrix.
Treat the vector as a discrete probability distribution (which makes
sense to do) and calculate some divergence measure.
I do not really understand when it is useful to do one rather than the other. My data has a lot of zero-elements. With that in mind, is there some general rule of thumbs as to which of the three methods is the best?
Sorry for the weak question, but I had to start somewhere...
Thank you!

Your question is not quite clear, are you looking for a distance metric between vectors, or an algorithm to efficiently find the nearest neighbour?
If your vectors just contain a numeric type such as doubles or integers, you can find a nearest neighbour efficiently using a structure such as the kd-tree. (since you are just looking at points in d-dimensional space). See http://en.wikipedia.org/wiki/Nearest_neighbor_search, for other methods.
Otherwise, choosing a distance metric and algorithm is very much dependent on the content of the vectors.

If your vectors are very sparse in nature and if they are binary, you can use Hamming or Hellinger distance. When your vector dimensions are large, avoid using Euclidean (refer http://en.wikipedia.org/wiki/Curse_of_dimensionality)
Please refer to http://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type=pdf&doi=10.1.1.154.8446 for a survey of distance/similarity measures, although the paper limits it to pair of probability distributions.

Related

Similarity for arrays of parts of speech

K-nearest neighbor and natural language processing: How do you test the distance between arrays of parts of speech? eg
('verb','adverb','noun') and ('adjective','adverb','pronoun')?
A better phrased question would be how do you tell the similarity between the two in the context that they are parts of speech and not just strings?
As a general approach, you can use the cosine between POS vectors as a measure of their similarity. Alternative approach would be using the hamming distance between the two vectors.
There are plenty of other distance functions between vectors. But it really depends on what you want to do and what does your data look like. You should answer questions like does the position matter? How much similarity would you give to these vectors? ('noun', 'verb') and ('verb', 'noun')? Is the distance between ('adverb') and ('adjective') less than distance between ('adverb') and ('noun')? and so on.

Good algorithm for finding subsets of point sets

I'm trying to find suitable algorithms for searching subsets of 2D points in larger set.
A picture is worth thousand words, so:
Any ideas on how one could achieve this? Note that the transformations are just rotation and scaling.
It seems that the most closely problem is Point set registration [1].
I was experimenting with CPD and other rigid and non-rigid algorithms' implementations, but they don't seem to perform
too well on finding small subsets in larger sets of points.
Another approach could be using star tracking algorithms like the Angle method mentioned in [2]
or more robust methods like [3]. But again, they all seem to be meant for large input sets and target sets. I'm looking for something less reliable but more minimalistic...
Thanks for any ideas!
[1]: http://en.wikipedia.org/wiki/Point_set_registration
[2]: http://www.acsu.buffalo.edu/~johnc/star_gnc04.pdf
[3]: http://arxiv.org/abs/0910.2233
here's some papers probably related to your question:
Geometric Pattern Matching under Euclidean Motion (1993) by L. Paul Chew , Michael T. Goodrich , Daniel P. Huttenlocher , Klara Kedem , Jon M. Kleinberg , Dina Kravets.
A fast expected time algorithm for the 2-D point pattern (2004) by Wamelena, Iyengarb.
Simple algorithms for partial point set pattern matching under rigid motion (2006) by Bishnua, Dasb, Nandyb, Bhattacharyab.
Exact and approximate Geometric Pattern Matching for point sets in the plane under similarity transformations (2007) by Aiger and Kedem.
and by the way, your last reference reminded me of:
An Application of Point Pattern Matching in Astronautics (1994) by G. Weber, L. Knipping and H. Alt.
I think you should start with a subset of the input points and determine the required transformation to match a subset of the large set. For example:
choose any two points of the input, say A and B.
map A and B to a pair of the large set. This will determine the scale and two rotation angles (clockwise or counter clockwise)
apply the same scaling and transformation to a third input point C and check the large set to see if a point exists there. You'll have to check two positions, one for each of rotation angle. If the point C exists where it should be in the large set, you can check the rest of the points.
repeat for each pair of points in the large set
I think you could also try to match a subset of 3 input points, knowing that the angles of a triangle will be invariant under scaling and rotations.
Those are my ideas, I hope they help solve your problem.
I would try the Iterative Closest Point algorithm. A simple version like the one you need should be easy to implement.
Take a look at geometric hashing. It allows finding geometric patterns under different transformations. If you use only rotation and scale, it will be quite simple.
The main idea is to encode the pattern in "native" coordinates, which is invariant under transformations.
You can try a geohash. Translate the points to a binary and interleave it. Measure the distance and compare it with the original. You can also try to rotate the geohash, i.e. z-curve or morton curve.

Best algorithm to interpolate on a grid

I have a set points whose coordinates are given by the arrays x, y and z and the value of the density field in each point is stored in the array d.
I would like to reconstruct the density field on a uniform grid. What's the best algorithm to do that?
I know that in python, the scipy module come in handy with the griddata function but I would like to write my own code, I just need a hint.
If you have some sort of scalar field and the points are the origins of the field, you can implement a brute force approach by walking all lattice points and calculating the field intensity given the sources. There are both recursive methods that allow "blanking" wide volumes where the field is more or less constant, and techniques to save some CPU time by calculating the variations from one point to the next.
If the points you have are samplings of a value, then you will have to decompose your space in volumes and interpolate the values. You can employ a simple Voronoi decomposition - this is usually done in 2D for precipitation measurements - or a Delaunay tetrahedralization (you can look into TetGen's documentation). The first approach assumes that the function is constant throughout each Voronoi volume; the last allows rendering a trilinear interpolation.
If you need to smooth a 3D grid, the trilinear interpolation looks like the best approach.
There are also other methods used for fast visualization, that involve maintaining a list of 3D points in order of distance from any one given point in your regular grid. When moving through the grid, you recalculate distances using quadratic increments. Then, you perform a simple interpolation based on a subset of points of chosen cardinality (i.e., if you consider the four nearest points at distances d1..d4, you would calculate the value in P by proportionally weighing the values v1..v4). This approach is fast and easy to implement by yourself, but be warned that it underperforms wherever the minimum distance between points is less than the lattice step (you can compensate by considering more points where this happens; and the effect is less evident if the sampled function is smooth at the same scale).
If you want to implement a mathematical method yourself, you need to learn the theory, of course. In this case, it's 3D scattered data interpolation.
Wikipedia, MATLAB help and scipy help say there are at least half a dozen different methods. WP has a fairly good description of them and there's a comparison article but I strongly suggest you find something in your native language on such a terminology-intensive subject.
One approach is to form the Delaunay triangulation of the scattered points [x,y,z], (actually a tetrahedralisation in your 3d case!) and perform interpolation within each element using a linear representation of the density field, defined at the tetrahedron vertices.
To evaluate the density at each structured grid point you would (i) determine which tetrahedron the point lay within and (ii) evaluate the linear interpolant.
Forming the Delaunay triangulation is non-trivial, put there are a few good libraries that can be used for this, depending on your language of choice. One good option is CGAL.
Hope this helps.

Spatial index for geo coordinates?

What kind of data structure could be used for an efficient nearest neighbor search in a large set of geo coordinates? With "regular" spatial index structures like R-Trees that assume planar coordinates, I see two problems (Are there others I have overlooked?):
Wraparound at the poles and the International Date Line
Distortion of distances near the poles
How can these factors be allowed for? I guess the second one could compensated by transforming the coordinates. Can an R-Tree be modified to take wraparound into account? Or are there specialized geo-spatial index structures?
Could you use a locality-sensitive hashing (LSH) algorithm in 3 dimensions? That would quickly give you an approximate neighboring group which you could then sanity-check by calculating great-circle distances.
Here's a paper describing an algorithm for efficient LSH on the surface of a unit d-dimensional hypersphere. Presumably it works for d=3.
Take a look at Geohash.
Also, to compensate for wraparound, simply use not one but three orthogonal R-trees, so that there does not exist a point on the earth surface such that all three trees have a wraparound at that point. Then, two points are close if they are close according to at least one of these trees.

Can I use arbitrary metrics to search KD-Trees?

I just finished implementing a kd-tree for doing fast nearest neighbor searches. I'm interested in playing around with different distance metrics other than the Euclidean distance. My understanding of the kd-tree is that the speedy kd-tree search is not guaranteed to give exact searches if the metric is non-Euclidean, which means that I might need to implement a new data structure and search algorithm if I want to try out new metrics for my search.
I have two questions:
Does using a kd-tree permanently tie me to the Euclidean distance?
If so, what other sorts of algorithms should I try that work for arbitrary metrics? I don't have a ton of time to implement lots of different data structures, but other structures I'm thinking about include cover trees and vp-trees.
The nearest-neighbour search procedure described on the Wikipedia page you linked to can certainly be generalised to other distance metrics, provided you replace "hypersphere" with the equivalent geometrical object for the given metric, and test each hyperplane for crossings with this object.
Example: if you are using the Manhattan distance instead (i.e. the sum of the absolute values of all differences in vector components), your hypersphere would become a (multidimensional) diamond. (This is easiest to visualise in 2D -- if your current nearest neighbour is at distance x from the query point p, then any closer neighbour behind a different hyperplane must intersect a diamond shape that has width and height 2x and is centred on p). This might make the hyperplane-crossing test more difficult to code or slower to run, however the general principle still applies.
I don't think you're tied to euclidean distance - as j_random_hacker says, you can probably use Manhattan distance - but I'm pretty sure you're tied to geometries that can be represented in cartesian coordinates. So you couldn't use a kd-tree to index a metric space, for example.

Resources