What is the ε (epsilon) parameter in Locality Sensitive Hashing (LSH)? - computational-geometry

I've read the original paper about Locality Sensitive Hashing.
The complexity is in function of the parameter ε, but I don't understand what it is.
Can you explain its meaning please?

ε is the approximation parameter.
LSH (as FLANN & kd-GeRaF) is designed for high dimensional data. In that space, k-NN doesn't work well, in fact it is almost as slow as brute force, because of the curse of dimensionality.
For that reason, we focus on solving the aproximate k-NN. Check Definition 1 from our paper, which basically say that it's OK to return an approximate neighbor lying in (1 + ε) further distance than the exact neighbor.
Check the image below:
here you see what does it mean finding the exact/approximate NN. In the traditional problem of NNS (Nearest Neighbor Search), we are asked to find the exact NN. In the modern problem, the approximate NNS, we are asked to find some neighbor inside the (1+ε) radius, thus either the exact or approximate NN would be a valid answer!
So, with a high probability, LSH will return a NN inside that (1+ε) radius. For ε = 0, we actually solve the exact NN problem.

Related

TSP Heuristics - Worst case ratio

I have some trouble trying to summarize the worst-case ratio of these heuristics for the metric (this means that it satisfies the triangle inequality) traveling salesman problem:
Nearest neighbor
Nearest insertion
Cheapest insertion
Farthest insertion
Nearest neighbor:
Here it says that the NN has a w-C ratio of
This one, page 8, same as this one says that it is
Which changes a lot.
Insertion algorithms:
Pretty match everyone agrees that the w-c ratio for cheapest and nearest insertion is <= 2 (always just for instances satisfying the triangle inequality) but coming to the farthest insertion every source is different:
here:
(forgot to change NN to FI)
While here
It is
And here there is also a different one:
Regarding the FI, I think that it depends on the starting sub-tour.
But in the NN, that ceil or floor bracket changes a lot the results, and since they all come from good sources, I can't figure out the right one.
Can someone summerize the actual known worst-case ratio for these algorithms?
NN: The correct bound uses ceiling, not floor (at least as proved in the original paper by Rosenkrantz et al. -- here, if you have access). I don't think there's a more recent bound that uses floor.
FI: Rosenkrantz et al. prove that the first bound applies to any insertion heuristic, including NN. Moreover, that bound is better than the other two (except for very small n). So I would use that bound. Note, however, that log really means log_2 in that formula. (I'm not sure where the other two bounds came from.)
One other note: It is known that there is no fixed worst-case bound for NN. It is not known whether there is a fixed worst-case bound for FI.

How to calculate the average time complexity of the nearest neighbor search using kd-tree?

We know the complexity of the nearest neighbor search of kd-tree is O(logn). But how to calculate it? The main problem is the average time complexity of the back tracing. I have tried to read the paper "An Algorithm for Finding Best Matches in Logarithmic Expected Time", but it is too complicate for me. Does anyone know a simple way to calculate that?
The calculation in the paper is about as simple as possible for a rigorous analysis.
(NB This is the price of being a true computer scientist and software engineer. You must put the effort into learning the math. Knowing the math is what separates people who think they can write solid programs from those who actually can. Jon Bentley, the guy who invented kd-trees, did so when he was in high school. Take this as inspiration.)
If you want a rough intuitive idea that is not rigorous, here is one.
Assume we are working in 2d. The sizes of the geometric areas represented by the 2d-tree are the key.
In the average case, one point partitions the domain into 2 roughly equal-sized rectangles. 3 points into 4. 7 points into 8 parts. Etc. In general N points lead to N-1 roughly equal-sized rectangles.
It not hard to see that if the domain is 1x1, the length of a side of these parts is on average O(sqrt(1/N)).
When you search for a nearest neighbor, you descend the tree to the rectangle containing the search point. After doing this, you have used O(log N) effort to find a point within R = O(sqrt(1/N)) of the correct one. This is just a point contained in the leaf that you discovered.
But this rectangle is not the only one that must be searched. You must still look at all others containing a point no more than distance R away from the search point, refining R each time you find a closer point.
Fortunately, the O(sqrt(1/N)) limit on R provides a tight bound on the average number of other rectangles this can be. In the average case, it's about 8 because each equal-sized rectangle has no more than 8 neighbors.
So the total effort to search is O(8 log n) = O(log n).
Again, I repeat this is not a rigorous analysis, but it ought to give you a feel for why the algorithm is O(log N) in the average case.

Weighted A* on wikipedia

I guess I find a problem on this wiki page:
I think the `
have a cost of at most ε times
in the Weighted A* algorithm part should be
have a cost less than ε times
instead.
Because here it assumes ε > 1. But I am not sure about it, just want to listen anybody's opinion on this..
Thank you for your help in advance :)
I believe the paragraph starting "Weighted A*. If ha(n) is" is correct, and a guarantee that the cost of the path found is at most eta times the cost of the best path is the sort of guarantee you want - since you are looking for the least cost path and trying to reduce cpu time you are settling for a sub-optimal (higher cost) solution but obtaining a guarantee that the cost is not too bad - at most eta times the cost of the best path.
I do think that there is an inconsistency between the use of eta in this paragraph and that in the paragraph above - I don't know whether that is a mistake or whether it derives from an unfortunate difference of conventions between weighted A* and a more general definition of approximate solutions.
The paragraph is consistent with the notes at http://inst.eecs.berkeley.edu/~cs188/sp11/slides/SP11%20cs188%20lecture%204%20--%20CSPs%206PP.pdf - bottom of page 5 on the pdf and with a rough proof. When weighted A* thinks it has a solution with cost g(x) all nodes still in play must have a predicted cost g(y) + eh(y) at least this. To get the largest possible error assume that g(y) is zero and that eh(y) = g(x) for correct solution y and we see that the solution A* thinks it has found is e times as expensive as y - since we presume that the original h() is admissable and therefore an upper bound on cost.

What are some fast approximations of Nearest Neighbor?

Say I have a huge (a few million) list of n vectors, given a new vector, I need to find a pretty close one from the set but it doesn't need to be the closest. (Nearest Neighbor finds the closest and runs in n time)
What algorithms are there that can approximate nearest neighbor very quickly at the cost of accuracy?
EDIT: Since it will probably help, I should mention the data are pretty smooth most of the time, with a small chance of spikiness in a random dimension.
There are exist faster algoritms then O(n) to search closest element by arbitary distance. Check http://en.wikipedia.org/wiki/Kd-tree for details.
If you are using high-dimension vector, like SIFT or SURF or any descriptor used in multi-media sector, I suggest your consider LSH.
A PhD dissertation from Wei Dong (http://www.cs.princeton.edu/cass/papers/cikm08.pdf) might help you find the updated algorithm of KNN search, i.e, LSH. Different from more traditional LSH, like E2LSH (http://www.mit.edu/~andoni/LSH/) published earlier by MIT researchers, his algorithm uses multi-probing to better balance the trade-off between recall rate and cost.
A web search on "nearest neighbor" lsh library finds
http://www.mit.edu/~andoni/LSH/
http://www.cs.umd.edu/~mount/ANN/
http://msl.cs.uiuc.edu/~yershova/MPNN/MPNN.htm
For approximate nearest neighbour, the fastest way is to use locality sensitive hashing (LSH). There are many variants of LSHs. You should choose one depending on the distance metric of your data. The big-O of the query time for LSH is independent of the dataset size (not considering time for output result). So it is really fast. This LSH library implements various LSH for L2 (Euclidian) space.
Now, if the dimension of your data is less than 10, kd tree is preferred if you want exact result.

How to efficiently find k-nearest neighbours in high-dimensional data?

So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)
My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.
My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?
The Wikipedia article for kd-trees has a link to the ANN library:
ANN is a library written in C++, which
supports data structures and
algorithms for both exact and
approximate nearest neighbor searching
in arbitrarily high dimensions.
Based on our own experience, ANN
performs quite efficiently for point
sets ranging in size from thousands to
hundreds of thousands, and in
dimensions as high as 20. (For applications in significantly higher
dimensions, the results are rather
spotty, but you might try it anyway.)
As far as algorithm/data structures are concerned:
The library implements a number of
different data structures, based on
kd-trees and box-decomposition trees,
and employs a couple of different
search strategies.
I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).
use a kd-tree
Unfortunately, in high dimensions this data structure suffers severely from the curse of dimensionality, which causes its search time to be comparable to the brute force search.
reduce the number of dimensions
Dimensionality reduction is a good approach, which offers a fair trade-off between accuracy and speed. You lose some information when you reduce your dimensions, but gain some speed.
By accuracy I mean finding the exact Nearest Neighbor (NN).
Principal Component Analysis(PCA) is a good idea when you want to reduce the dimensional space your data live on.
Is there some clever algorithm or data structure to solve this exactly in reasonable time?
Approximate nearest neighbor search (ANNS), where you are satisfied with finding a point that might not be the exact Nearest Neighbor, but rather a good approximation of it (that is the 4th for example NN to your query, while you are looking for the 1st NN).
That approach cost you accuracy, but increases performance significantly. Moreover, the probability of finding a good NN (close enough to the query) is relatively high.
You could read more about ANNS in the introduction our kd-GeRaF paper.
A good idea is to combine ANNS with dimensionality reduction.
Locality Sensitive Hashing (LSH) is a modern approach to solve the Nearest Neighbor problem in high dimensions. The key idea is that points that lie close to each other are hashed to the same bucket. So when a query arrives, it will be hashed to a bucket, where that bucket (and usually its neighboring ones) contain good NN candidates).
FALCONN is a good C++ implementation, which focuses in cosine similarity. Another good implementation is our DOLPHINN, which is a more general library.
You could conceivably use Morton Codes, but with 75 dimensions they're going to be huge. And if all you have is 16,000 data points, exhaustive search shouldn't take too long.
No reason to believe this is NP-complete. You're not really optimizing anything and I'd have a hard time figure out how to convert this to another NP-complete problem (I have Garey and Johnson on my shelf and can't find anything similar). Really, I'd just pursue more efficient methods of searching and sorting. If you have n observations, you have to calculate n x n distances right up front. Then for every observation, you need to pick out the top k nearest neighbors. That's n squared for the distance calculation, n log (n) for the sort, but you have to do the sort n times (different for EVERY value of n). Messy, but still polynomial time to get your answers.
BK-Tree isn't such a bad thought. Take a look at Nick's Blog on Levenshtein Automata. While his focus is strings it should give you a spring board for other approaches. The other thing I can think of are R-Trees, however I don't know if they've been generalized for large dimensions. I can't say more than that since I neither have used them directly nor implemented them myself.
One very common implementation would be to sort the Nearest Neighbours array that you have computed for each data point.
As sorting the entire array can be very expensive, you can use methods like indirect sorting, example Numpy.argpartition in Python Numpy library to sort only the closest K values you are interested in. No need to sort the entire array.
#Grembo's answer above should be reduced significantly. as you only need K nearest Values. and there is no need to sort the entire distances from each point.
If you just need K neighbours this method will work very well reducing your computational cost, and time complexity.
if you need sorted K neighbours, sort the output again
see
Documentation for argpartition

Resources