k-Nearest Neighbor VS Similarity Search - image

Is there any difference in these two algorithms? In a glance, they seem identical to me.
Let's say we are searching for images, then given a query image, one can search for k (=10 e.g.) images with a k-NN algorithm. In a Similarity Search algorithm one can search for for 10 images as well (10 something like a threshold I guess) and the results should be the same as with the k-NN algorithm, right?
Example of Similarity Search.

The main difference is Similarity Search is the feature/product, while k-NN is an algorithm.
Similarity Search is just saying "Give me the similar items", this is the feature. It does not say how it should be done.
k-NN on the other hand is an algorithm. It is not a feature, it is a classification algorithm. It is possible (though unlikely) that Similarity Search will actually use k-NN under the hood.

Related

approximate nearest neighbor (A1NN) for high dimension spaces

I read this question about finding the closest neighbor for 3-dimensions points. Octree is a solution for this case.
kd-Tree is a solution for small spaces (generally less than 50 dimensions).
For higher dimensions (vectors of hundreds of dimensions and millions of points) LSH is a popular solution for solving the AKNN (Aproxximate K-NN) problem, as pointed out in this question.
However, LSH is popular for K-NN solutions, where K>>1. For example, LSH has been successfully used for Content Based Image Retrieval (CBIR) applications, where each image is represented through a vector of hundreds of dimensions and the dataset is millions (or billions) of images. In this case, K is the number of top-K most similar images w.r.t. the query image.
But what if we are interested just to the most approximate similar neighbor (i.e. A1-NN) in high dimensional spaces? LSH is still the winner, or ad-hoc solutions have been proposed?
You might look at http://papers.nips.cc/paper/2666-an-investigation-of-practical-approximate-nearest-neighbor-algorithms.pdf and http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CTPAMI-TPTree.pdf. Both have figures and graphs showing the perfomance of LSH vs the performance of tree-based methods which also produce only approximate answers, for different values of k including k=1. The Microsoft paper claims that "It has been shown in [34] that randomized KD trees can
outperform the LSH algorithm by about an order of magnitude". Table 2 P 7 of the other paper appears to show speedups over LSH which are reasonably consistent for different values of k.
Note that this is not LSH vs kd-trees. This is LSH vs various clever tuned approximate search tree structures, where you typically search only the most promising parts of the tree, and not all of the parts of the tree that could possibly contain the closest point, and you search a number of different trees to get a decent probability of finding good points to compensate for this, tuning various parameters to get the fastest possible performance.

What's the difference between Levenshtein distance and the Wagner-Fischer algorithm

The Levenshtein distance is a string metric for measuring the difference between two sequences.
The Wagner–Fischer algorithm is a dynamic programming algorithm that computes the edit distance between two strings of characters.
Both using a matrix, and I don't see the difference?
Is the difference the backtracking or is there no further difference by the fact that one is the "literature" and the other one is the programming?
Also I am just writing on a thesis, and I am not sure how to divide it- should I first get into explaining the Levenshtein distance first and afterwards the Wagner-Fisher algorithm or doing both in one? I got kinda confused here.
You actually answer the question yourself in the first paragraph.
In the second paragraph you mix them up a bit.
Levenshtein distance is an edit distance metric named after Vladimir Levenshtein who considered this distance in 1965 and have nothing to do with the dynamic programming "matrix". And the Wagner–Fischer algorithm is a dynamic programming algorithm that computes the edit distance between two strings of characters.
However, the Levenshtein distance is normally computed using dynamic programming if what you need is a general purpose computation, that is, calculate the edit distance between two random input strings. But Levenshtein distance can also be used in a spell checker, when you compare one string with a dictionary. In cases like this its normally to slow to use a general purpose computation,and something like a Levenshtein Automaton can provide linear time to get all spelling suggestions. Btw, this is also used in the fuzzy search in Lucene since version 4.
About your thesis, well I think it depends. If its about the actual Levenshtein metric then I think thats where you should start, and if its about dynamic programming you should start with Wagner-Fischer. Anyway, thats my two cents about it. And good luck with you thesis.
Indeed, they are closely related, but they are not the same thing. The Levenshtein distance is a concept that is defined by a mathematical formula. However, trying to compute the Levenshtein distance by implementing the recursive formula directly will be horrendously slow. The Wagner-Fischer is a dynamic programming algorithm to compute it efficiently.

Google + Hamming (or binary) Search

I've read here:
How do search engines merge results from an inverted index?
"query execution algorithms are actually fairly dumb"
and there are many sofisticated algorithms using Hamming distance that are supposed to be used in a search engine.
Those algorithms can be found in the papers:
Efficient Processing of Hamming-Distance-Based Similarity-Search Queries Over MapReduce.
HmSearch: an efficient hamming distance query processing algorithm.
Multi-Index Hashing: A data structure for fast exact nearest neighbor search in Hamming space.
Does anybody know if algorithms in Hamming Space is really useful?

What are some fast approximations of Nearest Neighbor?

Say I have a huge (a few million) list of n vectors, given a new vector, I need to find a pretty close one from the set but it doesn't need to be the closest. (Nearest Neighbor finds the closest and runs in n time)
What algorithms are there that can approximate nearest neighbor very quickly at the cost of accuracy?
EDIT: Since it will probably help, I should mention the data are pretty smooth most of the time, with a small chance of spikiness in a random dimension.
There are exist faster algoritms then O(n) to search closest element by arbitary distance. Check http://en.wikipedia.org/wiki/Kd-tree for details.
If you are using high-dimension vector, like SIFT or SURF or any descriptor used in multi-media sector, I suggest your consider LSH.
A PhD dissertation from Wei Dong (http://www.cs.princeton.edu/cass/papers/cikm08.pdf) might help you find the updated algorithm of KNN search, i.e, LSH. Different from more traditional LSH, like E2LSH (http://www.mit.edu/~andoni/LSH/) published earlier by MIT researchers, his algorithm uses multi-probing to better balance the trade-off between recall rate and cost.
A web search on "nearest neighbor" lsh library finds
http://www.mit.edu/~andoni/LSH/
http://www.cs.umd.edu/~mount/ANN/
http://msl.cs.uiuc.edu/~yershova/MPNN/MPNN.htm
For approximate nearest neighbour, the fastest way is to use locality sensitive hashing (LSH). There are many variants of LSHs. You should choose one depending on the distance metric of your data. The big-O of the query time for LSH is independent of the dataset size (not considering time for output result). So it is really fast. This LSH library implements various LSH for L2 (Euclidian) space.
Now, if the dimension of your data is less than 10, kd tree is preferred if you want exact result.

How to efficiently find k-nearest neighbours in high-dimensional data?

So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)
My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.
My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?
The Wikipedia article for kd-trees has a link to the ANN library:
ANN is a library written in C++, which
supports data structures and
algorithms for both exact and
approximate nearest neighbor searching
in arbitrarily high dimensions.
Based on our own experience, ANN
performs quite efficiently for point
sets ranging in size from thousands to
hundreds of thousands, and in
dimensions as high as 20. (For applications in significantly higher
dimensions, the results are rather
spotty, but you might try it anyway.)
As far as algorithm/data structures are concerned:
The library implements a number of
different data structures, based on
kd-trees and box-decomposition trees,
and employs a couple of different
search strategies.
I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).
use a kd-tree
Unfortunately, in high dimensions this data structure suffers severely from the curse of dimensionality, which causes its search time to be comparable to the brute force search.
reduce the number of dimensions
Dimensionality reduction is a good approach, which offers a fair trade-off between accuracy and speed. You lose some information when you reduce your dimensions, but gain some speed.
By accuracy I mean finding the exact Nearest Neighbor (NN).
Principal Component Analysis(PCA) is a good idea when you want to reduce the dimensional space your data live on.
Is there some clever algorithm or data structure to solve this exactly in reasonable time?
Approximate nearest neighbor search (ANNS), where you are satisfied with finding a point that might not be the exact Nearest Neighbor, but rather a good approximation of it (that is the 4th for example NN to your query, while you are looking for the 1st NN).
That approach cost you accuracy, but increases performance significantly. Moreover, the probability of finding a good NN (close enough to the query) is relatively high.
You could read more about ANNS in the introduction our kd-GeRaF paper.
A good idea is to combine ANNS with dimensionality reduction.
Locality Sensitive Hashing (LSH) is a modern approach to solve the Nearest Neighbor problem in high dimensions. The key idea is that points that lie close to each other are hashed to the same bucket. So when a query arrives, it will be hashed to a bucket, where that bucket (and usually its neighboring ones) contain good NN candidates).
FALCONN is a good C++ implementation, which focuses in cosine similarity. Another good implementation is our DOLPHINN, which is a more general library.
You could conceivably use Morton Codes, but with 75 dimensions they're going to be huge. And if all you have is 16,000 data points, exhaustive search shouldn't take too long.
No reason to believe this is NP-complete. You're not really optimizing anything and I'd have a hard time figure out how to convert this to another NP-complete problem (I have Garey and Johnson on my shelf and can't find anything similar). Really, I'd just pursue more efficient methods of searching and sorting. If you have n observations, you have to calculate n x n distances right up front. Then for every observation, you need to pick out the top k nearest neighbors. That's n squared for the distance calculation, n log (n) for the sort, but you have to do the sort n times (different for EVERY value of n). Messy, but still polynomial time to get your answers.
BK-Tree isn't such a bad thought. Take a look at Nick's Blog on Levenshtein Automata. While his focus is strings it should give you a spring board for other approaches. The other thing I can think of are R-Trees, however I don't know if they've been generalized for large dimensions. I can't say more than that since I neither have used them directly nor implemented them myself.
One very common implementation would be to sort the Nearest Neighbours array that you have computed for each data point.
As sorting the entire array can be very expensive, you can use methods like indirect sorting, example Numpy.argpartition in Python Numpy library to sort only the closest K values you are interested in. No need to sort the entire array.
#Grembo's answer above should be reduced significantly. as you only need K nearest Values. and there is no need to sort the entire distances from each point.
If you just need K neighbours this method will work very well reducing your computational cost, and time complexity.
if you need sorted K neighbours, sort the output again
see
Documentation for argpartition

Resources