Fuzzy search over millions of strings with custom distance function - algorithm

I have a large pool of short strings and a custom distance function on them (let's say Damerau–Levenshtein distance).
Q: What is the state-of-the-art solution for getting top N strings from the pool according to the custom distance?
I am looking for both a theoretical approach to this problem as well as coded implementation (Java, Python, etc).

The straight forward approach is to iterate over all strings, calculate the distance for each and keep only the best N while you iterate.
If you need to do this task a lot, you should think if you can come up with a upper-bound / lower bound estimation for the costs that can be calculated much faster than your real cost function. E.g. pre-calculate all n-grams (e.g. 3-grams) for your strings. or maybe comparing the length difference can already give a lower bound for the distance. than you can skip the calculation of the distance for all strings which have a lower bound distance higher than your current distance of the n-th best match.

Related

Is Levenshtein distance algorithm performs better than Needleman Wunsch Algorithm?

I know that both Levenshtein and Needleman Wunsch has the time complexity of O(N*M) but I was curious to know which one performs better than the other and why?
I've studied both of them, both takes the same time, they are equally efficient,
If you want you can see for your self,
just print the cpu_clock times in both the algorithms you wont see much difference,
(maybe just a few ms, but that too differ from compiler to compiler)
There is no article that would go in detail to compare the two, because it is a waste of time.
These two algorithms do different things, so there is no point in comparing them.
The Levenshtein Distance Algorithm just calculates the Levenshtein distance of two strings (the minimum number of mutations needed to transform one into the other).
The Needleman-Wunsch algorithm finds the optimal alignment between two strings, which can be seen as the operations needed to transform one string into another.
The first part of this is similar to the Levenshtein Distance Algorithm however, so if a time comparison is needed then Needleman-Wunsch will take longer, purely because it actually does more

Given n points in a 2-D plane we have to find k nearest neighbours of each point among themselves

I explored the method using a min-heap. For each point we can store a min heap of size k but it takes too much space for large n(I m targeting for n around a 100 million). Surely there must be a better way of doing this utilising lesser space and not affecting time complexity that much. Is there some other data structure?
This problem is typical setup for KD-tree. Such solution would have linearithmic complexity but may be relatively complex to implement(if a ready implementation is not available)
An alternative approach could be using bucketing to reduce the complexity of the naive algorithm. The idea is to separate the plane into "buckets" i.e. squares of some size and place the points in the bucket they belong to. The closest points will be from the closest buckets. In case of random data this could be quite good improvement but the worst case is still the same as the naive approach.

Finding a single cluster of points with low variance

Given a collection of points in the complex plane, I want to find a "typical value", something like mean or mode. However, I expect that there will be a lot of outliers, and that only a minority of the points will be close to the typical value. Here is the exact measure that I would like to use:
Find the mean of the largest set of points with variance less than some programmer-defined constant C
The closest thing I have found is the article Finding k points with minimum diameter and related problems, which gives an efficient algorithm for finding a set of k points with minimum variance, for some programmer-defined constant k. This is not useful to me because the number of points close to the typical value could vary a lot and there may be other small clusters. However, incorporating the article's result into a binary search algorithm shows that my problem can be solved in polynomial time. I'm asking here in the hope of finding a more efficient solution.
Here is way to do it (from what i have understood of problem) : -
select the point k from dataset and calculate sorted list of points in ascending order of their distance from k in O(NlogN).
Keeping k as mean add the points from sorted list into set till variance < C and then stop.
Do this for all points
Keep track of set which is largest.
Time Complexity:- O(N^2*logN) where N is size of dataset
Mode-seeking algorithms such as Mean-Shift clustering may still be a good choice.
You could then just keep the mode with the largest set of points that has variance below the threshold C.
Another approach would be to run k-means with a fairly large k. Then remove all points that contribute too much to variance, decrease k and repeat. Even though k-means does not handle noise very well, it can be used (in particular with a large k) to identify such objects.
Or you might first run some simple outlier detection methods to remove these outliers, then identify the mode within the reduced set only. A good candidate method is 1NN outlier detection, which should run in O(n log n) if you have an R-tree for acceleration.

What are some fast approximations of Nearest Neighbor?

Say I have a huge (a few million) list of n vectors, given a new vector, I need to find a pretty close one from the set but it doesn't need to be the closest. (Nearest Neighbor finds the closest and runs in n time)
What algorithms are there that can approximate nearest neighbor very quickly at the cost of accuracy?
EDIT: Since it will probably help, I should mention the data are pretty smooth most of the time, with a small chance of spikiness in a random dimension.
There are exist faster algoritms then O(n) to search closest element by arbitary distance. Check http://en.wikipedia.org/wiki/Kd-tree for details.
If you are using high-dimension vector, like SIFT or SURF or any descriptor used in multi-media sector, I suggest your consider LSH.
A PhD dissertation from Wei Dong (http://www.cs.princeton.edu/cass/papers/cikm08.pdf) might help you find the updated algorithm of KNN search, i.e, LSH. Different from more traditional LSH, like E2LSH (http://www.mit.edu/~andoni/LSH/) published earlier by MIT researchers, his algorithm uses multi-probing to better balance the trade-off between recall rate and cost.
A web search on "nearest neighbor" lsh library finds
http://www.mit.edu/~andoni/LSH/
http://www.cs.umd.edu/~mount/ANN/
http://msl.cs.uiuc.edu/~yershova/MPNN/MPNN.htm
For approximate nearest neighbour, the fastest way is to use locality sensitive hashing (LSH). There are many variants of LSHs. You should choose one depending on the distance metric of your data. The big-O of the query time for LSH is independent of the dataset size (not considering time for output result). So it is really fast. This LSH library implements various LSH for L2 (Euclidian) space.
Now, if the dimension of your data is less than 10, kd tree is preferred if you want exact result.

How to efficiently find k-nearest neighbours in high-dimensional data?

So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)
My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.
My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?
The Wikipedia article for kd-trees has a link to the ANN library:
ANN is a library written in C++, which
supports data structures and
algorithms for both exact and
approximate nearest neighbor searching
in arbitrarily high dimensions.
Based on our own experience, ANN
performs quite efficiently for point
sets ranging in size from thousands to
hundreds of thousands, and in
dimensions as high as 20. (For applications in significantly higher
dimensions, the results are rather
spotty, but you might try it anyway.)
As far as algorithm/data structures are concerned:
The library implements a number of
different data structures, based on
kd-trees and box-decomposition trees,
and employs a couple of different
search strategies.
I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).
use a kd-tree
Unfortunately, in high dimensions this data structure suffers severely from the curse of dimensionality, which causes its search time to be comparable to the brute force search.
reduce the number of dimensions
Dimensionality reduction is a good approach, which offers a fair trade-off between accuracy and speed. You lose some information when you reduce your dimensions, but gain some speed.
By accuracy I mean finding the exact Nearest Neighbor (NN).
Principal Component Analysis(PCA) is a good idea when you want to reduce the dimensional space your data live on.
Is there some clever algorithm or data structure to solve this exactly in reasonable time?
Approximate nearest neighbor search (ANNS), where you are satisfied with finding a point that might not be the exact Nearest Neighbor, but rather a good approximation of it (that is the 4th for example NN to your query, while you are looking for the 1st NN).
That approach cost you accuracy, but increases performance significantly. Moreover, the probability of finding a good NN (close enough to the query) is relatively high.
You could read more about ANNS in the introduction our kd-GeRaF paper.
A good idea is to combine ANNS with dimensionality reduction.
Locality Sensitive Hashing (LSH) is a modern approach to solve the Nearest Neighbor problem in high dimensions. The key idea is that points that lie close to each other are hashed to the same bucket. So when a query arrives, it will be hashed to a bucket, where that bucket (and usually its neighboring ones) contain good NN candidates).
FALCONN is a good C++ implementation, which focuses in cosine similarity. Another good implementation is our DOLPHINN, which is a more general library.
You could conceivably use Morton Codes, but with 75 dimensions they're going to be huge. And if all you have is 16,000 data points, exhaustive search shouldn't take too long.
No reason to believe this is NP-complete. You're not really optimizing anything and I'd have a hard time figure out how to convert this to another NP-complete problem (I have Garey and Johnson on my shelf and can't find anything similar). Really, I'd just pursue more efficient methods of searching and sorting. If you have n observations, you have to calculate n x n distances right up front. Then for every observation, you need to pick out the top k nearest neighbors. That's n squared for the distance calculation, n log (n) for the sort, but you have to do the sort n times (different for EVERY value of n). Messy, but still polynomial time to get your answers.
BK-Tree isn't such a bad thought. Take a look at Nick's Blog on Levenshtein Automata. While his focus is strings it should give you a spring board for other approaches. The other thing I can think of are R-Trees, however I don't know if they've been generalized for large dimensions. I can't say more than that since I neither have used them directly nor implemented them myself.
One very common implementation would be to sort the Nearest Neighbours array that you have computed for each data point.
As sorting the entire array can be very expensive, you can use methods like indirect sorting, example Numpy.argpartition in Python Numpy library to sort only the closest K values you are interested in. No need to sort the entire array.
#Grembo's answer above should be reduced significantly. as you only need K nearest Values. and there is no need to sort the entire distances from each point.
If you just need K neighbours this method will work very well reducing your computational cost, and time complexity.
if you need sorted K neighbours, sort the output again
see
Documentation for argpartition

Resources