Local Sensitive Hashing using a arbitray non euclidean metric - data-structures

I have a very specific question. I work on a project, were I need to find nearest neighbours (k and near).
As I dont need the excat ones and want to be able to extend to high dimensions, I focused on LSH.
My data has a distance that is a metric, but non euclidean. I found many ways for vector space with euclidean metric (e.g. the p stable distribution), binary coding(via projections) or string based.
What I am searching are papers that present a LSH template for an arbitrary metric. Does anyone has some refernece to papers?
Thanks in advance
Dan

What you are looking for is quite new:
I think this paper may help
http://www.aaai.org/ocs/index.php/aaai/aaai10/paper/download/1839/2032
It suggests strategies for non-metric data,
which is even worse than having a non-euclidean case.

Related

Fastest k nearest neighbor with arbitrary metric?

The gotcha with this question is "arbitrary metric". If you don't know what that is, it's just the way to measure distance between points. (In the "real" world, the 1-dimensinal distance is just the absolute magnitude of the difference between the two points).
Enough of the pre-lims. I'm trying to find a fast k nearest neighbor algorithm with these properties:
works on an arbitrary metric
somewhat easy to implement
optimized for finding the distance of a set of points to another set of points
Wikipedia gives a list of algorithms and approaches but nothing on implementation.
UPDATE: the metric is the cosine similarity, which does not satisfy the triangle inquality. However, it seems that I can use the "angular similarity" (as per Wikipedia).
UPDATE: the use case is natural language processing. "Vectors" are the "context" of a given word, represented by binary properties (ex: the title of the document). So while there may be only a few properties (right now I'm just using 3), each vector has arbitrarily large dimension (in the title example, each title in the database would correspond to a dimension in the vector).
UPDATE: For the curious, I'm implementing this algorithm:
http://josquin.cs.depaul.edu/~mramezani/papers/IEEEIS.pdf
UPDATE: The algorithm will need to find nearest neighbors for about a dozen points from about 100s of points. The average dimension will probably be very large, say 50, (I really don't know yet). And yes, I'm interested in an algorithm, not a library. And yes, estimates are probably good enough.
I would advice you to go for Locality-sensitive hashing (LSH), which is in trend right now. It reduces the dimensionality of high-dimensional data, but I am not sure if your dimension will go well with that algorithm. See the Wikipedia page for more.
You can use your own metric, but in general you can do that in many algorithms. Hope this helps.
You could go for RKD trees, a forest of them, but maybe this is too much now.

Find all k-nearest neighbors

Problem:
I have N (~100m) strings each D (e.g. 100) characters long and with a low alphabet (eg 4 possible characters). I would like to find the k-nearest neighbors for every one of those N points ( k ~ 0.1D). Adjacent strings define via hamming distance. Solution doesn't have to be the best possible but closer the better.
Thoughts about the problem
I have a bad feeling this is a non trivial problem. I have read many papers and algorithms, however most of them has poor result in high dimension and it works when dimension is less than 5. For example this paper suggests an efficient algorithm, but its constant is related to dimension exponentially.
Currently, I am investigating on how can I reduce dimension in a way that hamming distance is preserved or can be computed.
Another option is locality sensitive hashing, Points that are close to each other under the chosen metric are mapped to the same bucket with high probability. Any Help? which option do you prefer?
One of the previously asked questions has some good discussions, so you can refer to that,
Nearest neighbors in high-dimensional data?
Other than this, you can also look at,
http://web.cs.swarthmore.edu/~adanner/cs97/s08/papers/dahl_wootters.pdf
Few papers which analyze different approaches,
http://www.jmlr.org/papers/volume11/radovanovic10a/radovanovic10a.pdf
https://www.cse.ust.hk/~yike/sigmod09-lsb.pdf

k-NN search in HUGE dimensions (~100,000)

Are there any articles about k-NN search problem for really huge amount of dimensions like 10k - 100k?
Most of articles with tests on real-world data operates with 10-50 dims, and a few operates 100-500.
In my case there is ~10^9 points in ~100k feature dimension, and there is no way to effectively reduce number of dimensions.
UPD.:
At the moment we are trying to adapt and implement VP-trees, but it's clear enough that any tree struct on this dimensionality wont work well.
Second approach is LSH, but there may be big troubles with accuracy depending on data distribution.
Take a look at FLANN library.
In this paper you will find a dissertation on how data dimensionality is one of the factors that has a great impact on the nearest neighbor matching performance, and the solutions adopted in FLANN.
Are you using kd-tree for nearest neighbour search? kd-tree deteriorates to almost exhaustive search in higher dimensions.
In higher dimensions, it is usually suggested to use approximate nearest neighbour search. here is the link to the original paper: http://cvs.cs.umd.edu/~mount/Papers/dist.pdf, and if that is a bit too heavy, try this: dimacs.rutgers.edu/Workshops/MiningTutorial/pindyk-slides.ppt‎
There are many factors affecting the choice of decision when it comes to nearest neighbour search. Whether you need to load the points entirely in primary memory or you could use secondary memory should also govern your decision.

How to find the closest 2 points in a 100 dimensional space with 500,000 points?

I have a database with 500,000 points in a 100 dimensional space, and I want to find the closest 2 points. How do I do it?
Update: Space is Euclidean, Sorry. And thanks for all the answers. BTW this is not homework.
There's a chapter in Introduction to Algorithms devoted to finding two closest points in two-dimensional space in O(n*logn) time. You can check it out on google books. In fact, I suggest it for everyone as the way they apply divide-and-conquer technique to this problem is very simple, elegant and impressive.
Although it can't be extended directly to your problem (as constant 7 would be replaced with 2^101 - 1), it should be just fine for most datasets. So, if you have reasonably random input, it will give you O(n*logn*m) complexity where n is the number of points and m is the number of dimensions.
edit
That's all assuming you have Euclidian space. I.e., length of vector v is sqrt(v0^2 + v1^2 + v2^2 + ...). If you can choose metric, however, there could be other options to optimize the algorithm.
Use a kd tree. You're looking at a nearest neighbor problem and there are highly optimized data structures for handling this exact class of problems.
http://en.wikipedia.org/wiki/Kd-tree
P.S. Fun problem!
You could try the ANN library, but that only gives reliable results up to 20 dimensions.
Run PCA on your data to convert vectors from 100 dimensions to say 20 dimensions. Then create a K-Nearest Neighbor tree (KD-Tree) and get the closest 2 neighbors based on euclidean distance.
Generally if no. of dimensions are very large then you have to either do a brute force approach (parallel + distributed/map reduce) or a clustering based approach.
Use the data structure known as a KD-TREE. You'll need to allocate a lot of memory, but you may discover an optimization or two along the way based on your data.
http://en.wikipedia.org/wiki/Kd-tree.
My friend was working on his Phd Thesis years ago when he encountered a similar problem. His work was on the order of 1M points across 10 dimensions. We built a kd-tree library to solve it. We may be able to dig-up the code if you want to contact us offline.
Here's his published paper:
http://www.elec.qmul.ac.uk/people/josh/documents/ReissSelbieSandler-WIAMIS2003.pdf

Determining the best k for a k nearest neighbour

I have need to do some cluster analysis on a set of 2 dimensional data (I may add extra dimensions along the way).
The analysis itself will form part of the data being fed into a visualisation, rather than the inputs into another process (e.g. Radial Basis Function Networks).
To this end, I'd like to find a set of clusters which primarily "looks right", rather than elucidating some hidden patterns.
My intuition is that k-means would be a good starting place for this, but that finding the right number of clusters to run the algorithm with would be problematic.
The problem I'm coming to is this:
How to determine the 'best' value for k such that the clusters formed are stable and visually verifiable?
Questions:
Assuming that this isn't NP-complete, what is the time complexity for finding a good k. (probably reported in number of times to run the k-means algorithm).
is k-means a good starting point for this type of problem? If so, what other approaches would you recommend. A specific example, backed by an anecdote/experience would be maxi-bon.
what short cuts/approximations would you recommend to increase the performance.
For problems with an unknown number of clusters, agglomerative hierarchical clustering is often a better route than k-means.
Agglomerative clustering produces a tree structure, where the closer you are to the trunk, the fewer the number of clusters, so it's easy to scan through all numbers of clusters. The algorithm starts by assigning each point to its own cluster, and then repeatedly groups the two closest centroids. Keeping track of the grouping sequence allows an instant snapshot for any number of possible clusters. Therefore, it's often preferable to use this technique over k-means when you don't know how many groups you'll want.
There are other hierarchical clustering methods (see the paper suggested in Imran's comments). The primary advantage of an agglomerative approach is that there are many implementations out there, ready-made for your use.
In order to use k-means, you should know how many cluster there is. You can't try a naive meta-optimisation, since the more cluster you'll add (up to 1 cluster for each data point), the more it will brought you to over-fitting. You may look for some cluster validation methods and optimize the k hyperparameter with it but from my experience, it rarely work well. It's very costly too.
If I were you, I would do a PCA, eventually on polynomial space (take care of your available time) depending on what you know of your input, and cluster along the most representatives components.
More infos on your data set would be very helpful for a more precise answer.
Here's my approximate solution:
Start with k=2.
For a number of tries:
Run the k-means algorithm to find k clusters.
Find the mean square distance from the origin to the cluster centroids.
Repeat the 2-3, to find a standard deviation of the distances. This is a proxy for the stability of the clusters.
If stability of clusters for k < stability of clusters for k - 1 then return k - 1
Increment k by 1.
The thesis behind this algorithm is that the number of sets of k clusters is small for "good" values of k.
If we can find a local optimum for this stability, or an optimal delta for the stability, then we can find a good set of clusters which cannot be improved by adding more clusters.
In a previous answer, I explained how Self-Organizing Maps (SOM) can be used in visual clustering.
Otherwise, there exist a variation of the K-Means algorithm called X-Means which is able to find the number of clusters by optimizing the Bayesian Information Criterion (BIC), in addition to solving the problem of scalability by using KD-trees.
Weka includes an implementation of X-Means along with many other clustering algorithm, all in an easy to use GUI tool.
Finally you might to refer to this page which discusses the Elbow Method among other techniques for determining the number of clusters in a dataset.
You might look at papers on cluster validation. Here's one that is cited in papers that involve microarray analysis, which involves clustering genes with related expression levels.
One such technique is the Silhouette measure that evaluates how closely a labeled point is to its centroid. The general idea is that, if a point is assigned to one centroid but is still close to others, perhaps it was assigned to the wrong centroid. By counting these events across training sets and looking across various k-means clusterings, one looks for the k such that the labeled points overall fall into the "best" or minimally ambiguous arrangement.
It should be said that clustering is more of a data visualization and exploration technique. It can be difficult to elucidate with certainty that one clustering explains the data correctly, above all others. It's best to merge your clusterings with other relevant information. Is there something functional or otherwise informative about your data, such that you know some clusterings are impossible? This can reduce your solution space considerably.
From your wikipedia link:
Regarding computational complexity,
the k-means clustering problem is:
NP-hard in general Euclidean
space d even for 2 clusters
NP-hard for a general number of
clusters k even in the plane
If k and d are fixed, the problem can be
exactly solved in time O(ndk+1 log n),
where n is the number of entities to
be clustered
Thus, a variety of heuristic
algorithms are generally used.
That said, finding a good value of k is usually a heuristic process (i.e. you try a few and select the best).
I think k-means is a good starting point, it is simple and easy to implement (or copy). Only look further if you have serious performance problems.
If the set of points you want to cluster is exceptionally large a first order optimisation would be to randomly select a small subset, use that set to find your k-means.
Choosing the best K can be seen as a Model Selection problem. One possible approach is Minimum Description Length, which in this context means: You could store a table with all the points (in which case K=N). At the other extreme, you have K=1, and all the points are stored as their distances from a single centroid. This Section from Introduction to Information Retrieval by Manning and Schutze suggest minimising the Akaike Information Criterion as a heuristic for an optimal K.
This problematic belongs to the "internal evaluation" class of "clustering optimisation problems" which curent state of the art solution seems to use the **Silhouette* coeficient* as stated here
https://en.wikipedia.org/wiki/Cluster_analysis#Applications
and here:
https://en.wikipedia.org/wiki/Silhouette_(clustering) :
"silhouette plots and averages may be used to determine the natural number of clusters within a dataset"
scikit-learn provides a sample usage implementation of the methodology here
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

Resources