how to find the nearest neighbor of a sparse vector - algorithm

I have about 500 vectors,each vector is a 1500-dimension vector,
and almost every vector is very sparse-- I mean only about 30-70 dimension of the vector is not 0。
Now, the problom is that here is a given vetor,also 1500 dimension,and I need to compare it to the 500 vectors to find which of the 500 is the nearest one.(In euclidean distance).
There is no doubt that brute-force method is a solution , but I need to calculate the distance for 500 times ,which takes a long time.
Yesterday I read an article "Object retrieval with large vocabularies and fast spatial matching", it says using inverted index will help,its says:
but after my test, it made almost no sense, imagine a 1500-vector in which 50 of the dimension are not zero, when it comes to another one, they may always have the same dimension that are not zero. In other words, this algorithm can only rule out a little vectors, I still need to compare with many vectors left.
Thank you for your nice that you have read to here, my question is that:
1.will this algorithm make sense?
2.is there any other way to do what I want to do? such as flann or Kd-TREE?
but I want the exact accurate nearest neighbor, a approxiate one is not enough

This kind of index is called inverted lists, and is commonly used for text.
For example, Apache Lucene uses this kind of indexing for text similarity search.
Essentially, you use a columnar layout, and you only store the non-zero values. For on-disk efficiency, various compression techniques can be employed.
You can then compute many similarities using set operations on these lists.
k-d-trees cannot be used here. They will be extremely inefficient if you have many duplicate (zero) values.

I don't know your context but if you don't care of having a long preprocess step and you have to make this check often and fast, you can build a neighborhood graph and sorting neighbors by distances.
To efficiently build this graph you can perform a taxicab distance or a square distance to sort the points by distances (This will avoid heavy calculations).
Then if you want the nearest neighbor you just have to pick the first neighbor :p.

Related

Fastest k nearest neighbor with arbitrary metric?

The gotcha with this question is "arbitrary metric". If you don't know what that is, it's just the way to measure distance between points. (In the "real" world, the 1-dimensinal distance is just the absolute magnitude of the difference between the two points).
Enough of the pre-lims. I'm trying to find a fast k nearest neighbor algorithm with these properties:
works on an arbitrary metric
somewhat easy to implement
optimized for finding the distance of a set of points to another set of points
Wikipedia gives a list of algorithms and approaches but nothing on implementation.
UPDATE: the metric is the cosine similarity, which does not satisfy the triangle inquality. However, it seems that I can use the "angular similarity" (as per Wikipedia).
UPDATE: the use case is natural language processing. "Vectors" are the "context" of a given word, represented by binary properties (ex: the title of the document). So while there may be only a few properties (right now I'm just using 3), each vector has arbitrarily large dimension (in the title example, each title in the database would correspond to a dimension in the vector).
UPDATE: For the curious, I'm implementing this algorithm:
http://josquin.cs.depaul.edu/~mramezani/papers/IEEEIS.pdf
UPDATE: The algorithm will need to find nearest neighbors for about a dozen points from about 100s of points. The average dimension will probably be very large, say 50, (I really don't know yet). And yes, I'm interested in an algorithm, not a library. And yes, estimates are probably good enough.
I would advice you to go for Locality-sensitive hashing (LSH), which is in trend right now. It reduces the dimensionality of high-dimensional data, but I am not sure if your dimension will go well with that algorithm. See the Wikipedia page for more.
You can use your own metric, but in general you can do that in many algorithms. Hope this helps.
You could go for RKD trees, a forest of them, but maybe this is too much now.

Approximated closest pair algorithm

I have been thinking about a variation of the closest pair problem in which the only available information is the set of distances already calculated (we are not allowed to sort points according to their x-coordinates).
Consider 4 points (A, B, C, D), and the following distances:
dist(A,B) = 0.5
dist(A,C) = 5
dist(C,D) = 2
In this example, I don't need to evaluate dist(B,C) or dist(A,D), because it is guaranteed that these distances are greater than the current known minimum distance.
Is it possible to use this kind of information to reduce the O(n²) to something like O(nlogn)?
Is it possible to reduce the cost to something close to O(nlogn) if I accept a kind of approximated solution? In this case, I am thinking about some technique based on reinforcement learning that only converges to the real solution when the number of reinforcements go to infinite, but provides a great approximation for small n.
Processing time (measured by the big O notation) is not the only issue. To keep a very large amount of previous calculated distances can also be an issue.
Imagine this problem for a set with 10⁸ points.
What kind of solution should I look for? Was this kind of problem solved before?
This is not a classroom problem or something related. I have been just thinking about this problem.
I suggest using ideas that are derived from quickly solving k-nearest-neighbor searches.
The M-Tree data structure: (see http://en.wikipedia.org/wiki/M-tree and http://www.vldb.org/conf/1997/P426.PDF ) is designed to reduce the number distance comparisons that need to be performed to find "nearest neighbors".
Personally, I could not find an implementation of an M-Tree online that I was satisfied with (see my closed thread Looking for a mature M-Tree implementation) so I rolled my own.
My implementation is here: https://github.com/jon1van/MTreeMapRepo
Basically, this is binary tree in which each leaf node contains a HashMap of Keys that are "close" in some metric space you define.
I suggest using my code (or the idea behind it) to implement a solution in which you:
Search each leaf node's HashMap and find the closest pair of Keys within that small subset.
Return the closest pair of Keys when considering only the "winner" of each HashMap.
This style of solution would be a "divide and conquer" approach the returns an approximate solution.
You should know this code has an adjustable parameter the governs the maximum number of Keys that can be placed in an individual HashMap. Reducing this parameter will increase the speed of your search, but it will increase the probability that the correct solution won't be found because one Key is in HashMap A while the second Key is in HashMap B.
Also, each HashMap is associated a "radius". Depending on how accurate you want your result you maybe able to just search the HashMap with the largest hashMap.size()/radius (because this HashMap contains the highest density of points, thus it is a good search candidate)
Good Luck
If you only have sample distances, not original point locations in a plane you can operate on, then I suspect you are bounded at O(E).
Specifically, it would seem from your description that any valid solution would need to inspect every edge in order to rule out it having something interesting to say, meanwhile, inspecting every edge and taking the smallest solves the problem.
Planar versions bypass O(V^2), by using planar distances to deduce limitations on sets of edges, allowing us to avoid needing to look at most of the edge weights.
Use same idea as in space partitioning. Recursively split given set of points by choosing two points and dividing set in two parts, points that are closer to first point and points that are closer to second point. That is same as splitting points by a line passing between two chosen points.
That produces (binary) space partitioning, on which standard nearest neighbour search algorithms can be used.

Two algorithms to find nearest neighbor with Locality-sensitive hashing, which one?

Currently I'm studying how to find a nearest neighbor using Locality-sensitive hashing. However while I'm reading papers and searching the web I found two algorithms for doing this:
1- Use L number of hash tables with L number of random LSH functions, thus increasing the chance that two documents that are similar to get the same signature. For example if two documents are 80% similar, then there's an 80% chance that they will get the same signature from one LSH function. However if we use multiple LSH functions, then there's a higher chance to get the same signature for the documents from one of the LSH functions. This method is explained in wikipedia and I hope my understanding is correct:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search
2- The other algorithm uses a method from a paper (section 5) called: Similarity Estimation Techniques from Rounding Algorithms by Moses S. Charikar. It's based on using one LSH function to generate the signature and then apply P permutations on it and then sort the list. Actually I don't understand the method very well and I hope if someone could clarify it.
My main question is: why would anyone use the second method rather than the first method? As I find it's easier and faster.
I really hope someone can help!!!
EDIT:
Actually I'm not sure if #Raff.Edward were mixing between the "first" and the "second". Because only the second method uses a radius and the first just uses a new hash family g composed of the hash family F. Please check the wikipedia link. They just used many g functions to generate different signatures and then for each g function it has a corresponding hash table. In order to find the nearest neighbor of a point you just let the point go through the g functions and check the corresponding hash tables for collisions. Thus how I understood it as more function ... more chance for collisions.
I didn't find any mentioning about radius for the first method.
For the second method they generate only one signature for each feature vector and then apply P permutations on them. Now we have P lists of permutations where each contains n signatures. Now they then sort each list from P. After that given a query point q, they generate the signature for it and then apply the P permutations on it and then use binary search on each permuted and sorted P list to find the most similar signature to the query q. I concluded this after reading many papers about it, but I still don't understand why would anyone use such a method because it doesn't seem fast in finding the hamming distance!!!!
For me I would simply do the following to find the nearest neighbor for a query point q. Given a list of signatures N, I would generate the signature for the query point q and then scan the list N and compute the hamming distance between each element in N and the signature of q. Thus I would end up with the nearest neighbor for q. And it takes O(N)!!!
Your understanding of the first one is a little off. The probability of a collision occurring is not proportional to the similarity, but whether or not it is less than the pre-defined radius. The goal is that anything within the radius will have a high chance of colliding, and anything outside the radius * (1+eps) will have a low chance of colliding (and the area in-between is a little murky).
The first algorithm is actually fairly difficult to implement well, but can get good results. In particular, the first algorithm is for the L1 and L2 (and technically a few more) metrics.
The second algorithm is very simple to implement, though a naive implementation may use up too much memory to be useful depending on your problem size. In this case, the probability of collision is proportional to the similarity of the inputs. However, it only works for the Cosine Similarity (or distance metrics based on a transform of the similarity.)
So which one you would use is based primarily on which distance metric you are using for Nearest Neighbor (or whatever other application).
The second one is actually much easier to understand and implement than the first one, the paper is just very wordy.
The short version: Take a random vector V and give each index a independent random unit normal value. Create as many vectors as you want the signature length to be. The signature is the signs of each index when you do a Matrix Vector product. Now the hamming distance between any two signatures is related to the cosine similarity between the respective data points.
Because you can encode the signature into an int array and use an XOR with a bit count instruction to get the hamming distance very quickly, you can get approximate cosine similarity scores very quickly.
LSH algorithms doesn't have a lot of standardization, and the two papers (and others) use different definitions, so its all a bit confusing at times. I only recently implemented both of these algorithms in JSAT, and am still working on fully understanding them both.
EDIT: Replying to your edit. The wikipedia article is not great for LSH. If you read the original paper, the first method you are talking about only works for a fixed radius. The hash functions are then created based on that radius, and concatenated to increase the probability of getting near by points in a collision. They then construct a system for doing k-NN on-top of this by determine the maximum value of k they wan, and then finding the largest reasonable distance they would find the k'th nearest neighbor in. In this way, a radius search will very likely return the set of k-NNs. To speed this up, they also create a few extra small radius since the density is often not uniform, and the smaller radius you use, the faster the results.
The wikipedia section you linked is taken from the paper description for the "Stable Distribution" section, which presents the hash function for a search of radius r=1.
For the second paper, the "sorting" you describe is not part of the hashing, but part of one-scheme for searching the hamming space more quickly. I as I mentioned, I recently implemented this, and you can see a quick benchmark I did using a brute force search is still much faster than the naive method of NN. Again, you would also pick this method if you need the cosine similarity over the L2 or L1 distance. You will find many other papers proposing different schemes for searching the hamming space created by the signatures.
If you need help convincing yourself fit can be faster even if you were still doing brute force - just look at it this way: Lets say that the average sparse document has 40 common words with another document (a very conservative number in my experience). You have n documents to compare against. Brute force cosine similarity would then involve about 40*n floating point multiplications (and some extra work). If you have a 1024 bit signature, thats only 32 integers. That means we could do a brute force LSH search in 32*n integer operations, which are considerably faster then floating point operations.
There are also other factors at play here as well. For a sparse data set we have to keep both the doubles and integer indices to represent the non zero indexes, so the sparse dot product is doing a lot of additional integer operations to see which indices they have in common. LSH also allows us to save memory, because we don't need to store all of these integers and doubles for each vector, instead we can just keep its hash around - which is only a few bytes.
Reduced memory use can help us better exploit the CPU cache.
Your O(n) is the naive way I have used in my blog post. And it is fast. However, if you sort the bits before hand, you can do the binary search in O(log(n)). Even if you have L of these lists, L << n, and so it should be faster. The only issue is it gets you approximate hamming NN which are already approximating the cosine similarity, so the results can become a bit worse. It depends on what you need.

Collision detection complexity <O(n²): Simpler approach than grid, quadtrees, BSP?

I have a large number of object (balls, for a start) that move stepwise in space, one at a time, and shall not overlap. Currently, for every move I check for collision with every other object. Several other questions here deal with this, however, I thought of a seemingly simple solution that does not seem to come up in this context, and I wonder why.
Why not simply keep 2 collections (for 2D, or 3 in three dimensions) of all objects, sorted by the x and y (and z) coordinate, respectively, and at every move look up all other objects within a given distance (ball diameter here) in each dimension and do the actual collision check only on objects in both (or all 3) result sets?
I realize this only works for equally-sized objects, but alternatively one could use twice as many collections, sorted by the (1) highest (2) lowest coordinate of each object for each dimension. Any reason why this would not work, or give significantly less of an improvement compared to going from O(n) "pairwise check" to "grid method" or "quad/octrees"? I see the update of these sorted collections as the costly operation here, but using e.g. a TreeSet (my implementation would be in Java) it should still be significantly less than O(n), right?
The check for which objects are in both result sets involves looking at all objects in two strips of the plane. That is a much larger area, and therefore involves more objects, than the enclosing square that a quadtree lets you immediately narrow down to. More objects means that it is slower.
You want to use a spatial index or space-filling-curve instead of a quadtree. A sfc reduce the 2d complexity to a 1d complexity and is different from a quadtree because it can only store 1 object per x,y pair? Maybe this works for your problem? You want to search for Nick's hilbert curve quadtree spatial index blog.

Nearest neighbors in high-dimensional data?

I have asked a question a few days back on how to find the nearest neighbors for a given vector. My vector is now 21 dimensions and before I proceed further, because I am not from the domain of Machine Learning nor Math, I am beginning to ask myself some fundamental questions:
Is Euclidean distance a good metric for finding the nearest neighbors in the first place? If not, what are my options?
In addition, how does one go about deciding the right threshold for determining the k-neighbors? Is there some analysis that can be done to figure this value out?
Previously, I was suggested to use kd-Trees but the Wikipedia page clearly says that for high-dimensions, kd-Tree is almost equivalent to a brute-force search. In that case, what is the best way to find nearest-neighbors in a million point dataset efficiently?
Can someone please clarify the some (or all) of the above questions?
I currently study such problems -- classification, nearest neighbor searching -- for music information retrieval.
You may be interested in Approximate Nearest Neighbor (ANN) algorithms. The idea is that you allow the algorithm to return sufficiently near neighbors (perhaps not the nearest neighbor); in doing so, you reduce complexity. You mentioned the kd-tree; that is one example. But as you said, kd-tree works poorly in high dimensions. In fact, all current indexing techniques (based on space partitioning) degrade to linear search for sufficiently high dimensions [1][2][3].
Among ANN algorithms proposed recently, perhaps the most popular is Locality-Sensitive Hashing (LSH), which maps a set of points in a high-dimensional space into a set of bins, i.e., a hash table [1][3]. But unlike traditional hashes, a locality-sensitive hash places nearby points into the same bin.
LSH has some huge advantages. First, it is simple. You just compute the hash for all points in your database, then make a hash table from them. To query, just compute the hash of the query point, then retrieve all points in the same bin from the hash table.
Second, there is a rigorous theory that supports its performance. It can be shown that the query time is sublinear in the size of the database, i.e., faster than linear search. How much faster depends upon how much approximation we can tolerate.
Finally, LSH is compatible with any Lp norm for 0 < p <= 2. Therefore, to answer your first question, you can use LSH with the Euclidean distance metric, or you can use it with the Manhattan (L1) distance metric. There are also variants for Hamming distance and cosine similarity.
A decent overview was written by Malcolm Slaney and Michael Casey for IEEE Signal Processing Magazine in 2008 [4].
LSH has been applied seemingly everywhere. You may want to give it a try.
[1] Datar, Indyk, Immorlica, Mirrokni, "Locality-Sensitive Hashing Scheme Based on p-Stable Distributions," 2004.
[2] Weber, Schek, Blott, "A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces," 1998.
[3] Gionis, Indyk, Motwani, "Similarity search in high dimensions via hashing," 1999.
[4] Slaney, Casey, "Locality-sensitive hashing for finding nearest neighbors", 2008.
I. The Distance Metric
First, the number of features (columns) in a data set is not a factor in selecting a distance metric for use in kNN. There are quite a few published studies directed to precisely this question, and the usual bases for comparison are:
the underlying statistical
distribution of your data;
the relationship among the features
that comprise your data (are they
independent--i.e., what does the
covariance matrix look like); and
the coordinate space from which your
data was obtained.
If you have no prior knowledge of the distribution(s) from which your data was sampled, at least one (well documented and thorough) study concludes that Euclidean distance is the best choice.
YEuclidean metric used in mega-scale Web Recommendation Engines as well as in current academic research. Distances calculated by Euclidean have intuitive meaning and the computation scales--i.e., Euclidean distance is calculated the same way, whether the two points are in two dimension or in twenty-two dimension space.
It has only failed for me a few times, each of those cases Euclidean distance failed because the underlying (cartesian) coordinate system was a poor choice. And you'll usually recognize this because for instance path lengths (distances) are no longer additive--e.g., when the metric space is a chessboard, Manhattan distance is better than Euclidean, likewise when the metric space is Earth and your distances are trans-continental flights, a distance metric suitable for a polar coordinate system is a good idea (e.g., London to Vienna is is 2.5 hours, Vienna to St. Petersburg is another 3 hrs, more or less in the same direction, yet London to St. Petersburg isn't 5.5 hours, instead, is a little over 3 hrs.)
But apart from those cases in which your data belongs in a non-cartesian coordinate system, the choice of distance metric is usually not material. (See this blog post from a CS student, comparing several distance metrics by examining their effect on kNN classifier--chi square give the best results, but the differences are not large; A more comprehensive study is in the academic paper, Comparative Study of Distance Functions for Nearest Neighbors--Mahalanobis (essentially Euclidean normalized by to account for dimension covariance) was the best in this study.
One important proviso: for distance metric calculations to be meaningful, you must re-scale your data--rarely is it possible to build a kNN model to generate accurate predictions without doing this. For instance, if you are building a kNN model to predict athletic performance, and your expectation variables are height (cm), weight (kg), bodyfat (%), and resting pulse (beats per minute), then a typical data point might look something like this: [ 180.4, 66.1, 11.3, 71 ]. Clearly the distance calculation will be dominated by height, while the contribution by bodyfat % will be almost negligible. Put another way, if instead, the data were reported differently, so that bodyweight was in grams rather than kilograms, then the original value of 86.1, would be 86,100, which would have a large effect on your results, which is exactly what you don't want. Probably the most common scaling technique is subtracting the mean and dividing by the standard deviation (mean and sd refer calculated separately for each column, or feature in that data set; X refers to an individual entry/cell within a data row):
X_new = (X_old - mu) / sigma
II. The Data Structure
If you are concerned about performance of the kd-tree structure, A Voronoi Tessellation is a conceptually simple container but that will drastically improve performance and scales better than kd-Trees.
This is not the most common way to persist kNN training data, though the application of VT for this purpose, as well as the consequent performance advantages, are well-documented (see e.g. this Microsoft Research report). The practical significance of this is that, provided you are using a 'mainstream' language (e.g., in the TIOBE Index) then you ought to find a library to perform VT. I know in Python and R, there are multiple options for each language (e.g., the voronoi package for R available on CRAN)
Using a VT for kNN works like this::
From your data, randomly select w points--these are your Voronoi centers. A Voronoi cell encapsulates all neighboring points that are nearest to each center. Imagine if you assign a different color to each of Voronoi centers, so that each point assigned to a given center is painted that color. As long as you have a sufficient density, doing this will nicely show the boundaries of each Voronoi center (as the boundary that separates two colors.
How to select the Voronoi Centers? I use two orthogonal guidelines. After random selecting the w points, calculate the VT for your training data. Next check the number of data points assigned to each Voronoi center--these values should be about the same (given uniform point density across your data space). In two dimensions, this would cause a VT with tiles of the same size.That's the first rule, here's the second. Select w by iteration--run your kNN algorithm with w as a variable parameter, and measure performance (time required to return a prediction by querying the VT).
So imagine you have one million data points..... If the points were persisted in an ordinary 2D data structure, or in a kd-tree, you would perform on average a couple million distance calculations for each new data points whose response variable you wish to predict. Of course, those calculations are performed on a single data set. With a V/T, the nearest-neighbor search is performed in two steps one after the other, against two different populations of data--first against the Voronoi centers, then once the nearest center is found, the points inside the cell corresponding to that center are searched to find the actual nearest neighbor (by successive distance calculations) Combined, these two look-ups are much faster than a single brute-force look-up. That's easy to see: for 1M data points, suppose you select 250 Voronoi centers to tesselate your data space. On average, each Voronoi cell will have 4,000 data points. So instead of performing on average 500,000 distance calculations (brute force), you perform far lesss, on average just 125 + 2,000.
III. Calculating the Result (the predicted response variable)
There are two steps to calculating the predicted value from a set of kNN training data. The first is identifying n, or the number of nearest neighbors to use for this calculation. The second is how to weight their contribution to the predicted value.
W/r/t the first component, you can determine the best value of n by solving an optimization problem (very similar to least squares optimization). That's the theory; in practice, most people just use n=3. In any event, it's simple to run your kNN algorithm over a set of test instances (to calculate predicted values) for n=1, n=2, n=3, etc. and plot the error as a function of n. If you just want a plausible value for n to get started, again, just use n = 3.
The second component is how to weight the contribution of each of the neighbors (assuming n > 1).
The simplest weighting technique is just multiplying each neighbor by a weighting coefficient, which is just the 1/(dist * K), or the inverse of the distance from that neighbor to the test instance often multiplied by some empirically derived constant, K. I am not a fan of this technique because it often over-weights the closest neighbors (and concomitantly under-weights the more distant ones); the significance of this is that a given prediction can be almost entirely dependent on a single neighbor, which in turn increases the algorithm's sensitivity to noise.
A must better weighting function, which substantially avoids this limitation is the gaussian function, which in python, looks like this:
def weight_gauss(dist, sig=2.0) :
return math.e**(-dist**2/(2*sig**2))
To calculate a predicted value using your kNN code, you would identify the n nearest neighbors to the data point whose response variable you wish to predict ('test instance'), then call the weight_gauss function, once for each of the n neighbors, passing in the distance between each neighbor the the test point.This function will return the weight for each neighbor, which is then used as that neighbor's coefficient in the weighted average calculation.
What you are facing is known as the curse of dimensionality. It is sometimes useful to run an algorithm like PCA or ICA to make sure that you really need all 21 dimensions and possibly find a linear transformation which would allow you to use less than 21 with approximately the same result quality.
Update:
I encountered them in a book called Biomedical Signal Processing by Rangayyan (I hope I remember it correctly). ICA is not a trivial technique, but it was developed by researchers in Finland and I think Matlab code for it is publicly available for download. PCA is a more widely used technique and I believe you should be able to find its R or other software implementation. PCA is performed by solving linear equations iteratively. I've done it too long ago to remember how. = )
The idea is that you break up your signals into independent eigenvectors (discrete eigenfunctions, really) and their eigenvalues, 21 in your case. Each eigenvalue shows the amount of contribution each eigenfunction provides to each of your measurements. If an eigenvalue is tiny, you can very closely represent the signals without using its corresponding eigenfunction at all, and that's how you get rid of a dimension.
Top answers are good but old, so I'd like to add up a 2016 answer.
As said, in a high dimensional space, the curse of dimensionality lurks around the corner, making the traditional approaches, such as the popular k-d tree, to be as slow as a brute force approach. As a result, we turn our interest in Approximate Nearest Neighbor Search (ANNS), which in favor of some accuracy, speedups the process. You get a good approximation of the exact NN, with a good propability.
Hot topics that might be worthy:
Modern approaches of LSH, such as Razenshteyn's.
RKD forest: Forest(s) of Randomized k-d trees (RKD), as described in FLANN,
or in a more recent approach I was part of, kd-GeRaF.
LOPQ which stands for Locally Optimized Product Quantization, as described here. It is very similar to the new Babenko+Lemptitsky's approach.
You can also check my relevant answers:
Two sets of high dimensional points: Find the nearest neighbour in the other set
Comparison of the runtime of Nearest Neighbor queries on different data structures
PCL kd-tree implementation extremely slow
To answer your questions one by one:
No, euclidean distance is a bad metric in high dimensional space. Basically in high dimensions, data points have large differences between each other. That decreases the relative difference in the distance between a given data point and its nearest and farthest neighbour.
Lot of papers/research are there in high dimension data, but most of the stuff requires a lot of mathematical sophistication.
KD tree is bad for high dimensional data ... avoid it by all means
Here is a nice paper to get you started in the right direction. "When in Nearest Neighbour meaningful?" by Beyer et all.
I work with text data of dimensions 20K and above. If you want some text related advice, I might be able to help you out.
Cosine similarity is a common way to compare high-dimension vectors. Note that since it's a similarity not a distance, you'd want to maximize it not minimize it. You can also use a domain-specific way to compare the data, for example if your data was DNA sequences, you could use a sequence similarity that takes into account probabilities of mutations, etc.
The number of nearest neighbors to use varies depending on the type of data, how much noise there is, etc. There are no general rules, you just have to find what works best for your specific data and problem by trying all values within a range. People have an intuitive understanding that the more data there is, the fewer neighbors you need. In a hypothetical situation where you have all possible data, you only need to look for the single nearest neighbor to classify.
The k Nearest Neighbor method is known to be computationally expensive. It's one of the main reasons people turn to other algorithms like support vector machines.
kd-trees indeed won't work very well on high-dimensional data. Because the pruning step no longer helps a lot, as the closest edge - a 1 dimensional deviation - will almost always be smaller than the full-dimensional deviation to the known nearest neighbors.
But furthermore, kd-trees only work well with Lp norms for all I know, and there is the distance concentration effect that makes distance based algorithms degrade with increasing dimensionality.
For further information, you may want to read up on the curse of dimensionality, and the various variants of it (there is more than one side to it!)
I'm not convinced there is a lot use to just blindly approximating Euclidean nearest neighbors e.g. using LSH or random projections. It may be necessary to use a much more fine tuned distance function in the first place!
A lot depends on why you want to know the nearest neighbors. You might look into the mean shift algorithm http://en.wikipedia.org/wiki/Mean-shift if what you really want is to find the modes of your data set.
I think cosine on tf-idf of boolean features would work well for most problems. That's because its time-proven heuristic used in many search engines like Lucene. Euclidean distance in my experience shows bad results for any text-like data. Selecting different weights and k-examples can be done with training data and brute-force parameter selection.
iDistance is probably the best for exact knn retrieval in high-dimensional data. You can view it as an approximate Voronoi tessalation.
I've experienced the same problem and can say the following.
Euclidean distance is a good distance metric, however it's computationally more expensive than the Manhattan distance, and sometimes yields slightly poorer results, thus, I'd choose the later.
The value of k can be found empirically. You can try different values and check the resulting ROC curves or some other precision/recall measure in order to find an acceptable value.
Both Euclidean and Manhattan distances respect the Triangle inequality, thus you can use them in metric trees. Indeed, KD-trees have their performance severely degraded when the data have more than 10 dimensions (I've experienced that problem myself). I found VP-trees to be a better option.
KD Trees work fine for 21 dimensions, if you quit early,
after looking at say 5 % of all the points.
FLANN does this (and other speedups)
to match 128-dim SIFT vectors. (Unfortunately FLANN does only the Euclidean metric,
and the fast and solid
scipy.spatial.cKDTree
does only Lp metrics;
these may or may not be adequate for your data.)
There is of course a speed-accuracy tradeoff here.
(If you could describe your Ndata, Nquery, data distribution,
that might help people to try similar data.)
Added 26 April, run times for cKDTree with cutoff on my old mac ppc, to give a very rough idea of feasibility:
kdstats.py p=2 dim=21 N=1000000 nask=1000 nnear=2 cutoff=1000 eps=0 leafsize=10 clustype=uniformp
14 sec to build KDtree of 1000000 points
kdtree: 1000 queries looked at av 0.1 % of the 1000000 points, 0.31 % of 188315 boxes; better 0.0042 0.014 0.1 %
3.5 sec to query 1000 points
distances to 2 nearest: av 0.131 max 0.253
kdstats.py p=2 dim=21 N=1000000 nask=1000 nnear=2 cutoff=5000 eps=0 leafsize=10 clustype=uniformp
14 sec to build KDtree of 1000000 points
kdtree: 1000 queries looked at av 0.48 % of the 1000000 points, 1.1 % of 188315 boxes; better 0.0071 0.026 0.5 %
15 sec to query 1000 points
distances to 2 nearest: av 0.131 max 0.245
You could try a z order curve. It's easy for 3 dimension.
I had a similar question a while back. For fast Approximate Nearest Neighbor Search you can use the annoy library from spotify: https://github.com/spotify/annoy
This is some example code for the Python API, which is optimized in C++.
from annoy import AnnoyIndex
import random
f = 40
t = AnnoyIndex(f, 'angular') # Length of item vector that will be indexed
for i in range(1000):
v = [random.gauss(0, 1) for z in range(f)]
t.add_item(i, v)
t.build(10) # 10 trees
t.save('test.ann')
# ...
u = AnnoyIndex(f, 'angular')
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors
They provide different distance measurements. Which distance measurement you want to apply depends highly on your individual problem. Also consider prescaling (meaning weighting) certain dimensions for importance first. Those dimension or feature importance weights might be calculated by something like entropy loss or if you have a supervised learning problem gini impurity gain or mean average loss, where you check how much worse your machine learning model performs, if you scramble this dimensions values.
Often the direction of the vector is more important than it's absolute value. For example in the semantic analysis of text documents, where we want document vectors to be close when their semantics are similar, not their lengths. Thus we can either normalize those vectors to unit length or use angular distance (i.e. cosine similarity) as a distance measurement.
Hope this is helpful.
Is Euclidean distance a good metric for finding the nearest neighbors in the first place? If not, what are my options?
I would suggest soft subspace clustering, a pretty common approach nowadays, where feature weights are calculated to find the most relevant dimensions. You can use these weights when using euclidean distance, for example. See curse of dimensionality for common problems and also this article can enlighten you somehow:
A k-means type clustering algorithm for subspace clustering of mixed numeric and
categorical datasets

Resources