I'm trying to write my own KD-Tree implementation and eventually a kNN implementation. and I'm having a bit of difficulty understanding how the KD-Tree construct the search tree.
on wikipedia it says that it finds the median of the values and use that as the root of the tree.
When there are many dimensions however, how would u compute the median?
You don't find the median in several dimensions (in fact, there is no meaningful order for multidimensional numbers). At every level of the kd Tree, you focus on one dimension. You choose the median based on this dimension, ignoring other components.
Note that you can use many criteria other than the median, depending on what you want to do. Likewise, selecting a good scheme for deciding the dimension for each node is an art, though virtually every scheme is correct.
It is not required to find the medians: from wikipedia:
Note also that it is not required to select the median point. In that
case, the result is simply that there is no guarantee that the tree
will be balanced. A simple heuristic to avoid coding a complex
linear-time median-finding algorithm, or using an O(n log n) sort of
all n points, is to use sort to find the median of a fixed number of
randomly selected points to serve as the splitting plane. In practice,
this technique often results in nicely balanced trees.
KD-Tree from wikipedia
You can simply sort the points according to one dimension, then choose
the median as root, then recursively construct subtrees(sort with other dimension)
here is an implementation:
https://github.com/tavaresdong/cs106l/blob/master/KDTree/src/KDTree.h
Related
Say I have a simulation that needs to pick between many discrete events with given weights, with the weights themselves following a known distribution.
I need some structure that supports updates, and efficiently picks random events.
It is trivial to get O(log(n)) insertion, deletion, and random-choice operations, with a binary search tree. It should be possible to improve the random-choice operation, with interpolation search for example.
What are some theoretical results stronger than this, or known good implementations?
EDIT: I consider Niklas's reply to this comment about the O(log* n) algorithm (ftp.cs.brown.edu/pub/techreports/92/cs92-36.pdf)to be exactly what I was looking for.
I've searched around the web and visited the wiki page for the Median of median algorithm. But can't seem to find an explicit statement to my question:
If one has a very very large list of integers (TBs in size) and wants to find the median of this list in a distributed manner, would breaking the list up into sub lists of varying sizes (or equal doesn't really matter), then proceed to compute the medians of those smaller sub-lists, then compute the median of those medians result in the median of the original large list?
Furthermore is this statement also correct for any of the kth statistics? I'd be interested in links to research etc in this area.
The answer to your question is no.
If you want to understand how to actually select the k-th order statistics (including the median of course) in a parallel setting (distributed setting is of course not really different), take a look at this recent paper, in which I proposed a new algorithm improving the previous state of the art algorithm for parallel selection:
Deterministic parallel selection algorithms on coarse-grained multicomputers
Here, we use two weighted 3-medians as pivots, and partition around these pivots using five-way partitioning. We also implemented and tested the algorithm using MPI. Results are very good, taking into account that this is a deterministic algorithm exploiting the worst-case O(n) selection algorithm. Using the randomized O(n) QuickSelect algorithm provides an extremely fast parallel algorithm.
If one has a very very large list of integers (TBs in size) and wants to find the median of this list in a distributed manner, would breaking the list up into sub lists of varying sizes (or equal doesn't really matter), then proceed to compute the medians of those smaller sub-lists, then compute the median of those medians result in the median of the original large list?
No. The actual median of the entire list is not necessarily a median of any of the sublists.
Median-of-medians can give you a good choice of pivot for quickselect by virtue of being nearer the actual median than a randomly selected element, but you would have to do the rest of the quickselect algorithm to locate the actual median of the larger list.
I'm given a task to write an algorithm to compute the maximum two dimensional subset, of a matrix of integers. - However I'm not interested in help for such an algorithm, I'm more interested in knowing the complexity for the best worse-case that can possibly solve this.
Our current algorithm is like O(n^3).
I've been considering, something alike divide and conquer, by splitting the matrix into a number of sub-matrices, simply by adding up the elements within the matrices; and thereby limiting the number of matrices one have to consider in order to find an approximate solution.
Worst case (exhaustive search) is definitely no worse than O(n^3). There are several descriptions of this on the web.
Best case can be far better: O(1). If all of the elements are non-negative, then the answer is the matrix itself. If the elements are non-positive, the answer is the element that has its value closest to zero.
Likewise if there are entire rows/columns on the edges of your matrix that are nothing but non-positive integers, you can chop these off in your search.
I've figured that there isn't a better way to do it. - At least not known to man yet.
And I'm going to stick with the solution I got, mainly because its simple.
Say I have a huge (a few million) list of n vectors, given a new vector, I need to find a pretty close one from the set but it doesn't need to be the closest. (Nearest Neighbor finds the closest and runs in n time)
What algorithms are there that can approximate nearest neighbor very quickly at the cost of accuracy?
EDIT: Since it will probably help, I should mention the data are pretty smooth most of the time, with a small chance of spikiness in a random dimension.
There are exist faster algoritms then O(n) to search closest element by arbitary distance. Check http://en.wikipedia.org/wiki/Kd-tree for details.
If you are using high-dimension vector, like SIFT or SURF or any descriptor used in multi-media sector, I suggest your consider LSH.
A PhD dissertation from Wei Dong (http://www.cs.princeton.edu/cass/papers/cikm08.pdf) might help you find the updated algorithm of KNN search, i.e, LSH. Different from more traditional LSH, like E2LSH (http://www.mit.edu/~andoni/LSH/) published earlier by MIT researchers, his algorithm uses multi-probing to better balance the trade-off between recall rate and cost.
A web search on "nearest neighbor" lsh library finds
http://www.mit.edu/~andoni/LSH/
http://www.cs.umd.edu/~mount/ANN/
http://msl.cs.uiuc.edu/~yershova/MPNN/MPNN.htm
For approximate nearest neighbour, the fastest way is to use locality sensitive hashing (LSH). There are many variants of LSHs. You should choose one depending on the distance metric of your data. The big-O of the query time for LSH is independent of the dataset size (not considering time for output result). So it is really fast. This LSH library implements various LSH for L2 (Euclidian) space.
Now, if the dimension of your data is less than 10, kd tree is preferred if you want exact result.
So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)
My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.
My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?
The Wikipedia article for kd-trees has a link to the ANN library:
ANN is a library written in C++, which
supports data structures and
algorithms for both exact and
approximate nearest neighbor searching
in arbitrarily high dimensions.
Based on our own experience, ANN
performs quite efficiently for point
sets ranging in size from thousands to
hundreds of thousands, and in
dimensions as high as 20. (For applications in significantly higher
dimensions, the results are rather
spotty, but you might try it anyway.)
As far as algorithm/data structures are concerned:
The library implements a number of
different data structures, based on
kd-trees and box-decomposition trees,
and employs a couple of different
search strategies.
I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).
use a kd-tree
Unfortunately, in high dimensions this data structure suffers severely from the curse of dimensionality, which causes its search time to be comparable to the brute force search.
reduce the number of dimensions
Dimensionality reduction is a good approach, which offers a fair trade-off between accuracy and speed. You lose some information when you reduce your dimensions, but gain some speed.
By accuracy I mean finding the exact Nearest Neighbor (NN).
Principal Component Analysis(PCA) is a good idea when you want to reduce the dimensional space your data live on.
Is there some clever algorithm or data structure to solve this exactly in reasonable time?
Approximate nearest neighbor search (ANNS), where you are satisfied with finding a point that might not be the exact Nearest Neighbor, but rather a good approximation of it (that is the 4th for example NN to your query, while you are looking for the 1st NN).
That approach cost you accuracy, but increases performance significantly. Moreover, the probability of finding a good NN (close enough to the query) is relatively high.
You could read more about ANNS in the introduction our kd-GeRaF paper.
A good idea is to combine ANNS with dimensionality reduction.
Locality Sensitive Hashing (LSH) is a modern approach to solve the Nearest Neighbor problem in high dimensions. The key idea is that points that lie close to each other are hashed to the same bucket. So when a query arrives, it will be hashed to a bucket, where that bucket (and usually its neighboring ones) contain good NN candidates).
FALCONN is a good C++ implementation, which focuses in cosine similarity. Another good implementation is our DOLPHINN, which is a more general library.
You could conceivably use Morton Codes, but with 75 dimensions they're going to be huge. And if all you have is 16,000 data points, exhaustive search shouldn't take too long.
No reason to believe this is NP-complete. You're not really optimizing anything and I'd have a hard time figure out how to convert this to another NP-complete problem (I have Garey and Johnson on my shelf and can't find anything similar). Really, I'd just pursue more efficient methods of searching and sorting. If you have n observations, you have to calculate n x n distances right up front. Then for every observation, you need to pick out the top k nearest neighbors. That's n squared for the distance calculation, n log (n) for the sort, but you have to do the sort n times (different for EVERY value of n). Messy, but still polynomial time to get your answers.
BK-Tree isn't such a bad thought. Take a look at Nick's Blog on Levenshtein Automata. While his focus is strings it should give you a spring board for other approaches. The other thing I can think of are R-Trees, however I don't know if they've been generalized for large dimensions. I can't say more than that since I neither have used them directly nor implemented them myself.
One very common implementation would be to sort the Nearest Neighbours array that you have computed for each data point.
As sorting the entire array can be very expensive, you can use methods like indirect sorting, example Numpy.argpartition in Python Numpy library to sort only the closest K values you are interested in. No need to sort the entire array.
#Grembo's answer above should be reduced significantly. as you only need K nearest Values. and there is no need to sort the entire distances from each point.
If you just need K neighbours this method will work very well reducing your computational cost, and time complexity.
if you need sorted K neighbours, sort the output again
see
Documentation for argpartition