k-NN search in HUGE dimensions (~100,000) - algorithm

Are there any articles about k-NN search problem for really huge amount of dimensions like 10k - 100k?
Most of articles with tests on real-world data operates with 10-50 dims, and a few operates 100-500.
In my case there is ~10^9 points in ~100k feature dimension, and there is no way to effectively reduce number of dimensions.
UPD.:
At the moment we are trying to adapt and implement VP-trees, but it's clear enough that any tree struct on this dimensionality wont work well.
Second approach is LSH, but there may be big troubles with accuracy depending on data distribution.

Take a look at FLANN library.
In this paper you will find a dissertation on how data dimensionality is one of the factors that has a great impact on the nearest neighbor matching performance, and the solutions adopted in FLANN.

Are you using kd-tree for nearest neighbour search? kd-tree deteriorates to almost exhaustive search in higher dimensions.
In higher dimensions, it is usually suggested to use approximate nearest neighbour search. here is the link to the original paper: http://cvs.cs.umd.edu/~mount/Papers/dist.pdf, and if that is a bit too heavy, try this: dimacs.rutgers.edu/Workshops/MiningTutorial/pindyk-slides.ppt‎
There are many factors affecting the choice of decision when it comes to nearest neighbour search. Whether you need to load the points entirely in primary memory or you could use secondary memory should also govern your decision.

Related

R-Tree construction algorithm using polygon MBR

I cannot seem to find any documentation on how to construct an R-Tree when I have all the known Minimum Bounding Rectangles (MBR) of the polygons in my set. The R-Tree would be ideal for storing these spatial references to eliminate my current brute force inspecting for polygon intersection:
for p1 in polygons: # O(n)
for p2 in polygons: # O(n)
if p2 is not p1: # O(1)
if p2.intersects(p1): # O(1); computed using DeMorgans law on vertices
# do stuff
Does anybody have a reference that denotes methods of how to determine the partitioning of the rectangles that encompass the MBRs of the polygons in the set?
There are many R-Tree partitioning algorithm, in my experience the best one is R*Tree (R-Star-Tree) by Beckmann et al. Just search for "The R*-tree: an efficient and robust access method for points and rectangles".
If you prefer reading code, there are also many open-source implementations, including my own one in Java. Be warned though, the R*Tree algorithm is not trivial.
If you are looking for something simpler, try quadtrees or octrees. If insertion and update speed is top priority, have a look at the PH-Tree (again my own implementation), but it also more complicated than quadtrees.
Another simpler solution is the AABB-Tree, it's like an R-Tree but with only two bounding boxes per node. It's used a lot in computer graphics I think, but I don't know much about it except that it is relatively simple for an R-Tree.
EDIT (Update to answer comment)
If you are looking for a bulk loading strategy such as STR, here is the original paper. You can have a look at my R-Tree implementation, as it also provides an implementation of an STR-Loader that can handle points and rectangles.
Searching stack overflow I also found this answer which apparently points to an alternative bulk loader specifically for storing rectangles.
I would like to point out that bulk loading (such as STR-Loading) is the fasted way to load an R-Tree. However, in my own experiments (see Figure 3 here), this is still 2-3 times slower than a good quadtree or a PH-Tree.

how to find the nearest neighbor of a sparse vector

I have about 500 vectors,each vector is a 1500-dimension vector,
and almost every vector is very sparse-- I mean only about 30-70 dimension of the vector is not 0。
Now, the problom is that here is a given vetor,also 1500 dimension,and I need to compare it to the 500 vectors to find which of the 500 is the nearest one.(In euclidean distance).
There is no doubt that brute-force method is a solution , but I need to calculate the distance for 500 times ,which takes a long time.
Yesterday I read an article "Object retrieval with large vocabularies and fast spatial matching", it says using inverted index will help,its says:
but after my test, it made almost no sense, imagine a 1500-vector in which 50 of the dimension are not zero, when it comes to another one, they may always have the same dimension that are not zero. In other words, this algorithm can only rule out a little vectors, I still need to compare with many vectors left.
Thank you for your nice that you have read to here, my question is that:
1.will this algorithm make sense?
2.is there any other way to do what I want to do? such as flann or Kd-TREE?
but I want the exact accurate nearest neighbor, a approxiate one is not enough
This kind of index is called inverted lists, and is commonly used for text.
For example, Apache Lucene uses this kind of indexing for text similarity search.
Essentially, you use a columnar layout, and you only store the non-zero values. For on-disk efficiency, various compression techniques can be employed.
You can then compute many similarities using set operations on these lists.
k-d-trees cannot be used here. They will be extremely inefficient if you have many duplicate (zero) values.
I don't know your context but if you don't care of having a long preprocess step and you have to make this check often and fast, you can build a neighborhood graph and sorting neighbors by distances.
To efficiently build this graph you can perform a taxicab distance or a square distance to sort the points by distances (This will avoid heavy calculations).
Then if you want the nearest neighbor you just have to pick the first neighbor :p.

Fastest k nearest neighbor with arbitrary metric?

The gotcha with this question is "arbitrary metric". If you don't know what that is, it's just the way to measure distance between points. (In the "real" world, the 1-dimensinal distance is just the absolute magnitude of the difference between the two points).
Enough of the pre-lims. I'm trying to find a fast k nearest neighbor algorithm with these properties:
works on an arbitrary metric
somewhat easy to implement
optimized for finding the distance of a set of points to another set of points
Wikipedia gives a list of algorithms and approaches but nothing on implementation.
UPDATE: the metric is the cosine similarity, which does not satisfy the triangle inquality. However, it seems that I can use the "angular similarity" (as per Wikipedia).
UPDATE: the use case is natural language processing. "Vectors" are the "context" of a given word, represented by binary properties (ex: the title of the document). So while there may be only a few properties (right now I'm just using 3), each vector has arbitrarily large dimension (in the title example, each title in the database would correspond to a dimension in the vector).
UPDATE: For the curious, I'm implementing this algorithm:
http://josquin.cs.depaul.edu/~mramezani/papers/IEEEIS.pdf
UPDATE: The algorithm will need to find nearest neighbors for about a dozen points from about 100s of points. The average dimension will probably be very large, say 50, (I really don't know yet). And yes, I'm interested in an algorithm, not a library. And yes, estimates are probably good enough.
I would advice you to go for Locality-sensitive hashing (LSH), which is in trend right now. It reduces the dimensionality of high-dimensional data, but I am not sure if your dimension will go well with that algorithm. See the Wikipedia page for more.
You can use your own metric, but in general you can do that in many algorithms. Hope this helps.
You could go for RKD trees, a forest of them, but maybe this is too much now.

Comparison of the runtime of Nearest Neighbor queries on different data structures

Given n points in d-dimensional space, there are several data structures, such as Kd-Trees, Quadtrees, etc. to index the points. On these data structures it is possible to implement straight-forward algorithm for nearest neighbor queries around a given input point.
Is there a book, paper, survey, ... that compares the theoretical (mostly expected) runtime of the nearest neighbor query on different data structures?
The data I am looking at is composed of fairly small point clouds, so it can all be processed in main memory. For the sake of simplicity, I assume the data to be uniformly distributed. That is, im am not interested in real-world performance, but rather theoretical results
You let the dimension of the points undefined and you just give an approximation for the number of points. What does small means? It's relative what one person means by small.
What you search, of course, doesn't exist. Your question is pretty much this:
Question:
For a small (whatever does small means to you) dataset, of any dimension with data that follow a uniform distribution, what's the optimal data structure to use?
Answer:
There is no such data structure.
Wouldn't it be too strange to have an answer on that? A false analogy would be to put as a synonym of this question, the "Which is the optimal programming language?" question that most of the first year undergrads have. Your question is not that naive, but it's walking on the same track.
Why there is no such data structure?
Because, the dimension of the dataset is variable. This means, that you might have a dataset in 2 dimensions, but it could also mean that you have a dataset in 1000 dimensions, or even better a dataset in 1000 dimensions, with an intrinsic dimension that is much less than 1000. Think about it, could one propose a data structure that would behave equally good for the three datasets I mentioned it? I doubt it.
In fact, there are some data structures that behave really nicely in low dimensions (quadtrees and KD-trees for example), while others are doing much better in higher dimensions (RKD-tree forest for instance).
Moreover, the algorithms and the expectations used for Nearest Neighbour search are heavily dependent on the dimension of the dataset (as well as the size of the dataset and the nature of the queries (for example a query that is too far from the dataset or equidistant from the points of the dataset will probably result in a slow search performance)).
In lower dimensions, one would perform a k-Nearest Neighbour(k-NN) search. In higher dimensions, it would be more wise to perform k-Approximate NN search. In this case, we follow the following trade-off:
Speed VS accuracy
What happens is that we achieve faster execution of the program, by sacrificing the correctness of our result. In other words, our search routine will be relatively fast, but it may (the possibility of this depends on many parameters, such as the nature of your problem and the library you are using) not return the true NN, but an approximation of the exact NN. For example it might not find the exact NN, but the third NN to the query point. You could also check the approximate-nn-searching wiki tag.
Why not always searching for the exact NN? Because of the curse of dimensionality, which results in the solutions provided in the lower dimensions to behave as good as the brute force would do (search all the points in the dataset for every query).
You see my answer already got big, so I should stop here. Your question is too broad, but interesting, I must admit. :)
In conclusion, which would be the optimal data structure (and algorithm) to use depends on your problem. The size of the dataset you are handling, the dimension and the intrinsic dimension of the points play a key role. The number and the nature of the queries also play an important role.
For nearest neighbor searches of potentially non-uniform point data I think a kd-tree will give you the best performance in general. As far as broad overviews and theoretical cost analysis I think Wikipedia is an OK place to start, but keep in mind it does not cover much real-world optimization:
http://en.wikipedia.org/wiki/Nearest_neighbor_search
http://en.wikipedia.org/wiki/Space_partitioning
Theoretical performance is one thing but real world performance is something else entirely. Real world performance depends as much on the details of the data structure implementation as it does on the type of data structure. For example, a pointer-less (compact array) implementation can be many times faster than a pointer-based implementation because of improved cache coherence and faster data allocation. Wider branching may be slower in theory but faster in practice if you leverage SIMD to test several branches simultaneously.
Also the exact nature of your point data can have a big impact on performance. Uniform distributions are less demanding and can be handled quickly with simpler data structures. Non-uniform distributions require more care. (Kd-trees work well for both uniform and non-uniform data.) Also, if your data is too large to process in-core then you will need to take an entirely different approach compared to smaller data sets.

kNN with dynamic insertions in high-dim space

I am looking for a method to do fast nearest neighbour (hopefully O(log n)) for high dimensional points (typically ~11-13 dimensional). I would like it to behave optimally during insertions after having initialized the structure. KD tree came to my mind but if you do not do bulk loading but do dynamic insertions, then kd tree ceases to be balanced and afaik balancing is an expensive operation.
So, I wanted to know what data structures would you prefer for such kind of setting. You have high dimensional points and you would like to do insertions and query for nearest neighbour.
Another data structure that comes to mind is the cover tree. Unlike KD trees which were originally developed to answer range queries, this data structure is optimal for nearest neighbor queries. It has been used in n-body problems that involve computing the k nearest neighbors of all the data points. Such problems also occur in density estimation schemes (Parzen windows).
I don't know enough about your specific problem, but I do know that there are online versions of this data structure. Check out Alexander Gray's page and this link
The Curse of Dimensionality gets in the way here. You might consider applying Principal Component Analysis (PCA) to reduce the dimensionality, but as far as I know, nobody has a great answer for this.
I have dealt with this type of problem before (in audio and video fingerprinting), sometimes with up to 30 dimensions. Analysis usually revealed that some of the dimensions did not contain relevant information for searches (actually fuzzy searches, my main goal), so I omitted them from the index structures used to access the data, but included them in the logic to determine matches from a list of candidates found during the search. This effectively reduced the dimensionality to a tractable level.
I simplified things further by quantizing the remaining dimensions severely, such that the entire multidimensional space was mapped into a 32-bit integer. I used this as the key in an STL map (a red-black tree), though I could have used a hash table. I was able to add millions of records dynamically to such a structure (RAM-based, of course) in about a minute or two, and searches took about a millisecond on average, though the data was by no means evenly distributed. Searches required careful enumeration of values in the dimensions that were mapped into the 32-bit key, but were reliable enough to use in a commercial product. I believe it is used to this day in iTunes Match, if my sources are correct. :)
The bottom line is that I recommend you take a look at your data and do something custom that exploits features in it to make for fast indexing and searching. Find the dimensions that vary the most and are the most independent of each other. Quantize those and use them as the key in an index. Each bucket in the index contains all items that share that key (there will likely be more than one). To find nearest neighbors, look at "nearby" keys and within each bucket, look for nearby values. Good luck.
p.s. I wrote a paper on my technique, available here. Sorry about the paywall. Perhaps you can find a free copy elsewhere. Let me know if you have any questions about it.
If you use a Bucket Kd-Tree with a reasonably large bucket size it lets the tree get better idea of where to split when the leaves get too full. The guys in Robocode do this under extremely harsh time-constraints, with random insertions happening on the fly and kNN with k>80, d > 10 and n > 30k in under 1ms. Check out this kD-Tree Tutorial which explains a bunch of kD-Tree enhancements and how to implement them.
In my experience, 11-13 dimensions is not too bad -- if you bulk-load. Both bulk-loaded R-trees (in contrast to k-d-trees these remain balanced!) and k-d-trees should still work much better than linear scanning.
Once you go fully dynamic, my experiences are much worse. Roughly: with bulk loaded trees I'm seeing 20x speedups, with incrementally built R-trees just 7x. So it does pay off to frequently rebuild the tree. And depending on how you organize your data, it may be much faster than you think. The bulk load for the k-d-tree that I'm using is O(n log n), and I read that there is a O(n log log n) variant, too. With a low constant factor. For the R-tree, Sort-Tile-Recursive is the best bulk load I have seen so far, and also O(n log n) with a low constant factor.
So yes, in high-dimensionality I would consider to just reload the tree from time to time.

Resources