Algorithm to find k neighbors in a certain range? - algorithm

Suppose there is a point cloud having 50 000 points in the x-y-z 3D space. For every point in this cloud, what algorithms or data strictures should be implemented to find k neighbours of a given point which are within a distance of [R,r]? Naive way is to go through each of the 49 999 points for each of the 50 000 points and do a metric testing. But this approach will take large time. Just like there is kd tree to find nearest neighbour in small time so is there some real-time DS/algo implementation out there to pre-process the point clouds to achieve the goal inn shortest time?

Your problem is part of the topic of Nearest Neighbor Search, or more precisely, k-Nearest Neighbor Search. The answer to your question depends on the data structure you are using to store the points. If you use R-trees or variants like R*-trees, and you are doing multiple searches on your database, you will likely find a substantial performance improvement in two or three-dimensional space compared with naive linear search. In higher dimensions, space partitioning schemes tend to underperform linear search.

As some answers already suggest for NN search you could use some tree algorithm like k-d-tree. There are implementations available for all programming languages.
If your description [R,r] suggests a hollow sphere you should compare one-time-testing (within interval) vs. two stages (test-for-outer and remove samples that pass test-for-inner).
You also did not mention performance requirements (timing or frame rate?) and your intended application (feasible approach?).

If you are using an ordinary Euclidean metric, you could go through the list three times and extract those points that within R in each dimension, essentially extracting the enclosing cube. Searching the resulting list would still be O(n^2), but on a much smaller n.

There are efficient algorithms (in average, for random data), see Nearest neighbor search.
Your approach is not efficient, yet simple.
Please read through, check you requirements and get back so we can help.

Related

How to efficiently find the two farthest point (Euclidean distance) in an 4-dimensional space given m points?

Given m 4-dimensional points, what is the efficient way to find out the two points that have the maximum Euclidean distance?
Currently, I am just using brute force approach and checking every pair distance with 2 nested for loops (O(m^2)) but this is very bad as it does not scale.
The problem computation scales with the dimensionality. At about 4 you're usually better off with brute force.
If there's some known functionality for this data you can cut down on things. Like if you do this a lot but the points don't change much. You can build the grouping by checking each point for the farlest point each time you add a new point caching the data from the brute force. You'd get O(N) on insertion and O(N) on farlest query. But, you'd need to do that N times giving you O(N^2).
You could reduce this a bit if you also clustered the data. So you define a cluster of points during the insertion and you can determine that since your house is in New York that no house in Paris can be further since you've compared it to a house in Australia. You can do that because you have the data in clusters. But, that's not going to save you that much because in 4D things get really hard to optimize because you end up needing way more boxes to store the clusters in 4D and most of the fun optimization is a proof that since you already exceed that distance in 4D you can rule out all the other points. That's great in 2D but those tricks become progressively messier with new dimensions.
Please look at the answer to this question: How to find two most distant points?
To find the convex hull you can use this: https://en.wikipedia.org/wiki/Gift_wrapping_algorithm

3D Nearest neighbour for query points located far away from set of points

I need to answer a lot of queries about finding nearest neighbour in pointset, locating far away from the query point. All approaches I've found so far work bad in this case (for example, k-d tree may have O(N) per query) or require use of Voronoi diagram (I have ~10m points so Voronoi diagram is too expensive).
Is there any known algorithm designed for such a task?
The problem here are the distances. You see, when a query is far from your dataset, then the kd-tree has to check many points, thus slowing down the query time.
The scenario you are facing is hard for the Nearest Neighbor Structures in general (and it's not the usual case), but if I were you, I would give a shot with Balanced Box-Decomposition trees, where you can read more about their algorithm and data structure.
Some multidimensional indexes have kNN queries that could be easily adapted to you needs, especially with k==1.
kNN algorithms usually have to first estimate the approximate nearest neighbour distance, then they use this distance to perform a range query.
In R-Trees or quadtrees, this estimation can be done efficiently by finding the node that is closest to your search point. Then they take one point from the closest node, calculate the distance to the search point, and then perform a range query based on this distance, usually with some multiplier because k>1.
This should be reasonable efficient even if the search point is far away.
If you are searching for one point only (k=1) then you could adapt this algorithm to use a range query that is exactly based on the closest point you found, no extra extension to get k>1 points.
If you are using Java, you could use my open-source implementations here. There is also a PH-Tree (a kind of quadtree, but much more space efficient and faster to load), which uses the same kNN approach.

Clustering elements based on highest similarity

I'm working with Docker images which consist of a set of re-usable layers. Now given a collection of images, I would like to combine images which have a large amount of shared layers.
To be more exact: Given a collection of N images, I want to create clusters where all images in a cluster share more than X percent of services with eachother. Each image is only allowed to belong to one cluster.
My own research points in the direction of cluster algorithms where I use a similarity measure to decide which images belong in a cluster together. The similarity measure I know how to write. However, I'm having difficulty finding an exact algorithm or pseudo-algorithm to get started.
Can someone recommend an algorithm to solve this problem or provide pseudo-code please?
EDIT: after some more searching I believe I'm looking for something like this hierarchical clustering ( https://github.com/lbehnke/hierarchical-clustering-java ) but with a threshold X so that neighbors with less than X% similarity don't get combined and stay in a separate cluster.
I believe you are a developer and you have no experience with data science?
There are a number of clustering algorithms and they have their advantages and disadvantages (please consult https://en.wikipedia.org/wiki/Cluster_analysis), but I think solution for your problem is easier than one can think.
I assume that N is small enough so you can store a matrix with N^2 float values in RAM memory? If this is the case, you are in a very comfortable situation. You write that you know how to implement similarity measure, so just calculate the measure for all N^2 pairs and store it in a matrix (it is a symmetric matrix, so only half of it can be stored). Please ensure that your similarity measure assigns special value for pair of images, where similarity measure is less than some X%, like 0 or infinity (it depends on that you treat a function like similarity measure or like a distance). I think perfect solution is to assign 1 for pairs, where similarity is greater than X% threshold and 0 otherwise.
After that, treat is just like a graph. Get first vertex and make, e.g., deep first search or any other graph walking routine. This is your first cluster. After that get first not visited vertex and repeat graph walking. Of course you can store graph as an adjacency list to save memory.
This algorithm assumes that you really do not pay attention to that how much images are similar and which pairs are more similar than other, but if they are similar enough (similarity measure is greater than a given threshold).
Unfortunately in cluster analysis it is common that 100% of possible pairs has to be computed. It is possible to save some number of distance calls using some fancy data structures for k-nearest neighbor search, but you have to assure that your similarity measure hold triangle inequality.
If you are not satisfied with this answer, please specify more details of your problem and read about:
K-means (main disadvantage: you have to specify number of clusters)
Hierarchical clustering (slow computation time, at the top all images are in one cluster, you have to cut a dendrogram at proper distance)
Spectral clustering (for graphs, but I think it is too complicated for this easy problem)
I ended up solving the problem by using hierarchical clustering and then traversing each branch of the dendrogram top to bottom until I find a cluster where the distance is below a threshold. Worst case there is no such cluster but then I'll end up in a leaf of the dendrogram which means that element is in a cluster of its own.

efficient algorithm to find nearest point in a graph that does not have a known equation

I'm asking this questions out of curiostity, since my quick and dirty implementation seems to be good enough. However I'm curious what a better implementation would be.
I have a graph of real world data. There are no duplicate X values and the X value increments at a consistant rate across the graph, but Y data is based off of real world output. I want to find the nearest point on the graph from an arbitrary given point P programmatically. I'm trying to find an efficient (ie fast) algorithm for doing this. I don't need the the exact closest point, I can settle for a point that is 'nearly' the closest point.
The obvious lazy solution is to increment through every single point in the graph, calculate the distance, and then find the minimum of the distance. This however could theoretically be slow for large graphs; too slow for what I want.
Since I only need an approximate closest point I imagine the ideal fastest equation would involve generating a best fit line and using that line to calculate where the point should be in real time; but that sounds like a potential mathematical headache I'm not about to take on.
My solution is a hack which works only because I assume my point P isn't arbitrary, namely I assume that P will usually be close to my graph line and when that happens I can cross out the distant X values from consideration. I calculating how close the point on the line that shares the X coordinate with P is and use the distance between that point and P to calculate the largest/smallest X value that could possible be closer points.
I can't help but feel there should be a faster algorithm then my solution (which is only useful because I assume 99% of the time my point P will be a point close to the line already). I tried googling for better algorithms but found so many algorithms that didn't quite fit that it was hard to find what I was looking for amongst all the clutter of inappropriate algorithms. So, does anyone here have a suggested algorithm that would be more efficient? Keep in mind I don't need a full algorithm since what I have works for my needs, I'm just curious what the proper solution would have been.
If you store the [x,y] points in a quadtree you'll be able to find the closest one quickly (something like O(log n)). I think that's the best you can do without making assumptions about where the point is going to be. Rather than repeat the algorithm here have a look at this link.
Your solution is pretty good, by examining how the points vary in y couldn't you calculate a bound for the number of points along the x axis you need to examine instead of using an arbitrary one.
Let's say your point P=(x,y) and your real-world data is a function y=f(x)
Step 1: Calculate r=|f(x)-y|.
Step 2: Find points in the interval I=(x-r,x+r)
Step 3: Find the closest point in I to P.
If you can use a data structure, some common data structures for spacial searching (including nearest neighbour) are...
quad-tree (and octree etc).
kd-tree
bsp tree (only practical for a static set of points).
r-tree
The r-tree comes in a number of variants. It's very closely related to the B+ tree, but with (depending on the variant) different orderings on the items (points) in the leaf nodes.
The Hilbert R tree uses a strict ordering of points based on the Hilbert curve. The Hilbert curve (or rather a generalization of it) is very good at ordering multi-dimensional data so that nearby points in space are usually nearby in the linear ordering.
In principle, the Hilbert ordering could be applied by sorting a simple array of points. The natural clustering in this would mean that a search would usually only need to search a few fairly-short spans in the array - with the complication being that you need to work out which spans they are.
I used to have a link for a good paper on doing the Hilbert curve ordering calculations, but I've lost it. An ordering based on Gray codes would be simpler, but not quite as efficient at clustering. In fact, there's a deep connection between Gray codes and Hilbert curves - that paper I've lost uses Gray code related functions quite a bit.
EDIT - I found that link - http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.133.7490

3D clustering Algorithm

Problem Statement:
I have the following problem:
There are more than a billion points in 3D space. The goal is to find the top N points which has largest number of neighbors within given distance R. Another condition is that the distance between any two points of those top N points must be greater than R. The distribution of those points are not uniform. It is very common that certain regions of the space contain a lot of points.
Goal:
To find an algorithm that can scale well to many processors and has a small memory requirement.
Thoughts:
Normal spatial decomposition is not sufficient for this kind of problem due to the non-uniform distribution. irregular spatial decomposition that evenly divide the number of points may help us the problem. I will really appreciate that if someone can shed some lights on how to solve this problem.
Use an Octree. For 3D data with a limited value domain that scales very well to huge data sets.
Many of the aforementioned methods such as locality sensitive hashing are approximate versions designed for much higher dimensionality where you can't split sensibly anymore.
Splitting at each level into 8 bins (2^d for d=3) works very well. And since you can stop when there are too few points in a cell, and build a deeper tree where there are a lot of points that should fit your requirements quite well.
For more details, see Wikipedia:
https://en.wikipedia.org/wiki/Octree
Alternatively, you could try to build an R-tree. But the R-tree tries to balance, making it harder to find the most dense areas. For your particular task, this drawback of the Octree is actually helpful! The R-tree puts a lot of effort into keeping the tree depth equal everywhere, so that each point can be found at approximately the same time. However, you are only interested in the dense areas, which will be found on the longest paths in the Octree without even having to look at the actual points yet!
I don't have a definite answer for you, but I have a suggestion for an approach that might yield a solution.
I think it's worth investigating locality-sensitive hashing. I think dividing the points evenly and then applying this kind of LSH to each set should be readily parallelisable. If you design your hashing algorithm such that the bucket size is defined in terms of R, it seems likely that for a given set of points divided into buckets, the points satisfying your criteria are likely to exist in the fullest buckets.
Having performed this locally, perhaps you can apply some kind of map-reduce-style strategy to combine spatial buckets from different parallel runs of the LSH algorithm in a step-wise manner, making use of the fact that you can begin to exclude parts of your problem space by discounting entire buckets. Obviously you'll have to be careful about edge cases that span different buckets, but I suspect that at each stage of merging, you could apply different bucket sizes/offsets such that you remove this effect (e.g. perform merging spatially equivalent buckets, as well as adjacent buckets). I believe this method could be used to keep memory requirements small (i.e. you shouldn't need to store much more than the points themselves at any given moment, and you are always operating on small(ish) subsets).
If you're looking for some kind of heuristic then I think this result will immediately yield something resembling a "good" solution - i.e. it will give you a small number of probable points which you can check satisfy your criteria. If you are looking for an exact answer, then you are going to have to apply some other methods to trim the search space as you begin to merge parallel buckets.
Another thought I had was that this could relate to finding the metric k-center. It's definitely not the exact same problem, but perhaps some of the methods used in solving that are applicable in this case. The problem is that this assumes you have a metric space in which computing the distance metric is possible - in your case, however, the presence of a billion points makes it undesirable and difficult to perform any kind of global traversal (e.g. sorting of the distances between points). As I said, just a thought, and perhaps a source of further inspiration.
Here are some possible parts of a solution.
There are various choices at each stage,
which will depend on Ncluster, on how fast the data changes,
and on what you want to do with the means.
3 steps: quantize, box, K-means.
1) quantize: reduce the input XYZ coordinates to say 8 bits each,
by taking 2^8 percentiles of X,Y,Z separately.
This will speed up the whole flow without much loss of detail.
You could sort all 1G points, or just a random 1M,
to get 8-bit x0 < x1 < ... x256, y0 < y1 < ... y256, z0 < z1 < ... z256
with 2^(30-8) points in each range.
To map float X -> 8 bit x, unrolled binary search is fast —
see Bentley, Pearls p. 95.
Added: Kd trees
split any point cloud into different-sized boxes, each with ~ Leafsize points —
much better than splitting X Y Z as above.
But afaik you'd have to roll your own Kd tree code
to split only the first say 16M boxes, and keep counts only, not the points.
2) box: count the number of points in each 3d box,
[xj .. xj+1, yj .. yj+1, zj .. zj+1].
The average box will have 2^(30-3*8) points;
the distribution will depend on how clumpy the data is.
If some boxes are too big or get too many points, you could
a) split them into 8,
b) track the centre of the points in each box,
otherwide just take box midpoints.
3)
K-means clustering
on the 2^(3*8) box centres.
(Google parallel "k means" -> 121k hits.)
This depends strongly on K aka Ncluster, also on your radius R.
A rough approach would be to grow a
heap
of the say 27*Ncluster boxes with the most points,
then take the biggest ones subject to your Radius constraint.
(I like to start with a
Minimum spanning tree,
then remove the K-1 longest links to get K clusters.)
See also
Color quantization .
I'd make Nbit, here 8, a parameter from the beginning.
What is your Ncluster ?
Added: if your points are moving in time, see
collision-detection-of-huge-number-of-circles on SO.
I would also suggest to use an octree. The OctoMap framework is very good at dealing with huge 3D point clouds. It does not store all the points directly, but updates the occupancy density of every node (aka 3D box).
After the tree is built, you can use a simple iterator to find the node with the highest density. If you would like to model the point density or distribution inside the nodes, the OctoMap is very easy to adopt.
Here you can see how it was extended to model the point distribution using a planar model.
Just an idea. Create a graph with given points and edges between points when distance < R.
Creation of this kind of graph is similar to spatial decomposition. Your questions can be answered with local search in graph. First are vertices with max degree, second is finding of maximal unconnected set of max degree vertices.
I think creation of graph and search can be made parallel. This approach can have large memory requirement. Splitting domain and working with graphs for smaller volumes can reduce memory need.

Resources