Fixed radius nearest neighbours, with sets - algorithm

I need to efficiently solve the following problem, a variant of the Fixed radius nearest neighbours problem:
Given a list of n sets S, where each set S[i] consists of (2-dimensional) input points, and a query point q: List the indices of all sets in S such that at least one point of the set is within distance 'r' of q.
Approaches involving range trees, k-d trees and similar data structures storing all the points solve this in running times similar to O(log(n) + k), where n is the total number of points, and k is the number of results (points) returned. My problem is that each set is quite large, and while I can deal with large values of n, large values of k make my algorithm run very slowly and consume prohibitive amounts of space, when I actually only need the indices of the valid sets rather than all of the individual points or nearest point in each set.
If I make randomized k-d trees of each set, and then query for q in each set, (correct me if I'm wrong), I can solve the problem in O(m*log(n/m)) amortized time, where m is the number of the sets, which is a significant improvement over the first approach, but before implementing it, I wonder if there are other better practical ways of solving the problem, especially as m and n can become 10x or more of what they are now, and I am also concerned about the space/memory used in this approach. Elements could also be added to the sets, which may make the k-d trees unbalanced, and may require frequent reconstructions.
Other approaches I've tried involve partitioning the 2-d space into grids, and then using bloom filters, (and taking their union), but that that takes a prohibitive amount of space, and I still need to query for m sets. I also can't use a disjoint set to compute unions, because the points in each partition are not disjoint, and cannot be made disjoint.
Current values I am working with:
Total number of points: 250 million (could become 10x larger)
Number of sets: 50,000
The number of points in a set are thus, on average, ~5,000, but there are sets having 200,000+ points.
Values of k (number of matching points), for radii of interest: up to 40 million when there are 250 million points. The points are very densely clustered in some places. Even for such a large value of k, the number of matching sets is only 30,000 or so.
I'd welcome an approach which involves "once you've found any point in the set within the radius, don't bother about processing other points in the set." Any other approach that solves this problem efficiently, is of course, equally welcome.
I don't have to store the entire data structure in memory, I can store the structure in a database and retrieve parts that are needed.
On a side note, I'd also appreciate if someone could point me to a well-tested k-d tree implementation in Java, which at least works well for 2 dimensions, and serializes and deserializes properly.

Related

Most efficient implementation to get the closest k items

In the K-Nearest-Neighbor algorithm, we find the top k neighbors closest to a new point out of N observations and use those neighbors to classify the point. From my knowledge of data structures, I can think of two implementations of this process:
Approach 1
Calculate the distances to the new point from each of N observations
Sort the distances using quicksort and take the top k points
Which would take O(N + NlogN) = O(NlogN) time.
Approach 2
Create a max-heap of size k
Calculate the distance from the new point for the first k points
For each following observation, if the distance is less than the max in the heap, pop that point from the heap and replace it with the current observation
Re-heapify (logk operations for N points)
Continue until there are no more observations at which point we should only have the top 5 closest distances in the heap.
This approach would take O(N + NlogK) = O(NlogK) operations.
Are my analyses correct? How would this process be optimized in a standard package like sklearn? Thank you!
Here's a good overview of the common methods used: https://en.wikipedia.org/wiki/Nearest_neighbor_search
What you describe is linear search (since you need to compute the distance to every point in the dataset).
The good thing is that this always works. The bad thing about it is that is slow, especially if you query it a lot.
If you know a bit more about your data you can get better performance. If the data has low dimensionality (2D, 3D) and is uniformly distributed (this doesn't mean perfectly, just not in very dense and very tight clusters) then space partitioning works great because it cuts down quickly on the points that are too far anyway (complexity O(logN) ). They work also for higher dimensionallity or if there are some clusters, but the performance suffers a bit (still better overall than linear search).
Usually space partitioning or locality sensitive hashing are enough for common datasets.
The trade-off is that you use more memory and some set-up time to speed up future queries. If you have a lot of queries then it's worth it. If you only have a few, not so much.

Given n points in a 2-D plane we have to find k nearest neighbours of each point among themselves

I explored the method using a min-heap. For each point we can store a min heap of size k but it takes too much space for large n(I m targeting for n around a 100 million). Surely there must be a better way of doing this utilising lesser space and not affecting time complexity that much. Is there some other data structure?
This problem is typical setup for KD-tree. Such solution would have linearithmic complexity but may be relatively complex to implement(if a ready implementation is not available)
An alternative approach could be using bucketing to reduce the complexity of the naive algorithm. The idea is to separate the plane into "buckets" i.e. squares of some size and place the points in the bucket they belong to. The closest points will be from the closest buckets. In case of random data this could be quite good improvement but the worst case is still the same as the naive approach.

Fewest subsets with sum less than N

I have a specific sub-problem for which I am having trouble coming up with an optimal solution. This problem is similar to the subset sum group of problems as well as space filling problems, but I have not seen this specific problem posed anywhere. I don't necessarily need the optimal solution (as I am relatively certain it is NP-hard), but an effective and fast approximation would certainly suffice.
Problem: Given a list of positive valued integers find the fewest number of disjoint subsets containing the entire list of integers where each subset sums to less than N. Obviously no integer in the original list can be greater than N.
In my application I have many lists and I can concatenate them into columns of a matrix as long as they fit in the matrix together. For downstream purposes I would like to have as little "wasted" space in the resulting ragged matrix, hence the space filling similarity.
Thus far I am employing a greedy-like approach, processing from the largest integers down and finding the largest integer that fits into the current subset under the limit N. Once the smallest integer no longer fits into the current subset I proceed to the next subset similarly until all numbers are exhausted. This almost certainly does not find the optimal solution, but was the best I could come up with quickly.
BONUS: My application actually requires batches, where there is a limit on the number of subsets in each batch (M). Thus the larger problem is to find the fewest batches where each batch contains M subsets and each subset sums to less than N.
Straight from Wikipedia (with some bold amendments):
In the bin packing problem, objects [Integers] of different volumes [values] must be
packed into a finite number of bins [sets] or containers each of volume V [summation of the subset < V] in
a way that minimizes the number of bins [sets] used. In computational
complexity theory, it is a combinatorial NP-hard problem.
https://en.wikipedia.org/wiki/Bin_packing_problem
As far as I can tell, this is exactly what you are looking for.

Finding a single cluster of points with low variance

Given a collection of points in the complex plane, I want to find a "typical value", something like mean or mode. However, I expect that there will be a lot of outliers, and that only a minority of the points will be close to the typical value. Here is the exact measure that I would like to use:
Find the mean of the largest set of points with variance less than some programmer-defined constant C
The closest thing I have found is the article Finding k points with minimum diameter and related problems, which gives an efficient algorithm for finding a set of k points with minimum variance, for some programmer-defined constant k. This is not useful to me because the number of points close to the typical value could vary a lot and there may be other small clusters. However, incorporating the article's result into a binary search algorithm shows that my problem can be solved in polynomial time. I'm asking here in the hope of finding a more efficient solution.
Here is way to do it (from what i have understood of problem) : -
select the point k from dataset and calculate sorted list of points in ascending order of their distance from k in O(NlogN).
Keeping k as mean add the points from sorted list into set till variance < C and then stop.
Do this for all points
Keep track of set which is largest.
Time Complexity:- O(N^2*logN) where N is size of dataset
Mode-seeking algorithms such as Mean-Shift clustering may still be a good choice.
You could then just keep the mode with the largest set of points that has variance below the threshold C.
Another approach would be to run k-means with a fairly large k. Then remove all points that contribute too much to variance, decrease k and repeat. Even though k-means does not handle noise very well, it can be used (in particular with a large k) to identify such objects.
Or you might first run some simple outlier detection methods to remove these outliers, then identify the mode within the reduced set only. A good candidate method is 1NN outlier detection, which should run in O(n log n) if you have an R-tree for acceleration.

Searching for a tuple with all elements greater than a given tuple efficiently

Consider the following list of tuples:
[(5,4,5), (6,9,6), (3,8,3), (7,9,8)]
I am trying to devise an algorithm to check whether there exists at least one tuple in the list where all elements of that tuple are greater than or equal to a given tuple (the needle).
For example, for a given tuple (6,5,7), the algorithm should return True as every element in the given tuple is less than the last tuple in the list, i.e. (7,9,8). However, for a given tuple (9,1,9), the algorithm should return False as there is no tuple in the list where each element is greater than the given tuple. In particular, this is due to the second element 1 of the given tuple, which is smaller than the second element of all tuple in the list.
A naive algorithm would loop through the tuple in the list one by one, and loop through the the element of the tuple in the inner loop. Assuming there are n tuples, where each tuple have m elements, this will give a complexity of O(nm).
I am thinking whether it would be possible to have an algorithm to produce the task with a lower complexity. Pre-processing or any fancy data-structure to store the data is allowed!
My original thought was to make use of some variant of binary search, but I can't seem to find a data structure that allow us to not fall back to the naive solution once we have eliminated some tuples based on the first element, which implies that this algorithm could potentially be O(nm) at the end as well.
Thanks!
Consider the 2-tuple version of this problem. Each tuple (x,y) corresponds to an axis-aligned rectangle on the plane with upper right corner at (x,y) and lower right at (-oo,+oo). The collection corresponds to the union of these rectangles. Given a query point (needle), we need only determine if it's in the union. Knowing the boundary is sufficient for this. It's an axis-aligned polyline that's monotonically non-increasing in y with respect to x: a "downward staircase" in the x direction. With any reasonable data structure (e.g. an x-sorted list of points on the polyline), it's simple to make the decision in O(log n) time for n rectangles. It's not hard to see how to construct the polyline in O(n log n) time by inserting rectangles one at a time, each with O(log n) work.
Here's a visualization. The four dots are input tuples. The area left and below the blue line corresponds to "True" return values:
Tuples A, B, C affect the boundary. Tuple D doesn't.
So the question is whether this 2-tuple version generalizes nicely to 3. The union of semi-infinite axis-aligned rectangles becomes a union of rectangular prisms instead. The boundary polyline becomes a 3d surface.
There exist a few common ways to represent problems like this. One is as an octree. Computing the union of octrees is a well-known standard algorithm and fairly efficient. Querying one for membership requires O(log k) time where k is the biggest integer coordinate range contained in it. This is likely to be the simplest option. But octrees can be relatively slow and take a lot of space if the integer domain is big.
Another candidate without these weaknesses is a Binary Space Partition, which can handle arbitrary dimensions. BSPs use (hyper)planes of dimension n-1 to recursively split n-d space. A tree describes the logical relationship of the planes. In this application, you'll need 3 planes per tuple. The intersection of the "True" half-spaces induced by by the planes will be the True semi-infinite prism corresponding to the tuple. Querying a needle is traversing the tree to determine if you're inside any of the prisms. Average case behavior of BSPs is very good, but worst case size of the tree is terrible: O(n) search time over a tree of size O(2^n). In real applications, tricks are used to find BSPs of modest size at creation time, starting with randomizing insertion order.
K-d trees are another tree-based space partitioning scheme that could be adapted to this problem. This will take some work, though, because most presentations of k-d trees are concerned with searching for points, not representing regions. They'd have the same worst case behavior as BSPs.
The other bad news is that these algorithms aren't well-suited to tuples much bigger than 3. Trees quickly become too big. Searching high dimensional spaces is hard and a topic of active research. However, since you didn't say anything about tuple length, I'll stop here.
This kind of problem is addressed by spatial indexing systems. There are many data structures that allow your query to be executed efficiently.
Let S be a topologically-sorted copy of the original set of n each m-tuples. Then we can use binary search for any test tuple in S, at a cost of O(m ln n) per search (due to at most lg n search plies with at most m comparisons per ply).
Note, suppose there exist tuples P, Q in S such that P ≤ Q (that is, no element of Q is smaller than the corresponding element of P). Then tuple Q can be removed from S. In practice this often might cut the size of S to a small multiple of m, which would give O(m ln m) performance; but in the worst case, will provide no reduction at all.
Trying to answer
allcorrespondingelements greater than or equal to a given tuple (needle)
(using y and z for members of the set/hay stack, x for the query tuple/needle and x ll y when xₐ ≤ yₐ for all ₐ (x dominated by y))
compute telling summary information like min, sum and max of all tuple elements
order criteria by selectivity
weed out dominated tuples
build a k-d-tree
top off with lower and upper bounding boxes:
one tuple lower consisting of the minimum values for each element (if lower dominates x return True)
and upper consisting of the minimum values: return False if x dominates upper

Resources