The problem I'm tackling is as follows:
We have a system with thousands of drivers sending their location data to our back-end services. The problem is given a location (lat, long) and a radius to find which vehicles/drivers are inside the circle.
The obvious and easy answer to this problem is a brute force approach: Take every driver's last location and calculate the distance between the driver's vehicle and the center point, it either resides in the circle or not.
However, I believe this approach is not the most scalable and efficient solution especially when we are talking about thousands of queries like this, the system might get overwhelmed.
So my question is what are some better approaches? are there better algorithms? are there any third-party tools/technologies to help me(such as PostGIS etc)?
Thanks for your attention
Consider using a kD-tree (k=2) or a vantage-point-tree.
Anyway, as the vehicles constantly move, you should keep the old positions (when the tree was built) as well as the current positions and perform a search with extra allowance. Then periodically rebuild the tree.
If a rebuild costs O(N Log N) and M queries cost O(M Log N), you are winning against the exhaustive solution O(MN) by a factor approximately (Log N)/M + (Log N)/N.
Related
In the K-Nearest-Neighbor algorithm, we find the top k neighbors closest to a new point out of N observations and use those neighbors to classify the point. From my knowledge of data structures, I can think of two implementations of this process:
Approach 1
Calculate the distances to the new point from each of N observations
Sort the distances using quicksort and take the top k points
Which would take O(N + NlogN) = O(NlogN) time.
Approach 2
Create a max-heap of size k
Calculate the distance from the new point for the first k points
For each following observation, if the distance is less than the max in the heap, pop that point from the heap and replace it with the current observation
Re-heapify (logk operations for N points)
Continue until there are no more observations at which point we should only have the top 5 closest distances in the heap.
This approach would take O(N + NlogK) = O(NlogK) operations.
Are my analyses correct? How would this process be optimized in a standard package like sklearn? Thank you!
Here's a good overview of the common methods used: https://en.wikipedia.org/wiki/Nearest_neighbor_search
What you describe is linear search (since you need to compute the distance to every point in the dataset).
The good thing is that this always works. The bad thing about it is that is slow, especially if you query it a lot.
If you know a bit more about your data you can get better performance. If the data has low dimensionality (2D, 3D) and is uniformly distributed (this doesn't mean perfectly, just not in very dense and very tight clusters) then space partitioning works great because it cuts down quickly on the points that are too far anyway (complexity O(logN) ). They work also for higher dimensionallity or if there are some clusters, but the performance suffers a bit (still better overall than linear search).
Usually space partitioning or locality sensitive hashing are enough for common datasets.
The trade-off is that you use more memory and some set-up time to speed up future queries. If you have a lot of queries then it's worth it. If you only have a few, not so much.
When m is the amount of features and n is the amount of samples, the python scikit-learn site (http://scikit-learn.org/stable/modules/tree.html) states that the runtime to construct a binary decision tree is mnlog(n).
I understand that the log(n) comes from the average height of the tree after splitting. I understand that at each split, you have to look at each feature (m) and choose the best one to split on. I understand that this is done by calculating a "best metric" (in my case, a gini impurity) for each sample at that node (n). However, to find the best split, doesn't this mean that you would have to look at each possible way to split the samples for each feature? And wouldn't that be something like 2^n-1 * m rather than just mn? Am I thinking about this wrong? Any advice would help. Thank you.
One way to build a decision tree would be, at each point, to do something like this:
For each possible feature to split on:
Find the best possible split for that feature.
Determine the "goodness" of this fit.
Of all the options tried above, take the best and use that for the split.
The question is how do perform each step. If you have continuous data, a common technique for finding the best possible split would be to sort the data into ascending order along that data point, then consider all possible partition points between those data points and taking the one that minimizes the entropy. This sorting step takes time O(n log n), which dominates the runtime. Since we're doing that for each of the O(m) features, the runtime ends up working out to O(mn log n) total work done per node.
I am looking for some data structures for range searching. I think range trees offer a good time complexity (but with some storage requirements).
However, it seems to me that other data structures, like KD-trees, are more discussed and recommended than range trees. Is this true? If so, why?
I would expect that it is because kd-trees can straightforwardly be extended to contains objects other than points. This gives it many applications in e.g. virtual worlds, where we want quick querying of triangles. Similar extensions of range trees are not straightforward, and in fact I've never seen any.
To give a quick recap: a kd-tree can preprocess a set of n points in d-space in O(n log n) time into a structure using O(n) space, such that any d-dimensional range query can be answered in O(n1-1/d + k) time, where k is the number of answers. A range trees takes O(n logd-1n) time to preprocess, takes O(n logd-1n) space, and can answer range queries in O(logd-1n + k) time.
The query time for a range tree is obviously a lot better than that of a kd-tree, if we're talking about 2- or 3-dimensional space. However, the kd-tree has several advantages. First of all, it always requires only linear storage. Secondly, it is always constructed in O(n log n) time. Third, if the dimensionality is very high, it will outperform a range tree unless your points sets are very large (although arguably, at this point a linear search will be almost as fast as a kd-tree).
I think another important point is that kd-trees are more well known by people than range trees. I'd never heard of a range tree before taking a course in computational geometry, but I'd heard of and worked with kd-trees before (albeit in a computer graphics setting).
EDIT: You ask what is a better data structure for 2D or 3D fixed radius search when you have millions of points. I really can't tell you! I'd be inclined to say a range tree will be faster if you perform many queries, but for 3D the construction will be slower by a factor of O(log n), and memory use may become an issue before speed does. I'd recommend integrating good implementations of both structures and simply testing what does a better job for your particular requirements.
For your needs (particle simulation in 2D or 3D), you want a data structure capable of all nearest neighbor queries. The cover tree is a data structure that is best suited for this task. I came across it while computing the nearest neighbors for kernel density estimation. This Wikipedia page explains the basic definition of the tree, and John Langford's page has a link to a C++ implementation.
The running time of a single query is O(c^12*logn), where c is the expansion constant of the dataset. This is an upper bound - in practice, the data structure performs faster than others. This paper shows that the running time of batch processing of all nearest neighbors (for all the data points), as needed for a particle simulation, is O(c^16*n), and this theoretical linear bound is also practical for your need. Construction time is O(nlogn) and storage is O(n).
Sports tracker applications usually record a timestamp and a location in regular intervals in order to store the entire track. Analytical applications then allow to find certain statistics, such as the track subsection with the highest speed of a fixed duration (e.g. time needed for 5 miles). Or vice versa, the longest distance traversed in certain time span (e.g. Cooper distance in 12 minutes).
I'm wondering what's the most elegant and/or efficient approach to compute such sections.
In a naive approach, I'd normalize and interpolate the waypoints to get a more fine grained list of waypoints, either with a fixed time interval or fix distance steps. Then, move a sliding window representing my time span resp. distance segement over the list and search for the best sub-list matching my criteria. Is there any better way?
Elegance and efficiency are in the eye of the beholder.
Personally, I think your interpolation idea is elegant.
I imagine the interpolation algorithm is easy to build and the search you'll perform on the subsequent data is easy to perform. This can lead to tight code whose correctness can be easily verified. Furthermore, the interpolation algorithms probably already exist and are multi-purpose, so you don't have to to repeat yourself (DRY). Your suggested solution has the benefit of separating data processing from data analysis. Modularity of this nature is often considered a component of elegance.
Efficiency - are we talking about speed, space, or lines of code? You could try to combine the interpolation step with the search step to save space, but this will probably sacrifice speed and code simplicity. Certainly speed is sacrificed in the sense that multiple queries cannot take advantage of previous calculations.
When you consider the efficiency of your code, worry not so much about how the computer will handle it, or how you will code it. Think more deeply about the intrinsic time complexity of your approach. I suspect both the interpolation and search can be made to take place in O(N) time, in which case it would take vast amounts of data to bog you down: it is difficult to make an O(N) algorithm perform very badly.
In support of the above, interpolation is just estimating intermediate points between two values, so this is linear in the number of values and linear in the number of intermediate points. Searching could probably be done with a numerical variant of the Knuth-Morris-Pratt Algorithm, which is also linear.
So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)
My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.
My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?
The Wikipedia article for kd-trees has a link to the ANN library:
ANN is a library written in C++, which
supports data structures and
algorithms for both exact and
approximate nearest neighbor searching
in arbitrarily high dimensions.
Based on our own experience, ANN
performs quite efficiently for point
sets ranging in size from thousands to
hundreds of thousands, and in
dimensions as high as 20. (For applications in significantly higher
dimensions, the results are rather
spotty, but you might try it anyway.)
As far as algorithm/data structures are concerned:
The library implements a number of
different data structures, based on
kd-trees and box-decomposition trees,
and employs a couple of different
search strategies.
I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).
use a kd-tree
Unfortunately, in high dimensions this data structure suffers severely from the curse of dimensionality, which causes its search time to be comparable to the brute force search.
reduce the number of dimensions
Dimensionality reduction is a good approach, which offers a fair trade-off between accuracy and speed. You lose some information when you reduce your dimensions, but gain some speed.
By accuracy I mean finding the exact Nearest Neighbor (NN).
Principal Component Analysis(PCA) is a good idea when you want to reduce the dimensional space your data live on.
Is there some clever algorithm or data structure to solve this exactly in reasonable time?
Approximate nearest neighbor search (ANNS), where you are satisfied with finding a point that might not be the exact Nearest Neighbor, but rather a good approximation of it (that is the 4th for example NN to your query, while you are looking for the 1st NN).
That approach cost you accuracy, but increases performance significantly. Moreover, the probability of finding a good NN (close enough to the query) is relatively high.
You could read more about ANNS in the introduction our kd-GeRaF paper.
A good idea is to combine ANNS with dimensionality reduction.
Locality Sensitive Hashing (LSH) is a modern approach to solve the Nearest Neighbor problem in high dimensions. The key idea is that points that lie close to each other are hashed to the same bucket. So when a query arrives, it will be hashed to a bucket, where that bucket (and usually its neighboring ones) contain good NN candidates).
FALCONN is a good C++ implementation, which focuses in cosine similarity. Another good implementation is our DOLPHINN, which is a more general library.
You could conceivably use Morton Codes, but with 75 dimensions they're going to be huge. And if all you have is 16,000 data points, exhaustive search shouldn't take too long.
No reason to believe this is NP-complete. You're not really optimizing anything and I'd have a hard time figure out how to convert this to another NP-complete problem (I have Garey and Johnson on my shelf and can't find anything similar). Really, I'd just pursue more efficient methods of searching and sorting. If you have n observations, you have to calculate n x n distances right up front. Then for every observation, you need to pick out the top k nearest neighbors. That's n squared for the distance calculation, n log (n) for the sort, but you have to do the sort n times (different for EVERY value of n). Messy, but still polynomial time to get your answers.
BK-Tree isn't such a bad thought. Take a look at Nick's Blog on Levenshtein Automata. While his focus is strings it should give you a spring board for other approaches. The other thing I can think of are R-Trees, however I don't know if they've been generalized for large dimensions. I can't say more than that since I neither have used them directly nor implemented them myself.
One very common implementation would be to sort the Nearest Neighbours array that you have computed for each data point.
As sorting the entire array can be very expensive, you can use methods like indirect sorting, example Numpy.argpartition in Python Numpy library to sort only the closest K values you are interested in. No need to sort the entire array.
#Grembo's answer above should be reduced significantly. as you only need K nearest Values. and there is no need to sort the entire distances from each point.
If you just need K neighbours this method will work very well reducing your computational cost, and time complexity.
if you need sorted K neighbours, sort the output again
see
Documentation for argpartition