I need to solve a neighbor search problem, that is, for every given element, find all neighbor elements within a fixed distance.
I just learnt the data structure range tree, it seems to be able to solve this problem in O(N*(log(N)^(d-1))) complexity where d is dim of space.
I know nothing about R-tree, but just saw this from wikipedia:
A common real-world usage for an R-tree might be to .... then find answers quickly to queries such as "Find all museums
within 2 km of my current location",
which seems exactly the problem I want to solve.
So should I learn and use this data structure?
Related
There are two real life problem which I am struggling to find the answer of :
Restaurant Service : When I use my food ordering app(like FoodPand, Zomato, etc.), the app detects my location when I login and accordingly suggests the nearby restaurants(probably in the range good enough so that the selected restaurant can deliver the food).
Cab Service : When I use a cab service(like Uber or Ola), they also detect my location when I try to book a cab and suggest the nearby cabs available at that time.
Question :
How is finding the nearest restaurants and nearest cabs done ? Which specific algorithm is practically used by them ? Since both cases differ as search data is static in one case and constantly changing in another case.
My Take on the question :
After doing some brainstorming on the topic, I came to know that since the restaurants are fixed entity, we can map them over KD tree(which allows for storing spatial indexes). Based on the location of the customer, we can perform a search on the KD tree to find out a set of nearby restaurants. The creation of KD tree takes O(n) time and searching takes O(logn) time, n being the number of n odes in the tree. The approach seems good enough to me as I am not aware of any better approach than this and am still looking answer for.
In the case of cab service, the positions of cabs aren't static(unlike in restaurant service). So, creating KD tree for every changed locations of cabs seems to be an overhead. How can I find say 5 nearest cabs to me given the current locations of the cabs ? Which algorithm is practically used by the cabs ?
Any insight will be highly appreciated.
P.S. : I also came across K Nearest Neighbor search algorithm which again leads to KD trees.
There exist data structure called Quadtree as a solution to 2D dynamic k-D tree. But in practice real implementation may not involve heavy data structures. You may think simpler method like this one, which may potentially outperform dynamic k-D tree:
Divide map into rectangular grids
When cab changes its' location(post request containing GPS coordinates), check if it is still in the rectangle otherwise change rectangle's inner information
When user wants to find the nearest cab, just locate its rectangle in the grid and find the nearest one with exhaustive search
You may think "why?". In the reality, in the above simple method, it would be easier to add concurrency and parallelization while it may be very hard to modify k-d tree.
I am working on devising indexing strategy for finding similar hashes. The hashes are generated for images. i.e
String A = "00007c3fff1f3b06738f390079c627c3ffe3fb11f0007c00fff07ff03f003000" //Image 1
String B = "6000fc3efb1f1b06638f1b0071c667c7fff3e738d0007c00fff03ff03f803000" //Image 2
These two hashes are similar (based on Hamming distance and Levenshtein distance) and hence similar images. I have more than 190 million such hashes. I have to select a suitable indexing data structure where the worst case complexity for finding similar hash is not O(n). Hash data structure won't work because it will search for <, = and > (or will it?). I can find Hamming distance or other distance to calculate the similarity but in worst case I will end up calculating it 190 million times.
This is my strategy now:
Currently I am working on BTree where I will rank all the keys in a node based on no. of consecutive same characters and traverse the key which is highest ranked and if the child's keys rank is less than other key's rank in parent node, I will start traversing that key in the parent node. If all the rank of parent is same I will do normal BTree traverse (givenkey < nodeKey --> go to Child node of nodeKey..using ASCII comparison) which is where my issue is.
Because it would lead to lot of false negatives in search. As in the worst case I will traverse only one part of tree where potentially similar key can be found in other traversals. Else I have to search entire tree which is again O(n) where I might as well not have tree.
I feel there has to be a better way and right now I am stuck and it would be great to hear any inputs on breaking down the problem. Please share your thoughts.
P.S : and I cannot use any external database.
First, this is a very difficult problem. Don't expect neat, tidy answers.
One approximate data structure I have seen is Spatial Approximation Sample Hierarchy (SASH).
A SASH (Spatial Approximation Sample Hierarchy) is a general-purpose data structure for efficiently computing approximate answers for similarity queries. Similarity queries naturally arise in a number of important computing contexts, in particular content-based retrieval on multimedia databases, and nearest-neighbor methods for clustering and classification.
SASH uses only a distance function to build a data structure, so the distance function (and in your case, the image hash function as well) needs to be "good". The basic intuition is roughly that if A ~ B (image A is close to image B) and B ~ C, then usually A ~ C. The data structure creates links between items that are relatively close, and you prune your search by only looking for things that are closer to your query. Whether this strategy actually works depends on the nature of your data and the distance function.
It has been 10 years or so since I looked at SASH, so there are probably newer developments as well. Michael Houle's page seems to indicate he has newer research on something called Rank Cover Trees, which seem similar in purpose to SASH. This should at least get you started on research in the area; read some papers and follow the reference trail.
I have been thinking about a variation of the closest pair problem in which the only available information is the set of distances already calculated (we are not allowed to sort points according to their x-coordinates).
Consider 4 points (A, B, C, D), and the following distances:
dist(A,B) = 0.5
dist(A,C) = 5
dist(C,D) = 2
In this example, I don't need to evaluate dist(B,C) or dist(A,D), because it is guaranteed that these distances are greater than the current known minimum distance.
Is it possible to use this kind of information to reduce the O(n²) to something like O(nlogn)?
Is it possible to reduce the cost to something close to O(nlogn) if I accept a kind of approximated solution? In this case, I am thinking about some technique based on reinforcement learning that only converges to the real solution when the number of reinforcements go to infinite, but provides a great approximation for small n.
Processing time (measured by the big O notation) is not the only issue. To keep a very large amount of previous calculated distances can also be an issue.
Imagine this problem for a set with 10⁸ points.
What kind of solution should I look for? Was this kind of problem solved before?
This is not a classroom problem or something related. I have been just thinking about this problem.
I suggest using ideas that are derived from quickly solving k-nearest-neighbor searches.
The M-Tree data structure: (see http://en.wikipedia.org/wiki/M-tree and http://www.vldb.org/conf/1997/P426.PDF ) is designed to reduce the number distance comparisons that need to be performed to find "nearest neighbors".
Personally, I could not find an implementation of an M-Tree online that I was satisfied with (see my closed thread Looking for a mature M-Tree implementation) so I rolled my own.
My implementation is here: https://github.com/jon1van/MTreeMapRepo
Basically, this is binary tree in which each leaf node contains a HashMap of Keys that are "close" in some metric space you define.
I suggest using my code (or the idea behind it) to implement a solution in which you:
Search each leaf node's HashMap and find the closest pair of Keys within that small subset.
Return the closest pair of Keys when considering only the "winner" of each HashMap.
This style of solution would be a "divide and conquer" approach the returns an approximate solution.
You should know this code has an adjustable parameter the governs the maximum number of Keys that can be placed in an individual HashMap. Reducing this parameter will increase the speed of your search, but it will increase the probability that the correct solution won't be found because one Key is in HashMap A while the second Key is in HashMap B.
Also, each HashMap is associated a "radius". Depending on how accurate you want your result you maybe able to just search the HashMap with the largest hashMap.size()/radius (because this HashMap contains the highest density of points, thus it is a good search candidate)
Good Luck
If you only have sample distances, not original point locations in a plane you can operate on, then I suspect you are bounded at O(E).
Specifically, it would seem from your description that any valid solution would need to inspect every edge in order to rule out it having something interesting to say, meanwhile, inspecting every edge and taking the smallest solves the problem.
Planar versions bypass O(V^2), by using planar distances to deduce limitations on sets of edges, allowing us to avoid needing to look at most of the edge weights.
Use same idea as in space partitioning. Recursively split given set of points by choosing two points and dividing set in two parts, points that are closer to first point and points that are closer to second point. That is same as splitting points by a line passing between two chosen points.
That produces (binary) space partitioning, on which standard nearest neighbour search algorithms can be used.
I have a database with 500,000 points in a 100 dimensional space, and I want to find the closest 2 points. How do I do it?
Update: Space is Euclidean, Sorry. And thanks for all the answers. BTW this is not homework.
There's a chapter in Introduction to Algorithms devoted to finding two closest points in two-dimensional space in O(n*logn) time. You can check it out on google books. In fact, I suggest it for everyone as the way they apply divide-and-conquer technique to this problem is very simple, elegant and impressive.
Although it can't be extended directly to your problem (as constant 7 would be replaced with 2^101 - 1), it should be just fine for most datasets. So, if you have reasonably random input, it will give you O(n*logn*m) complexity where n is the number of points and m is the number of dimensions.
edit
That's all assuming you have Euclidian space. I.e., length of vector v is sqrt(v0^2 + v1^2 + v2^2 + ...). If you can choose metric, however, there could be other options to optimize the algorithm.
Use a kd tree. You're looking at a nearest neighbor problem and there are highly optimized data structures for handling this exact class of problems.
http://en.wikipedia.org/wiki/Kd-tree
P.S. Fun problem!
You could try the ANN library, but that only gives reliable results up to 20 dimensions.
Run PCA on your data to convert vectors from 100 dimensions to say 20 dimensions. Then create a K-Nearest Neighbor tree (KD-Tree) and get the closest 2 neighbors based on euclidean distance.
Generally if no. of dimensions are very large then you have to either do a brute force approach (parallel + distributed/map reduce) or a clustering based approach.
Use the data structure known as a KD-TREE. You'll need to allocate a lot of memory, but you may discover an optimization or two along the way based on your data.
http://en.wikipedia.org/wiki/Kd-tree.
My friend was working on his Phd Thesis years ago when he encountered a similar problem. His work was on the order of 1M points across 10 dimensions. We built a kd-tree library to solve it. We may be able to dig-up the code if you want to contact us offline.
Here's his published paper:
http://www.elec.qmul.ac.uk/people/josh/documents/ReissSelbieSandler-WIAMIS2003.pdf
I'm wondering if there is an algorithm for calculating the nearest locations (represented by lat/long) in better than O(n) time.
I know I could use the Haversine formula to get the distance from the reference point to each location and sort ASC, but this is inefficient for large data sets.
How does the MySQL DISTANCE() function perform? I'm guessing O(n)?
If you use a kd-tree to store your points, you can do this in O(log n) time (expected) or O(sqrt(n)) worst case.
You mention MySql, but there are some pretty sophisticated spatial features in SQL Server 2008 including a geography data type. There's some information out there about doing the types of things you are asking about. I don't know spatial well enough to talk about perf. but I doubt there is a bounded time algorithm to do what you're asking, but you might be able to do some fast set operations on locations.
If the data set being searched is static, e.g., the coordinates of all gas stations in the US, then a proper index (BSP) would allow for efficient searching. Postgres has had good support since the mid 90's for 2-dimensional indexed data so you can do just this sort of query.
Better than O(n)? Only if you go the way of radix sort or store the locations with hash keys that represent the general location they are in.
For instance, you could divide the globe with latitude and longitude to the minutes, enumerate the resulting areas, and make the hash for a location it's area. So when the time comes to get the closest location, you only need to check at most 9 hash keys -- you can test beforehand if an adjacent grid can possibly provide a close location than the best found so far, thus decreasing the set of locations to compute the distance to. It's still O(n), but with a much smaller constant factor. Properly implemented you won't even notice it.
Or, if the data is in memory or otherwise randomly accessible, you could store it sorted by both latitude and longitude. You then use binary search to find the closest latitude and longitude in the respective data sets. Next, you keep reading locations with increasing latitude or longitude (ie, the preceding and succeeding locations), until it becomes impossible to find a closer location.
You know you can't find a close location when the latitude of the next location to either side of the latitude-sorted data wouldn't be closer than the best case found so far even if they belonged in the same longitude as the point from which distance is being calculated. A similar test applies for the longitude-sorted data.
This actually gives you better than O(n) -- closer to O(logN), I think, but does require random, instead of sequential, access to data, and duplication of all data (or the keys to the data, at least).
I wrote a article about Finding the nearest Line at DDJ a couple of years ago, using a grid (i call it quadrants). Using it to find the nearest point (instead of lines) would be just a reduction of it.
Using Quadrants reduces the time drastically, although the complexity is not determinable mathematically (all points could theoretically lie in a single quadrant). A precondition of using quadrants/grids is, that you have a maximum distance for the point searched. If you just look for the nearest point, without giving a maximum distance, you cant use quadrants.
In this case, have a look at A Template for the Nearest Neighbor Problem (Larry Andrews at DDJ), having a retrival complexity of O(log n). I did not compare the runtime of both algorithms. Probably, if you have a reasonable maximum width, quadrants are better. The better general purpose algorithm is the one from Larry Andrews.
If you are looking for the (1) closest location, there's no need to sort. Simply iterate through your list, calculating the distance to each point and keeping track of the closest one. By the time you get through the list, you'll have your answer.
Even better would be to introduce the concept of grids. You would assign each point to a grid. Then, for your search, first determine the grid you are in and perform your calculations on the points in the grid. You'll need to be a little careful though. If the test location is close to the boundary of a grid, you'll need to search those grid(s) as well. Still, this is likely to be highly performant.
I haven't looked at it myself, but Postgres does have a module dedicated to the management of GIS data.
In an appliation I worked on in a previous life we took all of the data, computed it's key for a quad-tree (for 2D space) or an oct-tree (for 3D space) and stored that in the database. It was then a simple matter of loading the values from the database (to prevent you having to recompute the quad-tree) and following the standard quad-tree search algorithm.
This does of course mean you will touch all of the data at least once to get it into the data structure. But persisting this data-structure means you can get better lookup speeds from then on. I would imagine you will do a lot of nearest-neighbour checks for each data-set.
(for kd-tree's wikipedia has a good explanation: http://en.wikipedia.org/wiki/Kd-tree)
You need a spatial index. Fortunately, MySQL provides just such an index, in its Spatial Extensions. They use an R-Tree index internally - though it shouldn't really matter what they use. The manual page referenced above has lots of details.
I guess you could do it theoretically if you had a large enough table to do this... secondly, perhaps caching correctly could get you very good average case?
An R-Tree index can be used to speed spatial searches like this. Once created, it allows such searches to be better than O(n).