How could I make this KD-tree? - sorting

I have a very long, semi-sorted list of latitude, longitude, and time zone triplets. I want to be able to search this list quickly to find the closest time zone to any given latitude and longitude, so I would like to make this list into a KD tree.
I'm thinking that I should read the entire file first into some sort of data structure (what data structure? Possibly ArrayList<Triplet<Double, Double, String>>?). Then take the median element in that structure and make it the root, leaving me with a left and right list. Then keep taking the median element of each list and adding it as a left or right child.
A first attempt at this seemed slow and inefficient... But I feel like I did it wrong. Can you provide me with an algorithm or pseudocode for what I'm trying to do?

If it helps, I have a KD-Tree in Java which takes in XYZ as doubles in a inner class called XYZPoint. You could augment the XYZ Point with the time zone data and use X for Longitude, Y for Latitude, and zero for Z. It could at least be a starting point.
You could then use a nearest neighbor (euclidean distance) method, which is already implemented, for the closest time zone to a point.
Also.. for populating the KD-Tree, wikipeda suggests using HeapSort (my Java implementation linked) and removing the median repeatedly.

Related

Query from list of 2D data points (C++11)

I'm finding it hard to describe (and then search) what I want, so I will try here.
I have a list of 2D data points (time and distance). You could say it's like a vector of pairs. Although the data type doesn't matter, as I'm trying to find the best one now. It is/can be sorted on time.
Here is some example data to help me explain:
So I want to store a fairly large amount of data points like the ones in the spreadsheet above. I then want to be able to query them.
So if I say get_distance(0.2); it would return 1.1. This is quite simple.
Something like a map sounds sensible here to store the data with the time being the key. But then I come to the problem, what happens if the time I am querying isn't in the map like below:
But if I say get_distance(0.45);, I want it to average between the two nearest points just like the line on the graph and it would return 2.
All I have in my head at the minute is to loop through the data point vector find the point that has the closest time less than the time I want and find the point with the closest time above the time I want and average the distances. I don't think this sounds efficient, especially with a large amount of data points (probably up to around 10000, but there is a possibility to have more than this) and I want to do this query fairly often.
If anyone has a nice data type or algorithm that would work for me and could point me in that direction I would be grateful.
The STL is the way to go.
If your query time is not in the data, you want the largest that is smaller and the smallest that is larger so you can interpolate.
https://cplusplus.com/reference/algorithm/lower_bound/
https://cplusplus.com/reference/algorithm/upper_bound/
Note that since your data is already sorted you do not need a map - a vector is fine and saves the time taken to populate the map
You can still achieve this in O(N log N) time complexity, N being the scale of the data, with a std::map.
You can first query if there is an exact match. Use something like std::map::find to acheive this.
If there is no exact match, we should now query for the two keys that are the largest less than, or the least greater than the query (basically find the two keys that "sandwich" the query).
To do this, use std::map::lower_bound (or std::map::upper_bound, as the two are equivalent in this case). Save the iterator that is returned. To find the key greater than the query, simply increment the iterator (if itr is the iterator, just do itr++ and look for the value there).
The lower_bound (or upper_bound), along with find are all in O(N log N) and incrementing is O(log N), giving a total time complexity of O(N log N), which should be efficient enough in your case.

Approximated closest pair algorithm

I have been thinking about a variation of the closest pair problem in which the only available information is the set of distances already calculated (we are not allowed to sort points according to their x-coordinates).
Consider 4 points (A, B, C, D), and the following distances:
dist(A,B) = 0.5
dist(A,C) = 5
dist(C,D) = 2
In this example, I don't need to evaluate dist(B,C) or dist(A,D), because it is guaranteed that these distances are greater than the current known minimum distance.
Is it possible to use this kind of information to reduce the O(n²) to something like O(nlogn)?
Is it possible to reduce the cost to something close to O(nlogn) if I accept a kind of approximated solution? In this case, I am thinking about some technique based on reinforcement learning that only converges to the real solution when the number of reinforcements go to infinite, but provides a great approximation for small n.
Processing time (measured by the big O notation) is not the only issue. To keep a very large amount of previous calculated distances can also be an issue.
Imagine this problem for a set with 10⁸ points.
What kind of solution should I look for? Was this kind of problem solved before?
This is not a classroom problem or something related. I have been just thinking about this problem.
I suggest using ideas that are derived from quickly solving k-nearest-neighbor searches.
The M-Tree data structure: (see http://en.wikipedia.org/wiki/M-tree and http://www.vldb.org/conf/1997/P426.PDF ) is designed to reduce the number distance comparisons that need to be performed to find "nearest neighbors".
Personally, I could not find an implementation of an M-Tree online that I was satisfied with (see my closed thread Looking for a mature M-Tree implementation) so I rolled my own.
My implementation is here: https://github.com/jon1van/MTreeMapRepo
Basically, this is binary tree in which each leaf node contains a HashMap of Keys that are "close" in some metric space you define.
I suggest using my code (or the idea behind it) to implement a solution in which you:
Search each leaf node's HashMap and find the closest pair of Keys within that small subset.
Return the closest pair of Keys when considering only the "winner" of each HashMap.
This style of solution would be a "divide and conquer" approach the returns an approximate solution.
You should know this code has an adjustable parameter the governs the maximum number of Keys that can be placed in an individual HashMap. Reducing this parameter will increase the speed of your search, but it will increase the probability that the correct solution won't be found because one Key is in HashMap A while the second Key is in HashMap B.
Also, each HashMap is associated a "radius". Depending on how accurate you want your result you maybe able to just search the HashMap with the largest hashMap.size()/radius (because this HashMap contains the highest density of points, thus it is a good search candidate)
Good Luck
If you only have sample distances, not original point locations in a plane you can operate on, then I suspect you are bounded at O(E).
Specifically, it would seem from your description that any valid solution would need to inspect every edge in order to rule out it having something interesting to say, meanwhile, inspecting every edge and taking the smallest solves the problem.
Planar versions bypass O(V^2), by using planar distances to deduce limitations on sets of edges, allowing us to avoid needing to look at most of the edge weights.
Use same idea as in space partitioning. Recursively split given set of points by choosing two points and dividing set in two parts, points that are closer to first point and points that are closer to second point. That is same as splitting points by a line passing between two chosen points.
That produces (binary) space partitioning, on which standard nearest neighbour search algorithms can be used.

Algorithm for nearest point

I've got a list of ~5000 points (specified as longitude/latitude pairs), and I want to find the nearest 5 of these to another point, specified by the user.
Can anyone suggest an efficient algorithm for working this out? I'm implementing this in Ruby, so if there's a suitable library then that would be good to know, but I'm still interested in the algorithm!
UPDATE: A couple of people have asked for more specific details on the problem. So here goes:
The 5000 points are mostly within the same city. There might be a few outside it, but it's safe to assume that 99% of them lie within a 75km radius, and that all of them lie within a 200km radius.
The list of points changes rarely. For the sake of argument, let's say it gets updated once per day, and we have to deal with a few thousand requests in that time.
You could accelerate the search by partitioning the 2D space with a quad-tree or a kd-tree and then once you've reach a leaf node you compare the remaining distances one by one until you find the closest match.
See also this blog post which refers to this other blog post which both discuss nearest neighbors searches with kd-trees in Ruby.
You can get a very fast upper-bound estimator on distance using Manhattan distance (scaled for latitude), this should be good enough for rejecting 99.9% of candidates if they're not close (EDIT: since then you tell us they are close. In that case, your metric should be distance-squared, as per Lars H comment).
Consider this equivalent to rejecting anything outside a spherical-rectangle bounding-box (as an approximation to a circle bounding-box).
I don't do Ruby so here is algorithm with pseudocode:
Let the latitude, longitude of your reference point P (pa,po) and the other point X (xa,xo).
Precompute ka, the latitude scaling factor for longitudinal distances: ka (= cos(pa in°)). (Strictly, ka = constant is a linearized approximation in the vicinity of P.)
Then the distance estimator is: D(X,P) = ka*|xa-pa| + |xo-po| = ka*da + do
where |z| means abs(z). At worst this overestimates true distance by a factor of √2 (when da==do), hence we allow for that as follows:
Do a running search and keep Dmin, the fifth-smallest scaled-Manhattan-distance-estimate.
Hence you can reject upfront all points for which D(X,P) > √2 * Dmin (since they must be at least farther away than √((ka*da)² + do²) - that should eliminate 99.9% of points).
Keep a list of all remaining candidate points with D(X,P) <= √2 * Dmin. Update Dmin if you found a new fifth-smallest D. Priority-queue, or else a list of (coord,D) are good data structures.
Note that we never computed Euclidean distance, we only used float multiplication and addition.
(Consider this similar to quadtree except filtering out everything except the region that interests us, hence no need to compute accurate distances upfront or build the data structure.)
It would help if you tell us the expected spread in latitudes, longitudes (degrees, minutes or what? If all the points are close, the √2 factor in this estimator will be too conservative and mark every point as a candidate; a lookup-table based distance estimator would be preferable.)
Pseudocode:
initialize Dmin with the fifth-smallest D from the first five points in list
for point X in list:
if D(X,P) <= √2 * Dmin:
insert the tuple (X,D) in the priority-queue of candidates
if (Dmin>D): Dmin = D
# after first pass, reject candidates with D > √2 * Dmin (use the final value of Dmin)
# ...
# then a second pass on candidates to find lowest 5 exact distances
Since your list is quite short, I'd highly recommend brute force. Just compare all 5000 to the user-specified point. It'll be O(n) and you'll get paid.
Other than that, a quad-tree or Kd-tree are the usual approaches to spacial subdivision. But in your case, you'll end up doing a linear number of insertions into the tree, and then a constant number of logarithmic lookups... a bit of a waste, when you're probably better off just doing a linear number of distance comparisons and being done with it.
Now, if you want to find the N nearest points, you're looking at sorting on the computed distances and taking the first N, but that's still O(n log n)ish.
EDIT: It's worth noting that building the spacial tree becomes worthwhile if you're going to reuse the list of points for multiple queries.
Rather than pure brute-force, for 5000 nodes, I would calculate the individual x+y distances for every node, rather than the straight line distance.
Once you've sorted that list, if e.g. x+y for the 5th node is 38, you can rule out any node where either x or y distance is > 38. This way, you can rule out a lot of nodes without having to calculate the straight line distance. Then brute force calculate the straight line distance for the remaining nodes.
These algorithms are not easily explained, thus I will only give you some hints in the right direction. You should look for Voronoi Diagrams. With a Voronoi Diagram you can easily precompute a graph in O(n^2 log n) time and search the closest point in O(log n) time.
Precomputation is done with a cron job at night and searching is live. This corresponds to your specification.
Now you could save the k closests pairs of each of your 5000 points and then starting from the nearest point from the Voronoi Diagram and search the remaining 4 points.
But be warned that these algorithms are not very easy to implement.
A good reference is:
de Berg: Computational Geometry Algorithms Applications (2008) chapters 7.1 and 7.2
Since you have that few points, I would recommend doing a brute-force search, to the effect of trying all points against each other with is an O(n^2) operation, with n = 5000, or roughly 25/2 million iterations of a suitable algorithm, and just storing the relevant results. This would have sub 100 ms execution time in C, so we are looking at a second or two at the most in Ruby.
When the user picks a point, you can use your stored data to give the results in constant time.
EDIT I re-read your question, and it seems as though the user provides his own last point. In that case it's faster to just do a O(n) linear search through your set each time user provides a point.
if you need to repeat this multiple times, with different user-entered locations, but don't want to implement a quad-tree (or can't find a library implementation) then you can use a locality-sensitive hashing (kind-of) approach that's fairly intuitive:
take your (x,y) pairs and create two lists, one of (x, i) and one of (y, i) where i is the index of the point
sort both lists
then, when given a point (X, Y),
bisection sort for X and Y
expand outwards on both lists, looking for common indices
for common indices, calculate exact distances
stop expanding when the differences in X and Y exceed the exact distance of the most-distant of the current 5 points.
all you're doing is saying that a nearby point must have a similar x and a similar y value...

Possible to calculate closest locations via lat/long in better than O(n) time?

I'm wondering if there is an algorithm for calculating the nearest locations (represented by lat/long) in better than O(n) time.
I know I could use the Haversine formula to get the distance from the reference point to each location and sort ASC, but this is inefficient for large data sets.
How does the MySQL DISTANCE() function perform? I'm guessing O(n)?
If you use a kd-tree to store your points, you can do this in O(log n) time (expected) or O(sqrt(n)) worst case.
You mention MySql, but there are some pretty sophisticated spatial features in SQL Server 2008 including a geography data type. There's some information out there about doing the types of things you are asking about. I don't know spatial well enough to talk about perf. but I doubt there is a bounded time algorithm to do what you're asking, but you might be able to do some fast set operations on locations.
If the data set being searched is static, e.g., the coordinates of all gas stations in the US, then a proper index (BSP) would allow for efficient searching. Postgres has had good support since the mid 90's for 2-dimensional indexed data so you can do just this sort of query.
Better than O(n)? Only if you go the way of radix sort or store the locations with hash keys that represent the general location they are in.
For instance, you could divide the globe with latitude and longitude to the minutes, enumerate the resulting areas, and make the hash for a location it's area. So when the time comes to get the closest location, you only need to check at most 9 hash keys -- you can test beforehand if an adjacent grid can possibly provide a close location than the best found so far, thus decreasing the set of locations to compute the distance to. It's still O(n), but with a much smaller constant factor. Properly implemented you won't even notice it.
Or, if the data is in memory or otherwise randomly accessible, you could store it sorted by both latitude and longitude. You then use binary search to find the closest latitude and longitude in the respective data sets. Next, you keep reading locations with increasing latitude or longitude (ie, the preceding and succeeding locations), until it becomes impossible to find a closer location.
You know you can't find a close location when the latitude of the next location to either side of the latitude-sorted data wouldn't be closer than the best case found so far even if they belonged in the same longitude as the point from which distance is being calculated. A similar test applies for the longitude-sorted data.
This actually gives you better than O(n) -- closer to O(logN), I think, but does require random, instead of sequential, access to data, and duplication of all data (or the keys to the data, at least).
I wrote a article about Finding the nearest Line at DDJ a couple of years ago, using a grid (i call it quadrants). Using it to find the nearest point (instead of lines) would be just a reduction of it.
Using Quadrants reduces the time drastically, although the complexity is not determinable mathematically (all points could theoretically lie in a single quadrant). A precondition of using quadrants/grids is, that you have a maximum distance for the point searched. If you just look for the nearest point, without giving a maximum distance, you cant use quadrants.
In this case, have a look at A Template for the Nearest Neighbor Problem (Larry Andrews at DDJ), having a retrival complexity of O(log n). I did not compare the runtime of both algorithms. Probably, if you have a reasonable maximum width, quadrants are better. The better general purpose algorithm is the one from Larry Andrews.
If you are looking for the (1) closest location, there's no need to sort. Simply iterate through your list, calculating the distance to each point and keeping track of the closest one. By the time you get through the list, you'll have your answer.
Even better would be to introduce the concept of grids. You would assign each point to a grid. Then, for your search, first determine the grid you are in and perform your calculations on the points in the grid. You'll need to be a little careful though. If the test location is close to the boundary of a grid, you'll need to search those grid(s) as well. Still, this is likely to be highly performant.
I haven't looked at it myself, but Postgres does have a module dedicated to the management of GIS data.
In an appliation I worked on in a previous life we took all of the data, computed it's key for a quad-tree (for 2D space) or an oct-tree (for 3D space) and stored that in the database. It was then a simple matter of loading the values from the database (to prevent you having to recompute the quad-tree) and following the standard quad-tree search algorithm.
This does of course mean you will touch all of the data at least once to get it into the data structure. But persisting this data-structure means you can get better lookup speeds from then on. I would imagine you will do a lot of nearest-neighbour checks for each data-set.
(for kd-tree's wikipedia has a good explanation: http://en.wikipedia.org/wiki/Kd-tree)
You need a spatial index. Fortunately, MySQL provides just such an index, in its Spatial Extensions. They use an R-Tree index internally - though it shouldn't really matter what they use. The manual page referenced above has lots of details.
I guess you could do it theoretically if you had a large enough table to do this... secondly, perhaps caching correctly could get you very good average case?
An R-Tree index can be used to speed spatial searches like this. Once created, it allows such searches to be better than O(n).

Data structure for quick time interval look up

I have a set of time intervals In = (an, bn). I need to run lots of look ups where I'm given a time t and need to quickly return the intervals that contain t, e.g., those intervals such that an <= t <= bn.
What is a good data structure or algorithm for this?
If it matters, in my case the an and bn are integers.
What you are looking for is an Interval Tree (which is a type of Range Tree).
These have logarithmic lookup time like other tree structures (e.g., RB trees), so you should see comparable performance to using something like a Java TreeMap or an STL map.
Code for Red-black trees and interval trees from MIT
There is a C++ implementation in the CGAL Library.
Here's a C# Implementation
This is basically a space partitioning question. You have a large space with containers and a specific point in that space, what containers does it touch? A lot of games have to solve this problem, so it wouldn't be a bad idea to start there. Look for articles on "broad phase collision detection".
The simplest way to do this is to divide your number space into a constant number of pieces. Each piece knows which sets intersect with it, something that you calculate whenever a new set is added. Now, rather than testing every single set when you have a point, you only need to check the sets contained within the piece that the point is in.
Another general and efficient algorithm used is Binary Space Partioning. This algorithm subdivides your space into two sides. Each side would know which sets intersect with it. You can recursively repeat this process to your desired precision (although it doesn't make sense to ever create a subdivision smaller than the range of your smallest set).
You are welcome to check out the C# implementation I've posted on CodePlex for IntervalTree which solve this problem exactly.
Ido.
Your problem is only one dimensional, so it's a bit simpler than the space partitioning problems found in most games.
You could have just a simple BST and in each leaf remember list of intervals to the left fromt the leaf.
if you had intervals A (0, 10) and B (5, 15), then the leaves of the tree would be (0 with empty list), (5 with A), (10 with A, B) and (15 with B).

Resources