Binary search algorithm for 2 dimensional approximite data - sorting

Here is my specific problem. I need to write an algorithm that:
1) Takes these 2 arrays:
a) An array of about 3000 postcodes (or zip codes if you're in the US), with the longitude and latitude of the center point of the areas they cover (that is, 3 numbers per array element)
b) An array of about 120,000 locations, consisting of longitude and latitude
2) Converts each location to the postcode whose centerpoint is closests to the given longitude and latitude
Note that the longitudes and latitudes of the locations are very unlikely to precisely match those in the postcodes array. That's why I'm looking for the shortest distance to the center point of the area covered by the postcode.
I know how to calculate the distance between two longitude/latitude pairs. I also appreciate that being closests to the center point of an area covered by a postcode doesn't necessarily mean you are in the area covered by that postcode - if you're in a very big postcode area but close to the border, you may be closer to the center point of a neighbouring postcode area. However, in this case I don't have to take this into account - shortest distance to center point is enough.
A very simple way to solve this problem would be to visit each of the 120,000 locations, and find the postcode with the closest centerpoint by calculating the distance to each of the 3000 postcode centerpoints. That would mean 3000 x 120,000 = 360,000,000 distance calculations though.
If postcodes and locations were in a one-dimensional space (that is, identified by 1 number instead of 2), I could simply sort the postcode array by its one-dimensional centerpoint and then do a binary search in the postcode array for each location.
So I guess what I'm looking for is a way to sort the two dimensional space of longitudes and latitudes of the postcode center points, so I can perform a two dimensional binary search for each location. I've seen solutions to this problem, but those only work for direct matches, while I'm looking for the center point closests to a given location.
I am considering caching solutions, but if there is a fast two-dimensional binary search that I could use, that would make the solution much simpler.
This will be part of a batch program, so I'm not counting milli seconds but it can't take days either. It will run once a month without manual intervention.

You can use a space-filling-curve and a quadkey instead of a quadtree or a spatial index. There are some very interesting sfc like the hilbert curve and the moore curve with very interesting patterns.

Related

how to count (ONLY ONCE) common latitude or longitude lying in the edge between two blocks of grid

I built a grid that includes the US map. The grid consists of latitudes and longitudes that represent small rectangles inside the US map. A small rectangle consists of many latitudes and longitudes.
From a data set that contains the population, I could reside each part of data in its place on the US grid based on their latitude and longitude.
My goal is to let the user to enter a queryBorders (inside the gird) to get the accurate population.
The problem is some points' (or population)latitudes and longitudes live in the border of two neighbors, which cause these points to get counted more than one time. This gives inaccurate results.
In the illustration above, how do I get the accurate result (excluding the repeated points) for the queryBorders (a,b,c,d)?
(getting help from "How to scale down a range of numbers with a known min and max value" to get space for each block!)
Thanks in advance.

Matching massive amount of coordinates from two lists

I'm searching for an more efficient Algorithm to match coordinates between two lists.
Given are two Lists with Lat/Long Values. My goal is it to find for every Coordinate in the first list, all matching coordinates from the other list in a given Radius, like 500 meters for example.
Right now it's just brute forced by two for loops, just doing the calculation of the distance and checking if its within my radius for every coordinate. But that brings me to a complexity of O(n²).
To improve this, my first idea would be to do something similar to a Hashmap:
Classify the first list to bigger "fields" by cutting off some decimals at the end. An example would be:
lat: 44.7261 long: 8.2831 -> lat: 44.72 long: 8,28
lat: 43.8102 long: 9.7612 -> lat: 43.81 long: 9.76
lat: 44.7281 long: 8.2899 -> lat: 44.72 long: 8,28
So some "groups" of coordinates are created.
Now I only need to iterate once over the second list and looking in which group a specific coordinate lies and do the Calculation with all Coordinates in that group.
Visually you could describe the idea of creating squares in the map that are my Hashs. Then first looking in what hash the current coordinate lies and comparing all coordinates in that hash with the current one.
Like this I can reduce the complexity from O(n²) to O(n+m*(average_size_of_groups))
If a coordinate will be at the border of a group I'll need to check the neighbours of this group too.
But somehow I believe there is a more efficient way to match these two lists. I was looking for algorithms that treat such kind of problems, but my google searches weren't successful.
Thank you very much :)
Your algorithm is pretty good, but the best size for your groups is smaller than you seem to be guessing, and that means you're doing too many comparisons.
Instead of just cutting off a few decimal places, you should divide the points into squares that are the same size as your radius.
Then each point is compared with the points it its own group and the 8 neighboring groups.
A common optimization for this kind of thing is to pre-process your array of points and create a two-dimensional array of "buckets", with each bucket holding a list of points.. One dimension is the latitude, and the other is longitude. If you want a granularity of 500 meters, then each bucket represents a 500x500 meter square.
You'll need a way to map a lat/lon value to an x/y value for your matrix. You decide what lat/lon will correspond to your 0,0 matrix square. Then, to compute the lat/lon for any point, you subtract the offset (the lat/lon from 0,0), and convert the latitude and longitude to meters. Then divide each by 500 and put the point in the resulting bucket.
This gets a little tricky, of course, because the distance between degrees of longitude depends on the latitude, as described in https://gis.stackexchange.com/questions/142326/calculating-longitude-length-in-miles.
Now, when somebody says "Give me all the points within 500 meters of Austin", you can get the lat/lon of Austin, convert to bucket coordinates as described above, and then compare that with all the points from that bucket and the 8 surrounding buckets.
The size of the array is the range of latitude, converted to meters and divided by 500, multiplied by the range of longitude, also converted to meters and divided by 500.
The Earth's circumference of approximately 40,100 km gives you a estimated maximum size of this array: 80,200 x 80,200, or about 6.432 billion buckets if you want your buckets to be 500 meters. If you want to cover that large a range, you'll probably want to use a sparse matrix representation.

Finding Closest Point to a Set of Points (in Lat/Long)

I am given a random set of points (not told how many) with a latitude and longitude and need to sort them.
I will then be given another random point (lat/long) and need to find the point closest to it (from the set above).
When I implement a linear search (put the set in a list and then calculate all the distances), it takes approximately 7 times longer than permitted.
So, I need a much more efficient sorting algorithm, but I am unsure how I could do this, especially since I'm given points that don't exist on a flat plane.
If your points are moderately well distributed then geometric hashing is a simple method to speed up nearest neighbor searches in practice. The idea is simply to register your objects in grid cells and do your search cell-wise so you can restrict your search to a local neighborhood.
This little python demo applied the idea to circles in the plane:
So in your case you can choose some fixed N and split the longitude coordinates in [0, 2pi] into N equal parts and the latitude coordinates in [0, pi] into N parts. This gives you N^2 cells on the sphere. You register all your initial points at these cells.
When you are given the query point p then you start searching in the cell that is hit by p and in a large enough neighborhood such that you cannot miss the closest point.
If you initially register n points then you could choose N to something like sqrt(n)/4 or so.

How to find if location points set contains points with distance larger than 1 km

I have a set of users, each user has a set of points (n~5000) represented by latitude and longitude.
I need to find static users. By 'static' I mean users for which there are no pairs of points further than 1km apart. What is the best algorithm for this?
The maximum distance between any pair of points in a set of points is called the diameter of the set.
Here is one efficient algorithm, based on the convex hull, for solving this problem:
http://www.tcs.fudan.edu.cn/rudolf/Courses/Algorithms/Alg_ss_07w/Webprojects/Qinbo_diameter/2d_alg.htm
http://cgm.cs.mcgill.ca/~athens/cs507/Projects/2000/MS/diameter/node2.html
Since you probably don't care about exactness here, it would be easier to just find the minimum and maximum latitude and longitude over all of the points, and test whether a side of the box defined by these extrema is larger than some threshold. This works assuming you don't care about users near the north or south pole.

Using list of scattered points in 2D tilemap, how would I find the center of those points?

Have a look at this image for example (just a random image of scattered points from image search):
(reference)
You'll see the locations with blue points. Let's say the blue represents what I'm looking for. But I want to find the coordinates where there is the most blue. Meaning the most dense or center of most points (in the picture, it would approximate [.5, .5]).
If I have an arrayList of each and every blue point (x,y coordinates), then how do I use those points to find the center/most dense area of those points?
There are several options, dependent on what precisely you need. The simplest would be the mean, the average of all points: You sum all points up and divide by their number. Getting the most dense area is complicated, because at first you have to come up with a definition of "dense". One option would be: For each point P, find the 7 nearest neighbors N_P1...N_P7. The point P where the 7th neighbor has the smallest distance |P-N_P7| is the point with the highest density around it and you pick P as center. You can replace that 7 with any number that works for you. You could even replace it with some parameter from your data set, say 1/3 of the total number of points.

Resources