Fast way search millions of coordinates by distance - go

I have a data set of about 20 million coordinates. I want to be able to pass in a latitude, longitude, and distance in miles and return all coordinates that are within the mile range of my given coordinates. I need the response time to ideally be sub 50ms.
I have tried loading all coordinates in memory in a golang service which, on every request, will loop through the data and using haversine filter all coordinates which are within the given miles distance of my given coordinate.
This method sees the results return in around 2 seconds. What approach would be good to increase the speed of the results? I am open to any suggestions.
I am toying around with the idea of grouping all coordinates by degree and only filtering by the nearest to the given coordinates. Haven't had any luck improving the response times yet though. My data set is only a test one too as the real data could potentially be in the hundreds of millions.

I think that this is more of a data structure problem. One good way to store large sets of geospatial coordinates is with an R-tree. It provides logn M search. I have limited knowledge of Go, but I have used an R-Tree to great effect for similarly sized datasets in a similar use case in a JS application. From a quick search it appears as though there are at least a couple Go R-Tree implementations out there.

Idea would be to have a "grid" that partitions coordinates, so that when you do need to do a lookup you can safely return all coordinates in particular cell, do not return any from the cells too far away from target, and only do per coordinate comparison for coordinates that are in the cells that contains some coordinates within distance and some outside the distance.
Simplified to 1D:
Coordinates are from 1 to 100
you partition into 5 blocks of 20
When somebody looks for all coordinates within distance 25 from 47
you return all coordinates in blocks [30,39], [40,49],[50,59],[60,69] and then after doing per coordinate analysis for blocks [20,29] and [70,79] you additionally return 22,23,24,25,26,27,28,29, 70,71,72.
Unfortunately I have no realistic way to estimate speedup of this approach so you would need to implement it and benchmark it by yourself.

MongoDB has various geographic searches $geoNear will allow you to search for points within a specific distance from a point or within a shape.
https://docs.mongodb.com/manual/reference/operator/aggregation/geoNear/
PostGIS for Postgres has something similar, but I am not too familiar with it.

Related

Reducing a graph's datapoints while maintaining its main features

I have a large set of data, which needs to be displayed on a graph repeatedly.
The graph has a width of 1400 pixels. The data contains more than 30,000 datapoints.
Thus, I would like to reduce the datapoints to a number roughly around 1400, while still maintaining the main features of the graph (max, min, etc.).
If you look at programs like LabVIEW and MATLAB they are able to display graphs containing a large number of datapoints, by compressing the data, without losing the graph‘s main features.
I am unable to use a simple decimation, average or moving average, as this would not maintain the features of the graph.
Does anyone know of any algorithms which are being used by these kind of programs or would give me the expected results?
I am also interested in performance algorithms.
LabVIEW makes use of a max-min decimation algorithm.
As you can see from the reference a run of data points is compressed into a maximum and minimum value and then both points are plotted at the same x-axis value with the vertical pixels between the two points filled.
If you don't have control over how each pixel of the plot is rendered then you can try implementing something similar where you take say eight points, find the maximum and minimum values and then pass those to the plotting function/tool (accounting for the order that they occur in the series) - giving you a decimation factor of four.
I've already used the Ramer–Douglas–Peucker algorithm in LabVIEW for a project that had several graphs updated continuously, it worked fine!
This algorithm doesn't have a target number of points as output, but you tune the hyperparameters to meet your desired output size.
I don't have my implementation to share with you, but the algorithm is very simple and can be easily implemented in LabVIEW or another language. In LabVIEW you can put it inside the definition of a xControl to abstract it from your code and use it multiple times.

Kalman Filter on a set of points belonging to the same object?

Let's say you're tracking a set of 20 segments with the same length belonging to the same 3D plane.
To visualize, imagine that you're drawing a set of segments of length 10 cm randomly on a sheet of paper. And make someone move this sheet in front of the camera.
Let's say those segments are represented by two points A and B.
Let's assume we manage to track A_t and B_t for all the segments. The tracked points aren't stable from frame to frame resulting in occasional jitter which might be solved by a Kalman filter.
My questions are concerning the state vector:
A Kalman filter for A and B for each segment (with 20 segments this results in 40 KF) is an obvious solution but it looks too heavy (knowing that this should run in real-time).
Since all the tracked points have the same properties (belonging to the same 3D plane, have the same length) isn't it possible to create one big KF with all those variables?
Thanks.
Runtime: keep in mind that the kalman equations involve matrix multiplications and one inversion. So having 40 states means having some 40x40 matrices. That will always take longer to calculate than running 40 one-state filters, where your matrices are 1x1 (scalar). Anyway, running the big filter only makes sense if you do know of a mathematical relationship between your states (=correlation), otherwise its output wise the same like running the 40 one-state filters.
With the information given thats really hard to tell. E.g. if your segments are always a polyline you could describe that differently in contrast to knowing nothing about the shape.

Fetching the nearest location of points, while accounting for bodies of water

I've got a database of points (in this case, schools) and in my app, users search for the nearest ones to them. Under the hood, we're currently using ElasticSearch to filter by latlng (using Geo Distance, which gets the distance as the crow flies. For the majority of places, this works fine, but in some coastal areas, this will pick up places that are impossible to get to, in the example below, a radius of 20 miles will pick up schools in Weston-Super-Mare, in reality 55 miles:
I initially decided to use the Google Maps Distance Matrix API to filter my inital as the crow flies search, but there's a limit of 25 destinations per query, and as the requests will be dynamic and user-facing, it's not practical to parcel these requests up into small pieces and pop in a background job.
Is there any way to carry out these calculations while accounting for bodies of water on a database level? The schools are stored in a Postgres database, so I thoughts about using PostGIS and some kind of point in polygon query, but I have no idea where to start looking.
Any ideas are very much appreciated!

Distance matrix between 500,000 sets of coordinates

I'm working on a project with 500,000 participants. We have in our database the precise coordinates of their home, and we want to release this data to someone who needs it to evaluate how close our participants live to one another.
We are very reluctant to release the precise coordinates, because this is an anonymized project and the risk for re-identification would be very high. Rounded coordinates (to something like 100m or 1km) are apparently not precise enough for what they're trying to achieve.
A nice workaround would have been to send them a 500,000 by 500,000 matrix with the absolute distance between each pair of participants, but this means 250 billion entries, or rather 125 billion if we remove half the matrix since |A-B| = |B-A|.
I've never worked with this type of data before, so I was wondering if anyone had a clever idea on how to deal with this? (Something that would not involve sending them 2 TB of data!)
Thanks.
Provided that the recipient of the data is happy to perform the great circle calculation to calculate the distance themselves, then you only need to send the 500,000 lines, but with transposed latitudes and longitudes.
First of all identify an approximate geospatial centre of your dataset, and then work out the offsets needed to transpose this centre to 0°N and 0°E. Then apply these same offsets to the users' latitudes and longitudes. This will centre the results around the equator and the prime meridian.
Provided your real data isn't too close to the poles, the distance calculated between real points A and B will be very close to the corresponding offset points.
Obviously the offsets applied need to be kept secret.
This approach may not work if it is known that your data is based around a particular place - the recipient may be able to deduce where the real points are - but that is something you'll need to decide yourself.

Strategies to detect and delete cluttering aggregations of GPS points?

my problem is that I have a large set of GPS tracks from different GPS loggers used in cars. When not turned off these cheap devices log phantom movements even if standing still:
As you can see in the image above, about a thousand points get visualized in a kind of congestion. Now I want to remove all of these points so that the red track coming from the left ends before the jitter starts.
My approach is to "draw" two or three circles around each point in the track, check how many other points are located within these circles and check the ratio:
(#points / covered area) > threshold?
If the threshold exceeds a certain ratio (purple circles), I could delete all points within. So: easy method, but has huge disadvantages, e.g. computation time, deleting "innocent" tracks only passing through the circle, doesn't detect outliers like the single points at the bottom of the picture).
I am looking for a better way to detect large heaps of points like in the picture. It should not remove false positives (of perhaps 5 or 10 points, these aggregations don't matter to me). Also, it should not simplify the rest of the track!
Edit: The result in given example should look like this:
My first step would be to investigate the speeds implied by the 'movements' of your stationary car and the changes in altitude. If either of these changes too quickly or too slowly (you'll have to decide the thresholds here) then you can probably conclude that they are due to the GPS jitter.
What information, other than position at time, does your GPS device report ?
EDIT (after OP's comment)
The problem is to characterise part of the log as 'car moving' and part of the log as 'car not moving but GPS location jittering'. I suggested one approach, Benjamin suggested another. If speed doesn't discriminate accurately enough, try acceleration. Try rate of change of heading. If none of these simple approaches work, I think it's time for you to break out your stats textbooks and start figuring out autocorrelation of random processes and the like. At this point I quietly slink away ...
Similarly to High Performance Mark's answer, you could look for line intersections that happen within a short number of points. When driving on a road, the route of the last n points rarely intersects with itself, but it does in your stationary situation because of the jitter. A single intersection could be a person doubling-back or circling around a block, but multiple intersections should be rarer. The angle of intersection will also be sharper for the jitter case.
What is the data interval of the GPS Points, it seems that these are in seconds. There may be one other way to add to the logic previously mentioned.
sum_of_distance(d0,d1,d2....dn)>=80% of sum_of_distance(d0,dn)
This 0 to n th value can iterate in smaller and larger chunks, as the traveled distance within that range will not be much. So, you can iterate over may be 60 points of data initially, and within that data iterate in 10 number of data in each iteration.

Resources