Distance matrix between 500,000 sets of coordinates - matrix

I'm working on a project with 500,000 participants. We have in our database the precise coordinates of their home, and we want to release this data to someone who needs it to evaluate how close our participants live to one another.
We are very reluctant to release the precise coordinates, because this is an anonymized project and the risk for re-identification would be very high. Rounded coordinates (to something like 100m or 1km) are apparently not precise enough for what they're trying to achieve.
A nice workaround would have been to send them a 500,000 by 500,000 matrix with the absolute distance between each pair of participants, but this means 250 billion entries, or rather 125 billion if we remove half the matrix since |A-B| = |B-A|.
I've never worked with this type of data before, so I was wondering if anyone had a clever idea on how to deal with this? (Something that would not involve sending them 2 TB of data!)
Thanks.

Provided that the recipient of the data is happy to perform the great circle calculation to calculate the distance themselves, then you only need to send the 500,000 lines, but with transposed latitudes and longitudes.
First of all identify an approximate geospatial centre of your dataset, and then work out the offsets needed to transpose this centre to 0°N and 0°E. Then apply these same offsets to the users' latitudes and longitudes. This will centre the results around the equator and the prime meridian.
Provided your real data isn't too close to the poles, the distance calculated between real points A and B will be very close to the corresponding offset points.
Obviously the offsets applied need to be kept secret.
This approach may not work if it is known that your data is based around a particular place - the recipient may be able to deduce where the real points are - but that is something you'll need to decide yourself.

Related

Kalman Filter on a set of points belonging to the same object?

Let's say you're tracking a set of 20 segments with the same length belonging to the same 3D plane.
To visualize, imagine that you're drawing a set of segments of length 10 cm randomly on a sheet of paper. And make someone move this sheet in front of the camera.
Let's say those segments are represented by two points A and B.
Let's assume we manage to track A_t and B_t for all the segments. The tracked points aren't stable from frame to frame resulting in occasional jitter which might be solved by a Kalman filter.
My questions are concerning the state vector:
A Kalman filter for A and B for each segment (with 20 segments this results in 40 KF) is an obvious solution but it looks too heavy (knowing that this should run in real-time).
Since all the tracked points have the same properties (belonging to the same 3D plane, have the same length) isn't it possible to create one big KF with all those variables?
Thanks.
Runtime: keep in mind that the kalman equations involve matrix multiplications and one inversion. So having 40 states means having some 40x40 matrices. That will always take longer to calculate than running 40 one-state filters, where your matrices are 1x1 (scalar). Anyway, running the big filter only makes sense if you do know of a mathematical relationship between your states (=correlation), otherwise its output wise the same like running the 40 one-state filters.
With the information given thats really hard to tell. E.g. if your segments are always a polyline you could describe that differently in contrast to knowing nothing about the shape.

Fast way search millions of coordinates by distance

I have a data set of about 20 million coordinates. I want to be able to pass in a latitude, longitude, and distance in miles and return all coordinates that are within the mile range of my given coordinates. I need the response time to ideally be sub 50ms.
I have tried loading all coordinates in memory in a golang service which, on every request, will loop through the data and using haversine filter all coordinates which are within the given miles distance of my given coordinate.
This method sees the results return in around 2 seconds. What approach would be good to increase the speed of the results? I am open to any suggestions.
I am toying around with the idea of grouping all coordinates by degree and only filtering by the nearest to the given coordinates. Haven't had any luck improving the response times yet though. My data set is only a test one too as the real data could potentially be in the hundreds of millions.
I think that this is more of a data structure problem. One good way to store large sets of geospatial coordinates is with an R-tree. It provides logn M search. I have limited knowledge of Go, but I have used an R-Tree to great effect for similarly sized datasets in a similar use case in a JS application. From a quick search it appears as though there are at least a couple Go R-Tree implementations out there.
Idea would be to have a "grid" that partitions coordinates, so that when you do need to do a lookup you can safely return all coordinates in particular cell, do not return any from the cells too far away from target, and only do per coordinate comparison for coordinates that are in the cells that contains some coordinates within distance and some outside the distance.
Simplified to 1D:
Coordinates are from 1 to 100
you partition into 5 blocks of 20
When somebody looks for all coordinates within distance 25 from 47
you return all coordinates in blocks [30,39], [40,49],[50,59],[60,69] and then after doing per coordinate analysis for blocks [20,29] and [70,79] you additionally return 22,23,24,25,26,27,28,29, 70,71,72.
Unfortunately I have no realistic way to estimate speedup of this approach so you would need to implement it and benchmark it by yourself.
MongoDB has various geographic searches $geoNear will allow you to search for points within a specific distance from a point or within a shape.
https://docs.mongodb.com/manual/reference/operator/aggregation/geoNear/
PostGIS for Postgres has something similar, but I am not too familiar with it.

Finding the angle of stripeline/ Angle of rotation

So I’m trying to find the rotational angle for stripe lines in images like the attached photo.
The only assumption is that the lines are parallel, and their orientation is about 90 degrees approximately more or less [say 5 degrees tolerance].
I have to make sure the stripe lines in the result image will be %100 vertical. The quality of the images varies as well as their histogram/greyscale values. So methods based on non-adaptive thresholding already failed for my cases [I’m not interested in thresholding based methods if I cannot make it adaptive]. Also, there are some random black clusters on top of the stripe lines sometimes.
What I did so far:
1) Of course HoughLines is the first option, but I couldn’t make it work for all my images, I had some partial success though following this great article:
http://felix.abecassis.me/2011/09/opencv-detect-skew-angle/.
The main reason of failure to my understanding was that, I needed to fine tune the parameters for different images. Parameters such as Canny/BW/Morphological edge detection (If needed) | parameters for minLinelength/maxLineGap/etc. For sure there’s a way to hack into this and make it work, but, to me this is a fragile solution!
2) What I’m working on right now, is to divide the image to a top slice and a bottom slice, then find the peaks and valleys of each slice. Then basically find the angle using the width of the image and translation of peaks. I’m currently working on finding which peak of the top slice belongs to which of the bottom slice, since there will be some false positive peaks in my computation due to existence of black/white clusters on top of the strip lines.
Example: Location of peaks for slices:
Top Slice = { 1, 33,67,90,110}
BottomSlice = { 3, 14, 35,63,90,104}
I am actually getting similar vectors when extracting peaks. So as can be seen, the length of vector might vary, any idea how can I get a group like:
{{1,3},{33,35},{67,63},{90,90},{110,104}}
I’m open to any idea about improving any of these algorithms or a completely new approach. If needed, I can upload more images.
If you can get a list of points for a single line, a linear regression will give you a formula for the straight line that best fits the points. A simple trig operation will convert the line formula to an angle.
You can probably use some line thinning operation to turn the stripes into a list of points.
You can run an accumulator of spatial derivatives along different angles. If you want a half-degree precision and a sample of 5 lines, you have a maximum 10*5*1500 = 7.5m iterations. You can safely reduce the sampling rate along the line tenfold, which will give you a sample size of 150 points per sample, reducing the number of iterations to less than a million. Somewhere around that point the operation of straightening the image ought to become the bottleneck.

Strategies to detect and delete cluttering aggregations of GPS points?

my problem is that I have a large set of GPS tracks from different GPS loggers used in cars. When not turned off these cheap devices log phantom movements even if standing still:
As you can see in the image above, about a thousand points get visualized in a kind of congestion. Now I want to remove all of these points so that the red track coming from the left ends before the jitter starts.
My approach is to "draw" two or three circles around each point in the track, check how many other points are located within these circles and check the ratio:
(#points / covered area) > threshold?
If the threshold exceeds a certain ratio (purple circles), I could delete all points within. So: easy method, but has huge disadvantages, e.g. computation time, deleting "innocent" tracks only passing through the circle, doesn't detect outliers like the single points at the bottom of the picture).
I am looking for a better way to detect large heaps of points like in the picture. It should not remove false positives (of perhaps 5 or 10 points, these aggregations don't matter to me). Also, it should not simplify the rest of the track!
Edit: The result in given example should look like this:
My first step would be to investigate the speeds implied by the 'movements' of your stationary car and the changes in altitude. If either of these changes too quickly or too slowly (you'll have to decide the thresholds here) then you can probably conclude that they are due to the GPS jitter.
What information, other than position at time, does your GPS device report ?
EDIT (after OP's comment)
The problem is to characterise part of the log as 'car moving' and part of the log as 'car not moving but GPS location jittering'. I suggested one approach, Benjamin suggested another. If speed doesn't discriminate accurately enough, try acceleration. Try rate of change of heading. If none of these simple approaches work, I think it's time for you to break out your stats textbooks and start figuring out autocorrelation of random processes and the like. At this point I quietly slink away ...
Similarly to High Performance Mark's answer, you could look for line intersections that happen within a short number of points. When driving on a road, the route of the last n points rarely intersects with itself, but it does in your stationary situation because of the jitter. A single intersection could be a person doubling-back or circling around a block, but multiple intersections should be rarer. The angle of intersection will also be sharper for the jitter case.
What is the data interval of the GPS Points, it seems that these are in seconds. There may be one other way to add to the logic previously mentioned.
sum_of_distance(d0,d1,d2....dn)>=80% of sum_of_distance(d0,dn)
This 0 to n th value can iterate in smaller and larger chunks, as the traveled distance within that range will not be much. So, you can iterate over may be 60 points of data initially, and within that data iterate in 10 number of data in each iteration.

Automatic tracking algorithm

I'm trying to write a simple tracking routine to track some points on a movie.
Essentially I have a series of 100-frames-long movies, showing some bright spots on dark background.
I have ~100-150 spots per frame, and they move over the course of the movie. I would like to track them, so I'm looking for some efficient (but possibly not overkilling to implement) routine to do that.
A few more infos:
the spots are a few (es. 5x5) pixels in size
the movement are not big. A spot generally does not move more than 5-10 pixels from its original position. The movements are generally smooth.
the "shape" of these spots is generally fixed, they don't grow or shrink BUT they become less bright as the movie progresses.
the spots don't move in a particular direction. They can move right and then left and then right again
the user will select a region around each spot and then this region will be tracked, so I do not need to automatically find the points.
As the videos are b/w, I though I should rely on brigthness. For instance I thought I could move around the region and calculate the correlation of the region's area in the previous frame with that in the various positions in the next frame. I understand that this is a quite naïve solution, but do you think it may work? Does anyone know specific algorithms that do this? It doesn't need to be superfast, as long as it is accurate I'm happy.
Thank you
nico
Sounds like a job for Blob detection to me.
I would suggest the Pearson's product. Having a model (which could be any template image), you can measure the correlation of the template with any section of the frame.
The result is a probability factor which determine the correlation of the samples with the template one. It is especially applicable to 2D cases.
It has the advantage to be independent from the sample absolute value, since the result is dependent on the covariance related with the mean of the samples.
Once you detect an high probability, you can track the successive frames in the neightboor of the original position, and select the best correlation factor.
However, the size and the rotation of the template matter, but this is not the case as I can understand. You can customize the detection with any shape since the template image could represent any configuration.
Here is a single pass algorithm implementation , that I've used and works correctly.
This has got to be a well reasearched topic and I suspect there won't be any 100% accurate solution.
Some links which might be of use:
Learning patterns of activity using real-time tracking. A paper by two guys from MIT.
Kalman Filter. Especially the Computer Vision part.
Motion Tracker. A student project, which also has code and sample videos I believe.
Of course, this might be overkill for you, but hope it helps giving you other leads.
Simple is good. I'd start doing something like:
1) over a small rectangle, that surrounds a spot:
2) apply a weighted average of all the pixel coordinates in the area
3) call the averaged X and Y values the objects position
4) while scanning these pixels, do something to approximate the bounding box size
5) repeat next frame with a slightly enlarged bounding box so you don't clip spot that moves
The weight for the average should go to zero for pixels below some threshold. Number 4 can be as simple as tracking the min/max position of anything brighter than the same threshold.
This will of course have issues with spots that overlap or cross paths. But for some reason I keep thinking you're tracking stars with some unknown camera motion, in which case this should be fine.
I'm afraid that blob tracking is not simple, not if you want to do it well.
Start with blob detection as genpfault says.
Now you have spots on every frame and you need to link them up. If the blobs are moving independently, you can use some sort of correspondence algorithm to link them up. See for instance http://server.cs.ucf.edu/~vision/papers/01359751.pdf.
Now you may have collisions. You can use mixture of gaussians to try to separate them, give up and let the tracks cross, use any other before-and-after information to resolve the collisions (e.g. if A and B collide and A is brighter before and will be brighter after, you can keep track of A; if A and B move along predictable trajectories, you can use that also).
Or you can collaborate with a lab that does this sort of stuff all the time.

Resources