Geopandas and PostGIS query with points and lines - performance

I have a geodataframe with 10e5 rows containing points. I also have a PostGIS database with a table that contains a road network (2*10e6 rows). I need an efficient query that will find the road segment in the PostGIS database that intersects with each of the points in the geodataframe.
Would I basically just loop through each row of the geodataframe using ST_INTERSECTS? Or would I use a UNION and do all intersects at the same time? Is there a completely different approach that I am not considering?

Related

Geopandas polygons which share edges with a linestring

I have a MultiLineString lines which runs along the edges of several polygons 'poly`. All are GeoPandas GeoDataFrames.
The polygons are triangles formed through scipy.spatial.delaunay
I'm trying to get 3 separate bool masks to identify those polygons which share 1, 2, or 3 edges with the MultiLineString.
I've been trying every combination of:
contains/crosses/intersects/overlaps/touches/within/covers/
mask = poly.<method>(lines)
Any ideas?
Give the geopandas docs on merging data a close read. All of those methods are binary predicates which merge the dataframes on their indexes. So unless the dataframes are already aligned, you’ll get incorrect results or nans.
You want a spatial join using GeoDataFrame.sjoin:
poly.sjoin(lines, how="left")
This will give you any lines which intersect the polygons.

Compute pairwise geospatial distance matrix using GeoPandas

I have a large array that has user IDs and their respective geospatial locations at different times. I am trying to rank these users in terms of their geospatial activity; that is, how much each user moves around, on average. Hence, I am trying to compute a distance matrix for each user that computes their pairwise distances for each of their locations. I have a geometry column in my geospatial dataframe already, and I'm hoping that there is an efficient method (i.e., not cycling through each entry and computing gdf.geometry.distance(entry) over and over) to compute the pairwise distances from each entry in the geometry column to every other entry in the geometry column. Any ideas?

Spatial index on a sorted set

I have a large set of objects to render in 2D, which I have sorted from bottom to top. I'm currently using an R-tree to get a subset of them out that are within the current viewport. However, after getting them out of the spatial index, I have to re-sort them by their Z order. That sorting takes about 6 times longer than looking up the list of them in the spatial index (where several hundred items have matched my query).
Is there a kind of 2D spatial index which has fast lookup by rectangular bounding box, which will return the elements in a sorted order?
You can build the R-tree on the Z-order directly.
Usually, the Hilbert order is preferred, this is known as an Hilbert-R-tree.
But you can do the same with the Z-order, too.
However, you may also consider to store the data fully in Z-order right away; in a B+-tree for example.
Instead of querying with a rectangle, translate your query into Z-order intervals, and query for the Z indexes. This is a very classic approach predating the R-trees:
Morton, G. M. (1966)
A computer Oriented Geodetic Data Base; and a New Technique in File Sequencing
Technical Report, Ottawa, Canada: IBM Ltd.

what's the efficient algorithm to find points in a circle based on geographic coordinates?

I'm working on an implementation for finding the nearest person based on geographic coordinates. For example, person A has coordinates(Longitude and latitude ) m, and I want to find the person within the circle of center m with radius x.
I plan to store the geographic coordinates in a MySQL database, what's the efficient way to search the nearest coordinates?
My specific question is:
what fields should be stored in database for efficient search? I plan to store persons' coordinates only.
what's the algorithm for finding the nearest person? I plan to calculate the distance between person A's coordinate and the coordinates stores in the database.
However, I think it is less efficient to calculate if the number of coordinates is huge in the database.
What's the better way to achieve this application?
A simple Pithagoras theorem application:
-- The coordinates of the person at the center of the "circle"
set #x0 = (select x from persons where id=1),
#y0 = (select y from persons where id=1);
-- Calculate the distance for each point in the persons table, excluding the "center"
select *, sqrt(pow(x - #x0, 2) + pow(y - #y0, 2)) as distance
from persons
where id != 1
order by sqrt(pow(x - #x0, 2) + pow(y - #y0, 2))
I realize this talks about SQL Server specifically, and not MySql. However, that does not mean you cannot implement a similar solution. (and it seems MySql has Spatial Indexes).
This is implemented in multiple database technologies using grids. There a great video that goes in depth about how grid coordinates works on sql bits called Creating High Performance Spatial Database.
Here are a few images of the grid technique from MSDN Spatial Indexing Overview.
Break down the earth into multiple level grids.
Locate which grids contain some or all of the object (square, circle, whatever)..
This technique allows inclusion and exclusion very quickly without the need to calculate distance on every object (time consuming crazy like).

Querying a collection of rectangles for the overlap of an input rectangle

In a multi-dimensional space, I have a collection of rectangles, all of which are aligned to the grid. (I am using the word "rectangles" loosely - in a three dimensional space, they would be rectangular prisms.)
I want to query this collection for all rectangles that overlap an input rectangle.
What is the best data structure for holding the collection of rectangles? I will be adding rectangles to and removing rectangles from the collection from time to time, but these operations will be infrequent. The operation I want to be fast is the query.
One solution is to keep the corners of the rectangles in a list, and do a linear scan over the list, finding which rectangles overlap the query rectangle and skipping over the ones that don't.
However, I want the query operation to be faster than linear.
I've looked at the R-tree data structure, but it holds a collection of points, not a collection of rectangles, and I don't see any obvious way to generalize it.
The coordinates of my rectangles are discrete, in case you find that helpful.
I am interested in the general solution, but I will also tell you the properties of my specific problem: my problem space has three dimensions, and their multiplicity varies wildly. The first dimension has two possible values, the second dimension has 87 values, and the third dimension has 1.8 million values.
You can probably use KD-Trees which can be used for rectangles according to the wiki page:
Variations
Instead of points
Instead of points, a kd-tree can also
contain rectangles or
hyperrectangles[5]. A 2D rectangle is
considered a 4D object (xlow, xhigh,
ylow, yhigh). Thus range search
becomes the problem of returning all
rectangles intersecting the search
rectangle. The tree is constructed the
usual way with all the rectangles at
the leaves. In an orthogonal range
search, the opposite coordinate is
used when comparing against the
median. For example, if the current
level is split along xhigh, we check
the xlow coordinate of the search
rectangle. If the median is less than
the xlow coordinate of the search
rectangle, then no rectangle in the
left branch can ever intersect with
the search rectangle and so can be
pruned. Otherwise both branches should
be traversed. See also interval tree,
which is a 1-dimensional special case.
Let's call the original problem by PN - where N is number of dimensions.
Suppose we know the solution for P1 - 1-dimensional problem: find if a new interval is overlapping with a given collection of intervals.
Once we know to solve it, we can check if the new rectangle is overlapping with the collection of rectangles in each of the x/y/z projections.
So the solution of P3 is equivalent to P1_x AND P1_y AND P1_z.
In order to solve P1 efficiently we can use sorted list. Each node of the list will include coordinate and number-of-opened-intetrvals-up-to-this-coordinate.
Suppose we have the following intervals:
[1,5]
[2,9]
[3,7]
[0,2]
then the list will look as follows:
{0,1} , {1,2} , {2,2}, {3,3}, {5,2}, {7,1}, {9,0}
if we receive a new interval, say [6,7], we find the largest item in the list that is smaller than 6: {5,2} and smllest item that is greater than 7: {9,0}.
So it is easy to say that the new interval does overlap with the existing ones.
And the search in the sorted list is faster than linear :)
You have to use some sort of a partitioning technique. However, because your problem is constrained (you use only rectangles), the data-structure can be a little simplified. I haven't thought this through in detail, but something like this should work ;)
Using the discrete value constraint - you can create a secondary table-like data-structure where you store the discrete values of second dimension (the 87 possible values). Assume that these values represent planes perpendicular to this dimension. For each of these planes you can store, in this secondary table, the rectangles that intersect these planes.
Similarly for the third dimension you can use another table with as many equally spaced values as you need (1.8 million is too much, so you would probably want to make this at least a couple of magnitudes smaller), and create a map the rectangles that are between two chosen values.
Given a query rectangle you can query the first table in constant time to determine a set of tables which possibly intersects this query. Then you can do another query on the second table, and do an intersection of the results from the first and the second query results. This should narrow down the number of actual intersection tests that you have to perform.

Resources