I have a large array that has user IDs and their respective geospatial locations at different times. I am trying to rank these users in terms of their geospatial activity; that is, how much each user moves around, on average. Hence, I am trying to compute a distance matrix for each user that computes their pairwise distances for each of their locations. I have a geometry column in my geospatial dataframe already, and I'm hoping that there is an efficient method (i.e., not cycling through each entry and computing gdf.geometry.distance(entry) over and over) to compute the pairwise distances from each entry in the geometry column to every other entry in the geometry column. Any ideas?
Related
When we talk about PCA we say that we use it to reduce the dimensionality of the data. I have 2-d data, and using PCA reduced the dimensionality to 1-d.
Now,
The first component will be in such a way that it captures the maximum variance. What does it mean that the 1st component has max. variance?
Also, if we take 3-d data and reduce its dimensionality to 2-d then the 1st component will be built with max variance along the x-axis or y-axis?
PCA works by first centering the data at the origin (subtracting the mean from each data point), and then rotating it to be in line with the axes (diagonalizing the covariance matrix into a “variance” matrix). The components are then sorted so that the diagonal of the variance matrix is in descending order, which translates to the first component having the largest variance, the second having the next largest variance, etc. Later, you squish your original data by zero-ing out less important components (projecting onto principal components), and then undoing the aforementioned transformations.
To answer your questions:
The first component having the max variance means that its corresponding entry in the variance matrix is the largest one.
I suppose it depends on what you call your axes.
Source: Probability and Statistics for Computer Science by David Forsyth.
I have a large set of objects to render in 2D, which I have sorted from bottom to top. I'm currently using an R-tree to get a subset of them out that are within the current viewport. However, after getting them out of the spatial index, I have to re-sort them by their Z order. That sorting takes about 6 times longer than looking up the list of them in the spatial index (where several hundred items have matched my query).
Is there a kind of 2D spatial index which has fast lookup by rectangular bounding box, which will return the elements in a sorted order?
You can build the R-tree on the Z-order directly.
Usually, the Hilbert order is preferred, this is known as an Hilbert-R-tree.
But you can do the same with the Z-order, too.
However, you may also consider to store the data fully in Z-order right away; in a B+-tree for example.
Instead of querying with a rectangle, translate your query into Z-order intervals, and query for the Z indexes. This is a very classic approach predating the R-trees:
Morton, G. M. (1966)
A computer Oriented Geodetic Data Base; and a New Technique in File Sequencing
Technical Report, Ottawa, Canada: IBM Ltd.
In my problem, there are N points in the domain and they are somehow randomly distributed. For each point I need to find all neighbor points with distance less than a given double precision floating number, DIST.
Is there an efficient way to do this in Thrust?
In serial, I would use a neighborhood table and hope to achieve approximately O(n) instead of naive algorithm of O(n^2).
I have found a thrust example for 2D bucket sort, which is a perfect fit for the first part of my problem. But that is not enough, because for each bucket I need to find all points in the neighbor buckets, and then compute their distances and see if any of them is less than DIST. Finding neighbors and compute distance should be relatively easy, but adding those eligible points to a result array seems really difficult for me to implement in Thrust.
A way to rephrase this particular problem is this -- I have two 2D arrays A1 and A2, the column number represent the index of the 2D bucket and each column have different number of elements that are indices of my points. Each element in column(i) of A1 will form a potential pair with each element in colunm(i) of A2, and all eligible pairs should be recorded to a result array.
I could use a CUDA kernel and allocating tons of potentially unused memory as a workaround, but that would be the last thing I would want to do.
Thanks in advance.
The full solution is out of the scope of a single Stack Overflow answer, but there's a discussion on how to use Thrust to build a 2D spatial index in this repository:
https://github.com/jaredhoberock/thrust-workshop
Another possibility, simpler than creating a quad-tree, is using a neighborhood matrix.
First place all your points into a 2D square matrix (or 3D cubic grid, if you are dealing with three dimensions). Then you can run a full or partial spatial sort, so points will became ordered inside the matrix.
Points with small Y could move to the top rows of the matrix, and likewise, points with large Y would go to the bottom rows. The same will happen with points with small X coordinates, that should move to the columns on the left. And symmetrically, points with large X value will go to the right columns.
After you did the spatial sort (there are many ways to achieve this, both by serial or parallel algorithms) you can lookup the nearest points of a given point P by just visiting the adjacent cells where point P is actually stored in the neighborhood matrix.
If this matrix is placed into texture memory, you can use all the spatial caching from CUDA to have very fast accesses to all neighbors!
You can read more details for this idea in the following paper (you will find PDF copies of it online): Supermassive Crowd Simulation on GPU based on Emergent Behavior.
The sorting step gives you interesting choices. You can use just the even-odd transposition sort described in the paper, which is very simple to implement (even in CUDA). If you run just one pass of this, it will give you a partial sort, which can be already useful if your matrix is near-sorted. That is, if your points move slowly, it will save you a lot of computation.
If you need a full sort, you can run such even-odd transposition pass several times (as described in the following Wikipedia page):
http://en.wikipedia.org/wiki/Odd%E2%80%93even_sort
There is a second paper from the same authors, describing an extension to 3D and using three passes of the bitonic sort (which is highly parallel, but it is not a spatial sort). they claim it is both more precise than a single even-odd transposition pass and more efficient than a full sort. The paper is A Neighborhood Grid Data Structure for Massive 3D Crowd Simulation on GPU.
Suppose we are given a small number of objects and "distances" between them -- what algorithm exists for fitting these objects to points in two-dimensional space in a way that approximates these distances?
The difficulty here is that the "distance" is not distance in Euclidean space -- this is why we can only fit/approximate.
(for those interested in what the notion of distance is precisely, it is the symmetric distance metric on the power set of a (finite) set).
Given that the number of objects is small, you can create an undirected weighted graph, where these objects would be nodes and the edge between any two nodes has the weight that corresponds to the distance between these two objects. You end up with n*(n-1)/2 edges.
Once the graph is created, there are a lot of visualization software and algorithms that correspond to graphs.
Try a triangulation method, something like this;
Start by taking three objects with known distances between them, and create a triangle in an arbitrary grid based on the side lengths.
For each additional object that has not been placed, find at least three other objects that have been placed that you have known distances to, and use those distances to place the object using distance / distance intersection (i.e. the intersection point of the three circles centred around the fixed points with radii of the distances)
Repeat until all objects have been placed, or no more objects can be placed.
For unplaced objects, you could start another similar exercise, and then use any available distances to relate the separate clusters. Look up triangulation and trilateration networks for more info.
Edit: As per the comment below, where the distances are approximate and include an element of error, the above approach may be used to establish provisional coordinates for each object, and those coordinates may then be adjusted using a least squares method such as variation of coordinates This would also cater for weighting distances based on their magnitude as required. For a more detailed description, check Ghilani & Wolf's book on the subject. This depends very much on the nature of the differences between your distances and how you would like your objects represented in Euclidean space based on those distances. The relationship needs to be modelled and applied as part of any solution.
This is an example of Multidimensional Scaling, or more generally, Nonlinear dimensionality reduction. There are a fair amount of tools/libraries for doing this available (see the second link for a list).
In a multi-dimensional space, I have a collection of rectangles, all of which are aligned to the grid. (I am using the word "rectangles" loosely - in a three dimensional space, they would be rectangular prisms.)
I want to query this collection for all rectangles that overlap an input rectangle.
What is the best data structure for holding the collection of rectangles? I will be adding rectangles to and removing rectangles from the collection from time to time, but these operations will be infrequent. The operation I want to be fast is the query.
One solution is to keep the corners of the rectangles in a list, and do a linear scan over the list, finding which rectangles overlap the query rectangle and skipping over the ones that don't.
However, I want the query operation to be faster than linear.
I've looked at the R-tree data structure, but it holds a collection of points, not a collection of rectangles, and I don't see any obvious way to generalize it.
The coordinates of my rectangles are discrete, in case you find that helpful.
I am interested in the general solution, but I will also tell you the properties of my specific problem: my problem space has three dimensions, and their multiplicity varies wildly. The first dimension has two possible values, the second dimension has 87 values, and the third dimension has 1.8 million values.
You can probably use KD-Trees which can be used for rectangles according to the wiki page:
Variations
Instead of points
Instead of points, a kd-tree can also
contain rectangles or
hyperrectangles[5]. A 2D rectangle is
considered a 4D object (xlow, xhigh,
ylow, yhigh). Thus range search
becomes the problem of returning all
rectangles intersecting the search
rectangle. The tree is constructed the
usual way with all the rectangles at
the leaves. In an orthogonal range
search, the opposite coordinate is
used when comparing against the
median. For example, if the current
level is split along xhigh, we check
the xlow coordinate of the search
rectangle. If the median is less than
the xlow coordinate of the search
rectangle, then no rectangle in the
left branch can ever intersect with
the search rectangle and so can be
pruned. Otherwise both branches should
be traversed. See also interval tree,
which is a 1-dimensional special case.
Let's call the original problem by PN - where N is number of dimensions.
Suppose we know the solution for P1 - 1-dimensional problem: find if a new interval is overlapping with a given collection of intervals.
Once we know to solve it, we can check if the new rectangle is overlapping with the collection of rectangles in each of the x/y/z projections.
So the solution of P3 is equivalent to P1_x AND P1_y AND P1_z.
In order to solve P1 efficiently we can use sorted list. Each node of the list will include coordinate and number-of-opened-intetrvals-up-to-this-coordinate.
Suppose we have the following intervals:
[1,5]
[2,9]
[3,7]
[0,2]
then the list will look as follows:
{0,1} , {1,2} , {2,2}, {3,3}, {5,2}, {7,1}, {9,0}
if we receive a new interval, say [6,7], we find the largest item in the list that is smaller than 6: {5,2} and smllest item that is greater than 7: {9,0}.
So it is easy to say that the new interval does overlap with the existing ones.
And the search in the sorted list is faster than linear :)
You have to use some sort of a partitioning technique. However, because your problem is constrained (you use only rectangles), the data-structure can be a little simplified. I haven't thought this through in detail, but something like this should work ;)
Using the discrete value constraint - you can create a secondary table-like data-structure where you store the discrete values of second dimension (the 87 possible values). Assume that these values represent planes perpendicular to this dimension. For each of these planes you can store, in this secondary table, the rectangles that intersect these planes.
Similarly for the third dimension you can use another table with as many equally spaced values as you need (1.8 million is too much, so you would probably want to make this at least a couple of magnitudes smaller), and create a map the rectangles that are between two chosen values.
Given a query rectangle you can query the first table in constant time to determine a set of tables which possibly intersects this query. Then you can do another query on the second table, and do an intersection of the results from the first and the second query results. This should narrow down the number of actual intersection tests that you have to perform.