Use of Hilvert Curve to query a rectangular area and see if it overlaps other rectangles

Use of Hilvert Curve to query a rectangular area and see if it overlaps other rectangles - algorithm

I am looking for a method that can help me in project I am working on. The idea is that there are some rectangles in 2d space, I want to query a rectangular area and see if it overlaps any rectangle in that area. If it does, it fails. If it doesn't, meaning the space is empty, it succeeds.
I was linked to z-order curves to help turn 2d coordinates into 1d. While I was reading about it, I encountered the Hilbert curve. I read that the Hilbert curve is preferred over a z-order curve because it maintains better proximity of points. I also read that the Hilbert curve is used to make more efficient quadtrees and octrees.
I was reading this article https://sigmodrecord.org/publications/sigmodRecord/0103/3.lawder.pdf for a possible solution but I don't know if this applies to my case.
I also saw this comment https://news.ycombinator.com/item?id=14217760 which mentioned multiple index entries for non point objects.
Is there an elegant method where I can use the Hilbert curve to achieve this? Is it possible with just an array of rectangles?

I am pretty sure it is possible and can be very efficient. But there is a lot of complexity involved.
I have implemented this for a z-order curve and called it PH-Tree, you find implementations in Java and C++, as well as theoretical background in the link. The PH-Tree is similar to a z-ordered quadtree but with several modifications that allow taking full advantage of the space filling curve.
There are a few things to unpack here.
Hilbert curves vs z-order curves: Yes, Hilbert curves have a slightly better proximity than z-order curves. Better proximity does two things: 1) reduce the number of elements (nodes, data) to look at and 2) improve linear access patterns (less hopping around). Both is important if the spatial index is stored on disk and I/O is expensive (especially old disk drives).
If I remember correctly, Hilbert curve and z-order are similar enough that the always access the same amount of data. The only thing that Hilbert curves are better at is linear access.
However, Hilbert curves are much more complicated to calculate, in the tests I made with an in-memory index (not very thorough testing, I admit) I found that z-order curves are considerably more efficient because the reduced CPU time outweighs the cost of accessing data slightly out of order.
Space filling curves (at least Hilbert and z-curve) allow for some neat bit-level hacks that can speed up index operations. However, even for z-ordering I found that getting these right required a lot of thinking (I wrote a whole paper about it). I believe these operations can be adapted for Hilbert curves but I may take some time and, as I wrote above, it may not improve performance at all.
Storing rectangles in a spatial curve index:
There are different approaches to encode rectangles in a spatial curve. All approaches that I am aware of encode the rectangle in a multi-dimensional point. I found the following encoding to work best (assuming axis aligned rectangles). We defined the rectangle by the lower left minimum-corner and the upper right maximum corner, e.g. min={min0, min1}={2,3}/max={max0, max1}={8,4}. We transform this (by interleaving the min/max values) into a 4-dimensional point {2,8,3,4} and store this point in the index. Other approaches use a different ordering (2,3,8,4) or, instead of two corners, store the center point and the lengths of the edges.
Querying:
If you want to find any rectangles that overlap/intersect with a given region (we call that a window query) we need to create a 4-dimensional query box, i.e. an axis aligned box that is defined by a 4D min-point and 4D max-point (copied from here):
min = {−∞, min_0, −∞, min_1} = {−∞, 2, −∞, 3}
max = {max_0, +∞, max_1, +∞} = {8, +∞, 4, +∞}
We can process the dimensions one by one. The first min/max pair is {−∞, 8}, that will match any 2D rectangle whose minimum x-coordinate is is 8 or lower. All coordinates:
d=0: min/max pair is {−∞, 8}: matches any 2D min-x <= 8
d=1: min/max pair is {2, +∞}: matches any 2D max-x >= 2
d=2: min/max pair is {−∞, 4}: matches any 2D min-y <= 4
d=3: min/max pair is {3, +∞}: matches any 2D max-y <= 3
If all these conditions hold true, then the stored rectangle overlaps with the query window.
Final words:
This sounds complicated but can be implemented very efficiently (also lends itself to vectorization). I found that is on par with other indexes (quadtree, R-Star-Tree), see some benchmarks I made here.
Specifically, I found that the z-ordered indexes have very good insertion/update/removal times (I don't know whether that matters for you) and is very good for small query result sizes (it sounds like you often expect it be zero, i.e. no overlapping rectangle found). It generally works well with large datasets (1000s or millions of entries) and datasets that have strong clusters.
For smaller datasets of if you expect to typically find many result (you can of course abort a query early once you find the first match!) other index types may be better.
On a last note, I found the dataset characteristics to have considerable influence on which index worked best. Also, implementation appears to be at least as important as the underlying algorithms. Especially for quadtrees I found huge variations in performance from different implementations.

Related

Using a spatial index to find points within range of each other

I'm trying to find a spatial index structure suitable for a particular problem : using a union-find data structure, I want to connect\associate points that are within a certain range of each other.
I have a lot of points and I'm trying to optimize an existing solution by using a better spatial index.
Right now, I'm using a simple 2D grid indexing each square of width [threshold distance] of my point map, and I look for potential unions by searching for points in adjacent squares in the grid.
Then I compute the squared Euclidean distance to the adjacent cells combinations, which I compare to my squared threshold, and I use the union-find structure (optimized using path compression and etc.) to build groups of points.
Here is some illustration of the method. The single black points actually represent the set of points that belong to a cell of the grid, and the outgoing colored arrows represent the actual distance comparisons with the outside points.
(I'm also checking for potential connected points that belong to the same cells).
By using this pattern I make sure I'm not doing any distance comparison twice by using a proper "neighbor cell" pattern that doesn't overlap with already tested stuff when I iterate over the grid cells.
Issue is : this approach is not even close to being fast enough, and I'm trying to replace the "spatial grid index" method with something that could maybe be faster.
I've looked into quadtrees as a suitable spatial index for this problem, but I don't think it is suitable to solve it (I don't see any way of performing repeated "neighbours" checks for a particular cell more effectively using a quadtree), but maybe I'm wrong on that.
Therefore, I'm looking for a better algorithm\data structure to effectively index my points and query them for proximity.
Thanks in advance.

I have some comments:
1) I think your problem is equivalent to a "spatial join". A spatial join takes two sets of geometries, for example a set R of rectangles and a set P of points and finds for every rectangle all points in that rectangle. In Your case, R would be the rectangles (edge length = 2 * max distance) around each point and P the set of your points. Searching for spatial join may give you some useful references.
2) You may want to have a look at space filling curves. Space filling curves create a linear order for a set of spatial entities (points) with the property that points that a close in the linear ordering are usually also close in space (and vice versa). This may be useful when developing an algorithm.
3) Have look at OpenVDB. OpenVDB has a spatial index structure that is highly optimized to traverse 'voxel'-cells and their neighbors.
4) Have a look at the PH-Tree (disclaimer: this is my own project). The PH-Tree is a somewhat like a quadtree but uses low level bit operations to optimize navigation. It is also Z-ordered/Morten-ordered (see space filling curves above). You can create a window-query for each point which returns all points within that rectangle. To my knowledge, the PH-Tree is the fastest index structure for this kind of operation, especially if you typically have only 9 points in a rectangle. If you are interested in the code, the V13 implementation is probably the fastest, however the V16 should be much easier to understand and modify.
I tried on my rather old desktop machine, using about 1,000,000 points I can do about 200,000 window queries per second, so it should take about 5 second to find all neighbors for every point.
If you are using Java, my spatial index collection may also be useful.

A standard approach to this is the "sweep and prune" algorithm. Sort all the points by X coordinate, then iterate through them. As you do, maintain the lowest index of the point which is within the threshold distance (in X) of the current point. The points within that range are candidates for merging. You then do the same thing sorting by Y. Then you only need to check the Euclidean distance for those pairs which showed up in both the X and Y scans.
Note that with your current union-find approach, you can end up unioning points which are quite far from each other, if there are a bunch of nearby points "bridging" them. So your basic approach -- of unioning groups of points based on proximity -- can induce an arbitrary amount of distance error, not just the threshold distance.

Rearranging pixels to co-locate correlated pixels

Suppose we have a 2D array of gray-scale pixels. As we iterate through a large number of images, we find that each pixel has some level of correlation with all the others - across all images, certain subsets of pixels tend to be "on" at the same time as each other. How would I algorithmically rearrange the pixels' locations so that pixels which correlate with each other are also located near each other in the image?
Below is a visual (though maybe not technically accurate) depiction of what I want to do. Imagine that similarly-colored pixels are correlated across all images. We want to rearrange the pixels' locations so that these correlated pixels are co-located on the grid:
I have tried a genetic algorithm; the fitness function considered both euclidean distance between two pixels and their correlation. However it's too slow for my application, and I feel like their should be a more elegant approach.
Any thoughts would be appreciated.

Here is my formulation of your problem: You have a 2-dimensional array of vectors with correlations between these vectors. You want to rearrange the array so that vectors which are highly-correlated with each other are close together. The problem, presumably, is that any objective function which involves iterating over the entire array is too expensive to use with something like a genetic algorithm.
Here is one idea: Decide on a notion of the neighborhood of a cell (position in the array). I would go discrete rather than Euclidean since there is less computational overhead. Maybe the neighborhood of a cell is just itself and its immediate neighbors -- but you could go farther. The important thing is to keep the neighborhood small compared to the overall array. For each neighborhood -- calculate something like the average correlation between vectors in that neighborhood. Think of this as a local objective function. The overall objective function could be obtained by summing or averaging the values of these smaller objective functions.
Use some sort of hill-climbing approach (or maybe simulated annealing or tabu search) which involves making local changes on a single array rather than maintaining an entire population of arrays. Use local changes which consist of swapping two entries (and also experiment with such things as permuting triples of elements). A key insight is that such a local change only involves changing a handful of these local objective functions. In particular -- you can reject a move as being non-improving without having to recompute the overall objective function at all. Furthermore -- once you accept a move as improving, you can update the overall objective function without having to recompute very much.
I am vaguely uncomfortable with the idea of using correlation as the measure of similarity since correlation is signed. If you can't get good results with your present approach, perhaps rather than maximizing correlation you can try to minimize the squared distance between neighboring vectors.

CUDA Thrust find near neighbor points

In my problem, there are N points in the domain and they are somehow randomly distributed. For each point I need to find all neighbor points with distance less than a given double precision floating number, DIST.
Is there an efficient way to do this in Thrust?
In serial, I would use a neighborhood table and hope to achieve approximately O(n) instead of naive algorithm of O(n^2).
I have found a thrust example for 2D bucket sort, which is a perfect fit for the first part of my problem. But that is not enough, because for each bucket I need to find all points in the neighbor buckets, and then compute their distances and see if any of them is less than DIST. Finding neighbors and compute distance should be relatively easy, but adding those eligible points to a result array seems really difficult for me to implement in Thrust.
A way to rephrase this particular problem is this -- I have two 2D arrays A1 and A2, the column number represent the index of the 2D bucket and each column have different number of elements that are indices of my points. Each element in column(i) of A1 will form a potential pair with each element in colunm(i) of A2, and all eligible pairs should be recorded to a result array.
I could use a CUDA kernel and allocating tons of potentially unused memory as a workaround, but that would be the last thing I would want to do.
Thanks in advance.

The full solution is out of the scope of a single Stack Overflow answer, but there's a discussion on how to use Thrust to build a 2D spatial index in this repository:
https://github.com/jaredhoberock/thrust-workshop

Another possibility, simpler than creating a quad-tree, is using a neighborhood matrix.
First place all your points into a 2D square matrix (or 3D cubic grid, if you are dealing with three dimensions). Then you can run a full or partial spatial sort, so points will became ordered inside the matrix.
Points with small Y could move to the top rows of the matrix, and likewise, points with large Y would go to the bottom rows. The same will happen with points with small X coordinates, that should move to the columns on the left. And symmetrically, points with large X value will go to the right columns.
After you did the spatial sort (there are many ways to achieve this, both by serial or parallel algorithms) you can lookup the nearest points of a given point P by just visiting the adjacent cells where point P is actually stored in the neighborhood matrix.
If this matrix is placed into texture memory, you can use all the spatial caching from CUDA to have very fast accesses to all neighbors!
You can read more details for this idea in the following paper (you will find PDF copies of it online): Supermassive Crowd Simulation on GPU based on Emergent Behavior.
The sorting step gives you interesting choices. You can use just the even-odd transposition sort described in the paper, which is very simple to implement (even in CUDA). If you run just one pass of this, it will give you a partial sort, which can be already useful if your matrix is near-sorted. That is, if your points move slowly, it will save you a lot of computation.
If you need a full sort, you can run such even-odd transposition pass several times (as described in the following Wikipedia page):
http://en.wikipedia.org/wiki/Odd%E2%80%93even_sort
There is a second paper from the same authors, describing an extension to 3D and using three passes of the bitonic sort (which is highly parallel, but it is not a spatial sort). they claim it is both more precise than a single even-odd transposition pass and more efficient than a full sort. The paper is A Neighborhood Grid Data Structure for Massive 3D Crowd Simulation on GPU.

Sparse (Pseudo) Infinite Grid Data Structure for Web Game

I'm considering trying to make a game that takes place on an essentially infinite grid.
The grid is very sparse. Certain small regions of relatively high density. Relatively few isolated nonempty cells.
The amount of the grid in use is too large to implement naively but probably smallish by "big data" standards (I'm not trying to map the Internet or anything like that)
This needs to be easy to persist.
Here are the operations I may want to perform (reasonably efficiently) on this grid:
Ask for some small rectangular region of cells and all their contents (a player's current neighborhood)
Set individual cells or blit small regions (the player is making a move)
Ask for the rough shape or outline/silhouette of some larger rectangular regions (a world map or region preview)
Find some regions with approximately a given density (player spawning location)
Approximate shortest path through gaps of at most some small constant empty spaces per hop (it's OK to be a bad approximation often, but not OK to keep heading the wrong direction searching)
Approximate convex hull for a region
Here's the catch: I want to do this in a web app. That is, I would prefer to use existing data storage (perhaps in the form of a relational database) and relatively little external dependency (preferably avoiding the need for a persistent process).
Guys, what advice can you give me on actually implementing this? How would you do this if the web-app restrictions weren't in place? How would you modify that if they were?
Thanks a lot, everyone!

I think you can do everything using quadtrees, as others have suggested, and maybe a few additional data structures. Here's a bit more detail:
Asking for cell contents, setting cell contents: these are the basic quadtree operations.
Rough shape/outline: Given a rectangle, go down sufficiently many steps within the quadtree that most cells are empty, and make the nonempty subcells at that level black, the others white.
Region with approximately given density: if the density you're looking for is high, then I would maintain a separate index of all objects in your map. Take a random object and check the density around that object in the quadtree. Most objects will be near high density areas, simply because high-density areas have many objects. If the density near the object you picked is not the one you were looking for, pick another one.
If you're looking for low-density, then just pick random locations on the map - given that it's a sparse map, that should typically give you low density spots. Again, if it doesn't work right try again.
Approximate shortest path: if this is a not-too-frequent operation, then create a rough graph of the area "between" the starting point A and end point B, for some suitable definition of between (maybe the square containing the circle with the midpoint of AB as center and 1.5*AB as diameter, except if that diameter is less than a certain minimum, in which case... experiment). Make the same type of grid that you would use for the rough shape / outline, then create (say) a Delaunay triangulation of the black points. Do a shortest path on this graph, then overlay that on the actual map and refine the path to one that makes sense given the actual map. You may have to redo this at a few different levels of refinement - start with a very rough graph, then "zoom in" taking two points that you got from the higher level as start and end point, and iterate.
If you need to do this very frequently, you'll want to maintain this type of graph for the entire map instead of reconstructing it every time. This could be expensive, though.
Approx convex hull: again start from something like the rough shape, then take the convex hull of the black points in that.
I'm not sure if this would be easy to put into a relational database; a file-based storage could work but it would be impractical to have a write operation be concurrent with anything else, which you would probably want if you want to allow this to grow to a reasonable number of players (per world / map, if there are multiple worlds / maps). I think in that case you are probably best off keeping a separate process alive... and even then making this properly respect multithreading is going to be a headache.

A kd tree or a quadtree is a good data structure to solve your problem. Especially the latter it's a clever way to address the grid and to reduce the 2d complexity to a 1d complexity. Quadtrees is also used in many maps application like bing and google maps. Here is a good start: Nick quadtree spatial index hilbert curve blog.

How to subsample a 2D polygon?

I have polygons that define the contour of counties in the UK. These shapes are very detailed (10k to 20k points each), thus rendering the related computations (is point X in polygon P?) quite computationaly expensive.
Thus, I would like to "subsample" my polygons, to obtain a similar shape but with less points. What are the different techniques to do so?
The trivial one would be to take one every N points (thus subsampling by a factor N), but this feels too "crude". I would rather do some averaging of points, or something of that flavor. Any pointer?

Two solutions spring to mind:
1) since the map of the UK is reasonably squarish, you could choose to render a bitmap with the counties. Assign each a specific colour, and then render the borders with a 1 or 2 pixel thick black line. This means you'll only have to perform the expensive interior/exterior calculation if a sample happens to lie on the border. The larger the bitmap, the less often this will happen.
2) simplify the county outlines. You can use a recursive Ramer–Douglas–Peucker algorithm to recursively simplify the boundaries. Just make sure you cache the results. You may also have to solve this not for entire county boundaries but for shared boundaries only, to ensure no gaps. This might be quite tricky.

Here you can find a project dealing exactly with your issues. Although it works primarily with an area "filled" by points, you can set it to work with a "perimeter" type definition as yours.
It uses a k-nearest neighbors approach for calculating the region.
Samples:
Here you can request a copy of the paper.
Seemingly they planned to offer an online service for requesting calculations, but I didn't test it, and probably it isn't running.
HTH!

Polygon triangulation should help here. You'll still have to check many polygons, but these are triangles now, so they are easier to check and you can use some optimizations to determine only a small subset of polygons to check for a given region or point.
As it seems you have all the algorithms you need for polygons, not only for triangles, you can also merge several triangles that are too small after triangulation or if triangle count gets too high.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio