Assuming I have a large set of Coordinates such us (3,4), (5,-6), etc., where x and y are integers; is it possible to order them using a BST?
How can I go about determining what should be on the left vs right node?
The reason why I'm looking at BST instead of simply using a list of coordinates is so that I can more efficiently (vs linear search) determine those coordinates that would be in the Moore neighborhood (Chebyshev distance 1) of another.
I've thought about alternating comparisons to x and y values; is that a good approach?
How else can I apply BST to this situation? Or is using BST untenable?
I suggest you create a grid of cells. Each cell (which is actually a list) contains all coordinates which like within it.
If you need to find neighbors of a coordinate, you just look at the coordinates that lies within the same cell (or in neighboring cells).
Despite the simplicity of the approach by aioobe to create a grid of cells (two dimensional array), it was a bit heavy handed/inefficient to store the state of all possible cells/coordinates especially when I may have cases where there is only have a handful of actual coordinates in a very large space (sparse array).
Ultimately I realised using a BST is feasible (there are other approaches) and this is what I did to find Moore Neighbours efficiently using a balanced BST:
Impose a ordinal relation on coordinates i.e. (x1,y1) > (x2,y2) => (x1 > x2) || (x1==x2 && y1 > y2)
Sort list of coordinates by that relationship (to generate a balanced tree)
Generate BST from sorted list recursively: insert median element as node, slice the list below and above the median and pass as parameter to recursive call for generating left and right sub-trees
Then to search for the Moore Neighbours of a coordinate, I can then just look up/search for the 8 possible neighbour coordinates in the tree (as aioobe suggests) in O(log n).
Related
N.B: there's a major edit at the bottom of the question - check it out
Question
Say I have a set of points:
I want to find the point with the most points surrounding it, within radius (ie a circle) or within (ie a square) of the point for 2 dimensions. I'll refer to it as the densest point function.
For the diagrams in this question, I'll represent the surrounding region as circles. In the image above, the middle point's surrounding region is shown in green. This middle point has the most surrounding points of all the points within radius and would be returned by the densest point function.
What I've tried
A viable way to solve this problem would be to use a range searching solution; this answer explains further and that it has " worst-case time". Using this, I could get the number of points surrounding each point and choose the point with largest surrounding point count.
However, if the points were extremely densely packed (in the order of a million), as such:
then each of these million points () would need to have a range search performed. The worst-case time , where is the number of points returned in the range, is true for the following point tree types:
kd-trees of two dimensions (which are actually slightly worse, at ),
2d-range trees,
Quadtrees, which have a worst-case time of
So, for a group of points within radius of all points within the group, it gives complexity of for each point. This yields over a trillion operations!
Any ideas on a more efficient, precise way of achieving this, so that I could find the point with the most surrounding points for a group of points, and in a reasonable time (preferably or less)?
EDIT
Turns out that the method above is correct! I just need help implementing it.
(Semi-)Solution
If I use a 2d-range tree:
A range reporting query costs , for returned points,
For a range tree with fractional cascading (also known as layered range trees) the complexity is ,
For 2 dimensions, that is ,
Furthermore, if I perform a range counting query (i.e., I do not report each point), then it costs .
I'd perform this on every point - yielding the complexity I desired!
Problem
However, I cannot figure out how to write the code for a counting query for a 2d layered range tree.
I've found a great resource (from page 113 onwards) about range trees, including 2d-range tree psuedocode. But I can't figure out how to introduce fractional cascading, nor how to correctly implement the counting query so that it is of O(log n) complexity.
I've also found two range tree implementations here and here in Java, and one in C++ here, although I'm not sure this uses fractional cascading as it states above the countInRange method that
It returns the number of such points in worst case
* O(log(n)^d) time. It can also return the points that are in the rectangle in worst case
* O(log(n)^d + k) time where k is the number of points that lie in the rectangle.
which suggests to me it does not apply fractional cascading.
Refined question
To answer the question above therefore, all I need to know is if there are any libraries with 2d-range trees with fractional cascading that have a range counting query of complexity so I don't go reinventing any wheels, or can you help me to write/modify the resources above to perform a query of that complexity?
Also not complaining if you can provide me with any other methods to achieve a range counting query of 2d points in in any other way!
I suggest using plane sweep algorithm. This allows one-dimensional range queries instead of 2-d queries. (Which is more efficient, simpler, and in case of square neighborhood does not require fractional cascading):
Sort points by Y-coordinate to array S.
Advance 3 pointers to array S: one (C) for currently inspected (center) point; other one, A (a little bit ahead) for nearest point at distance > R below C; and the last one, B (a little bit behind) for farthest point at distance < R above it.
Insert points pointed by A to Order statistic tree (ordered by coordinate X) and remove points pointed by B from this tree. Use this tree to find points at distance R to the left/right from C and use difference of these points' positions in the tree to get number of points in square area around C.
Use results of previous step to select "most surrounded" point.
This algorithm could be optimized if you rotate points (or just exchange X-Y coordinates) so that width of the occupied area is not larger than its height. Also you could cut points into vertical slices (with R-sized overlap) and process slices separately - if there are too many elements in the tree so that it does not fit in CPU cache (which is unlikely for only 1 million points). This algorithm (optimized or not) has time complexity O(n log n).
For circular neighborhood (if R is not too large and points are evenly distributed) you could approximate circle with several rectangles:
In this case step 2 of the algorithm should use more pointers to allow insertion/removal to/from several trees. And on step 3 you should do a linear search near points at proper distance (<=R) to distinguish points inside the circle from the points outside it.
Other way to deal with circular neighborhood is to approximate circle with rectangles of equal height (but here circle should be split into more pieces). This results in much simpler algorithm (where sorted arrays are used instead of order statistic trees):
Cut area occupied by points into horizontal slices, sort slices by Y, then sort points inside slices by X.
For each point in each slice, assume it to be a "center" point and do step 3.
For each nearby slice use binary search to find points with Euclidean distance close to R, then use linear search to tell "inside" points from "outside" ones. Stop linear search where the slice is completely inside the circle, and count remaining points by difference of positions in the array.
Use results of previous step to select "most surrounded" point.
This algorithm allows optimizations mentioned earlier as well as fractional cascading.
I would start by creating something like a https://en.wikipedia.org/wiki/K-d_tree, where you have a tree with points at the leaves and each node information about its descendants. At each node I would keep a count of the number of descendants, and a bounding box enclosing those descendants.
Now for each point I would recursively search the tree. At each node I visit, either all of the bounding box is within R of the current point, all of the bounding box is more than R away from the current point, or some of it is inside R and some outside R. In the first case I can use the count of the number of descendants of the current node to increase the count of points within R of the current point and return up one level of the recursion. In the second case I can simply return up one level of the recursion without incrementing anything. It is only in the intermediate case that I need to continue recursing down the tree.
So I can work out for each point the number of neighbours within R without checking every other point, and pick the point with the highest count.
If the points are spread out evenly then I think you will end up constructing a k-d tree where the lower levels are close to a regular grid, and I think if the grid is of size A x A then in the worst case R is large enough so that its boundary is a circle that intersects O(A) low level cells, so I think that if you have O(n) points you could expect this to cost about O(n * sqrt(n)).
You can speed up whatever algorithm you use by preprocessing your data in O(n) time to estimate the number of neighbouring points.
For a circle of radius R, create a grid whose cells have dimension R in both the x- and y-directions. For each point, determine to which cell it belongs. For a given cell c this test is easy:
c.x<=p.x && p.x<=c.x+R && c.y<=p.y && p.y<=c.y+R
(You may want to think deeply about whether a closed or half-open interval is correct.)
If you have relatively dense/homogeneous coverage, then you can use an array to store the values. If coverage is sparse/heterogeneous, you may wish to use a hashmap.
Now, consider a point on the grid. The extremal locations of a point within a cell are as indicated:
Points at the corners of the cell can only be neighbours with points in four cells. Points along an edge can be neighbours with points in six cells. Points not on an edge are neighbours with points in 7-9 cells. Since it's rare for a point to fall exactly on a corner or edge, we assume that any point in the focal cell is neighbours with the points in all 8 surrounding cells.
So, if a point p is in a cell (x,y), N[p] identifies the number of neighbours of p within radius R, and Np[y][x] denotes the number of points in cell (x,y), then N[p] is given by:
N[p] = Np[y][x]+
Np[y][x-1]+
Np[y-1][x-1]+
Np[y-1][x]+
Np[y-1][x+1]+
Np[y][x+1]+
Np[y+1][x+1]+
Np[y+1][x]+
Np[y+1][x-1]
Once we have the number of neighbours estimated for each point, we can heapify that data structure into a maxheap in O(n) time (with, e.g. make_heap). The structure is now a priority-queue and we can pull points off in O(log n) time per query ordered by their estimated number of neighbours.
Do this for the first point and use a O(log n + k) circle search (or some more clever algorithm) to determine the actual number of neighbours the point has. Make a note of this point in a variable best_found and update its N[p] value.
Peek at the top of the heap. If the estimated number of neighbours is less than N[best_found] then we are done. Otherwise, repeat the above operation.
To improve estimates you could use a finer grid, like so:
along with some clever sliding window techniques to reduce the amount of processing required (see, for instance, this answer for rectangular cases - for circular windows you should probably use a collection of FIFO queues). To increase security you can randomize the origin of the grid.
Considering again the example you posed:
It's clear that this heuristic has the potential to save considerable time: with the above grid, only a single expensive check would need to be performed in order to prove that the middle point has the most neighbours. Again, a higher-resolution grid will improve the estimates and decrease the number of expensive checks which need to be made.
You could, and should, use a similar bounding technique in conjunction with mcdowella's answers; however, his answer does not provide a good place to start looking, so it is possible to spend a lot of time exploring low-value points.
I have a STL file that contains the coordinates (x,y,z) of 3 points (p0, p1, p2) of a triangle. these triangle represent a 3D surface f(x,y,z). The STL file might have over a 1000 triangles to represent a complex geometry.
for my application, I need to know the neighboring triangles for each triangle entry from the stl file. meaning that for each triangle, i have to pick 3 pairs of points pair1=(p0,p1), pair2=(p0,p2), pair3= (p1,p2) and compare them with pair of points in other triangles in the set
what's the best and most efficient algorithm to achieve this purpose? can i use a hashtree, hashmap?
change the mesh representation to point table and triangle faces table. STL demands that all triangles are joined in their vertexes so no cutting of edges which means neighboring triangle always share one complete edge.
double pnt[points][3];
int tri[triangles][3];
The pnt should be list of all distinct points (index sort it to improve speed for high point count). The tri should contain 3 indexes of points used in triangle. Sort them (asc or desc) to improve match speed.
Now if any triangle tri[i] shares the same edge like tri[j] then those two are neighboring triangles.
if ((tri[i][0]==tri[j][0])&&(tri[i][1]==tri[j][1])
||(tri[i][0]==tri[j][1])&&(tri[i][1]==tri[j][2])) triangles i,j are neighbors
Add all combinations ...
If you need just neighboring points then find all triangles containing that points and all the other points used in those triangles are neighbors
To load STL to such structure do this:
clear pnt[],tri[] lists/tables
process each triangle of STL
for each point of triangle
look if it is in pnt[] if yes use its index for new triangle. if not add new point to pnt and use its index for new triangle. When all 3 points done add new triangle to tri.
Improving pnt[] performance
Add index sort for pnt[] sorted by any coordinate for example x and improve performance of checking if point is already present in pnt.
So while adding (xi,yi,zi) into pnt[] find index of point that have the biggest x which is xi>=pnt[i0][0] via binary search and then scan all points in pnt until x crosses xi so xi<pnt[i1][0] this way you do not need to check all points.
If this is too slow (usually if number of points is bigger then 40000) you can improve performance more by segment index sorting (dividing index sort into segment pages of finite size like 8192 points)
Improving tri[] performance
You can also sort the tri[] by tri[i][0] so you can use binary search similarly to pnt[].
I would suggest going with hashmap where values are sets (based on tree) of references to Tringles, keys are those pairs of Points (lets call these pairs simply Sides) and some hashing function that would take into accout the property that hash of Side (a,b) should be equal to hash of (b,a).
Some kind of algorithm:
Read 3 Points and create from them 3 Sides and Triangle.
Add all that to hashmap: map[side[i]].insert(tringle)
Repeat 1-2 until you read all the data
Now you have a map with filled data. About the complexity of filling: insertion into hashmap are constant-time at average (it also depends on the hash-function) and insertion complexity into a set is logarithmic so the complete complexity of filleng data is O(n*logm) where n is the number of Sides and m is average number of Tringles with the same Side.
Normally each set would contain around 4 Triangles: 1 + 3 side-neighbours, so logm is relatively small (comparing to n) and could be not taken into account (suppose it is a constant). These suggestions lead us to some kind of conclusion: best-case complexity for filling is O(n) (no collisions, no rehashing, etc) and worst is O(n*logn) (worst-case inserting of n Sides by 1 average case in map and by logn case inserting into one set meaning all Tringles share the same Side).
Now to get all side-neighbours for some Triangle:
Get all 3 sets for each Side of that Triangle (e.g. set[i] = map[triangle.sides[i]].
Get intersection of those 3 sets (exclude triangle to get only its side-neighbours).
Done.
About complexity of getting side-neighbours: linearly-depent on the size of the sets and relatively small comparing to 'n' in normal case.
Note: To get not side-neighbours but point-neighbours (assuming neighbours are called any 2 Triangles with common Point not Side) simply fill sets with Points instead of Sides. The above assumptions about time-complexities hold exept for constants.
Consider a 2000 x 2000 2D bool array. 100,000 elements are set to true, the rest to false.
Given a cell (x1,y1) we need to find the nearest cell (x2,y2) (by manhattan distance: abs(x1-x2) + abs(y1-y2)) that is false.
One way to do that would be to:
for (int dist = 0; true; dist++)
for ((x2,y2) in all cells dist away from (x1,y1))
if (!array[x2,y2])
return (x2,y2);
In the worst case we would have to iterate through 100,000 cells before finding the free one.
Is there a data structure we could use rather than a 2D array that would allow us to perform this search quicker?
If the data is constant and you have many queries on it:
You might want to use a k-d tree, and look for the nearest neighbor. Insert (i,j) for each element such that arr[i][j] = false. The standard k-d tree uses euclidean distance but I think one can modify it to use manhattan distances instead..
If the data is used for one query:
You will need at least Omega(n*m) ops to read the data and insert it into any data structure - so no point in doing that - the suggested solution will outperform only the build up of any data structure.
You might be interested into look into Region QuadTree. Here initially the entire image is modeled as the root since the image contains all 0s (assumption). Then when a particular pixel is set, the image is divided into 4 quadrants first and the 3 quadrants where the pixel is not included are left as leaves. The remaining quadrant is subdivided again and so on. This is reached till we have 4 point leaves out of which one is set.
This representation will help to rule-out entire regions during the search and the search time can be optimized to O(log n)
i have an unordered list of horizontal/vertical segments of length 1, which build one or many polygons. I now need to find the list of all connected corners in each polygon.
Example:
[Horizontal (0,0), Vertical (2,0), Horizontal (1,0), Horizontal (2,2), Vertical (2,1)]
represents a line like this
2 X--X
|
1 X
|
0 X--X--X
0 1 2
I would be looking for the corners [(2,0), (2,2)].
In imperative languages i would probably use a (doubly-)linked data structure and traverse those. I can't come up with a elegant representation for this in Haskell. How would you do it?
Before we go looking for corners, let's take a step back. What are you trying to do?
I have an unordered list of horizontal/vertical segments of length 1, which build one or many polygons. I now need to find the list of all connected corners in each polygon.
"Searching" and "unordered lists" don't really go together, as I'm sure you realize! This is true even in simple lookups, but it's even worse for what you're doing, which is conceptually closer to finding duplicates because it requires correlating elements of the collection with each other, instead of inspecting each independently.
So, you're definitely going to want something with a lot more structure to it. One option would be a more semantically-meaningful representation in terms of complete polygons, allowing a simple traversal of an unbroken perimeter, but I'm guessing you don't have that information available (for instance, if you're trying to create such a structure here).
Now, in a comment you said:
The reason for this is, that the segments were stored in "Set"s before, in order to remove overlapping segments. This representation guarantees that there is only one segment (x,y)--(x+1,y).
This is worth further thought. How does Data.Set work, and why is it better for removing duplicates than an unordered list? That last bit's kind of a give-away, because Data.Set is precisely an ordered collection, so by giving each item a representation that sorts uniquely, you get the combined benefits of automatically removing duplicates and fast lookup.
As mentioned above, your actual problem is conceptually similar to finding duplicates; instead of finding overlapping segments, you want adjacent ones. Can using Data.Set help you here as well?
Alas, it cannot. To see why, think about how the sorting works. Given two items X and Y, there are three possible comparisons: X < Y, X == Y, or X > Y. If distinct, adjacent elements differ by the minimum amount possible, you can safely examine only elements that are adjacent in the sorted collection. But this cannot generalize to line segments for multiple reasons, the simplest being that up to four distinct elements can be adjacent, which cannot be described in a sorted sequence.
Hopefully I've been heavy-handed enough with my hints that you're now wondering what a sorted collection that does allow four distinct elements to be adjacent would look like, and whether it would allow easy searching the way Data.Set does.
The answer to the latter is yes, absolutely, and the answer to the first is that it would be a higher-dimensional search tree. A simple binary search tree looks something like this:
data Tree a = Leaf | Branch a (Tree a) (Tree a)
...where you ensure that at any branch, all leaf values in the left half are smaller than those in the right. A simple 2-dimensional search tree would instead look something like this:
data Tree a = Leaf | Branch a (Tree a) (Tree a) (Tree a) (Tree a)
...where each branch represents a quadrant, sorting by comparing on the two axes independently. Otherwise, it works just like familiar 1-dimensional search trees, with straightforward translations of many standard algorithms, and given a specific line segment you can quickly search for any adjacent segments.
Edit: In hindsight, I got a little too wrapped-up in exposition and forgot to give references. This is not at all a novel concept, and has many extant variations:
What I described would be called a point Quadtree and is a simple extension of binary search trees, like Data.Set.
The same concept can be done with regions instead of discrete points, with lookups ending at regions that are either fully included or excluded. These are similar extensions of tries, like Data.IntSet.
A variation called R-trees are similar to a B-trees and have useful performance characteristics for some purposes.
The concepts extend just as well to higher dimensions, as well. Data structures along these lines are used for rendering and collision detection in simulations and video games, spatial databases with "nearest neighbor" searches, as well as more abstract applications you wouldn't normally think of geometrically, where sparse data points can be sorted along multiple axes and some combined notion of "distance" is meaningful.
Oddly enough, I've been unable to find any implementation of such data structures on Hackage, besides one incomplete and seemingly-abandoned package.
If I understand the problem description correctly, each segment can take part in up to four possible corners that each identifies a specific complementary segment. Given a list of segments, we can than walk down the list seeing which possible two-segment corners are present, then figure out where those segments meet. This is a very slow approach due to the repeated list traversals, but if you are only dealing with handfuls of segments, it is at least fairly concise.
data Segment = Horizontal (Int,Int) | Vertical (Int,Int) deriving (Eq, Show)
example = [ Horizontal (0,0)
, Vertical (2,0)
, Horizontal (1,0)
, Horizontal (2,2)
, Vertical (2,1) ]
corners [] = []
corners (Horizontal (x,y):xs) = ns ++ corners xs
where ns = map cornerLoc . filter (`elem` xs) $
map Vertical [(x,y),(x+1,y),(x,y-1),(x+1,y-1)]
cornerLoc (Vertical (x',_)) = (max x x', y)
corners (Vertical (x,y):xs) = ns ++ corners xs
where ns = map cornerLoc . filter (`elem` xs) $
map Horizontal [(x,y),(x,y+1),(x-1,y),(x-1,y+1)]
cornerLoc (Horizontal (_,y')) = (x, max y y')
My goal is a more efficient implementation of the algorithm posed in this question.
Consider two sets of points (in N-space. 3-space for the example case of RGB colorspace, while a solution for 1-space 2-space differs only in the distance calculation). How do you find the point in the first set that is the farthest from its nearest neighbor in the second set?
In a 1-space example, given the sets A:{2,4,6,8} and B:{1,3,5}, the answer would be
8, as 8 is 3 units away from 5 (its nearest neighbor in B) while all other members of A are just 1 unit away from their nearest neighbor in B. edit: 1-space is overly simplified, as sorting is related to distance in a way that it is not in higher dimensions.
The solution in the source question involves a brute force comparison of every point in one set (all R,G,B where 512>=R+G+B>=256 and R%4=0 and G%4=0 and B%4=0) to every point in the other set (colorTable). Ignore, for the sake of this question, that the first set is elaborated programmatically instead of iterated over as a stored list like the second set.
First you need to find every element's nearest neighbor in the other set.
To do this efficiently you need a nearest neighbor algorithm. Personally I would implement a kd-tree just because I've done it in the past in my algorithm class and it was fairly straightforward. Another viable alternative is an R-tree.
Do this once for each element in the smallest set. (Add one element from the smallest to larger one and run the algorithm to find its nearest neighbor.)
From this you should be able to get a list of nearest neighbors for each element.
While finding the pairs of nearest neighbors, keep them in a sorted data structure which has a fast addition method and a fast getMax method, such as a heap, sorted by Euclidean distance.
Then, once you're done simply ask the heap for the max.
The run time for this breaks down as follows:
N = size of smaller set
M = size of the larger set
N * O(log M + 1) for all the kd-tree nearest neighbor checks.
N * O(1) for calculating the Euclidean distance before adding it to the heap.
N * O(log N) for adding the pairs into the heap.
O(1) to get the final answer :D
So in the end the whole algorithm is O(N*log M).
If you don't care about the order of each pair you can save a bit of time and space by only keeping the max found so far.
*Disclaimer: This all assumes you won't be using an enormously high number of dimensions and that your elements follow a mostly random distribution.
The most obvious approach seems to me to be to build a tree structure on one set to allow you to search it relatively quickly. A kd-tree or similar would probably be appropriate for that.
Having done that, you walk over all the points in the other set and use the tree to find their nearest neighbour in the first set, keeping track of the maximum as you go.
It's nlog(n) to build the tree, and log(n) for one search so the whole thing should run in nlog(n).
To make things more efficient, consider using a Pigeonhole algorithm - group the points in your reference set (your colorTable) by their location in n-space. This allows you to efficiently find the nearest neighbour without having to iterate all the points.
For example, if you were working in 2-space, divide your plane into a 5 x 5 grid, giving 25 squares, with 25 groups of points.
In 3 space, divide your cube into a 5 x 5 x 5 grid, giving 125 cubes, each with a set of points.
Then, to test point n, find the square/cube/group that contains n and test distance to those points. You only need to test points from neighbouring groups if point n is closer to the edge than to the nearest neighbour in the group.
For each point in set B, find the distance to its nearest neighbor in set A.
To find the distance to each nearest neighbor, you can use a kd-tree as long as the number of dimensions is reasonable, there aren't too many points, and you will be doing many queries - otherwise it will be too expensive to build the tree to be worthwhile.
Maybe I'm misunderstanding the question, but wouldn't it be easiest to just reverse the sign on all the coordinates in one data set (i.e. multiply one set of coordinates by -1), then find the first nearest neighbour (which would be the farthest neighbour)? You can use your favourite knn algorithm with k=1.
EDIT: I meant nlog(n) where n is the sum of the sizes of both sets.
In the 1-Space set I you could do something like this (pseudocode)
Use a structure like this
Struct Item {
int value
int setid
}
(1) Max Distance = 0
(2) Read all the sets into Item structures
(3) Create an Array of pointers to all the Items
(4) Sort the array of pointers by Item->value field of the structure
(5) Walk the array from beginning to end, checking if the Item->setid is different from the previous Item->setid
if (SetIDs are different)
check if this distance is greater than Max Distance if so set MaxDistance to this distance
Return the max distance.