I have a large number of objects each representing a numeric range (e.g. [1,3], [2,8], [3,3]). I want to be able to rapidly query for all of the ranges overlapping a given range. This is the one-dimensional equivalent of standard 2D or 3D spatial indexes, such as R-trees.
For example:
Data = [0,1], [1,3], [2,8], [3,3], [5,9]
Query = [2,4]
Output = [1,3], [2,8], [3,3]
I'd like adding items to the structure or removing items from it to usually run in O(log N), and for searching the structure to also usually be O(log N).
Is there a good fit from well-understood standard data structure?
An interval tree comes to mind, which is: (description from Wikipedia, image from here)
A tree with each node storing:
A center point
A pointer to another node containing all intervals completely to the left of the center point
A pointer to another node containing all intervals completely to the right of the center point
All intervals overlapping the center point sorted by their beginning point
All intervals overlapping the center point sorted by their ending point
It allows for O(log n + m) interval intersection queries, where m is the number of intersecting intervals.
You'll have to look at either of the sites for more details regarding construction and querying.
Related
I recently came across this algorithmic question in an interview. The question goes something like:
Initially there is a rectangle (starting at the origin (0,0) and ending at (n,m)) given. Then there are q queries like x=r or y=c which basically divides the initial rectangles into smaller rectangles. After each query, we have to return the largest rectangle size currently present.
See the diagram:
So, here we were initially given a rectangle from (0,0) to (6,6) [a square in fact!!]. Now after the 1st query (shown as dotted line above) x = 2, the largest rectangle size is 24. After the second query y = 1, the largest rectangle size is 20. And this is how it goes on and on.
My approach to solving this:
At every query, find:
The largest interval on the x axis (maxX) [keep storing all the x = r values in a list]
The largest interval on y axis (maxY) [keep storing all the y = c values in another list]
At every query, your answer is (maxX * maxY)
For finding 1 and 2, I will have to iterate through the whole list, which is not very efficient.
So, I have 2 questions:
Is my solution correct? If not, what is the correct approach to the problem. If yes, how can I optimise my solution?
It's correct but takes O(n) time per query.
You could, for each dimension, have one binary search tree (or other sorted container with O(log n) operations) for the coordinates (initially two) and one for the interval sizes. Then for each query in that dimension:
Add the new coordinate to the coordinates.
From its neighbors, compute the interval's old size and remove that from the sizes.
Compute the two new intervals' sizes and add them to the sizes.
The largest size is at the end of the sizes.
Would be O(log n) per query.
Yes, your algorithm is correct.
To optimize it, first of all, consider only one dimension, because the two dimensions in your geometry are fully orthogonal.
So, you need to have a data structure which holds a partitioning of an interval into sub-intervals, and supports fast application of these two operations:
Split a given interval into two
Find a largest interval
You can do that by using two sorted lists, one sorted by coordinate, and the other sorted by size. You should have pointers from one data structure to the other, and vice-versa.
To implement the "splitting" operation:
Find the interval which you should split, using binary search in the coordinate-sorted list
Remove the interval from both lists
Add two smaller intervals to both lists
I have a STL file that contains the coordinates (x,y,z) of 3 points (p0, p1, p2) of a triangle. these triangle represent a 3D surface f(x,y,z). The STL file might have over a 1000 triangles to represent a complex geometry.
for my application, I need to know the neighboring triangles for each triangle entry from the stl file. meaning that for each triangle, i have to pick 3 pairs of points pair1=(p0,p1), pair2=(p0,p2), pair3= (p1,p2) and compare them with pair of points in other triangles in the set
what's the best and most efficient algorithm to achieve this purpose? can i use a hashtree, hashmap?
change the mesh representation to point table and triangle faces table. STL demands that all triangles are joined in their vertexes so no cutting of edges which means neighboring triangle always share one complete edge.
double pnt[points][3];
int tri[triangles][3];
The pnt should be list of all distinct points (index sort it to improve speed for high point count). The tri should contain 3 indexes of points used in triangle. Sort them (asc or desc) to improve match speed.
Now if any triangle tri[i] shares the same edge like tri[j] then those two are neighboring triangles.
if ((tri[i][0]==tri[j][0])&&(tri[i][1]==tri[j][1])
||(tri[i][0]==tri[j][1])&&(tri[i][1]==tri[j][2])) triangles i,j are neighbors
Add all combinations ...
If you need just neighboring points then find all triangles containing that points and all the other points used in those triangles are neighbors
To load STL to such structure do this:
clear pnt[],tri[] lists/tables
process each triangle of STL
for each point of triangle
look if it is in pnt[] if yes use its index for new triangle. if not add new point to pnt and use its index for new triangle. When all 3 points done add new triangle to tri.
Improving pnt[] performance
Add index sort for pnt[] sorted by any coordinate for example x and improve performance of checking if point is already present in pnt.
So while adding (xi,yi,zi) into pnt[] find index of point that have the biggest x which is xi>=pnt[i0][0] via binary search and then scan all points in pnt until x crosses xi so xi<pnt[i1][0] this way you do not need to check all points.
If this is too slow (usually if number of points is bigger then 40000) you can improve performance more by segment index sorting (dividing index sort into segment pages of finite size like 8192 points)
Improving tri[] performance
You can also sort the tri[] by tri[i][0] so you can use binary search similarly to pnt[].
I would suggest going with hashmap where values are sets (based on tree) of references to Tringles, keys are those pairs of Points (lets call these pairs simply Sides) and some hashing function that would take into accout the property that hash of Side (a,b) should be equal to hash of (b,a).
Some kind of algorithm:
Read 3 Points and create from them 3 Sides and Triangle.
Add all that to hashmap: map[side[i]].insert(tringle)
Repeat 1-2 until you read all the data
Now you have a map with filled data. About the complexity of filling: insertion into hashmap are constant-time at average (it also depends on the hash-function) and insertion complexity into a set is logarithmic so the complete complexity of filleng data is O(n*logm) where n is the number of Sides and m is average number of Tringles with the same Side.
Normally each set would contain around 4 Triangles: 1 + 3 side-neighbours, so logm is relatively small (comparing to n) and could be not taken into account (suppose it is a constant). These suggestions lead us to some kind of conclusion: best-case complexity for filling is O(n) (no collisions, no rehashing, etc) and worst is O(n*logn) (worst-case inserting of n Sides by 1 average case in map and by logn case inserting into one set meaning all Tringles share the same Side).
Now to get all side-neighbours for some Triangle:
Get all 3 sets for each Side of that Triangle (e.g. set[i] = map[triangle.sides[i]].
Get intersection of those 3 sets (exclude triangle to get only its side-neighbours).
Done.
About complexity of getting side-neighbours: linearly-depent on the size of the sets and relatively small comparing to 'n' in normal case.
Note: To get not side-neighbours but point-neighbours (assuming neighbours are called any 2 Triangles with common Point not Side) simply fill sets with Points instead of Sides. The above assumptions about time-complexities hold exept for constants.
My problem is:
We have a set of N points in a 2D space, each of which has a weight. Given any rectangle region R, how to efficiently return the point with the largest weight inside R?
Note that all query regions R have the same shape, i.e. same lengths and widths. And point and rectangle coordinates are float numbers.
My initial idea is use a R-tree to store points. For a region R, extract all points in R, and then find the point with max. weights. The time complexity is O(logN + V), where V is number of points in R. Can we do better?
I tried to search the solution, but still not successfully. Any suggestion?
Thanks,
This sounds like a range max query problem in 2D. You have some very good algorithms here.
Simpler would be to just use a 2D segment tree.
Basically, each node of your segment tree will store 2d regions instead of 1d intervals. So each node will have 4 children, reducing to a quad tree, which you can then operate on as you would a classic segment tree. This is described in detail here.
This will be O(log n) per query, where n is the total number of points. And it also allows you to do a lot more operations, such as update a point's weight, update a region's weights etc.
How about adding an additional attribute to each tree node which contains the maximum weight of all the points contained by any of its children.
This will be easy to update while adding points. It'll be a little more work to maintain when you remove a point that changes the maximum. You'll have to traverse the tree backwards and update the max value for all parent nodes.
With this attribute, if you want to retrieve the maximum-weight point then when you query the tree with your query region you only inspect the child node with the maximum weight as you traverse the tree. Note, you may have more than one point with the same maximum weight so you may have more than one child node to inspect.
Only inspecting the child nodes with a maximum weight attribute will improve your query throughput at the expense of more memory and slower time building/modifying the tree.
Look up so called range trees, which in your case you would want to implement in 2-dimensions. This would be a 2-layer "tree of trees", where you first split the set of points based on x-coordinate and then for each set of x-points at one of the nodes in the resulting tree, you build a tree based on y-coordinate for those points at that node in the original tree. You can look up how to adapt a 2-d range tree to return the number of points in the query rectangle in O((log n)^2) time, independent of the number of points. Similarly, instead of storing the count of points for subrectangles in the range tree, you can store the maximum objective value of points within that rectangle. This will give you O(n log n) time guaranteed storage and construction time, and O((log n)^2) query time, regardless of the number of points in the query rectangle.
An adaptation of so-called "fractional cascading" for range-tree "find all points in query rectangle" might even be able to get your query time down to O(log n), but I'm not sure since you are taking max of value of points within the query rectangle.
Hint:
Every point has a "zone of influence", which is the locus of the positions of the (top-left corner of the) rectangle such that this point dominates. The set of the zones of influence defines a partition of the plane. Every edge of the partition occurs at the abcissa [ordinate] of a given point or its abscissa [ordinate] minus the width [height] of the query region.
If you map the coordinate values to their rank (by sorting on both axis), you can represent the partition as a digital image of size 4N². To precompute this image, initialize it with minus infinity, and for every point you fill its zone of influence with its weight, taking the maximum. If the query window size is R² pixels on average, the cost of constructing the image is NR².
A query is made by finding the row and column of the relevant pixel and returning the pixel value. This takes two dichotomic searches, in time Lg(N).
This approach is only realistic for moderate values of N (say up to 1000). Better insight can be gained on this problem by studying the geometry of the partition map.
You can try a weighted voronoi-diagram when the positive weight is substracted from the euklidian distance. Sites with big weight tends to have big cells with near-by sites with small weights. Then sort the cells by the number of sites and compute a minimum bounding box for each cell. Match it with the rectangular search box.
Let's say we have n empty boxes in a row. We are going to put m groups of coins in some consequtive boxes, which are known in advance. We put the 1st group of coins in boxes from i_1 to j_1, the 2nd group in boxes from i_2 to j_2 and so on.
Let be c_i number of coins in box i, after putting all the coins in the boxes. We want to be able to quickly determine, how many coins are there in the boxes with indexes i = s, s + 1, ... e - 1, e, i. e. we want to compute sum
c_s +c_(s+1) + ... + c_e
efficiently. This can be done by using Fenwick tree. Without any improvements, Fenwick tree needs O(n) space for storing c_i's (in a table; actually, tree[i] != c_i, values are stored smarter) and O(log n) time for computing the upper sum.
If we have the case where
n is too big for us to make a table of length n (let's say ~ 10 000 000 000)
m is sufficiently small (let's say ~ 500 000)
there is a way to somehow compress coordinates (indexes) of the boxes, i.e. it suffices to store just boxes with indexes i_1, i_2, ... , i_m. Since a value that is stored in tree[i] depends on binary representation of i, my idea is to sort indexes i_1, j_1, i_2, j_2, ... , i_m, j_m and make a tree with length O(m). Adding a new value to the tree would then be straight forward. Also, to compute that sum, we only have to find the first index that is not greater than e and the last that is not smaller than s. Both can be done with binary search. After that the sum can be easily computed.
Problem occurs in 2D case. Now, we have an area of points (x,y) in the plane, 0 < x,y < n. There are m rectangles in that area. We know coordinates of their down-left and up-right corners and we want to compute how many rectangles contain a point (a,b). The simplest (and my only) idea is to follow the manner from the 1D case: for each coordinate x_i of corners store all the coordinates y_i of the corners. The idea is not so clever, since it needs O(m^2) = too much space. My question is
How to store coordinates in the tree in a more efficient way?
Solutions of the problem that use Fenwick trees are preferred, but every solution is welcome!
The easiest approach is using map/unordered_map instead of 2d array. In that case you even have no need in coordinates compression. Map will create a key-value pair only when it needed, so it creates log^2(n) key-value pairs for each point from input.
Also you could you segment tree based on pointers (instead of arrays) with lazy initialisation (you should create node only when it needed).
Use 2d Segment Tree. It could be noticed that for each canonical segment by y-coordinate you can build segment tree (1d) for x-coordinates only for points lying in zone y_min <= y < y_max, where y_min and y_max are bounds of the canonical segment by y. It implies that each input point will be only in log(n) segment trees for x-coordinates, which makes O(n log n) memory in total.
Consider a 2000 x 2000 2D bool array. 100,000 elements are set to true, the rest to false.
Given a cell (x1,y1) we need to find the nearest cell (x2,y2) (by manhattan distance: abs(x1-x2) + abs(y1-y2)) that is false.
One way to do that would be to:
for (int dist = 0; true; dist++)
for ((x2,y2) in all cells dist away from (x1,y1))
if (!array[x2,y2])
return (x2,y2);
In the worst case we would have to iterate through 100,000 cells before finding the free one.
Is there a data structure we could use rather than a 2D array that would allow us to perform this search quicker?
If the data is constant and you have many queries on it:
You might want to use a k-d tree, and look for the nearest neighbor. Insert (i,j) for each element such that arr[i][j] = false. The standard k-d tree uses euclidean distance but I think one can modify it to use manhattan distances instead..
If the data is used for one query:
You will need at least Omega(n*m) ops to read the data and insert it into any data structure - so no point in doing that - the suggested solution will outperform only the build up of any data structure.
You might be interested into look into Region QuadTree. Here initially the entire image is modeled as the root since the image contains all 0s (assumption). Then when a particular pixel is set, the image is divided into 4 quadrants first and the 3 quadrants where the pixel is not included are left as leaves. The remaining quadrant is subdivided again and so on. This is reached till we have 4 point leaves out of which one is set.
This representation will help to rule-out entire regions during the search and the search time can be optimized to O(log n)