Interview question:
Given billions of rectangles, find rectangle with minimum area overlapping a given point P(x,y)
There is a simple way to achieve the answer in O(n) time by processing each rectangle sequentially, but optimize it further provided large number of Rectangle array.
My best approach would be to check each rectangle, see if the point is inside, then calculate area and compare with current smallest area. This can be done in a single pass. I cannot conceive of any other method that doesn't require checking all rectangles
If you use the same rectangle set with many point queries, then R-tree data structure allows to know what rectangles contain given point without checking all rectangles
You do need to process all rectangles at least once, of course. But use that first pass wisely and you can get multiple faster lookups later on.
I would insert the rectangles into a K-d tree or a Quadtree.
Most probably you are not telling the full question. Because the way how it stands now, your solution is optimal.
No matter what you have to go through each rectangle at least one to check whether it actually covers the point and to calculate its area. There is no point to preprocess them in any way because you need to answer only one question.
Preprocessing only makes sense if you will need to answer many similar questions in the future.
Related
I´ve got the following theoretical problem:
I have an amount n of cuboids in 3-dimensional space.
They are aligned to the coordinate-system, so that one cuboid can be described via a point (x,y,z) and dimensions (dimX,dimY,dimZ).
I want to organize these cuboids in a way that I´m able to check if a newly inserted cuboid intersecs with one of the existing (collision detection).
To do this I decided to use hierarchical bounding-boxes.
So in sum I have a binary-tree-structure of bounding volumes.
Insertion is then done by determining recursively the distance to both children (=the distance between the two centers of two cuboids) and inserting in the path with the smallest distance.
Collision detection works similar, but we take all bounding volumes in a sub-path which are intersecting a given cuboid.
The tricky part is, how to balance this tree to get better performance if some cuboids are very close to each other and others are far away.
So far, I´ve found no way to use e.g. an AVL-tree because then I´d have to be able to compare two cuboids in some way that does not break the conditions on which collision detection depends.
P.S.: I know there are libraries to do this, but I want to understand the principles of collision detection e.g. in games in detail and therefore want to implement this by myself.
I´ve now tried it with space-partitioning instead of object-partitioning. That´s not exactly what I wanted to do but I found much more helpful information about it, e.g.: https://en.wikipedia.org/wiki/Kd-tree
With this information it should be possible to implement it.
I'll do my best to make my case scenario simple.
1-) Let's suppose we need to stores a lot of rectangles in some kind of array/data structure. Each rectangle are of different sizes and are positioned at various parts of the screen. Rectangles can't overlap together.
2-) Then we click on the screen with the mouse at point [x,y].
Now, we need to determine if we clicked on a part of one of the rectangles. Well, that would be insane to iterate through all the rectangles to make some kind of comparisons, especially if there is a huge number of them.
What would be the fastest technique/algorithm to do it with as little steps as possible? What would be the best data structure to use in such case?
One way would be to use a quadtree to store the rectangles: The root represents the whole area that contains all rectangles, and this area is then recursively subdivided as required.
If you want to test if a certain coordinate is within one of the rectangles, you start at the root, and walk down the tree until either you find a rectangle or not.
This can be done in O(log n) time.
Having a path, for example:
composed of a set of points 'p'
I have a set of randomly placed points, both inside and outside. called 'n'
So to compare all the points inside to the points on the path to the random points, would be probably exponentially complex. Something like O(n) = n^p if I'm not mistaken It's O(n) = n*p
So to solve the problem I am thinking that you could subdivide the path in a minimum area that is completely outside, and other that is completely inside. as in the figure
The green set would be inside, the black outside and the orange to be iterated again several times
Is this possible, and more importantly, is it efficient?
Consider looking at point-in-polygon testing literature.
With this large number of points, I can imagine a y-sweeping approach to be sensible and efficient. Much more than this iteration thing you are trying to do.
Sure, its possible. But without any idea of how costly it is to FIND these regions, nor how costly it is to partition the points into inside & outside, there's no way to evaluate its efficiency.
A sampling-based approach may be your best bet. Not a particle filter but the same sort of idea. Sequential Monte Carlo method is something to look into also.
http://en.wikipedia.org/wiki/Particle_filter
http://en.wikipedia.org/wiki/Monte_Carlo_method
Essentenially, take a random sampling of the points and test if they are inside or outside to build your map. Repeat the sample method a number of times until you have a good enough map. You can adjust the number of samples and number of repeats to balance both efficiency and result.
A similar (i.e., geometric rather than spatial) approach might be to first use a convex hull to exclude points that are definitely outside the path.
Then, recursively decompose the path into subpaths which are not on the hull (call them "concavities"), and apply the same convex hull approach to them. You will end up with a tree of regions (each bounded by the convex hull at its root), such that all children are to be subtracted from the convex bound defined by their parent.
Unfortunately, this alone will not guarantee an efficient query (as there is no upper bound on either the depth or the number of children of a particular node; also note that the child bounds can overlap even if the path does not...). You will still need some form of acceleration structure -- but that should be significantly easier now that your regions are convex.
What is the original representation of your path? implicit function? parametric? bitmap? polyline?... the answer varies greatly depending on this
I have a set of points which are contained within the rectangle. I'd like to split the rectangles into subrectangles based on point density (giving a number of subrectangles or desired density, whichever is easiest).
The partitioning doesn't have to be exact (almost any approximation better than regular grid would do), but the algorithm has to cope with the large number of points - approx. 200 millions. The desired number of subrectangles however is substantially lower (around 1000).
Does anyone know any algorithm which may help me with this particular task?
Just to understand the problem.
The following is crude and perform badly, but I want to know if the result is what you want>
Assumption> Number of rectangles is even
Assumption> Point distribution is markedly 2D (no big accumulation in one line)
Procedure>
Bisect n/2 times in either axis, looping from one end to the other of each previously determined rectangle counting "passed" points and storing the number of passed points at each iteration. Once counted, bisect the rectangle selecting by the points counted in each loop.
Is that what you want to achieve?
I think I'd start with the following, which is close to what #belisarius already proposed. If you have any additional requirements, such as preferring 'nearly square' rectangles to 'long and thin' ones you'll need to modify this naive approach. I'll assume, for the sake of simplicity, that the points are approximately randomly distributed.
Split your initial rectangle in 2 with a line parallel to the short side of the rectangle and running exactly through the mid-point.
Count the number of points in both half-rectangles. If they are equal (enough) then go to step 4. Otherwise, go to step 3.
Based on the distribution of points between the half-rectangles, move the line to even things up again. So if, perchance, the first cut split the points 1/3, 2/3, move the line half-way into the heavy half of the rectangle. Go to step 2. (Be careful not to get trapped here, moving the line in ever decreasing steps first in one direction, then the other.)
Now, pass each of the half-rectangles in to a recursive call to this function, at step 1.
I hope that outlines the proposal well enough. It has limitations: it will produce a number of rectangles equal to some power of 2, so adjust it if that's not good enough. I've phrased it recursively, but it's ideal for parallelisation. Each split creates two tasks, each of which splits a rectangle and creates two more tasks.
If you don't like that approach, perhaps you could start with a regular grid with some multiple (10 - 100 perhaps) of the number of rectangles you want. Count the number of points in each of these tiny rectangles. Then start gluing the tiny rectangles together until the less-tiny rectangle contains (approximately) the right number of points. Or, if it satisfies your requirements well enough, you could use this as a discretisation method and integrate it with my first approach, but only place the cutting lines along the boundaries of the tiny rectangles. This would probably be much quicker as you'd only have to count the points in each tiny rectangle once.
I haven't really thought about the running time of either of these; I have a preference for the former approach 'cos I do a fair amount of parallel programming and have oodles of processors.
You're after a standard Kd-tree or binary space partitioning tree, I think. (You can look it up on Wikipedia.)
Since you have very many points, you may wish to only approximately partition the first few levels. In this case, you should take a random sample of your 200M points--maybe 200k of them--and split the full data set at the midpoint of the subsample (along whichever axis is longer). If you actually choose the points at random, the probability that you'll miss a huge cluster of points that need to be subdivided will be approximately zero.
Now you have two problems of about 100M points each. Divide each along the longer axis. Repeat until you stop taking subsamples and split along the whole data set. After ten breadth-first iterations you'll be done.
If you have a different problem--you must provide tick marks along the X and Y axis and fill in a grid along those as best you can, rather than having the irregular decomposition of a Kd-tree--take your subsample of points and find the 0/32, 1/32, ..., 32/32 percentiles along each axis. Draw your grid lines there, then fill the resulting 1024-element grid with your points.
R-tree
Good question.
I think the area you need to investigate is "computational geometry" and the "k-partitioning" problem. There's a link that might help get you started here
You might find that the problem itself is NP-hard which means a good approximation algorithm is the best you're going to get.
Would K-means clustering or a Voronoi diagram be a good fit for the problem you are trying to solve?
That's looks like Cluster analysis.
Would a QuadTree work?
A quadtree is a tree data structure in which each internal node has exactly four children. Quadtrees are most often used to partition a two dimensional space by recursively subdividing it into four quadrants or regions. The regions may be square or rectangular, or may have arbitrary shapes. This data structure was named a quadtree by Raphael Finkel and J.L. Bentley in 1974. A similar partitioning is also known as a Q-tree. All forms of Quadtrees share some common features:
They decompose space into adaptable cells
Each cell (or bucket) has a maximum capacity. When maximum capacity is reached, the bucket splits
The tree directory follows the spatial decomposition of the Quadtree
For a 2D game I am working on, I am using y axis sorting in a simple rectangle-based collision detection. This is working fine, and now I want to find the nearest empty rectangle at a given location with a given size, efficiently. How can I do this? Is there an algorithm?
I could think of a simple brute force grid test (with each grid the size of the empty space we're looking for) but obviously this is slow and not even a complete test.
Consider using quad-trees to store your rectangles.
See http://en.wikipedia.org/wiki/Quadtree for more information.
If you're already using axis sorting, then presumably you've computed a list of your rectangles sorted by their positions.
Perhaps I am misunderstanding, but could you not just look at the two rectangles before and after the rectangle in question, and decide which one is closer? If you're talking about finding the closest rectangle to an arbitrary point, then you could simply walk through the list until you find the first rectangle with a greater position than your arbitrary point, and use that rectangle and the one before it as the two to compare.