I recently came across this algorithmic question in an interview. The question goes something like:
Initially there is a rectangle (starting at the origin (0,0) and ending at (n,m)) given. Then there are q queries like x=r or y=c which basically divides the initial rectangles into smaller rectangles. After each query, we have to return the largest rectangle size currently present.
See the diagram:
So, here we were initially given a rectangle from (0,0) to (6,6) [a square in fact!!]. Now after the 1st query (shown as dotted line above) x = 2, the largest rectangle size is 24. After the second query y = 1, the largest rectangle size is 20. And this is how it goes on and on.
My approach to solving this:
At every query, find:
The largest interval on the x axis (maxX) [keep storing all the x = r values in a list]
The largest interval on y axis (maxY) [keep storing all the y = c values in another list]
At every query, your answer is (maxX * maxY)
For finding 1 and 2, I will have to iterate through the whole list, which is not very efficient.
So, I have 2 questions:
Is my solution correct? If not, what is the correct approach to the problem. If yes, how can I optimise my solution?
It's correct but takes O(n) time per query.
You could, for each dimension, have one binary search tree (or other sorted container with O(log n) operations) for the coordinates (initially two) and one for the interval sizes. Then for each query in that dimension:
Add the new coordinate to the coordinates.
From its neighbors, compute the interval's old size and remove that from the sizes.
Compute the two new intervals' sizes and add them to the sizes.
The largest size is at the end of the sizes.
Would be O(log n) per query.
Yes, your algorithm is correct.
To optimize it, first of all, consider only one dimension, because the two dimensions in your geometry are fully orthogonal.
So, you need to have a data structure which holds a partitioning of an interval into sub-intervals, and supports fast application of these two operations:
Split a given interval into two
Find a largest interval
You can do that by using two sorted lists, one sorted by coordinate, and the other sorted by size. You should have pointers from one data structure to the other, and vice-versa.
To implement the "splitting" operation:
Find the interval which you should split, using binary search in the coordinate-sorted list
Remove the interval from both lists
Add two smaller intervals to both lists
Related
My problem is: we have N points in a 2D space, each point has a positive weight. Given a query consisting of two real numbers a,b and one integer k, find the position of a rectangle of size a x b, with edges are parallel to axes, so that the sum of weights of top-k points, i.e. k points with highest weights, covered by the rectangle is maximized?
Any suggestion is appreciated.
P.S.:
There are two related problems, which are already well-studied:
Maximum region sum: find the rectangle with the highest total weight sum. Complexity: NlogN.
top-K query for orthogonal ranges: find top-k points in a given rectangle. Complexity: O(log(N)^2+k).
You can reduce this problem into finding two points in the rectangle: rightmost and topmost. So effectively you can select every pair of points and calculate the top-k weight (which according to you is O(log(N)^2+k)). Complexity: O(N^2*(log(N)^2+k)).
Now, given two points, they might not form a valid pair: they might be too far or one point may be right and top of the other point. So, in reality, this will be much faster.
My guess is the optimal solution will be a variation of maximum region sum problem. Could you point to a link describing that algorithm?
An non-optimal answer is the following:
Generate all the possible k-plets of points (they are N × N-1 × … × N-k+1, so this is O(Nk) and can be done via recursion).
Filter this list down by eliminating all k-plets which are not enclosed in a a×b rectangle: this is a O(k Nk) at worst.
Find the k-plet which has the maximum weight: this is a O(k Nk-1) at worst.
Thus, this algorithm is O(k Nk).
Improving the algorithm
Step 2 can be integrated in step 1 by stopping the branch recursion when a set of points is already too large. This does not change the need to scan the element at least once, but it can reduce the number significantly: think of cases where there are no solutions because all points are separated more than the size of the rectangle, that can be found in O(N2).
Also, the permutation generator in step 1 can be made to return the points in order by x or y coordinate, by pre-sorting the point array correspondingly. This is useful because it lets us discard a bunch of more possibilities up front. Suppose the array is sorted by y coordinate, so the k-plets returned will be ordered by y coordinate. Now, supposing we are discarding a branch because it contains a point whose y coordinate is outside the max rectangle, we can also discard all the next sibling branches because their y coordinate will be more than of equal to the current one which is already out of bounds.
This adds O(n log n) for the sort, but the improvement can be quite significant in many cases -- again, when there are many outliers. The coordinate should be chosen corresponding to the minimum rectangle side, divided by the corresponding side of the 2D field -- by which I mean the maximum coordinate minus the minimum coordinate of all points.
Finally, if all the points lie within an a×b rectangle, then the algorithm performs as O(k Nk) anyways. If this is a concrete possibility, it should be checked, an easy O(N) loop, and if so then it's enough to return the points with the top N weights, which is also O(N).
Let's say we have n empty boxes in a row. We are going to put m groups of coins in some consequtive boxes, which are known in advance. We put the 1st group of coins in boxes from i_1 to j_1, the 2nd group in boxes from i_2 to j_2 and so on.
Let be c_i number of coins in box i, after putting all the coins in the boxes. We want to be able to quickly determine, how many coins are there in the boxes with indexes i = s, s + 1, ... e - 1, e, i. e. we want to compute sum
c_s +c_(s+1) + ... + c_e
efficiently. This can be done by using Fenwick tree. Without any improvements, Fenwick tree needs O(n) space for storing c_i's (in a table; actually, tree[i] != c_i, values are stored smarter) and O(log n) time for computing the upper sum.
If we have the case where
n is too big for us to make a table of length n (let's say ~ 10 000 000 000)
m is sufficiently small (let's say ~ 500 000)
there is a way to somehow compress coordinates (indexes) of the boxes, i.e. it suffices to store just boxes with indexes i_1, i_2, ... , i_m. Since a value that is stored in tree[i] depends on binary representation of i, my idea is to sort indexes i_1, j_1, i_2, j_2, ... , i_m, j_m and make a tree with length O(m). Adding a new value to the tree would then be straight forward. Also, to compute that sum, we only have to find the first index that is not greater than e and the last that is not smaller than s. Both can be done with binary search. After that the sum can be easily computed.
Problem occurs in 2D case. Now, we have an area of points (x,y) in the plane, 0 < x,y < n. There are m rectangles in that area. We know coordinates of their down-left and up-right corners and we want to compute how many rectangles contain a point (a,b). The simplest (and my only) idea is to follow the manner from the 1D case: for each coordinate x_i of corners store all the coordinates y_i of the corners. The idea is not so clever, since it needs O(m^2) = too much space. My question is
How to store coordinates in the tree in a more efficient way?
Solutions of the problem that use Fenwick trees are preferred, but every solution is welcome!
The easiest approach is using map/unordered_map instead of 2d array. In that case you even have no need in coordinates compression. Map will create a key-value pair only when it needed, so it creates log^2(n) key-value pairs for each point from input.
Also you could you segment tree based on pointers (instead of arrays) with lazy initialisation (you should create node only when it needed).
Use 2d Segment Tree. It could be noticed that for each canonical segment by y-coordinate you can build segment tree (1d) for x-coordinates only for points lying in zone y_min <= y < y_max, where y_min and y_max are bounds of the canonical segment by y. It implies that each input point will be only in log(n) segment trees for x-coordinates, which makes O(n log n) memory in total.
Suppose I have a set of points,
then I define line L. How do I obtain b, d, and f?
Can this be solved using kd-tree (with slight modification)?
==EDIT==
How my program works:
Define a set of points
L is defined later, it has nothing to do with point set
My only idea right now:
Get middle point m of line L.
Based on point m, Get all points in the radius of lenght(L)/2 using KD-Tree
For every points, test if it lies on line L
Perhaps I'll add colinear threshold if some points are slightly lie on the query line.
The running time of my approach will depend on the L length, longer the line, bigger the query, more points need to be checked.
You can have logarithmic-time look-up. My algorithm achieves that at the cost of a giant memory usage (up to cubic in the number of points):
If you know the direction of the line in advance, you can achieve logarithmic-time lookup quite easily: let a*x + b*y = c be the equation of the line, then a / b describes the direction, and c describes the line position. For each a, b (except [0, 0]) and point, c is unique. Then sort the points according to their value of c into an index; when you get the line, do a search in this index.
If all your lines are orthogonal, it takes two indexes, one for x, one for y. If you use four indexes, you can look up by lines at 45° as well. You don't need to get the direction exact; if you know the bounding region for all the points, you can search every point in a strip parallel to the indexed direction that spans the query line within the bounding region:
The above paragraphs define "direction" as the ratio a / b. This yields infinite ratios, however. A better definition defines "direction" as a pair (a, b) where at least one of a, b is non-zero and two pairs (a1, b1), (a2, b2) define the same direction iff a1 * b2 == b1 * a2. Then { (a / b, 1) for b nonzero, (1, 0) for b zero} is one particular way of describing the space of directions. Then we can choose (1, 0) as the "direction at infinity", then order all other directions by their first component.
Be aware of floating point inaccuracies. Rational arithmetic is recommended. If you choose floating point arithmetic, be sure to use epsilon comparison when checking point-line incidence.
Algorithm 1: Just choose some value n, prepare n indexes, then choose one at query time. Unfortunately, the downside is obvious: the lookup is still a range sweep and thus linear, and the expected speedup drops as the direction gets further away from an indexed one. It also doesn't provide anything useful if the bounding region is much bigger than the region where most of the points are (you could search extremal points separately from the dense region, however).
The theoretical lookup speed is still linear.
In order to achieve logarithmic lookup this way, we need an index for every possible direction. Unfortunately, we can't have infinitely many indexes. Fortunately, similar directions still produce similar indexes - indexes that differ in only few swaps. If the directions are similar enough, they will produce identical indexes. Thus, we can use the same index for an entire range of directions. Namely, only directions such that two different points lie on the same line can cause a change of index.
Algorithm 2 achieves the logarithmic lookup time at the cost of a huge index:
When preparing:
For each pair of points (A, B), determine the direction from A to B. Collect the directions into an ordered set, calling the set the set of significant directions.
Turn this set into a list and add the "direction at infinity" to both ends.
For each pair of consecutive significant directions, choose an arbitrary direction within that range and prepare an index of all points for that direction. Collect the indexes into a list. Do not store any specific values of key in this index, only references to points.
Prepare an index over these indexes, where the direction is the key.
When looking up points by a line:
determine the line direction.
look up the right point index in the index of indexes. If the line direction falls at the boundary between two ranges, choose one arbitrarily. If not, you are guaranteed to find at most one point on the line.
Since there are only O(n^2) significant directions, there are O(n^2) ranges in this index. The lookup will take O(log n) time to find the right one.
look up the points in the index for this range, using the position with respect to the line direction as the key. This lookup will take O(log n) time.
Slight improvement can be obtained because the first and the last index are identical if the "direction at infinity" is not among the significant directions. Further improvements can be performed depending on what indexes are used. An array of indexes into an array of points is very compact, but if a binary search tree (such as a red-black tree or an AVL tree) is used for the index of points, you can do further improvements by merging subtrees identical by value to be identical by reference.
If the points are uniformly distributed, you could divide the plane in a Sqrt(n) x Sqrt(n) grid. Every gridcell contains 1 point on average, which is a constant.
Every line intersects at most 2 * Sqrt(n) grid cells [right? Proof needed :)]. Inspecting those cells takes O(Sqrt(n)) time, because each cell contains a constant number of points on average (of course this does not hold if the points have some bias).
Compute the bounding box of all of your points
Divide that bounding box in a uniform grid of x by y cells
Store each of your point in the cell it belongs to
Now for each line you want to test, all you have to do is find the cells it intersects, and test the points in those cells with "distance to line = 0".
Of course, it's only efficient if you gonna test many line for a given set of points.
Can try the next :
For each point find distance from a point to a line
More simple, for each point put the point coordinate in the line equation , is it match (meaning 0=0) than it's on the line
EDIT:
If you have many points - there is another way.
If you can sort the points, create 2 sort list:
1 sorted by x value
2 sorted by y values
Let say that your line start at (x1,y1) and ended at (x2,y2)
It's easy to filter all the points that their x value is not between [x1,x2] OR their y value is not between [y1,y2]
If you have no points - mean there are no points on this line.
Now split the line to 2, now you have 2 lines - run the same process again - you can see where this is going.
once you have small enough number of points (for you to choose) - let say 10, check if they are on the line in the usual way
This also enable you to get "as near" as you need to the line, and skip places where there are not relevant points
If you have enough memory, then it is possible to use Hough-algo like approach.
Fill r-theta array with lists of matching points (not counters). Then for every line find it's r-theta equation, and check points from the list with given r-theta coordinates.
Subtle thing - how to choose array resolution.
Consider a 2000 x 2000 2D bool array. 100,000 elements are set to true, the rest to false.
Given a cell (x1,y1) we need to find the nearest cell (x2,y2) (by manhattan distance: abs(x1-x2) + abs(y1-y2)) that is false.
One way to do that would be to:
for (int dist = 0; true; dist++)
for ((x2,y2) in all cells dist away from (x1,y1))
if (!array[x2,y2])
return (x2,y2);
In the worst case we would have to iterate through 100,000 cells before finding the free one.
Is there a data structure we could use rather than a 2D array that would allow us to perform this search quicker?
If the data is constant and you have many queries on it:
You might want to use a k-d tree, and look for the nearest neighbor. Insert (i,j) for each element such that arr[i][j] = false. The standard k-d tree uses euclidean distance but I think one can modify it to use manhattan distances instead..
If the data is used for one query:
You will need at least Omega(n*m) ops to read the data and insert it into any data structure - so no point in doing that - the suggested solution will outperform only the build up of any data structure.
You might be interested into look into Region QuadTree. Here initially the entire image is modeled as the root since the image contains all 0s (assumption). Then when a particular pixel is set, the image is divided into 4 quadrants first and the 3 quadrants where the pixel is not included are left as leaves. The remaining quadrant is subdivided again and so on. This is reached till we have 4 point leaves out of which one is set.
This representation will help to rule-out entire regions during the search and the search time can be optimized to O(log n)
In a multi-dimensional space, I have a collection of rectangles, all of which are aligned to the grid. (I am using the word "rectangles" loosely - in a three dimensional space, they would be rectangular prisms.)
I want to query this collection for all rectangles that overlap an input rectangle.
What is the best data structure for holding the collection of rectangles? I will be adding rectangles to and removing rectangles from the collection from time to time, but these operations will be infrequent. The operation I want to be fast is the query.
One solution is to keep the corners of the rectangles in a list, and do a linear scan over the list, finding which rectangles overlap the query rectangle and skipping over the ones that don't.
However, I want the query operation to be faster than linear.
I've looked at the R-tree data structure, but it holds a collection of points, not a collection of rectangles, and I don't see any obvious way to generalize it.
The coordinates of my rectangles are discrete, in case you find that helpful.
I am interested in the general solution, but I will also tell you the properties of my specific problem: my problem space has three dimensions, and their multiplicity varies wildly. The first dimension has two possible values, the second dimension has 87 values, and the third dimension has 1.8 million values.
You can probably use KD-Trees which can be used for rectangles according to the wiki page:
Variations
Instead of points
Instead of points, a kd-tree can also
contain rectangles or
hyperrectangles[5]. A 2D rectangle is
considered a 4D object (xlow, xhigh,
ylow, yhigh). Thus range search
becomes the problem of returning all
rectangles intersecting the search
rectangle. The tree is constructed the
usual way with all the rectangles at
the leaves. In an orthogonal range
search, the opposite coordinate is
used when comparing against the
median. For example, if the current
level is split along xhigh, we check
the xlow coordinate of the search
rectangle. If the median is less than
the xlow coordinate of the search
rectangle, then no rectangle in the
left branch can ever intersect with
the search rectangle and so can be
pruned. Otherwise both branches should
be traversed. See also interval tree,
which is a 1-dimensional special case.
Let's call the original problem by PN - where N is number of dimensions.
Suppose we know the solution for P1 - 1-dimensional problem: find if a new interval is overlapping with a given collection of intervals.
Once we know to solve it, we can check if the new rectangle is overlapping with the collection of rectangles in each of the x/y/z projections.
So the solution of P3 is equivalent to P1_x AND P1_y AND P1_z.
In order to solve P1 efficiently we can use sorted list. Each node of the list will include coordinate and number-of-opened-intetrvals-up-to-this-coordinate.
Suppose we have the following intervals:
[1,5]
[2,9]
[3,7]
[0,2]
then the list will look as follows:
{0,1} , {1,2} , {2,2}, {3,3}, {5,2}, {7,1}, {9,0}
if we receive a new interval, say [6,7], we find the largest item in the list that is smaller than 6: {5,2} and smllest item that is greater than 7: {9,0}.
So it is easy to say that the new interval does overlap with the existing ones.
And the search in the sorted list is faster than linear :)
You have to use some sort of a partitioning technique. However, because your problem is constrained (you use only rectangles), the data-structure can be a little simplified. I haven't thought this through in detail, but something like this should work ;)
Using the discrete value constraint - you can create a secondary table-like data-structure where you store the discrete values of second dimension (the 87 possible values). Assume that these values represent planes perpendicular to this dimension. For each of these planes you can store, in this secondary table, the rectangles that intersect these planes.
Similarly for the third dimension you can use another table with as many equally spaced values as you need (1.8 million is too much, so you would probably want to make this at least a couple of magnitudes smaller), and create a map the rectangles that are between two chosen values.
Given a query rectangle you can query the first table in constant time to determine a set of tables which possibly intersects this query. Then you can do another query on the second table, and do an intersection of the results from the first and the second query results. This should narrow down the number of actual intersection tests that you have to perform.