Definition 1: Point (x,y) is controlling point (x',y') if and only if x < x' and y < y'.
Definition 2: Point (x,y) is controlled by point (x',y') if and only if x' < x and y' < y.
I'm trying to come up with data structure to support the following operations:
Add(x,y) - Adds a point (x,y) to the system in O(logn) complexity, where n is the number of points in the system.
Remove(x,y) - Removes a point (x,y) from the system in O(logn) complexity, where n is the number of points in the system.
Score(x,y) - Returns the number of points (x,y) controls - number of points that (x,y) is controlled by. Worst case complexity O(logn).
I've tried to solve it using multiple AVL trees, but could not come up with elegant enough solution.
Point (x,y) is controlling point (x',y') if and only if x < x' and y <
y'.
Point (x,y) is controlled by point (x',y') if and only if x' < x and
y' < y.
Lets assume that (x,y) is the middle of the square.
(x,y) is controlling points in B square and is being controlled by points in C.
The output required is the number of points (x,y) controls minus the number of points (x,y) is being controlled by. Which is the number of points in B minus the number of points in C,B-C(Referring to the number of points in A,B,C,D as simply A,B,C,D).
We can easily calculate the number of points in A+C, that's simply the number of points with x' < x.
Same goes for C+D (Points with y'y), B+D (x'>x).
We add up A+C to C+D which is A+2C+D.
Add up A+B to B+D which is A+2B+D.
Deduct the two: A+2B+D-(A+2C+D) = 2B-2C, divide by two: (2B-2C)/2 = B-C which is the output needed.
(I'm assuming handling the 1D case is simple enough and there is no need to explain.)
For the sake of future reference
Solution outline:
We will maintain two AVL trees.
Tree_X: will hold points sorted by their X coordinate.
Tree_Y: will hold points sorted by their Y coordinate.
Each node within both trees will hold the following additional data:
Number of leaves in left sub-tree.
Number of leaves in right sub-tree.
For a point $(x,y)$ we will define regions A ,B, C, D:
Point (x',y') is in A if x' < x and y' > y.
Point (x',y') is in B if x' > x and y' > y.
Point (x',y') is in C if x' < x and y' < y.
Point (x',y') is in D if x' > x and y' < y.
Now it is clear that Score(x,y) = |C|-|B|.
However |A|+|C|, |B|+|D|, |A|+|B|, |C|+|D| could be easily retrieved from our two AVL trees, as we will soon see.
And notice that [(|A| + |C| + |C| + |D|) - (|A| + |B| + |B| + |D|)]/2 = |C|-|B|
Implementation of required operations:
Add(x,y) - We will add point (x,y) to both of our AVL trees. Since the additional data we are storing is affected only on the insertion path and since the insertion occurs in (logn), the total cost of Add(x,y) is O(logn).
Remove(x,y) - We will remove point (x,y) from both of our AVL trees. Since the additional data we are storing is affected only on the removal path and since the removal occurs in (logn), the total cost of Remove(x,y) is O(logn).
Score(x,y) - I will show how to calculate $|B|+|D|$ as others done in similar way and same complexity costs. It is clear that $|B|+|D|$ is the number of points which satisfy $x' > x$. To calculate this number we will:
Find x in AVL_X. Complexity O(logn).
Go upwards in Tree_X until the root and on each turn right we will sum the number of elements in left sub-tree of the son. Complexity O(logn).
Total cost of Remove(x,y) is O(logn).
Related
Question:
Given two sorted sequences in increasing order, X and Y.
Y is of size k and X is of size m.
I would like to take a subset of X i.e. X' of size k, and consider the following optimization problem:
d(Y,X') = sum ( (Y[j] - X'[j])^2) for j in [1,k]:;
X' is the subset of X with size k,
Y[j] and X'[j] is element in Y and X',
I want to find the subset of X, to reach the minimum of d(Y, X').
d(Y, X') is the sum of the squared distance between elements in Y and X'.
Note that X' could have k! numbers of rearrangements, so its order is totally unknown.
I want to utilize DP approach to solve this problem.
My thought so far:
I plan to go over all the element in Y and compute their squared distance with each element in X. But I'm totally at lost with what to do next.
I have a set of 2D points, and I want to be able to make the following query with arguments x_min and n: what are the n points with largest y which have x > x_min?
To rephrase in Ruby:
class PointsThing
def initialize(points)
#points = points
end
def query(x_min, n)
#points.select { |point| point.x > x_min }.sort_by { |point| point.y }.take(n)
end
end
Ideally, my class would also support an insert and delete operation.
I can't think of a data structure for this which would enable the query to run in less than O(|#points|) time. Does anyone know one?
Sort the points by x descending. For each point in order, insert it into a purely functional red-black tree ordered by y descending. Keep all of the intermediate trees in an array.
To look up a particular x_min, use binary search to find the intermediate tree where exactly the points with x > x_min have been inserted. Traverse this tree to find the first n points.
The preprocessing cost is O(p log p) in time and space, where p is the number of points. The query time is O(log p + n), where n is the number of points to be returned in the query.
If your data are not sorted, then you have no choice but to check every point since you cannot know if there exists another point for which y is greater than that of all other points and for which x > x_min. In short: you can't know if another point should be included if you don't check them all.
In that case, I would assume that it would be impossible to check in sublinear time as you ask for, since you have to check them all. Best case for searching all would be linear.
If your data are sorted, then your best case will be constant time (all n points are those with the greatest y), and worst case would be linear (all n points are those with least y). Average case would be closer to constant I think if your x and x_min are both roughly random within a specific range.
If you want this to scale (that is, you could have large values of n), you will want to keep your resultant set sorted as well since you will need to check new potential points against it and to drop the lowest value when you insert (if size > n). Using a tree, this can be log time.
So, to do the entire thing, worst case is for unsorted points, in which case you're looking at nlog(n) time. Sorted points is better, in which case you're looking at average case of log(n) time (again, assuming roughly randomly distributed values for x and x_min), which yes is sub-linear.
In case it isn't at first obvious why sorted points will have have constant time to search through, I will go over that here quickly.
If the n points with the greatest y values all had x > x_min (the best case) then you are just grabbing what you need off the top, so that case is obvious.
For the average case, assuming roughly randomly distributed x and x_min, the odds that x > x_min are basically half. For any two random numbers a and b, a > b is just as likely to be true as b > a. This is the same thing with x and x_min; x > x_min is equally as likely to be true as x_min > x, meaning 0.5 probability. This means that, for your points, on average every second point checked will meet your x > x_min requirement, so on average you will check 2n points to find the n highest points that meet your criteria. So the best case was c time, average is 2c which is still constant.
Note, however, that for values of n approaching the size of the set this hides the fact that you are going through the entire set, essentially bringing it right back up to linear time. So my assertion that it is constant time does not hold true if you assume random values of n within the range of the size of your set.
If this is not a purely academic question and is prompted by some actual need, then it depends on the situation.
(edit)
I just realized that my constant-time assertions were assuming a data structure where you have direct access to the highest value and can go sequentially to lower values. If the data structure that those are provided to you in does not fit that description, then obviously that will not be the case.
Some precomputation would help in this case.
First partition the set of points taking x_min as pivot element.
Then for set of points lying on right side of x_min build a max_heap based on y co-ordinates.
Now run your query as: Perform n extract_max operations on the built max_heap.
The running time of your query would be log X + log (X-1) + ..... log (X-(n-1))
log X: For the first extract max operation.
log X-1: For the second extract max operation and so on.
X: Size of original Max heap.
Even in the worst case when your n << X , The time taken would be O(n log X).
Notation
Let P be the set of points.
Let top_y ( n, x_min) describe the query to collect the n points from P with the largest y-coordinates among those with x-coordinate greater than or equal to `x_min' .
Let x_0 be the minimum of x coordinates in your point set. Partition the x axis to the right of x_0 into a set of left-hand closed, right-hand open intervals I_i by the set of x coordinates of the point set P such that min(I_i) is the i-th but smallest x coordinate from P. Define the coordinate rank r(x) of x as the index of the interval x is an element of or 0 if x < x_0.
Note that r(x) can be computed in O(log #({I_i})) using a binary search tree.
Simple Solution
Sort your point set by decreasing y-coordinates and save this array A in time O(#P log #P) and space O(#P).
Process each query top_y ( n, x_min ) by traversing this array in order, skipping over items A_i: A_i.x < x_0, counting all other entries until the counter reaches n or you are at the end of A. This processing takes O(n) time and O(1) space.
Note that this may already be sufficient: Queries top_y ( n_0, a_0 ); a_0 < min { p.x | p \in P }, n_0 = c * #P, c = const require step 1 anyway and for n << #P and 'infrequent' queries any further optimizations weren't worth the effort.
Observation
Consider the sequences s_i,s_(i+1)of points with x-coordinates greater than or equal tomin(I_i), min(I_(i+1)), ordered by decreasing y-coordinate.s_(i+1)is a strict subsequence ofs_i`.
If p_1 \in s_(i+1) and p_2.x >= p_1.x then p_2 \in s_(i+1).
Refined Solution
A refined data structure allows for O(n) + O(log #P) query processing time.
Annotate the array A from the simple solution with a 'successor dispatch' for precisely those elements A_i with A_(i+1).x < A_i.x; This dispatch data would consist of an array disp:[r(A_(i+1).x) + 1 .. r(A_i.x)] of A-indexes of the next element in A whose x-coordinate ranks at least as high as the index into disp. The given dispatch indices suffice for processing the query, since ...
... disp[j] = disp[r(A_(i+1).x) + 1] for each j <= r(A_(i+1).x).
... for any x_min with r(x_min) > r(A_i.x), the algorithm wouldn't be here
The proper index to access disp is r(x_min) which remains constant throughout a query and thus takes O(log #P) to compute once per query while the index selection itself is O(1) at each A element.
disp can be precomputed. No two disp entries across all disp arrays are identical (Proof skipped, but it's easy [;-)] to see given the construction). Therefore the construction of disp arrays can be performed stack-based in a single sweep through the point set sorted in A. As there are #P entries, the disp structure takes O(#P) space and O(#P) time to construct, being dominated by space and time requirements for y-sorting. So in a certain sense, this structure comes for free.
Time requirements for query top_y(n,x_min)
Computing r(x_min): O(log #P);
Passage through A: O(n);
I was at the high frequency Trading firm interview, they asked me
Find a square whose length size is R with given n points in the 2D plane
conditions:
--parallel sides to the axis
and it contains at least 5 of the n points
running complexity is not relative to the R
they told me to give them O(n) algorithm
Interesting problem, thanks for posting! Here's my solution. It feels a bit inelegant but I think it meets the problem definition:
Inputs: R, P = {(x_0, y_0), (x_1, y_1), ..., (x_N-1, y_N-1)}
Output: (u,v) such that the square with corners (u,v) and (u+R, v+R) contains at least 5 points from P, or NULL if no such (u,v) exist
Constraint: asymptotic run time should be O(n)
Consider tiling the plane with RxR squares. Construct a sparse matrix, B defined as
B[i][j] = {(x,y) in P | floor(x/R) = i and floor(y/R) = j}
As you are constructing B, if you find an entry that contains at least five elements stop and output (u,v) = (i*R, j*R) for i,j of the matrix entry containing five points.
If the construction of B did not yield a solution then either there is no solution or else the square with side length R does not line up with our tiling. To test for this second case we will consider points from four adjacent tiles.
Iterate the non-empty entries in B. For each non-empty entry B[i][j], consider the collection of points contained in the tile represented by the entry itself and in the tiles above and to the right. These are the points in entries: B[i][j], B[i+1][j], B[i][j+1], B[i+1][j+1]. There can be no more than 16 points in this collection, since each entry must have fewer than 5. Examine this collection and test if there are 5 points among the points in this collection satisfying the problem criteria; if so stop and output the solution. (I could specify this algorithm in more detail, but since (a) such an algorithm clearly exists, and (b) its asymptotic runtime is O(1), I won't go into that detail).
If after iterating the entries in B no solution is found then output NULL.
The construction of B involves just a single pass over P and hence is O(N). B has no more than N elements, so iterating it is O(N). The algorithm for each element in B considers no more than 16 points and hence does not depend on N and is O(1), so the overall solution meets the O(N) target.
Run through set once, keeping the 5 largest x values in a (sorted) local array. Maintaining the sorted local array is O(N) (constant time performed N times at most).
Define xMin and xMax as the x-coordinates of the two points with largest and 5th largest x values respectively (ie (a[0] and a[4]).
Sort a[] again on Y value, and set yMin and yMax as above, again in constant time.
Define deltaX = xMax- xMin, and deltaY as yMax - yMin, and R = largest of deltaX and deltaY.
The square of side length R located with upper-right at (xMax,yMax) meets the criteria.
Observation if R is fixed in advance:
O(N) complexity means no sort is allowed except on a fixed number of points, as only a Radix sort would meet the criteria and it requires a constraint on the values of xMax-xMin and of yMax-yMin, which was not provided.
Perhaps the trick is to start with the point furthest down and left, and move up and right. The lower-left-most point can be determined in a single pass of the input.
Moving up and right in steps and counitng points in the square requries sorting the points on X and Y in advance, which to be done in O(N) time requiress that the Radix sort constraint be met.
I have to design an algorithm with running time O(nlogn) for the following problem:
Given a set P of n points, determine a value A > 0 such that the shear transformation (x,y) -> (x+Ay,y) does not change the order (in x direction) of points with unequal x-coordinates.
I am having a lot of difficulty even figuring out where to begin.
Any help with this would be greatly appreciated!
Thank you!
I think y = 0.
When x = 0, A > 0
(x,y) -> (x+Ay,y)
-> (0+(A*0),0) = (0,0)
When x = 1, A > 0
(x,y) -> (x+Ay,y)
-> (1+(A*0),0) = (1,0)
with unequal x-coordinates, (2,0), (3,0), (4,0)...
So, I think that the begin point may be (0,0), x=0.
Suppose all x,y coordinates are positive numbers. (Without loss of generality, one can add offsets.) In time O(n log n), sort a list L of the points, primarily in ascending order by x coordinates and secondarily in ascending order by y coordinates. In time O(n), process point pairs (in L order) as follows. Let p, q be any two consecutive points in L, and let px, qx, py, qy denote their x and y coordinate values. From there you just need to consider several cases and it should be obvious what to do: If px=qx, do nothing. Else, if py<=qy, do nothing. Else (px>qx, py>qy) require that px + A*py < qx + A*qy, i.e. (px-qx)/(py-qy) > A.
So: Go through L in order, and find the largest A' that is satisfied for all point pairs where px>qx and py>qy. Then choose a value of A that's a little less than A', for example, A'/2. (Or, if the object of the problem is to find the largest such A, just report the A' value.)
Ok, here's a rough stab at a method.
Sort the list of points by x order. (This gives the O(nlogn)--all the following steps are O(n).)
Generate a new list of dx_i = x_(i+1) - x_i, the differences between the x coordinates. As the x_i are ordered, all of these dx_i >= 0.
Now for some A, the transformed dx_i(A) will be x_(i+1) -x_i + A * ( y_(i+1) - y_i). There will be an order change if this is negative or zero (x_(i+1)(A) < x_i(A).
So for each dx_i, find the value of A that would make dx_i(A) zero, namely
A_i = - (x_(i+1) - x_i)/(y_(i+1) - y_i). You now have a list of coefficients that would 'cause' an order swap between a consecutive (in x-order) pair of points. Watch for division by zero, but that's the case where two points have the same y, these points will not change order. Some of the A_i will be negative, discard these as you want A>0. (Negative A_i will also induce an order swap, so the A>0 requirement is a little arbitrary.)
Find the smallest A_i > 0 in the list. So any A with 0 < A < A_i(min) will be a shear that does not change the order of your points. Pick A_i(min) as that will bring two points to the same x, but not past each other.
Consider some points on a 2d plane and function f(x)=ax, where b=0. Let's say a point is a 1x1 square.
Now we want to tell how many points is between f(x) function and y line, as in picture below.
Black points are valid, white not. We also say point is valid if it:
intersects with the y axis;
or with the function f(x);
or is between them.
As denoted in the picture :
How can we solve this, assuming that we don't remove any of the points and we don't add them? Is there any other approach than standard brute force?
If I am understanding this right the points are random and given to you by their coordinates, and the line is also given to you. If that is the case, there cannot be any a priori knowledge about any relationship between the points, so you'd have to go through them, in the order given, and compare their x coordinate with 0 and their y coordinate with f(x). If a point passes the check you increment the counter, otherwise you don't. The algorithm runs in O(n) time and I highly doubt you can do any better than that without some extra information about the points.
The question is quite unclear but it appears from comment "I mean find that a in f(x)=ax to have maximum points which are valid and their amount doesn't exceed some value X" that you want to find a such that N(a)=X, where by N(a) I mean number of points right of the y axis and above line y=ax; or if no such a exists, find a such that m = N(a)<X and N(b)<m implies N(b)<X.
Here's an O(n*ln(n)) algorithm: For each point p, excluding any p below y=0, compute slope M_p as ratio of p's y and x coordinates, or DBL_MAX if x=0. Sort the M's into ascending order (this is the O(n*ln(n)) step), and call the sorted array S.
Now we will set up an array T such that when any X is given, S[T[X-1]] is a slope that will place X points on or above that slope:
S[n] = DBL_MAX;
for (k=0, j=n-1; k<=n; --j) {
T[j] = k;
do ++k; while (S[k]==S[k-1] && k<=n);
}
Thereafter, let any X be given. Let h = T[X-1]. If h<n then N(S[h]) <= X; if h==n, there are multiple points on the Y axis and no finite slope will work.
This algorithm uses time O(n*ln(n)) and space O(n) to preprocess a set of n first-quadrant points, and thereafter uses time O(1) to find an a for any given X, 0 < X <= n, such that N(a) = X, if such a exists, else returns a such that N(a) < X < N(b) if b>a, else returns DBL_MAX.