Finding closest pair of points in the plane with non-distinct x-coordinates in O(nlogn) - algorithm

Most of the implementations of the algorithm to find the closest pair of points in the plane that I've seen online have one of two deficiencies: either they fail to meet an O(nlogn) runtime, or they fail to accommodate the case where some points share an x-coordinate. Is a hash map (or equivalent) required to solve this problem optimally?
Roughly, the algorithm in question is (per CLRS Ch. 33.4):
For an array of points P, create additional arrays X and Y such that X contains all points in P, sorted by x-coordinate and Y contains all points in P, sorted by y-coordinate.
Divide the points in half - drop a vertical line so that you split X into two arrays, XL and XR, and divide Y similarly, so that YL contains all points left of the line and YR contains all points right of the line, both sorted by y-coordinate.
Make recursive calls for each half, passing XL and YL to one and XR and YR to the other, and finding the minimum distance, d in each of those halves.
Lastly, determine if there's a pair with one point on the left and one point on the right of the dividing line with distance smaller than d; through a geometric argument, we find that we can adopt the strategy of just searching through the next 7 points for every point within distance d of the dividing line, meaning the recombination of the divided subproblems is only an O(n) step (even if it looks n2 at first glance).
This has some tricky edge cases. One way people deal with this is sorting the strip of points of distance d from the dividing line at every recombination step (e.g. here), but this is known to result in an O(nlog2n) solution.
Another way people deal with edge cases is by assuming each point has a distinct x-coordinate (e.g. here): note the snippet in closestUtil which adds to Pyl (or YL as we call it) if the x-coordinate of a point in Y is <= the line, or to Pyr (YR) otherwise. Note that if all points lie on the same vertical line, this would result us writing past the end of the array in C++, as we write all n points to YL.
So the tricky bit when points can have the same x-coordinate is dividing the points in Y into YL and YR depending on whether a point p in Y is in XL or XR. The pseudocode in CLRS for this is (edited slightly for brevity):
for i = 1 to Y.length
if Y[i] in X_L
Y_L.length = Y_L.length + 1;
Y_L[Y_L.length] = Y[i]
else Y_R.length = Y_R.length + 1;
Y_R[Y_R.length] = Y[i]
However, absent of pseudocode, if we're working with plain arrays, we don't have a magic function that can determine whether Y[i] is in X_L in O(1) time. If we're assured that all x-coordinates are distinct, sure - we know that anything with an x-coordinate less than the dividing line is in XL, so with one comparison we know what array to partition any point p in Y into. But in the case where x-coordinates are not necessarily distinct (e.g. in the case where they all lie on the same vertical line), do we require a hash map to determine whether a point in Y is in XL or XR and successfully break down Y into YL and YR in O(n) time? Or is there another strategy?

Yes, there are at least two approaches that work here.
The first, as Bing Wang suggests, is to apply a rotation. If the angle is sufficiently small, this amounts to breaking ties by y coordinate after comparing by x, no other math needed.
The second is to adjust the algorithm on G4G to use a linear-time partitioning algorithm to divide the instance, and a linear-time sorted merge to conquer it. Presumably this was not done because the author valued the simplicity of sorting relative to the previously mentioned algorithms in most programming languages.

Tardos & Kleinberg suggests annotating each point with its position (index) in X.
You could do this in N time, or, if you really, really want to, you could do it "for free" in the sorting operation.
With this annotation, you could do your O(1) partitioning, and then take the position pr of the right-most point in Xl in O(1), using it to determine weather a point in Y goes in Yl (position <= pr), or Yr (position > pr). This does not require an extra data structure like a hash map, but it does require that those same positions are used in X and Y.
NB:
It is not immediately obvious to me that the partitioning of Y is the only problem that arises when multiple points have the same coordinate on the x-axis. It seems to me that the proof of linearity of the comparisons neccesary across partitions breaks, but I have seen only the proof that you need only 15 comparisons, not the proof for the stricter 7-point version, so i cannot be sure.

Related

How many paths of length n with the same start and end point can be found on a hexagonal grid?

Given this question, what about the special case when the start point and end point are the same?
Another change in my case is that we must move at every step. How many such paths can be found and what would be the most efficient approach? I guess this would be a random walk of some sort?
My think so far is, since we must always return to our starting point, thinking about n/2 might be easier. At every step, except at step n/2, we have 6 choices. At n/2 we have a different amount of choices depending on if n is even or odd. We also have a different amount of choices depending on where we are (what previous choices we made). For example if n is even and we went straight out, we only have one choice at n/2, going back. But if n is even and we didn't go straight out, we have more choices.
It is all the cases at this turning point that I have trouble getting straight.
Am I on the right track?
To be clear, I just want to count the paths. So I guess we are looking for some conditioned permutation?
This version of the combinatorial problem looks like it actually has a short formula as an answer.
Nevertheless, the general version, both this and the original question's, can be solved by dynamic programming in O (n^3) time and O (n^2) memory.
Consider a hexagonal grid which spans at least n steps in all directions from the target cell.
Introduce a coordinate system, so that every cell has coordinates of the form (x, y).
Let f (k, x, y) be the number of ways to arrive at cell (x, y) from the starting cell after making exactly k steps.
These can be computed either recursively or iteratively:
f (k, x, y) is just the sum of f (k-1, x', y') for the six neighboring cells (x', y').
The base case is f (0, xs, ys) = 1 for the starting cell (xs, ys), and f (0, x, y) = 0 for every other cell (x, y).
The answer for your particular problem is the value f (n, xs, ys).
The general structure of an iterative solution is as follows:
let f be an array [0..n] [-n-1..n+1] [-n-1..n+1] (all inclusive) of integers
f[0][*][*] = 0
f[0][xs][ys] = 1
for k = 1, 2, ..., n:
for x = -n, ..., n:
for y = -n, ..., n:
f[k][x][y] =
f[k-1][x-1][y] +
f[k-1][x][y-1] +
f[k-1][x+1][y] +
f[k-1][x][y+1]
answer = f[n][xs][ys]
OK, I cheated here: the solution above is for a rectangular grid, where the cell (x, y) has four neighbors.
The six neighbors of a hexagon depend on how exactly we introduce a coordinate system.
I'd prefer other coordinate systems than the one in the original question.
This link gives an overview of the possibilities, and here is a short summary of that page on StackExchange, to protect against link rot.
My personal preference would be axial coordinates.
Note that, if we allow standing still instead of moving to one of the neighbors, that just adds one more term, f[k-1][x][y], to the formula.
The same goes for using triangular, rectangular, or hexagonal grid, for using 4 or 8 or some other subset of neighbors in a grid, and so on.
If you want to arrive to some other target cell (xt, yt), that is also covered: the answer is the value f[n][xt][yt].
Similarly, if you have multiple start or target cells, and you can start and finish at any of them, just alter the base case or sum the answers in the cells.
The general layout of the solution remains the same.
This obviously works in n * (2n+1) * (2n+1) * number-of-neighbors, which is O(n^3) for any constant number of neighbors (4 or 6 or 8...) a cell may have in our particular problem.
Finally, note that, at step k of the main loop, we need only two layers of the array f: f[k-1] is the source layer, and f[k] is the target layer.
So, instead of storing all layers for the whole time, we can store just two layers, as we don't need more: one for odd k and one for even k.
Using only two layers is as simple as changing all f[k] and f[k-1] to f[k%2] and f[(k-1)%2], respectively.
This lowers the memory requirement from O(n^3) down to O(n^2), as advertised in the beginning.
For a more mathematical solution, here are some steps that would perhaps lead to one.
First, consider the following problem: what is the number of ways to go from (xs, ys) to (xt, yt) in n steps, each step moving one square north, west, south, or east?
To arrive from x = xs to x = xt, we need H = |xt - xs| steps in the right direction (without loss of generality, let it be east).
Similarly, we need V = |yt - ys| steps in another right direction to get to the desired y coordinate (let it be south).
We are left with k = n - H - V "free" steps, which can be split arbitrarily into pairs of north-south steps and pairs of east-west steps.
Obviously, if k is odd or negative, the answer is zero.
So, for each possible split k = 2h + 2v of "free" steps into horizontal and vertical steps, what we have to do is construct a path of H+h steps east, h steps west, V+v steps south, and v steps north. These steps can be done in any order.
The number of such sequences is a multinomial coefficient, and is equal to n! / (H+h)! / h! / (V+v)! / v!.
To finally get the answer, just sum these over all possible h and v such that k = 2h + 2v.
This solution calculates the answer in O(n) if we precalculate the factorials, also in O(n), and consider all arithmetic operations to take O(1) time.
For a hexagonal grid, a complicating feature is that there is no such clear separation into horizontal and vertical steps.
Still, given the starting cell and the number of steps in each of the six directions, we can find the final cell, regardless of the order of these steps.
So, a solution can go as follows:
Enumerate all possible partitions of n into six summands a1, ..., a6.
For each such partition, find the final cell.
For each partition where the final cell is the cell we want, add multinomial coefficient n! / a1! / ... / a6! to the answer.
Just so, this takes O(n^6) time and O(1) memory.
By carefully studying the relations between different directions on a hexagonal grid, perhaps we can actually consider only the partitions which arrive at the target cell, and completely ignore all other partitions.
If so, this solution can be optimized into at least some O(n^3) or O(n^2) time, maybe further with decent algebraic skills.

Picking the "spread" from the points on a line

I'm facing an algorithmic problem described as follows: Given a line from 0 to N (really big N), a list of X points on said line, and a number Z (0<=Z<=X) pick Z points from X to maximize the distance between two closest points. The brute-force solution in O(n^2) doesn't seem that difficult but I'm looking for something more sophisticated that can be done in O(n log n) time. Any clues, solutions, advice is very appreciated.
Edit: Answering the question in the first post-it is the minimal distance (between the two closest points) that has to be maximized.
One easy approach is O(XlogN).
First, sort the points.
Next observe that if you already know the minimum distance (call it d) between the points, it's O(X) to see if there's a way of picking Z points all of which are at least distance d apart: take the left-most element, then the next that's at least distance d away, then the next that's at least distance d away from that, and so on. If by the time you've got to the end of the array you have at least Z points, then you have a solution, and if you don't, there is no solution.
Now, you can use a binary search on [0, N] to find the largest d with a solution.
The sort is O(XlogX), the binary search takes O(logN) trials, and each is O(X). Overall, that's O(XlogX + XlogN), but since N >= X that simplifies to O(XlogN).

Algorithm for finding all combinations of (x,y,z,j) that satisfy w+x = y+j, where w,x,y,j are integers between -N...N inclusive

I'm working on a problem that requires an array (dA[j], j=-N..N) to be calculated from the values of another array (A[i], i=-N..N) based on a conservation of momentum rule (x+y=z+j). This means that for a given index j for all the valid combinations of (x,y,z) I calculate A[x]A[y]A[z]. dA[j] is equal to the sum of these values.
I'm currently precomputing the valid indices for each dA[j] by looping x=-N...+N,y=-N...+N and calculating z=x+y-j and storing the indices if abs(z) <= N.
Is there a more efficient method of computing this?
The reason I ask is that in future I'd like to also be able to efficiently find for each dA[j] all the terms that have a specific A[i]. Essentially to be able to compute the Jacobian of dA[j] with respect to dA[i].
Update
For the sake of completeness I figured out a way of doing this without any if statements: if you parametrize the equation x+y=z+j given that j is a constant you get the equation for a plane. The constraint that x,y,z need to be integers between -N..N create boundaries on this plane. The points that define this boundary are functions of N and j. So all you have to do is loop over your parametrized variables (s,t) within these boundaries and you'll generate all the valid points by using the vectors defined by the plane (s*u + t*v + j*[0,0,1]).
For example, if you choose u=[1,0,-1] and v=[0,1,1] all the valid solutions for every value of j are bounded by a 6 sided polygon with points (-N,-N),(-N,-j),(j,N),(N,N),(N,-j), and (j,-N).
So for each j, you go through all (2N)^2 combinations to find the correct x's and y's such that x+y= z+j; the running time of your application (per j) is O(N^2). I don't think your current idea is bad (and after playing with some pseudocode for this, I couldn't improve it significantly). I would like to note that once you've picked a j and a z, there is at most 2N choices for x's and y's. So overall, the best algorithm would still complete in O(N^2).
But consider the following improvement by a factor of 2 (for the overall program, not per j): if z+j= x+y, then (-z)+(-j)= (-x)+(-y) also.

Algorithm for finding path combinations?

Imagine you have a dancing robot in n-dimensional euclidean space starting at origin P_0 = (0,0,...,0).
The robot can make m types of dance moves D_1, D_2, ..., D_m
D_i is an n-vector of integers (D_i_1, D_i_2, ..., D_i_n)
If the robot makes dance move i than its position changes by D_i:
P_{t+1} = P_t + D_i
The robot can make any of the dance moves as many times as he wants and in any order.
Let a k-dance be defined as a sequence of k dance moves.
Clearly there are m^k possible k-dances.
We are interested to know the set of possible end positions of a k-dance, and for each end position, how many k-dances end at that location.
One way to do this is as follows:
P0 = (0, 0, ..., 0);
S[0][P0] = 1
for I in 1 to k
for J in 1 to m
for P in S[I-1]
S[I][P + D_J] += S[I][P]
Now S[k][Q] will tell you how many k-dances end at position Q
Assume that n, m, |D_i| are small (less than 5) and k is less than 40.
Is there a faster way? Can we calculate S[k][Q] "directly" somehow with some sort of linear algebra related trick? or some other approach?
You could create an adjacency matrix that would contain dance-move transitions in your space (the part of it that's reachable in k moves, otherwise it would be infinite). Then, the P_0 row of n-th power of this matrix contains the S[k] values.
The matrix in question quickly gets enormous, something like (k*(max(D_i_j)-min(D_i_j)))^n (every dimension can be halved if Q is close to origin), but that's true for your S matrix as well
Since dance moves are interchangable you can assume that for a i < j the robot first makes all the D_i moves before the D_j moves, thus reducing the number of combinations to actually calculate.
If you keep track of the number of times each dance move was made calculating the total number of combinations should be easy.
Since the 1-dimensional problem is closely related to the subset sum problem, you could probably take a similar approach - find all of the combinations of dance vectors that add together to have the correct first coordinate with exactly k moves; then take that subset of combinations and check to see which of those have the right sum for the second, and take the subset which matches both and check it for the third, and so on.
In this way, you get to at least only have to perform a very simple addition for the extremely painful O(n^k) step. It will indeed find all of the vectors which will hit a given value.

Partition a set into k groups with minimum number of moves

You have a set of n objects for which integer positions are given. A group of objects is a set of objects at the same position (not necessarily all the objects at that position: there might be multiple groups at a single position). The objects can be moved to the left or right, and the goal is to move these objects so as to form k groups, and to do so with the minimum distance moved.
For example:
With initial positions at [4,4,7], and k = 3: the minimum cost is 0.
[4,4,7] and k = 2: minimum cost is 0
[1,2,5,7] and k = 2: minimum cost is 1 + 2 = 3
I've been trying to use a greedy approach (by calculating which move would be shortest) but that wouldn't work because every move involves two elements which could be moved either way. I haven't been able to formulate a dynamic programming approach as yet but I'm working on it.
This problem is a one-dimensional instance of the k-medians problem, which can be stated as follows. Given a set of points x_1...x_n, partition these points into k sets S_1...S_k and choose k locations y_1...y_k in a way that minimizes the sum over all x_i of |x_i - y_f(i)|, where y_f(i) is the location corresponding of the set to which x_i is assigned.
Due to the fact that the median is the population minimizer for absolute distance (i.e. L_1 norm), it follows that each location y_j will be the median of the elements x in the corresponding set S_j (hence the name k-medians). Since you are looking at integer values, there is the technicality that if S_j contains an even number of elements, the median might not be an integer, but in such cases choosing either the next integer above or below the median will give the same sum of absolute distances.
The standard heuristic for solving k-medians (and the related and more common k-means problem) is iterative, but this is not guaranteed to produce an optimal or even good solution. Solving the k-medians problem for general metric spaces is NP-hard, and finding efficient approximations for k-medians is an open research problem. Googling "k-medians approximation", for example, will lead to a bunch of papers giving approximation schemes.
http://www.cis.upenn.edu/~sudipto/mypapers/kmedian_jcss.pdf
http://graphics.stanford.edu/courses/cs468-06-winter/Papers/arr-clustering.pdf
In one dimension things become easier, and you can use a dynamic programming approach. A DP solution to the related one-dimensional k-means problem is described in this paper, and the source code in R is available here. See the paper for details, but the idea is essentially the same as what #SajalJain proposed, and can easily be adapted to solve the k-medians problem rather than k-means. For j<=k and m<=n let D(j,m) denote the cost of an optimal j-medians solution to x_1...x_m, where the x_i are assumed to be in sorted order. We have the recurrence
D(j,m) = min (D(j-1,q) + Cost(x_{q+1},...,x_m)
where q ranges from j-1 to m-1 and Cost is equal to the sum of absolute distances from the median. With a naive O(n) implementation of Cost, this would yield an O(n^3k) DP solution to the whole problem. However, this can be improved to O(n^2k) due to the fact that the Cost can be updated in constant time rather than computed from scratch every time, using the fact that, for a sorted sequence:
Cost(x_1,...,x_h) = Cost(x_2,...,x_h) + median(x_1...x_h)-x_1 if h is odd
Cost(x_1,...,x_h) = Cost(x_2,...,x_h) + median(x_2...x_h)-x_1 if h is even
See the writeup for more details. Except for the fact that the update of the Cost function is different, the implementation will be the same for k-medians as for k-means.
http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Wang+Song.pdf
as I understand, the problems is:
we have n points on a line.
we want to place k position on the line. I call them destinations.
move each of n points to one of the k destinations so the sum of distances is minimum. I call this sum, total cost.
destinations can overlap.
An obvious fact is that for each point we should look for the nearest destinations on the left and the nearest destinations on the right and choose the nearest.
Another important fact is all destinations should be on the points. because we can move them on the line to right or to left to reach a point without increasing total distance.
By these facts consider following DP solution:
DP[i][j] means the minimum total cost needed for the first i point, when we can use only j destinations, and have to put a destination on the i-th point.
to calculate DP[i][j] fix the destination before the i-th point (we have i choice), and for each choice (for example k-th point) calculate the distance needed for points between the i-th point and the new point added (k-th point). add this with DP[k][j - 1] and find the minimum for all k.
the calculation of initial states (e.g. j = 1) and final answer is left as an exercise!
Task 0 - sort the position of the objects in non-decreasing order
Let us define 'center' as the position of the object where it is shifted to.
Now we have two observations;
For N positions the 'center' would be the position which is nearest to the mean of these N positions. Example, let 1,3,6,10 be the positions. Then mean = 5. Nearest position is 6. Hence the center for these elements is 6. This gives us the position with minimum cost of moving when all elements need to be grouped into 1 group.
Let N positions be grouped into K groups "optimally". When N+1 th object is added, then it will disturb only the K th group, i.e, first K-1 groups will remain unchanged.
From these observations, we build a dynamic programming approach.
Let Cost[i][k] and Center[i][k] be two 2D arrays.
Cost[i][k] = minimum cost when first 'i' objects are partitioned into 'k' groups
Center[i][k] stores the center of the 'i-th' object when Cost[i][k] is computed.
Let {L} be the elements from i-L,i-L+1,..i-1 which have the same center.
(Center[i-L][k] = Center[i-L+1][k] = ... = Center[i-1][k]) These are the only objects that need to be considered in the computation for i-th element (from observation 2)
Now
Cost[i][k] will be
min(Cost[i-1][k-1] , Cost[i-L-1][k-1] + computecost(i-L, i-L+1, ... ,i))
Update Center[i-L ... i][k]
computecost() can be found trivially by finding the center (from observation 1)
Time Complexity:
Sorting O(NlogN)
Total Cost Computation Matrix = Total elements * Computecost = O(NK * N)
Total = O(NlogN + N*NK) = O(N*NK)
Let's look at k=1.
For k=1 and n odd, all points should move to the center point. For k=1 and n even, all points should move to either of the center points or any spot between them. By 'center' I mean in terms of number of points to either side, i.e. the median.
You can see this because if you select a target spot, x, with more points to its right than it's left, then a new target 1 to the right of x would result in a cost reduction (unless there is exactly one more point to the right than the left and the target spot is a point, in which case n is even and the target is on/between the two center points).
If your points are already sorted, this is an O(1) operation. If not, I believe it's O(n) (via an order statistic algorithm).
Once you've found the spot that all points are moving to, it's O(n) to find the cost.
Thus regardless of whether the points are sorted or not, this is O(n).

Resources