Finding the distance between two sets in Manhattan distance - algorithm

I'm learning for the test from algorithms and I spotted problem that I cannot deal with for a few days. So I'm writing here for help.
For a given two disjoint sets on plane:
G={(x_1^G, y_1^G), (x_2^G, y_2^G), ..., (x_n^G, y_n^G)}
D={(x_1^D, y_1^D), (x_2^D, y_2^D), ..., (x_n^D, y_n^D)}
Where for every 1 <= i, j <= n we have y_i^D < y_j^G, so G is above D.
Find an effective algorithm that
counts the distance between them
defined as:
d(G,D) = min{ d(a,b): a \in G and b\in D },
where d(a,b) = |x_a - x_b| + |y_a - y_b|
O(n^2) is trivial, so it is not the answer.
I hope the solution isn't too hard since it is from materials to review before the test. Can anybody help?
I think it will appear that this is a special case of some common problem. But if it is a special case, maybe the solution can be easier?

There are a few different ways to do this in O(n log n) time.
One: Compute the manhattan distance Voronoi diagram of the G points and build a point location data structure based on that. This takes O(n log n) time. For each D point, find the closest G point using the point location data structure. This takes O(log n) time per D point. Take the min of the distances between the pairs you just found and that's your answer.
Two: You can adapt Fortune's algorithm to this problem; just keep separate binary trees for D and G points. Kind of annoying to describe.
The next idea computes the distance of the closest pair for the infinity-norm, which is max(|x1-x2|, |y1-y2|). You can tilt your problem 45 degrees (substituting u = x-y, v = x+y) to get it into the appropriate form.
Three (variant of two): Sort all of the points by y coordinate. Maintain d, the distance between the closest pair seen so far. We'll sweep a line from top to bottom, maintaining two binary search trees, one of G points and one of D points. When a point is d or farther above the sweep line, we remove it from its binary search tree. When a point is first encountered by the sweep line, say a D point, we (1) check the G binary search tree to see if it has any elements whose x-coordinate is within d of the new point's, updating d as necessary, and (2) insert the new point into D's binary search tree. Each point only causes a constant number of binary search tree operations plus a constant amount of additional work, so the sweep is O(n log n). The sort is too, unsurprisingly, so our overall time complexity is as desired.
You can probably make a divide-and-conquer strategy work too based on similar ideas to three.

Related

Find the closest set of points to another set

I have two sets, A and B, with N and M points in R^n respectively. I know that N < M always.
The distance between two points, P and Q, is denoted by d( P,Q ). As the problem is generic, this distance could be any function (e.g. Euclidean distance).
I want to find the closest subset of B to A. Mathematically I would say, I want to find the subset C of B with size N with the minimal global distance to A. The global distance between A and C is given by
D(A,C) = min([sum(d(P_i,Q_i),i=1,N) with P_i in A and Q_i in C* for C* in Permutations of C])
I've been thinking about this problem and I did an algorithm that get a local optimum, but not necessarily the optimal:
Step 1) Find the nearest point of each point of A in B. If there are no points repeated, I found the optimal subset and finish the algorithm. However, if there are points repeated, go to step 2.
Step 2) Compare their distances (of course I compare the distance between points with the same closest point). The point with the minimal distance keeps the point previously found and the others change their desired point for the "next" closest point that have not been selected for another point yet.
Step 3) Check if all the points are different. If they are, finish. If not, go back to step 2.
Any idea? Trying all the combinations is not a good one (I should calculate M!/(M-N)! global distances)
If M = N, this problem could be formulated as minimum-weight perfect matching in a bipartite graph or, in other words, an assignment problem. A well-known method for solving an assignment problem is the Hungarian algorithm.
To make Hungarian algorithm applicable in the case N < M, you could extend set A with (M-N) additional elements (each having zero distance to all elements of B).

Show that, given a query point q, it can be tested in time O(log n) whether q lies inside P

I am trying to solve some exercises of the book "Computational Geometry Algorithm and Applications, 3rd - de berg et al" of chapter 6 - Point Location. Unfortunately, I have no idea how to solve the following exercise:
Given a convex polygon P as an array of its n vertices in sorted order
along the boundary. Show that, given a query point q, it can be tested in
time O(log n) whether q lies inside P.
My Idea so far:
The only way I know to determine if a point lies inside p in O(log n) is to use a directed acyclic graph. In order to use a directed acyclic graph I need to build it, which is impossible in O(log n). So, somehow I need to use the ordered array, but the only solution I know with an array will cost O(N).
I hope that someone could help me.
The idea is basically to do a binary search, to find which 'segment' the point belongs to. The assumption here is that the polygon wraps around some fixed origin O, which is necessary to define an angular sorting routine.
To find whether q lies on the 'left' or 'right' of P[n/2] (by which I mean an anticlockwise or clockwise rotational difference about O), we do the 2D cross-product:
This is a real scalar. If this is positive, then a is to the right of b, and vice versa. In our code a = q - O and b = P[i] - O, where i is the index of the point on the polygon we are testing q against.
We can then use this test to find which 'segment' or 'wedge' q is in, i.e. which points of the polygon q is between (on the diagram these are P[n/2 - 1] and P[n/2]), using a binary search, which is O(log n). (I'll assume you know how to do this)
Once we know that, we need to know whether q is inside the 'wedge'.
From https://en.wikipedia.org/wiki/Line%E2%80%93line_intersection, for two lines defined by pairs of points [(x1, y1), (x2, y2)] and [(x3, y3), (x4, y4)] respectively, their intersection point (Px, Py) is given by
Compute the intersection between [Pl, Pr] and [q, O] to give s, and compute the distance |s - O|. If this is greater than |q - O| then q is inside the polygon P, and vice versa.
(This step is of course O(1). There may however be more elegant ways of doing it - I'm just illustrating the logic behind it)
The total complexity is then O(log n) + O(1) = O(log n).

Given a list of points, how to determine which are within a certain distance of each other

Give a list of 2d points and a maximum distance d, what is a better than O(n^2) way of finding which points are located within d from each point. I don't need a solution just some starting ideas.
use a spatial indexing structure such as kd tree and you can get O(n log n)
edit
Ah I think I misunderstood your comment. If you set n nearest neighbours in the query, in worst case a single search cost O (n log n) but you can put a flag on each found nearest points to denote whether they already belong to a particular cluster. Then you dont have to execute the nearest neighbour query again for those points. So final complexity is still O(n log n). Here's some more details on such range search http://www.cs.utah.edu/~lifeifei/cs6931/kdtree.pdf .
I am assuming here that the desired behaviour is to remove a point from consideration if it already belongs to a cluster. Perhaps you can clarify a bit on problem specification?
There can be n^2 point pairs to "find", so it's not really clear what you're after.
An "output-sensitive" way to do this whose running time is something like O(n log(n) + h), where h is the number of pairs you "find", is as follows:
Sort the points in order by y coordinate.
Sweep a line downward, throwing a point into a balanced binary tree when the sweep line hits it, and removing it once it's d above the sweep line.
When you hit a point with the sweep line, iterate over everything in the balanced binary tree that's at most d left and at most d right of the new point. "Find" every point that's within a distance d of the new point.
In the third bullet, if you have to look at k >= 6 points, there will be at least floor((k/6)^2) pairs to "find" (exercise!), so the number of pairs considered is proportional to the number of pairs "found."

Efficiently checking which of a large collection of nodes are close together?

I'm currently interested in generating random geometric graphs. For my particular problem, we randomly place node v in the unit square, and add an edge from v to node u if they have Euclidean distance <= D, where D=D(u,n) varies with u and the number of nodes n in the graph.
Important points:
It is costly to compute D, so I'd like to minimize the number of calls to this function.
The vast majority of the time, when v is added, edges uv will be added to only a small number of nodes u (usually 0 or 1).
Question: What is an efficient method for checking which vertices u are "close enough" to v?
The brute force algorithm is to compute and compare dist(v,u) and D(u,n) for all extant nodes u. This requires O(n2) calls to D.
I feel we should be able to do much better than this. Perhaps some kind of binning would work. We could divide the space up into bins, then for each vertex u, we store a list of bins where a newly placed vertex v could result in the edge uv. If v ends up placed outside of u's list of bins (which should happen most of the time), then it's too far away, and we don't need to compute D. This is somewhat of a off-the-top-of-my-head suggestion, and I don't know if it'd work well (e.g., there would be overhead in computing sufficiently close bins, which might be too costly), so I'm after feedback.
Based on your description of the problem, I would choose an R-tree as your data structure.
It allows for very fast searching by narrowing the set of vertices you need to run D against drastically. However, in the worst-case insertion, O(n) time is required. Thankfully, you're quite unlikely to hit the worst-case insertion with a typical data set.
I would probably just use a binning approach.
Say we cut the unit square in m x m subsquares (each having side length 1/m of course). Since you place your vertices uniformly at random (or so I assumed), every square will contain n / m^2 vertices on average.
Depending on A1, A2, m and n, you can probably determine the maximum radius you need to check. Say that's less than m. Then, after inserting v, you would need to check the square in which it landed, plus all adjacent squares. Anyway, this is a constant number of squares, so for every insertion you'll need to check O(n / m^2) other vertices on average.
I don't know the best value for m (as said, that depends on A1 and A2), but say it would be sqrt(n), then your entire algorithm could run in O(n) expected time.
EDIT
A small addition: you could keep track of vertices with many neighbors (so with high radius, which extends over multiple squares) and check them for every inserted vertex. There should only be few, so that's no problem.

Algorithm that finds the connectivity distance of a graph on uniform points on the unit square

Situation
Suppose we are given n points on the unit square [0, 1]x[0, 1] and a positive real number r. We define the graph G(point 1, point 2, ..., point n, r) as the graph on vertices {1, 2, ..., n} such that there is an edge connecting two given vertices if and only if the distance between the corresponding points is less than or equal to r. (You can think of the points as transmitters, which can communicate with each other as long as they are within range r.)
Given n points on the unit square [0, 1]x[0, 1], we define the connectivity distance as the smallest possible r for which G(point 1, point 2, ..., point n, r) is connected.
Problem 1) find an algorithm that determines if G(point 1, point 2, ..., point n, r) is connected
Problem 2) find an algorithm that finds the connectivity distance for any n given points
My partial solution
I have an algorithm (Algorithm 1) in mind for problem 1. I haven't implemented it yet, but I'm convinced it works. (Roughly, the idea is to start from vertex 1, and try to reach all other vertices through the edges. I think it would be somewhat similar to this.)
All that remains is problem 2. I also have an algorithm in mind for this one. However, I think it is not efficient time wise. I'll try to explain how it works:
You must first convince yourself that the connectivity distance rmin is necessarily the distance between two of the given points, say p and q. Hence, there are at most *n**(n-1)/2 possible values for rmin.
So, first, my algorithm would measure all *n**(n-1)/2 distances and store them (in an array in C, for instance) in increasing order. Then it would use Algorithm 1 to test each stored value (in increasing order) to see if the graph is connected with such range. The first value that does the job is the answer, rmin.
My question is: is there a better (time wise) algorithm for problem 2?
Remarks: the points will be randomly generated (something like 10000 of them), so that's the type of thing the algorithm is supposed to solve "quickly". Furthermore, I'll implement this in C. (If that makes any difference.)
Here is an algorithm which requires O(n2) time and O(n) space.
It's based on the observation that if you partition the points into two sets, then the connectivity distance cannot be less than the distance of the closest pair of points one from each set in the partition. In other words, if we build up the connected graph by always adding the closest point, then the largest distance we add will be the connectivity distance.
Create two sets, A and B. Put a random point into A and all the remaining points into B.
Initialize r (the connectivity distance) to 0.
Initialize a map M with the distance to every point in B of the point in A.
While there are still points in B:
Select the point b in B whose distance M[b] is the smallest.
If M[b] is greater than r, set r to M[b]
Remove b from B and add it to A.
For each point p in M:
If p is b, remove it from M.
Otherwise, if the distance from b to p is less than M[p], set M[p] to that distance.
When all the points are in A, r will be the connectivity distance.
Each iteration of the while loop takes O(|B|) time, first to find the minimum value in M (whose size is equal to the size of B); second, to update the values in M. Since a point is moved from B to A in each iteration, there will be exactly n iterations, and thus the total execution time is O(n2).
The algorithm presented above is an improvement to a previous algorithm, which used an (unspecified) solution to the bichromatic closest pair (BCP) problem to recompute the closest neighbour to A in every cycle. Since there is an O(n log n) solution to BCP, this implied a solution to the original problem in O(n2 log n). However, maintaining and updating the list of closest points is actually much simpler, and only requires O(n). Thanks to #LajosArpad for a question which triggered this line of thought.
I think your ideas are reasonably good, however, I have an improvement for you.
In fact you build up an array based on measurement and you sort your array. Very nice. At least with not too many points.
The number of n(n-1)/2 is a logical consequence of your pairing requirement. So, for 10000 elements, you will have 49995000 elements. You will need to increase significantly the speed! Also, this number of elements would eat a lot of your memory storage.
How can you achieve greater speed?
First of all, don't build arrays. You already have an array. Secondly, you can easily solve your problem by traversing. Let's suppose you have a function, which determines whether a given distance is enough to connect all the nodes, lets call this function "valid". It is not enough, because you need to find the minimal possible value. So, if you don't have more information about the nodes prior the execution of the algorithm, then my suggestion is this solution:
lowerBound <- 0
upperBound <- infinite
i <- 0
while i < numberOfElements do
j <- i + 1
while j < numberOfElements do
distance <- d(elements[i], elements[j])
if distance < upperBound and distance > lowerBound then
if valid(distance) then
upperBound <- distance
else
lowerBound <- distance
end if
end if
j <- j + 1
end while
i <- i + 1
end while
After traversing all the elements the value of upperBound will hold the smallest distance which still connects the network. You didn't store all the distances, as they were far too many and you have solved your problem in a single cycle. I hope you find my answer helpful.
If some distance makes graph connected, any larger distance would make it connected too. To find minimal connecting distance just sort all distances and use binary search.
Time complexity is O(n^2*log n), space complexity is O(n^2).
You can start with some small distance d then check for connectivity. If the Graph is connected, you're done, if not, increment d by a small distance then check again for connectivity.
You also need a clever algorithm to avoid O(N^2) in case N is big.

Resources